arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02580 2026-06-02 cs.CV 版本更新

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

在Blender中思考:基于视觉语言模型的分阶段可执行逆向图形

Guangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor

发表机构 * Cornell University(康奈尔大学)

AI总结 提出分阶段可执行逆向图形(SEIG)框架,利用预训练视觉语言模型直接从单张图像重建可编辑的Blender程序,无需专用基础模型或可微渲染,通过逐步细化几何、材质、组合和光照提升重建保真度。

详情
AI中文摘要

逆向图形是一个长期存在且高度欠约束的问题,旨在将图像重建为可编辑的3D场景,这些场景可以渲染、重新照明和操作。在这项工作中,我们研究了预训练的视觉语言模型(VLM)是否可以直接从单张图像执行可执行逆向图形,通过将场景重建为可编辑的Blender程序,而不依赖于专门的2D或3D基础模型、可微渲染或多视图监督。我们引入了分阶段可执行逆向图形(SEIG),这是一个智能体框架,通过直接在可执行的Blender代码空间中逐步细化场景因素(包括几何、材质、组合和照明),从单张图像重建3D场景。我们使用一系列重建指标(涵盖像素级、感知和语义保真度)在各种场景上评估我们的框架。我们的实验表明,分阶段重建显著提高了重建保真度,突出了任务分解对于使用通用VLM进行可执行逆向图形的重要性。最后,我们展示了由重建的可编辑Blender场景启用的各种下游应用。

英文摘要

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.

2606.02578 2026-06-02 cs.CV cs.AI 版本更新

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

通过感知扰动和奖励建模减轻多模态大语言模型作为评判者中的感知判断偏差

Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee, Jaeyo Shin, Hyunjung Shim

发表机构 * University of California, Berkeley(加州大学伯克利分校) KAIST(韩国科学技术院)

AI总结 本文通过构建感知扰动数据集和结合GRPO奖励与批排序目标的统一训练框架,解决了多模态大语言模型作为评判者时因视觉证据与文本线索冲突而产生的感知判断偏差问题,显著提升了感知忠实度和与人类评价的一致性。

Comments ICML 2026

详情
AI中文摘要

最近的多模态大语言模型展示了强大的推理能力,但它们作为自动评估器的可靠性仍然受到一个关键弱点的限制:当视觉证据与文本线索冲突时,多模态大语言模型评判者倾向于奖励看似合理的叙述而非感知上正确的答案。我们识别并系统分析了这一现象,称之为感知判断偏差。通过受控的视觉扰动,现有的多模态评判者经常锚定于响应文本而非自身的视觉感知,导致不一致且不可验证的评估。为了解决这个问题,我们引入了感知扰动判断数据集,该数据集构建了最小编辑的反事实响应,隔离了感知错误并实现了可验证的监督。基于该数据集,我们开发了一个统一的训练框架,将结构化的基于GRPO的奖励与批排序目标相结合,实现了无需显式成对标签的连贯全局排序。在多种多模态大语言模型作为评判者的基准测试上的实验表明,我们的方法显著提高了感知忠实度、排序连贯性以及与人类评价的一致性。我们的结果为训练感知基础、可解释且对视觉推理冲突鲁棒的多模态评判者建立了一条可扩展且可泛化的路径。

英文摘要

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.

2606.02577 2026-06-02 cs.RO cs.CV 版本更新

RoboDream: Compositional World Models for Scalable Robot Data Synthesis

RoboDream: 用于可扩展机器人数据合成的组合世界模型

Junjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li, Harshitha Rajaprakash, Pavel Tokmakov, Muhammad Zubair Irshad, Vitor Guizilini, Yue Wang

发表机构 * USC Physical Superintelligence (PSI) Lab(USC物理超智能实验室) Toyota Research Institute(丰田研究院)

AI总结 提出一种以具身为中心的组合世界模型,通过将轨迹执行与环境合成解耦,实现从新视角、新场景和新物体中合成逼真演示数据,并展示其在数据扩展和减少真实数据需求方面的有效性。

Comments Project page: https://junjieye.com/RoboDream/

详情
AI中文摘要

扩展机器人学习需要大规模、多样化的演示,然而通过远程操作收集真实世界数据仍然过于昂贵和耗时。虽然视频扩散模型为数据扩展提供了一条有希望的途径,但现有的生成方法通常局限于表面的视觉增强,或者遭受产生物理不可行运动的具身幻觉。我们提出了一种可泛化的以具身为中心的世界模型,通过合成具有新物体、新场景和新视角的逼真演示来实现可扩展的数据生成。我们的方法将生成锚定到渲染的机器人运动,同时以显式的场景和物体先验为条件,有效地将轨迹执行与环境合成解耦。这种公式有可能解锁两种强大的数据扩展能力:(1)检索与重生,将现有轨迹重新用于全新的上下文而无需新的运动数据;(2)无道具远程操作,操作员操纵空空气,模型随后幻觉出目标物体和场景,消除了重置时间。我们通过真实世界实验证明,我们生成的数据持续改进下游策略性能,并在各种操作任务中显著减少真实世界数据需求。

英文摘要

Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling, existing generative approaches are often limited to superficial visual augmentation, or suffer from embodiment hallucinations that yield physically infeasible motions. We present a generalizable embodiment-centric world model that achieves scalable data generation by synthesizing photorealistic demonstrations with novel objects, in novel scenes, and from novel viewpoints. Our approach anchors generation to rendered robot motion while conditioning on explicit scene and object priors, effectively decoupling trajectory execution from environment synthesis. This formulation has the potential to unlock two powerful data scaling capabilities: (1) retrieval and rebirth, which repurposes existing trajectories into entirely new contexts without new motion data; and (2) prop-free teleoperation, where operators manipulate empty air and the model hallucinates the target objects and scene afterwards, eliminating reset time. We demonstrate with real-world experiments that our generated data consistently improves downstream policy performance and significantly reduces real-world data requirements across diverse manipulation tasks.

2606.02575 2026-06-02 cs.CV 版本更新

From Zero to Hero: Training-Free Custom Concept Spawning in World Models

从零到英雄:世界模型中的免训练自定义概念生成

Kiymet Akdemir, Pinar Yanardag

发表机构 * Virginia Tech(弗吉尼亚理工学院)

AI总结 提出SPAWN方法,利用图像到视频骨干网络的结构特性,通过交换参考帧锚点与外部概念潜变量,实现无需训练即可在世界模型中生成用户指定的视觉概念。

详情
AI中文摘要

自回归世界模型已成为交互式视频生成的一种强大范式,允许用户通过动作在动态生成的环境中进行导航。这些模型通常以文本提示和/或单个参考帧为条件,从中生成整个世界。然而,一旦用户导航到该帧可见区域之外,未见区域将由基础模型的先验填充,用户无法指定应该出现什么以及出现在哪里。对于游戏、交互式故事讲述和模拟等应用来说,这是一个根本性的限制,因为在这些应用中,可控的场景组成至关重要。我们将这种缺失的能力称为概念生成;将用户指定的视觉概念引入世界模型,类似于游戏引擎中的生成。我们提出了SPAWN(Swapping Pinned Anchor with Windowed iNjection),一种免训练的概念生成方法。SPAWN利用了图像到视频骨干网络的结构特性:上下文记忆的第一个槽位被固定到参考帧,并作为每个生成块的基石锚点。通过在短注入窗口内将该锚点与外部概念潜变量交换,并让原始锚点返回,概念通过模型自身的记忆在滚动过程中自然传播。SPAWN支持从角色和道具等细粒度实体到建筑物和地标等大规模元素的概念,并接受概念图像或文本描述作为输入。实验表明,SPAWN在保持身份和时间一致性的同时,以一致的光照、尺度和视角整合概念,证明了在现有自回归世界模型中无需训练即可实现可控的概念生成。

英文摘要

Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model's priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model's own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.

2606.02573 2026-06-02 cs.CV 版本更新

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

HumanNOVA: 从单张图像实现逼真、通用且快速的3D人体化身建模

Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang, Jonathan C. Liu, Zhiwen Fan, Kai Wang, Zhangyang Wang, Georgios Pavlakos

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) National University of Singapore(新加坡国立大学) Texas A&M University(德克萨斯农工大学)

AI总结 提出HumanNOVA模型,通过可扩展数据生成流水线和前馈令牌条件化架构,从单张RGB图像快速生成逼真3D人体化身,无需测试时优化。

Comments CVPR 2026 Highlight

详情
AI中文摘要

在本文中,我们提出HumanNOVA,一种从单张RGB图像生成3D人体化身的逼真、通用且快速的模型。由于缺乏多样化、高质量的3D人体数据,实现逼真度和泛化性具有挑战性。为此,我们构建了一个可扩展的数据生成流水线,遵循两种策略。第一种是利用现有绑定资产,并通过日常生活中的大量姿态进行动画化。第二种策略是利用现有的多摄像头人体捕捉,并采用拟合方法生成更多样化的视角用于训练。这两种策略使我们能够扩展到10万个资产,显著增强了数据的数量和多样性,以支持稳健的模型训练。在架构方面,HumanNOVA采用前馈、令牌条件化的化身建模框架,可在不到一秒内实现快速推理,且无需测试时优化。给定输入图像和估计的简化人体网格(SMPL),无需详细几何或外观,模型首先将两者编码为紧凑的令牌表示。这些令牌随后作为条件信号,通过交叉注意力融合,构建基于三平面的3D化身表示。在多个基准上的大量实验表明,我们的方法在定量和定性上均具有优越性,并且在多样输入图像条件下具有鲁棒性。项目页面:https://HumanNOVA.github.io。

英文摘要

In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions. Project page at https://HumanNOVA.github.io .

2606.02572 2026-06-02 cs.CV 版本更新

VISReg: Variance-Invariance-Sketching Regularization for JEPA training

VISReg: 用于JEPA训练的方差-不变性-素描正则化

Haiyu Wu, Randall Balestriero, Morgan Levine

发表机构 * Altos Labs(Altos实验室) Brown University(布朗大学)

AI总结 提出VISReg正则化方法,用基于切片Wasserstein距离的素描目标替代协方差,以增强分布形状约束,在防止嵌入坍塌的同时提升鲁棒性和性能。

详情
AI中文摘要

自监督学习方法通过建模启发式或对嵌入空间进行显式正则化来防止嵌入坍塌。其中,VICReg将正则化分解为方差和协方差目标,提供了灵活性和可解释性。然而,协方差仅捕获二阶统计量——鼓励去相关,但未能强制执行稳定训练所需的完整分布形状。基于素描的方法如SIGReg通过将嵌入对齐到各向同性高斯分布来解决这一问题,但缺乏灵活性且在坍塌情况下梯度消失。我们提出方差-不变性-素描正则化(VISReg),它用基于切片Wasserstein距离的素描目标替代协方差,强制执行完整的分布形状,同时保留方差项以控制尺度。通过解耦尺度和形状,VISReg结合了VICReg的灵活性和素描方法的分布严谨性,即使在坍塌情况下也能提供稳健的梯度。我们表明VISReg具有线性可扩展性,在低质量数据集上优于现有正则化方法,并且对长尾和低秩场景具有鲁棒性。在ImageNet-1K上预训练后,VISReg在分布外数据集上达到了最先进的性能。在ImageNet-22K上预训练后,它匹配了DINOv2的OOD性能,尽管后者使用了10倍以上的数据(LVD-142M)。项目和代码:https://haiyuwu.github.io/visreg。

英文摘要

Self-supervised learning methods prevent embedding collapse via modeling heuristics or explicit regularization of the embedding space. Among the latter, VICReg decomposes regularization into variance and covariance objectives, offering flexibility and interpretability. However, covariance captures only second-order statistics -- encouraging decorrelation but failing to enforce the full distributional shape needed for stable training. Sketching-based methods such as SIGReg address this by aligning embeddings to an isotropic Gaussian, but lack flexibility and suffer from vanishing gradients under collapse. We propose Variance-Invariance-Sketching Regularization (VISReg), which replaces covariance with a Sliced-Wasserstein-based sketching objective that enforces full distributional shape, while retaining a variance term for scale control. By decoupling scale and shape, VISReg combines VICReg's flexibility with the distributional rigor of sketching methods, providing robust gradients even under collapse. We show that VISReg scales linearly, outperforms existing regularization on low-quality datasets, and is resilient to long-tailed and low-rank regimes. Pre-trained on ImageNet-1K, VISReg achieves state-of-the-art performance on out-of-distribution datasets. Pre-trained on ImageNet-22K, it matches DINOv2's OOD performance despite the latter using 10x more data (LVD-142M). Project and code: https://haiyuwu.github.io/visreg.

2606.02569 2026-06-02 cs.CV cs.AI cs.CL 版本更新

AdaCodec: A Predictive Visual Code for Video MLLMs

AdaCodec: 面向视频多模态大语言模型的预测性视觉编码

Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si, Chenglin Li, Shuai Dong, Kele Shao, Ruilin Li, Dianyi Wang, Nan Duan, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) JD.com(京东公司)

AI总结 针对视频帧间冗余问题,提出预测性视觉编码AdaCodec,通过条件预测代价决定是否发送完整参考帧或紧凑P-令牌,在匹配视觉令牌预算下提升性能,并大幅降低首令牌延迟。

Comments 23 pages

详情
AI中文摘要

视频在时间上是冗余的:相邻帧通常共享大部分物体、背景和布局。然而,现有的视频多模态大语言模型(视频MLLMs)通常将每个采样帧编码为独立的RGB图像,导致视觉令牌重复先前帧中已有的内容。这提示了一种更直接的视频接口:仅当场景无法从先前上下文中良好预测时,才发送完整的参考帧;否则,传输帧间变化的紧凑描述。我们将这种接口称为\emph{预测性视觉编码},并针对视频MLLMs实例化为 extbf{AdaCodec}。AdaCodec仅在条件预测代价高时,为参考帧花费完整的视觉令牌;否则,它将帧间变化(包括运动和预测残差)编码为紧凑的P-令牌。在所有11个基准测试中,在匹配视觉令牌预算下,AdaCodec相比基于Qwen3-VL-8B的逐帧RGB基线有所改进。即使在1/7的预算下,使用32k令牌的AdaCodec在所有长视频基准测试中超越了224k基线;在五个通用视频基准测试中,它提高了平均得分,同时将首令牌时间从9.26秒大幅缩短至1.62秒。

英文摘要

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emph{predictive visual code}, and instantiate it for video MLLMs as \textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.

2606.02565 2026-06-02 cs.CV 版本更新

Policy-based Foveated Imaging and Perception

基于策略的中央凹成像与感知

Howard Xiao, Jan Ackermann, Boyang Deng, Gordon Wetzstein

发表机构 * Stanford University USA(斯坦福大学)

AI总结 提出一种实时、预测且任务感知的中央凹成像系统,通过强化学习策略动态分配像素带宽到任务相关区域,在严格像素预算下实现高任务性能。

Comments Project website at https://howardxiao.ca/foveated/

详情
AI中文摘要

超高分辨率图像传感器具有捕捉许多视觉感知任务所需精细空间细节的潜力,但在实际带宽、延迟和功耗约束下,获取和处理所有全分辨率像素通常是不可行的。现有方法通过空间或时间下采样等采集策略来解决这一挑战,这些策略在评估任务相关性之前不可逆地丢弃信息。在这项工作中,我们引入了一种实时、预测且任务感知的中央凹成像系统,该系统直接在图像采集时运行。利用新兴的双流传感器架构,我们的方法将有限的像素带宽动态分配给任务相关的感兴趣区域,同时保持低分辨率的全局上下文。我们将中央凹采集建模为传感器注意力策略学习问题,其中过去的观察指导决定未来测量的动作,从而闭合感知-采集循环。通过在多个感知任务上的广泛模拟,我们证明了我们的方法在严格的像素预算下实现了高任务性能,并显著优于在相同带宽下运行的相关基线。我们进一步在200兆像素双流传感器上验证了我们的系统,在实际带宽和延迟约束下捕获真实世界视频,证明了任务驱动的采集时中央凹成像的实际可行性。

英文摘要

Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, latency, and power constraints. Existing approaches address this challenge through acquisition strategies such as spatial or temporal downsampling, which irrevocably discard information before task relevance can be assessed. In this work, we introduce a real-time, predictive, and task-aware foveated imaging system that operates directly at image acquisition time. Leveraging emerging dual-stream sensor architectures, our method dynamically allocates limited pixel bandwidth to task-relevant regions of interest while maintaining a low-resolution global context. We formulate foveated acquisition as a sensor attention policy-learning problem, in which past observations guide actions that determine future measurements, closing the perception-acquisition loop. Through extensive simulation across multiple perception tasks, we demonstrate that our approach achieves high task performance under strict pixel budgets and significantly outperforms relevant baselines operating at the same bandwidth. We further validate our system on a 200-megapixel dual-stream sensor, capturing real-world videos under realistic bandwidth and latency constraints, demonstrating the practical feasibility of task-driven, acquisition-time foveated imaging.

2606.02564 2026-06-02 cs.CV 版本更新

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

VLMs 是视频推理的好老师:通过自适应测试时优化

Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao, Pengfei Wan, Kun Gai, Jing Liao

发表机构 * City University of Hong Kong(香港城市大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队)

AI总结 提出将视觉语言模型(VLM)作为“教师”,通过提取任务规则并设计可微分奖励,指导视频生成模型(VGM)在测试时在线优化轻量级 LoRA 模块,从而提升视频推理的泛化能力。

Comments Project Page: https://VLM-as-Teacher.github.io/

详情
AI中文摘要

最近的“视频推理”范式利用视频生成模型(VGM)生成时间连贯的视觉轨迹来完成推理任务。尽管最先进的 VGM 在视觉质量上表现出色,但它们往往难以理解和遵循任务特定规则,导致在各种推理场景中出现逻辑失败。现有工作尝试利用视觉语言模型(VLM)作为问题预求解器,为 VGM 生成或细化文本指导。然而,文本描述无法捕捉复杂的时空细节,并且 VGM 即使有有效计划,也常常难以忠实执行细粒度或长尾指令。尽管 VLM 作为求解器存在困难,但它们具备强大的感知能力,可以评估过程约束满足和最终目标达成。利用这一优势,我们引入了一种范式转变,将 VLM 的角色转变为“教师”。具体来说,VLM 教师提取任务特定规则以制定可微分奖励,通过测试时在线优化轻量级 LoRA 模块来指导 VGM 推理器。该策略实现了自适应测试时优化,并将推理能力扩展到 VGM 的内在边界之外。在符号(VBVR-Bench)和通用(RULER-Bench)视频推理基准上的评估表明,所提出的方法平均性能提升 16.7 个百分点,在可比测试时成本下,大幅优于 VLM-as-Solver 范式(+0.4 个百分点)和 Best-of-N 缩放(+2.2 个百分点)。这些发现表明,将 VLM 作为测试时教师集成,为实现可泛化的视频推理提供了一种有前景的范式。项目页面:https://VLM-as-Teacher.github.io/

英文摘要

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/

2606.02553 2026-06-02 cs.CV 版本更新

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

LongLive-RAG: 一种用于长视频生成的通用检索增强框架

Qixin Hu, Shuai Yang, Wei Huang, Song Han, Yukang Chen

发表机构 * NVIDIA USC(美国大学) MIT(麻省理工学院)

AI总结 提出LongLive-RAG框架,通过将自回归视频生成中的历史潜变量作为可检索记忆,利用查询嵌入检索相关历史潜变量并引入窗口时间增量损失,以减轻滑动窗口注意力导致的误差累积,提升长视频生成质量。

Comments 20 pages, 7 figures, 4 tables

详情
AI中文摘要

自回归(AR)视频扩散支持可变长度合成,但长时生成常面临累积误差和身份漂移。为提升效率,现有方法在生成时普遍采用滑动窗口注意力。这会产生不可逆的生成轨迹:一旦活动窗口累积外观误差,后续生成只能基于此退化轨迹并进一步漂移。我们通过将长视频生成建模为检索增强生成(RAG)问题来解决这一限制。我们不依赖仅最近窗口,而是将先前生成的潜变量视为动态、可搜索的历史。我们提出LongLive-RAG,一个用于AR视频生成的通用检索框架。在每个新块中,LongLive-RAG使用查询嵌入检索相关历史潜变量。这一轻量级检索步骤相比生成仅增加少量开销,并使生成器能基于非局部上下文而非仅最近窗口进行条件生成。为使检索更具判别性,我们引入窗口时间增量损失,抑制冗余局部相似性并鼓励嵌入捕捉有意义的时序变化。这些组件共同帮助减少滑动窗口注意力引起的误差累积。在多个AR骨干网络和生成长度上的实验表明,长视频质量提升且平均VBench-Long排名最佳。据我们所知,在开放式AR长视频生成方法中,LongLive-RAG是首个将自生成潜变量历史构建为内容可寻址检索记忆的方法。代码见https://github.com/qixinhu11/LongLive-RAG。

英文摘要

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.

2606.02552 2026-06-02 cs.CV cs.AI 版本更新

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

建模深度歧义:一种用于无飞点深度估计的混合密度表示

Siyuan Bian, Congrong Xu, Jun Gao

发表机构 * University of Michigan(密歇根大学) NVIDIA(英伟达)

AI总结 提出混合密度表示MDA,通过预测每个像素的多个深度假设及其概率,解决深度估计中边界处的飞点伪影问题,显著改善边界重建并消除飞点。

详情
AI中文摘要

尽管深度估计取得了进展,飞点仍然是一个持续存在的失败模式:在物体边界附近,深度估计器经常在前景和背景表面之间的空白空间中预测虚假的3D点。我们将这种伪影追溯到一种标准建模选择:为每个像素分配单个深度假设。在边界处,一个像素可能跨越前景和背景表面,因此其真实深度在两者之间是模糊的。预测单个深度的模型无法同时保留两种可能性,因此训练反而将预测拉向一个位于两个表面之间的中间深度。我们通过MDA解决了这个问题,这是一种混合密度表示,让模型为每个像素预测多个深度假设及其相关概率。在边界附近,不同的假设可以与不同的表面对齐,解码后的深度从这些假设之一中选择,而不是放置在它们之间的空白空间中。在不同的骨干网络上,MDA显著改善了边界重建,并在很大程度上消除了飞点伪影,即使在严重的输入模糊下也是如此,同时增加了可忽略的运行时开销。相同的混合密度框架自然地扩展到透明物体,其中它预测透明像素处的多个深度层,以及天空区域,其中专用组件将无界天空与有限深度区域分开,产生无飞点的天际线。项目页面:https://biansy000.github.io/mda-site/。

英文摘要

Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this artifact to a standard modeling choice: assigning each pixel a single depth hypothesis. At boundaries, a pixel can straddle a foreground and a background surface, so its true depth is ambiguous between the two. A model that predicts a single depth cannot keep both possibilities, so training instead pulls the prediction toward an intermediate depth that lies on neither surface. We address this with MDA, a mixture-density representation that lets the model predict multiple depth hypotheses and their associated probabilities for each pixel. Near boundaries, different hypotheses can align with different surfaces, and the decoded depth is selected from one of these hypotheses rather than placed in the empty space between them. Across different backbones, MDA substantially improves boundary reconstruction and largely removes flying-point artifacts even under severe input blur, while adding negligible runtime overhead. The same mixture-density framework naturally extends to transparent objects, where it predicts multiple depth layers at transparent pixels, and to sky regions, where a dedicated component separates the unbounded sky from finite-depth regions, producing flying-point-free skylines. Project Page: https://biansy000.github.io/mda-site/.

2606.02551 2026-06-02 cs.RO cs.CV 版本更新

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

AFUN:迈向用于功能理解的可供性基础模型

Zhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, Jun Gao

发表机构 * University of Michigan(密歇根大学) University of California, San Diego(加州大学圣地亚哥分校) NVIDIA(英伟达)

AI总结 提出AFUN模型,从单张RGB-D图像和语言任务描述中预测任务条件功能掩码和3D接触后运动曲线,通过大规模标准化数据流水线实现开放世界泛化,在多项基准测试中显著优于现有方法。

详情
AI中文摘要

可供性理解连接视觉感知和物理动作,作为开放非结构化真实环境中机器人操作的可解释接口。然而,构建一个不仅理解交互发生的位置和方式,还能跨不同环境、物体和任务泛化的可供性基础模型,仍然是一个长期的研究挑战。现有方法通常只解决部分挑战,要么定位任务相关区域而不指定可执行运动,要么预测运动但可扩展性有限。在本文中,我们提出了我们的模型,朝着用于功能理解的可供性基础模型迈出了一步。从单个RGB-D观测和语言任务描述中,我们的模型预测任务条件功能掩码(在哪里交互)和3D接触后运动曲线(如何交互)。为了支持开放世界泛化,我们构建了一个大规模标准化数据流水线,将异构的机器人、人类、仿真和真实世界扫描数据转换为共享的可供性模式,包含语言、掩码和以物体为中心的3D运动标签。我们从三个方面评估我们的模型:对于可供性分割,我们的模型在来自4个基准的8个测试集上以较大优势优于所有基线,平均gIoU/cIoU提高+23.9/+26.3;对于接触点预测,它预测出更精确的点,命中率比最佳基线提高12.7-61.3%;对于3D运动,它在所有三个测试集上均达到最佳性能。我们的模型可以部署于真实世界机器人操作,无需针对机器人本体进行微调或使用任务特定启发式方法,展示了适应开放世界可供性任务的能力。项目页面:https://www.zhaoningwang.com/AFUN

英文摘要

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

2606.02535 2026-06-02 cs.CV 版本更新

LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

LL-Bench: 在大规模生成模型时代重新思考低级视觉评估

Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang, Liu Yang, Qiang Hu, Guangtao Zhai, Xiaoyun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出LL-Bench基准,包含大量真实退化图像和人工偏好标注,系统评估大规模生成模型在低级视觉任务中的性能,并引入LL-Score评估器以更好对齐人类偏好。

详情
AI中文摘要

大规模生成模型在图像生成和编辑任务中展现了卓越的能力。然而,它们在需要像素级控制的低级视觉任务中的表现仍未得到充分研究。为填补这一空白,我们引入了 extbf{LL-Bench},一个用于评估大规模生成模型在 extbf{低级视觉}任务上能力的全面 extbf{基准}。该基准包含覆盖16种低级退化任务的2,469张真实退化图像,以及由10个最先进的大规模生成模型和21个传统恢复模型生成的28,919张恢复图像,这些图像附有152,020个专家级成对人类偏好和28,334个质量评分。基于LL-Bench,我们进行了系统诊断,揭示了大规模生成模型在不同低级视觉任务中的性能边界和独特失败模式,并与传统代表性恢复方法进行了比较。此外,我们研究了当前质量评估指标在LL-Bench上的有效性,发现它们与人类评分存在显著差异。为了更好地使恢复图像质量评估与人类偏好对齐,我们进一步提出了 extbf{LL-Score},一个基于MLLM的评估器,能够同时捕捉恢复质量和幻觉存在。大量实验表明,LL-Score不仅优于现有的图像质量评估指标,而且可以作为有前景的奖励模型,用于训练低级视觉任务的生成模型。

英文摘要

Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-Bench}, a comprehensive \textbf{Benchmark} for evaluating the capabilities of large-scale generative models on \textbf{L}ow-\textbf{L}evel vision tasks. The benchmark comprises 2,469 real-world degraded images covering 16 low-level degradation tasks, and 28,919 restored images produced by 10 state-of-the-art large-scale generative models and 21 conventional restoration models, which are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores. Built upon LL-Bench, we present a systematic diagnosis that reveals the performance boundaries and unique failure modes of large-scale generative models across diverse low-level vision tasks, compared with conventional representative restoration approaches. Moreover, we investigate the effectiveness of current quality evaluation metrics on LL-Bench, which exhibit significant discrepancy with human ratings. To better align restored-image quality assessment with human preferences, we further propose \textbf{LL-Score}, an MLLM-based evaluator that captures both restoration quality and hallucination existence. Extensive experiments demonstrate that LL-score not only outperforms existing image quality assessment metrics, but also serves as a promising reward model for training generative models on low-level vision tasks.

2606.02532 2026-06-02 cs.CV 版本更新

Improving Combined Detection and Classification of TEM Defects via Mask-Conditioned Latent Diffusion Augmentation

通过掩码条件潜在扩散增强改善TEM缺陷的联合检测与分类

Ni Li, Nuohao Liu, Ryan Jacobs, Ajay Annamareddy, Maciej P. Polak, Kevin Field, Izabela Szlufarska, Dane Morgan

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of Michigan-Ann Arbor(密歇根大学安娜堡分校)

AI总结 提出一种基于掩码条件潜在扩散模型(LDM)的生成式数据增强方法,用于合成可控、自动标注的多类缺陷掩码的TEM图像,以提升小样本下Mask R-CNN模型的缺陷检测与分类性能。

详情
AI中文摘要

分析透射电子显微镜(TEM)图像中的微观结构缺陷,特别是在辐照金属合金中,通常受到高质量标注数据可用性的限制。为了解决这个问题,我们引入了一种生成式数据增强方法,使用掩码条件潜在扩散模型(LDM)合成具有可控、自动标注的多类缺陷掩码的逼真TEM图像。我们的方法无需生成过程中的人工标注,通过从实验掩码学习到的分布中采样,能够创建合成图像-掩码对。这些生成的数据用于增强不同规模(10、50和100张标注实验图像)的小型实验数据集,以训练Mask区域卷积神经网络(R-CNN)模型进行缺陷检测和分类。我们的结果表明,生成式增强带来了整体模型性能的小幅提升,检测和分类F1分数的调和平均值最高提升0.02。然而,我们也发现检测和分类改进的相对贡献取决于特定的训练/测试数据划分。这些发现凸显了针对性生成模型在数据稀缺的基于显微镜的图像量化任务中提升深度学习性能的潜力。

英文摘要

Analyzing microstructural defects in transmission electron microscopy (TEM) images, particularly in irradiated metal alloys, is often limited by the availability of high-quality, labeled data. To address this, we introduce a generative data augmentation approach using a mask-conditioned latent diffusion model (LDM) for synthesizing realistic TEM images with controllable, automatically labeled multi-class defect masks. Without requiring manual annotations for generation, our method enables the creation of synthetic image-mask pairs by sampling distributions learned from experimental masks. These generated data were used to augment small experimental datasets of varying sizes (10, 50, and 100 labeled experimental images) to train a Mask Regional Convolutional Neural Network (R-CNN) model for defect detection and classification. Our results show that generative augmentation yields small overall model performance improvements, with up to a 0.02 gain in the harmonic mean of detection and classification F1 scores. However, we also find that the relative contributions to detection and classification improvement depend on the specific train/test data split. These findings highlight the potential of targeted generative models to enhance deep learning performance in data-scarce microscopy-based image quantification tasks.

2606.02526 2026-06-02 cs.CV cs.AI 版本更新

Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition

为什么不采用超参数友好的优化?一种用于长尾识别的单调自适应范数缩放方法

Shuo Zhang, Chenqi Li, Tingting Zhu

发表机构 * University of Oxford(牛津大学)

AI总结 提出一种无需参数正则化的自适应单调归一化方法(SAMN),通过保序回归直接对类别权重范数施加单调性约束,实现超参数友好的长尾识别。

详情
AI中文摘要

长尾识别对深度学习构成了重大挑战。两阶段解耦范式将表示学习与分类器重训练分离,提供了一种有前景的解决方案。在分类器重训练阶段,自适应范数缩放是一种流行技术。它通过参数正则化调整每类权重范数,这不可避免地引入了超参数。然而,许多研究报告指出,长尾识别对这些超参数敏感,因为它们的设置显著影响性能。在本文中,我们首先从类条件分布的角度为范数缩放方法提供支持。此外,我们提出了一种简单而有效的方法,称为自适应单调归一化(SAMN)。SAMN避免了参数正则化的需求。它直接使用保序回归算法对每类权重范数施加单调性,使该方法对超参数友好。SAMN是一种通用策略,可与其他方法无缝集成以提升性能。在基准数据集上的实验表明,我们的方法显著提升了长尾识别性能,通常达到最先进的结果。

英文摘要

Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classifier retraining, offers a promising solution. During the classifier retraining stage, adaptive norm rescaling is a popular technique. It adjusts the per-class weight norms via parameter regularization, which inevitably introduces hyperparameters. However, many studies report that long-tailed recognition is sensitive to these hyperparameters, as their setup significantly impacts performance. In this paper, we first provide a class-conditional distribution perspective to support norm rescaling methods. Furthermore, we propose a simple but effective approach called Self-Adaptive Monotonic Normalization (SAMN). SAMN avoids the need for parameter regularization. It directly enforces monotonicity on per-class weight norms using the Pool Adjacent Violators Algorithm, making the method hyperparameter-friendly. SAMN is a universal strategy that integrates seamlessly with other methods for enhanced performance. Experiments on benchmark datasets demonstrate that our method significantly boosts long-tailed recognition performance, often achieving state-of-the-art results.

2606.02523 2026-06-02 cs.CL cs.CV cs.CY 版本更新

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

FigSIM:用于自杀迷因的细粒度自杀严重程度和比喻语言数据集

Liuliu Chen, Elise R. Carrotte, Brian E. Chapman, Jo Robinson, Mike Conway

发表机构 * School of Computing and Information Systems, University of Melbourne, Australia(墨尔本大学计算与信息学院) Orygen, The National Centre of Excellence in Youth Mental Health, Australia(奥里根青少年心理健康国家研究中心) Centre for Youth Mental Health, University of Melbourne, Australia(墨尔本大学青少年心理健康中心) O’Donnell School of Public Health, UT Southwestern Medical Center, United States(奥唐奈公共卫生学院,西南医学中心)

AI总结 本文提出FigSIM数据集,包含1049个自杀迷因,标注了细粒度自杀严重程度、比喻现象和自杀相关内容,并评估了16个单模态和多模态模型在比喻语言、自杀严重程度和自杀相关内容检测任务上的表现,揭示了建模和内容审核的独特挑战。

Comments Content warning: contains suicide-related content. Accepted to Findings of the Association for Computational Linguistics: ACL 2026

详情
AI中文摘要

自杀迷因是用于表达自杀相关想法或评论自杀相关问题的迷因。自杀迷因在社交媒体上越来越常见,但仍未被充分理解且可能有害。迫切需要更好地了解其特征,并制定适当的内容审核策略,以限制用户接触潜在有害内容。目前,缺乏注释的自杀迷因数据集仍然是开发和评估自动审核方法的主要障碍。在本文中,我们介绍了FigSIM,这是第一个用于自杀迷因细粒度分析的数据集。该数据集包含1049个迷因,每个迷因都标注了(1)细粒度自杀严重程度级别,(2)比喻现象(例如隐喻),以及(3)自杀相关内容(例如自杀方法描绘)。我们在三个任务上评估了16个单模态和多模态模型:比喻语言、自杀严重程度和自杀相关内容检测。总体而言,FigSIM表明自杀迷因对建模和内容审核都构成了独特的挑战。分析揭示了偏差,例如对较高自杀严重程度级别的预测不足,尤其是对于比喻迷因。该数据集(包括用于分析的分割)是公开可用的。内容警告:本文包含可能引发不适的自杀相关内容。

英文摘要

Suicide memes are memes used to express suicide-related thoughts or comment on suicide-related issues. Suicide memes are increasingly common on social media, yet remain poorly understood and potentially harmful. There is an urgent need to better understand their characteristics and to develop appropriate content moderation strategies that limits users' exposure to potentially harmful content. Currently, the absence of annotated datasets of suicide memes remains a key barrier to developing and evaluating automated moderation approaches. In this paper, we introduce FigSIM, the first dataset designed for fine-grained analysis of suicide memes. The dataset consists of 1049 memes, each annotated for (1) fine-grained suicide severity levels, (2) figurative phenomena (e.g., metaphors), and (3) suicide-related content (e.g., suicide method depiction). We benchmark 16 unimodal and multimodal models across three tasks: figurative language, suicide severity, and suicide-related content detection. Overall, FigSIM demonstrates that suicide memes pose unique challenges for both modeling and content moderation. Analysis revealed biases, such as underprediction of higher suicide severity levels, especially for figurative memes. The dataset (including splits used for analyses) is publicly available. Content Warning: This paper contains suicide-related content that may be triggering.

2606.02522 2026-06-02 cs.CV cs.AI 版本更新

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video: 诊断视频多模态大语言模型在瞬时视觉事件上的时间保真度

Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li, Xin Li, Haoyu Cao, Xing Sun, Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学) Southeast University(东南大学) Tencent Youtu Lab(腾讯优图实验室)

AI总结 提出 Moment-Video 基准,通过瞬时视觉事件理解任务诊断视频 MLLMs 的时间保真度,发现最佳模型准确率仅 39.6%,多数开源模型低于 25%。

Comments 28 pages, 10 figures, 11 tables

详情
AI中文摘要

视频多模态大语言模型(MLLMs)在通用和长视频理解方面取得了快速进展,但它们保留简短答案关键视觉证据的能力仍未得到充分探索。许多实际问题由瞬时视觉事件决定:可能仅持续几帧的局部化动作或状态转换。这种证据可能因稀疏帧采样而跳过、因视觉标记压缩而抑制,或因粗粒度时间聚合而稀释,导致语言端推理无法可靠恢复的失败。我们引入了 Moment-Video,一个通过瞬时视觉事件理解来诊断视频 MLLMs 时间保真度的基准。每个问题都基于局部化、视觉可观察且对采样敏感的事件,要求模型注意、计数、描述或推理瞬态证据,而非依赖持久对象、全局场景上下文或语言先验。Moment-Video 包含 1,000 个人工验证的视频问答对,涵盖 7 个领域和 25 个细分子类别,覆盖四种任务类型:时间发生、时间计数、动作描述和时间推理。我们在 Moment-Video 上评估了 33 个专有和开源 MLLMs。最佳模型 Seed-2.0-Pro 仅达到 39.6% 的整体准确率,而大多数开源模型低于 25%,揭示了瞬时视觉事件理解方面的巨大差距。诊断分析表明,更密集的帧采样改善了一些模型,但并未消除瓶颈,更长的视频带来了更强的时间定位挑战。这些发现表明,当前视频 MLLMs 仍然缺乏时间保真的表示来捕捉、保留和使用简短但决定性的视觉证据。

英文摘要

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.

2606.02518 2026-06-02 cs.CV 版本更新

ToolFG: Towards Well-Grounded Fine-Grained Image Classification

ToolFG:面向良好基础的细粒度图像分类

Yu Xue, Haoxuan Qu, Zhuoling Li, Yihang Lou, Yan Bai, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University(兰卡斯特大学) Peking University(北京大学)

AI总结 提出ToolFG框架,通过MCTS引导的工具使用知识蒸馏和模型-工具协同进化机制,使MLLM自主调用外部工具获取可靠视觉线索,实现细粒度图像分类。

详情
AI中文摘要

细粒度图像分类(FGIC)具有广泛的应用并吸引了大量研究关注。本文通过提出 extbf{ToolFG}探索了一种解决FGIC的新范式,这是首个针对FGIC定制的集成工具的多模态大语言模型(MLLM)框架。ToolFG使MLLM能够在推理过程中自主灵活地使用外部工具,主动与图像交互,并以更 extit{可靠}和 extit{良好基础}的方式收集可验证的视觉线索,以区分高度相似的类别。为了赋予模型这种工具使用能力,我们设计了一种新颖的 extbf{MCTS引导的工具使用知识蒸馏机制},该机制有效地从高级专有MLLM中挖掘与工具使用和FGIC相关的知识用于模型训练。此外,我们提出了一种 extbf{模型-工具协同进化机制},该机制共同优化工具集和模型的工具使用策略,推动它们朝向相互适应且专门针对FGIC的状态发展。大量实验证明了我们框架的有效性。

英文摘要

Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.

2606.02510 2026-06-02 cs.CV cs.RO 版本更新

Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis

并非所有点都同等重要:不确定性感知的4D LiDAR场景合成

Xiang Xu, Alan Liang, Youquan Liu, Xian Sun, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu

发表机构 * NUAA(南京航空航天大学) NUS(新加坡国立大学) FDU(福建工程学院) Duke(杜克大学) NTU(国立新加坡大学) NJUPT(南京理工大学泰州学院) SKL-TI(特种信息处理实验室)

AI总结 提出U4D框架,利用空间不确定性引导LiDAR场景生成,通过熵图识别高不确定性区域并优先合成,再补全其余区域,实现高保真4D场景。

Comments CVPR 2026 E2E3D Workshop; GitHub at https://github.com/worldbench/U4D

详情
AI中文摘要

从LiDAR获取的序列构建忠实的4D世界对于具身AI至关重要,但当前的生成框架对所有空间区域采用统一的建模能力。这忽略了单个扫描中感知难度的巨大差异:远距离表面、遮挡边界和小尺度物体比良好观测的结构具有更高的不确定性。我们提出了U4D,一种新的框架,明确利用空间不确定性以“从难到易”的顺序引导LiDAR场景生成。U4D通过预训练分割器的香农熵推导逐点不确定性图,然后应用无条件扩散阶段合成具有精确几何的高熵区域,接着是条件补全阶段,利用这些结构作为先验填充剩余区域。MoST(时空混合)块通过动态平衡空间细节和时间连续性进一步维护跨帧一致性。在nuScenes和SemanticKITTI上的大量实验证明了最先进的场景保真度、时间一致性和下游性能。

英文摘要

Constructing faithful 4D worlds from LiDAR-acquired sequences is crucial for embodied AI, yet current generative frameworks apply uniform modeling capacity across all spatial regions. This ignores that perceptual difficulty varies dramatically within a single scan: distant surfaces, occluded boundaries, and small-scale objects carry far higher uncertainty than well-observed structures. We present U4D, a new framework that explicitly leverages spatial uncertainty to guide LiDAR scene generation in a "hard-to-easy" schedule. U4D derives per-point uncertainty maps via Shannon Entropy from a pretrained segmentor, then applies an unconditional diffusion stage to synthesize high-entropy areas with precise geometry, followed by a conditional completion stage that fills in the remaining regions using these structures as priors. A MoST (Mixture of Spatio-Temporal) block further maintains cross-frame coherence by dynamically balancing spatial detail and temporal continuity. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art scene fidelity, temporal consistency, and downstream performance.

2606.02498 2026-06-02 cs.CV 版本更新

GloResNet: A lightweight 3D CNN with global topological features for preterm brain injury prediction

GloResNet:一种用于早产儿脑损伤预测的轻量级3D CNN与全局拓扑特征

Boyu Yuan, Jiamiao Lu, Weichuan Zhang, Benqing Wu, Tuo Wang, Changshan Wang, Changming Sun, Liang Guo

发表机构 * Image Computing Laboratory, Shaanxi University of Science and Technology(陕西科技大学图像计算实验室) Department of Neonatology, Shenzhen University of Advanced Technology General Hospital(深圳先进技术医院新生儿科) Department of Neurosurgery, The First Affiliated Hospital of Xi’an Jiaotong University(西安交通大学第一附属医院神经外科) CSIRO Technology(澳大利亚CSIRO技术)

AI总结 提出基于ResNet-10的轻量级3D CNN GloResNet,结合全局流形映射和预处理策略,在dHCP数据集上实现早产儿脑损伤预测,平均准确率75.18%。

详情
AI中文摘要

本研究引入了一个自动化深度学习框架,用于从T2加权MRI(dHCP数据集)预测早产儿脑损伤(BI)。我们提出了GloResNet,一种基于ResNet-10的轻量级3D CNN,并在MedicalNet上预训练以应对数据稀缺。一种全局流形映射策略首先将每个3D体积重采样为128x128x128,然后应用逐样本z分数强度归一化,从而在标准化外观的同时保留全局拓扑。训练集成了mixup、类别加权和测试时增强以提高鲁棒性。在5折交叉验证中,GloResNet达到了75.18%的平均准确率(峰值81.82%),特异性0.81,敏感性0.76。结果表明,拓扑感知的轻量级CNN能够有效预测新生儿脑损伤,提供了一种非侵入性筛查工具。本文源代码可从GitHub仓库获取:https://github.com/ICL-SUST/GloResNet-Preterm-Brain

英文摘要

This study introduces an automated deep learning framework for predicting brain injury (BI) in preterm infants from T2-weighted MRI (dHCP dataset). We propose GloResNet, a lightweight 3D CNN based on ResNet-10, pretrained on MedicalNet to address data scarcity. A global manifold mapping strategy first resamples each 3D volume to 128x128x128 and then applies subject-wise z-score intensity normalization, thereby preserving global topology while standardizing appearance. Training integrates mixup, class weighting, and test-time augmentation for robustness. In 5-fold cross-validation, GloResNet achieved 75.18% average accuracy (peak 81.82%), with specificity 0.81 and sensitivity 0.76. Results demonstrate that a topology-aware lightweight CNN has the capability to effectively predict neonatal BI, offering a non-invasive screening tool. The source code of this paper can be obtained from the GitHub repository: https://github.com/ICL-SUST/GloResNet-Preterm-Brain

2606.02491 2026-06-02 cs.CV 版本更新

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

MORPHOS: 基于时间结构化潜变量的自回归4D生成

Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee, Seungryong Kim

发表机构 * KAIST AI(韩国国立科学技术院人工智能实验室)

AI总结 提出MORPHOS框架,利用时间结构化潜变量(T-SLAT)统一表示4D动态资产,通过自回归因果注意力生成,解决多表示兼容、拓扑变化和长时间一致性问题。

Comments Project page: https://cvlab-kaist.github.io/MORPHOS/

详情
AI中文摘要

我们提出MORPHOS,一种新颖的自回归框架,能够从视频生成动态3D资产,支持多种表示,包括网格、3D高斯和辐射场。现有方法通常局限于单一表示,难以建模拓扑变化,或在长视频中无法保持时间一致性。为解决这些限制,我们引入时间结构化潜变量(T-SLAT),一种统一的4D表示,沿时间维度联合编码几何和外观。利用T-SLAT,MORPHOS通过因果注意力自回归生成动态3D资产,将每一帧条件于其先前历史,以确保时间一致性并处理演化的拓扑。我们还提出一种时间结构增强,以减轻自回归生成中的误差累积。MORPHOS在多个基准测试中实现了外观方面的最先进性能和几何方面的竞争性结果,展示了跨多种表示的卓越泛化能力和长时程生成的鲁棒性。

英文摘要

We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.

2606.02481 2026-06-02 cs.CV 版本更新

Places in the Wild: A Large, High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research

野外场景:一个用于生态有效视觉研究的大规模高分辨率RAW照片数据集

Michelle R. Greene

AI总结 本文提出了一个包含67,574张高分辨率RAW照片的数据集,通过360度视角采样覆盖260个场景类别,支持视角依赖识别、真实场景理解及自然场景统计研究。

Comments 19 pages, 3 tables, 4 figures

详情
AI中文摘要

大规模图像数据集加速了认知神经科学和计算机视觉的进展。然而,大多数数据集是低分辨率、来自互联网的JPEG图像,其拍摄条件未知且空间上下文有限。野外场景数据集包含67,574张高分辨率照片,这些照片在810个物理位置现场采集,涵盖260个基本级场景类别,包括室内、城市和自然环境。在每个位置,安装在全景三脚架上的4500万像素佳能EOS R5相机以5度水平间隔拍摄72张图像,并在不同仰角拍摄12张图像,实现了密集的360度视点采样。所有图像同时记录为14位RAW(CR3)文件和压缩JPEG文件,保留了传感器级别的细节,用于分析亮度、对比度、颜色和其他图像统计信息。该数据集附有完整的EXIF元数据和一套图像质量指标。野外场景数据集支持人类和模型中视角依赖识别的研究、在真实条件下训练和评估场景理解系统、自然场景统计特征的刻画,以及需要近全视野视觉显示的实验。

英文摘要

Large image datasets have accelerated progress in cognitive neuroscience and computer vision. However, most datasets are low-resolution, internet-sourced JPEGs with unknown capture conditions and limited spatial context. Places in the Wild is a dataset of 67,574 high-resolution photographs collected in situ across 810 physical locations spanning 260 basic-level scene categories, including indoor, urban, and natural environments. At each location, a 45-megapixel Canon EOS R5 mounted on a panoramic tripod captured 72 images at 5-degree horizontal intervals plus 12 images at varying elevations, yielding dense 360-degree viewpoint sampling. All images were recorded simultaneously as 14-bit RAW (CR3) files and compressed JPEGs, preserving sensor-level detail for analyses of luminance, contrast, color, and other image statistics. The dataset is accompanied by complete EXIF metadata and a suite of image-quality metrics. Places in the Wild supports research on viewpoint-dependent recognition in humans and models, training and evaluation of scene-understanding systems under realistic conditions, characterization of natural scene statistics, and experiments requiring near-full-field visual displays.

2606.02479 2026-06-02 cs.CV 版本更新

Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation

检索缺失内容:面向一致长视频生成的覆盖最大化检索

Minseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee, Hyunwoo J. Kim

发表机构 * Korea University(韩国大学) KAIST(韩国科学技术院)

AI总结 提出基于深度的覆盖最大化检索增强生成框架COVRAG,利用预训练3D先验构建轻量级覆盖图作为记忆证据,通过迭代检索最大化残差覆盖来提升长视频生成的几何一致性。

Comments 19 pages, 10 figures, 5 tables

详情
AI中文摘要

对于长时域自回归视频生成,保持长期几何一致性仍然具有挑战性。记忆增强生成模型通过检索历史帧来解决这一问题,但其有效性取决于两个关键设计选择:哪些3D几何证据应代表过去的观测,以及如何从这些证据中选择记忆帧。现有方法通常依赖相机位姿或视场重叠,这些方法轻量但过于粗糙,无法推理像素级可见性;或者使用显式3D重建,提供细粒度证据但在长序列中维护成本高昂。我们提出覆盖最大化检索增强生成(COVRAG),一种基于深度的记忆检索框架,利用预训练3D先验构建目标视图覆盖图作为轻量级3D记忆证据。在帧选择方面,COVRAG最大化残差覆盖增益,迭代检索能够解释当前上下文或先前选择的记忆未覆盖的目标视图区域的帧。为了提高长视频生成的可扩展性,我们引入滑动窗口深度缓存以实现高效的几何估计。在RealEstate10K和DL3DV10K上的实验表明,COVRAG在保持低延迟的同时,相比基线方法改善了长时域几何一致性。

英文摘要

Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.

2606.02463 2026-06-02 cs.CV cs.AI 版本更新

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

MASER: 面向具身3D空间智能的模态自适应专家路由

Hilton Raj, Vishnuram AV

发表机构 * Boston University(波士顿大学)

AI总结 提出MASER框架,通过训练共享VLM骨干的五个模态适配器并学习基于问题选择最佳适配器的神经路由策略,解决具身代理在3D环境中多模态推理时忽略问题语义的问题。

Comments Accepted to CVPR 2026 Foundation Models Meet Embodied Agents Workshop

详情
AI中文摘要

在3D环境中,具身代理通过推理自然语言、RGB图像、点云、深度图和相机位姿等多模态信息来回答空间相关问题。现有的视觉语言模型(VLM)在单一模态上微调,完全忽略了可能偏好不同于微调模态的问题语义。为解决这一问题,我们提出MASER(模态自适应专家路由),一个轻量级框架,训练共享VLM骨干的五个不同模态适配器,并学习一个神经路由策略,在推理时根据问题选择最佳适配器。我们使用冻结的句子变换器对每个问题进行编码,并将嵌入通过一个小型多层感知器(MLP),该感知器在oracle适配器-准确率标签上训练。我们在Open3D-VQA基准上评估我们的方法,评估结果表明没有单一模态是普遍最优的——点云答案在51.5%的情况下最佳。MASER以51.3%的oracle一致性进行路由,优于随机森林消融(43.5%),且每个问题仅调用一次适配器。

英文摘要

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.

2606.02459 2026-06-02 cs.CV 版本更新

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

像鸽子一样主动探索:通过智能视觉语言模型强化空间推理

Wei Deng, Xianlin Zhang, Mengshi Qi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种受鸽子认知地图启发的智能视觉语言模型管道,通过动态认知地图和空间断言代码提供密集奖励信号,在MindCube基准上实现80.5%的总体准确率,在Rotation子集上相对提升53.2%。

Comments Accepted by ICML 2026

详情
AI中文摘要

使视觉语言模型(VLM)能够进行空间推理仍然具有挑战性。现有方法将VLM视为被动观察者,这在实际应用中难以奏效。此外,强化学习方法依赖稀疏奖励,限制了其在复杂推理任务中的有效性。受鸽子构建和利用认知地图进行导航的启发,我们提出了一种新颖的智能管道用于空间推理。首先,我们引入了一种新的\emph{动态认知地图},将场景布局参数化为物体位置和朝向,作为新观测的持久记忆。其次,我们提出了一种新颖的\emph{空间断言代码(SAC)},即用Python表达式编程描述空间关系。通过与动态认知地图协作,SAC能够验证中间推理步骤,提供密集的奖励信号。我们通过监督学习和强化微调来优化模型。在MindCube基准上的实验表明,我们的方法达到了\emph{80.5\%}的总体准确率,在具有挑战性的 extsc{Rotation}子集上,比当前最佳方法高出\emph{29.5}个准确率点(相对提升\emph{53.2\%})。我们的代码和数据已在https://github.com/dw-dengwei/active-spatial-reasoning.git开源。

英文摘要

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

2606.02453 2026-06-02 cs.CV cs.AI 版本更新

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

初始化即半程:从引导势后验生成多样图像

Xiang Li, Dianbo Liu, Kenji Kawaguchi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Tokyo(东京大学)

AI总结 针对生成模型模式崩溃问题,提出从引导势后验中采样初始噪声的DivIn方法,利用朗之万动力学引导初始化远离崩溃区域,提升多样性且兼容扩散与流匹配模型。

Comments Accepted by ICML 2026 Spotlight

详情
AI中文摘要

尽管生成模型具有显著的保真度,但它们经常遭受模式崩溃。现有的增强多样性的策略主要集中于在生成轨迹期间进行干预。我们发现一个关键的疏忽:标准高斯初始化通常导致轨迹崩溃到主导模式,因为它对引导势景观是无关的。在这项工作中,我们从引导势后验中公式化选择初始噪声,这有效地将先验重新加权到多样性丰富的区域。为了高效地从该分布中采样,我们引入了多样性诱导初始化(DivIn),它利用朗之万动力学主动导航初始化景观,将初始噪声引导远离崩溃区域,同时将其锚定到有效的数据流形。我们的方法作为一种推理时多样性增强,与扩散和流匹配模型都兼容。大量实验表明,DivIn在类到图像和文本到图像场景中都表现出优越的性能。此外,我们强调,由于DivIn与基于轨迹的方法是正交的,将它们结合起来显著扩展了多样性-质量帕累托前沿,超越了任何单独方法所能达到的。

英文摘要

Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversity predominantly focus on intervening during the generation trajectory. We identify a critical oversight that the standard Gaussian initialization often causes trajectories to collapse into dominant modes because it is agnostic to the guidance potential landscape. In this work, we formulate selecting the initial noise from a guidance potential posterior, which effectively re-weights the prior towards diversity-rich regions. To sample from this distribution efficiently, we introduce Diversity-inducing Initialization (DivIn), which leverages Langevin dynamics to actively navigate the initialization landscape, steering initial noise away from collapsing regions while anchoring them to the valid data manifold. Our method serves as an inference-time diversity enhancement compatible with both diffusion and flow matching models. Extensive experiments show that DivIn exhibits a superior performance in both class-to-image and text-to-image scenarios. Furthermore, we highlight that as DivIn is orthogonal to trajectory-based methods, combining them significantly expands the diversity-quality Pareto frontier beyond what either achieves in isolation.

2606.02449 2026-06-02 cs.AI cs.CL cs.CV cs.LG cs.MM 版本更新

HLL: Can Agents Cross Humanity's Last Line of Verification?

HLL:智能体能否跨越人类最后一道验证防线?

Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen, Zhihua Wei, Gongshen Liu, Linfeng Zhang, Dongrui Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学) Tongji University(同济大学)

AI总结 提出HLL基准,通过交互式CAPTCHA验证评估多模态智能体在受保护工作流中替代人类的能力,发现当前智能体在定位、动作校准、状态跟踪和过程一致性方面存在脆弱性。

Comments 27 pages, 14 figures

详情
AI中文摘要

多模态智能体越来越被期望代表用户操作界面,这引发了一个核心部署问题:在服务特意防止自动化的流程中,它们能否真正替代人类?CAPTCHA验证使这个问题具体化。它不仅仅是一个视觉谜题,更是在账户创建、内容访问、表单提交和其他受保护操作之前设置的人类验证边界。我们引入了 extbf{人类最后一道验证防线(HLL)},这是一个受控基准,使用交互式CAPTCHA验证来评估智能体是否能够通过基于环境的类人交互(而非仅识别)跨越这一边界。HLL涵盖了多种CAPTCHA交互,并让智能体暴露于受控的现实压力因素下,包括杂乱的网页、更困难的任务变体以及解决过程的轨迹条件验证。我们在闭环GUI环境中评估了八个前沿多模态智能体。结果表明,当前智能体在这个人类替代边界上仍然脆弱:性能在不同验证类型间差异显著,在现实界面条件下下降,当正确答案必须由有效动作轨迹支持时进一步下降。通过揭示定位、动作校准、状态跟踪和过程一致性方面的差距,HLL为衡量多模态智能体在受保护的真实世界工作流中作为人类替代品有多接近提供了一个具体的测试平台。我们的代码可在https://github.com/XinhaoS0101/HLL获取。

英文摘要

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

2606.02443 2026-06-02 cs.CL cs.AI cs.CV 版本更新

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

PaSBench-Video: 面向主动安全预警的流式视频基准

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学)

AI总结 提出PaSBench-Video基准,包含740个视频,评估多模态大模型在危险发生前及时发出预警的能力,发现现有模型在时序精度和低误报率上表现不佳。

详情
AI中文摘要

从危险的第一个可见迹象到事故发生之间,通常存在一个仍可干预的时间窗口。具备视频能力的多模态大语言模型(MLLM)可以作为始终在线的安全监控器,在此窗口内发出警告。然而,当前的基准测试并未检验这一能力:它们依赖静态输入,忽略时间精度,并且省略了对安全场景的误报测量。我们提出了PaSBench-Video,一个包含740个视频的基准测试,涵盖驾驶、医疗、日常生活和工业生产四个领域,其中包含481个风险视频和259个无风险视频。风险视频标注了帧级别的风险起始点和事故边界。模型必须以因果方式观察视频,并发出在时间上校准且内容正确的警告。测试了13个MLLM后,我们发现没有模型在我们的最严格指标上超过20.0%,并且召回率与误报率紧密相关,皮尔逊相关系数为0.64:更高的检测率只能以在大多数安全片段上触发警告为代价。性能按领域显著分化:在日常生活领域,模型在低误报率下实现了中等召回率,因为该领域的风险本质上是异常的;而在驾驶领域,模型不加区分地触发警告,因为常规场景和危险场景看起来相似。这些结果表明,当前模型依赖于场景级别的活动线索,而不是推理正在出现的危害。

英文摘要

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

2606.02441 2026-06-02 cs.CV 版本更新

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

空间-时间解耦参考条件用于身份保持的文本到视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Lizhuang Ma, Jiangning Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Electronic Science and Technology of China(电子科技大学) Zhejiang University(浙江大学)

AI总结 提出ST-DRC框架,通过空间-时间解耦参考条件、TASS-RoPE机制和身份目标,实现高保真身份保持视频生成。

详情
AI中文摘要

身份保持视频生成(IPVG)旨在合成高保真视频,遵循文本提示同时忠实保持参考身份。尽管最近取得进展,现有IPVG方法仍难以平衡高级语义控制和低级身份保真度。为弥合这一差距,我们提出ST-DRC,一种有效的空间-时间解耦参考条件框架,用于身份保持的文本到视频生成。在框架层面,ST-DRC通过使用视频VAE编码参考图像并将其与噪声视频潜在变量拼接,执行潜在上下文特征注入,无需额外适配器即可访问丰富的低级身份细节。为将身份感知参考检索与外观复制分离,我们引入TASS-RoPE,一种时间相邻-空间偏移的RoPE方案,将参考令牌在时间上靠近视频序列但在空间上偏移,允许参考信息通过时空注意力流动,同时抑制像素级复制粘贴捷径。为进一步防止捷径学习并增强扩散目标中被稀释的身份监督,我们结合外观不变参考增强与面部引导身份目标,鼓励模型在颜色、姿态和布局变化下保持身份。在推理时,我们引入三流参考无分类器引导策略,独立控制文本遵循度和参考保真度。实验表明,ST-DRC在基于LTX-2.3的轻量级设计下,实现了强身份保持、提示对齐、时间一致性和视频质量。我们的方法在面部身份保持视频生成赛道中排名靠前,验证了空间-时间解耦参考条件的有效性。

英文摘要

Identity-preserving video generation (IPVG) aims to synthesize high-fidelity videos that follow text prompts while faithfully preserving a reference identity. Despite recent progress, existing IPVG methods still struggle to balance high-level semantic control and low-level identity fidelity. To bridge this gap, we propose ST-DRC, an effective Spatial-Temporal Decoupled Reference Conditioning framework for identity-preserving text-to-video generation. At the framework level, ST-DRC performs latent in-context feature injection by encoding the reference image with the video VAE and concatenating it with noisy video latents, enabling rich low-level identity details to be accessed without additional adapters. To separate identity-aware reference retrieval from appearance copying, we introduce TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme that places reference tokens near the video sequence in time but shifts them in space, allowing reference information to flow through spatio-temporal attention while suppressing pixel-level copy-paste shortcuts. To further prevent shortcut learning and strengthen the otherwise diluted identity supervision in the diffusion objective, we combine appearance-invariant reference augmentation with face-guided identity objectives, encouraging the model to preserve identity under variations in color, pose, and layout. At inference time, we introduce a three-stream reference classifier-free guidance strategy that independently controls text adherence and reference fidelity. Experiments demonstrate that ST-DRC achieves strong identity preservation, prompt alignment, temporal consistency, and video quality with a lightweight design built on LTX-2.3. Our method ranks among the top submissions in the facial identity-preserving video generation track, validating the effectiveness of spatial-temporal decoupled reference conditioning.

2606.02436 2026-06-02 cs.CV 版本更新

Geometry-Aware Implicit Memory for Video World Models

几何感知隐式记忆用于视频世界模型

Zhengxuan Wei, Xu Guo, Xinghui Li, Xunzhi Xiang, Min Wei, Yiran Zhu, Qiulin Wang, Xintao Wang, Pengfei Wan, Xiangwang Hou, Qi Fan

发表机构 * School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) Kling Team, Kuaishou Technology(快手技术 Kling 团队) Tsinghua University(清华大学)

AI总结 提出GIM-World框架,通过轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌,并利用相机可查询的几何头在训练期间从冻结的基础模型中蒸馏3D场景结构,从而在长时程视频生成中保持几何和视觉一致性。

Comments Project page: https://gim-world.github.io/

详情
AI中文摘要

视频世界模型旨在模拟可控的视觉环境,但长时程展开取决于模型在观察离开其原生上下文窗口后记住的内容。显式记忆保留帧或在线3D重建,可能会遭受启发式检索错误、冗余外观存储或重建伪影。隐式记忆将历史压缩为紧凑状态,但现有设计没有明确约束以编码跨视图场景几何。我们提出GIM-World,一种用于视频世界模型的几何感知隐式记忆框架。轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌,相机可查询的几何头在训练期间从冻结的基础模型中将3D场景结构蒸馏到记忆中,信息引导的剪枝规则在历史增长时保持编码成本有界。在推理时丢弃几何教师,留下轻量级记忆模块。在MIND上的实验表明,GIM-World在保持长时程几何和视觉一致性方面优于显式和隐式记忆基线。

英文摘要

Video world models aim to simulate controllable visual environments, but long-horizon rollouts depend on what the model remembers after observations leave its native context window. Explicit memories retain frames or online 3D reconstructions, which can suffer from heuristic retrieval errors, redundant appearance storage, or reconstruction artifacts. Implicit memories compress history into a compact state, but existing designs are not explicitly constrained to encode cross-view scene geometry. We propose GIM-World, a geometry-aware implicit memory framework for video world models. A lightweight transformer encoder compresses variable-length history into fixed-size memory tokens, a camera-queryable geometry head distills 3D scene structure from a frozen foundation model into the memory during training, and an information-guided pruning rule keeps encoding cost bounded as history grows. The geometry teacher is discarded at inference, leaving a lightweight memory module. Experiments on MIND show that GIM-World better preserves long-horizon geometric and visual consistency than both explicit- and implicit-memory baselines.

2606.02424 2026-06-02 cs.CV cs.AI cs.LG 版本更新

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

GC-MoE: 基因组引导的细胞类型特异性专家混合模型用于基于组织学的单细胞空间转录组学

Kaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise, Yuichiro Iwashita, Kazuya Nishimura, Andreas Dengel, Muhammad Nabeel Asim

发表机构 * Kyushu University(九州大学) German Research Center for Artificial Intelligence (DFKI GmbH)(德国人工智能研究中心) RPTU University Kaiserslautern-Landau(科布伦茨-劳恩堡大学) The University of Osaka(大阪大学) IntelligentX GmbH Osaka Metropolitan University(大阪 Metropolitan 大学)

AI总结 提出GC-MoE模型,通过路由网络估计细胞类型概率并软组合细胞类型特异性专家,结合细胞类型特异性共表达感知预测器和细胞间交互注意力模块,从组织学图像和细胞位置预测单细胞基因表达,在公共数据集上优于现有方法。

详情
AI中文摘要

基于组织学的单细胞空间转录组学(ST)估计旨在从组织病理学图像和细胞位置预测单个细胞的基因表达,从而减少对昂贵的单细胞ST测量的需求。与现有的组织学到ST方法主要预测包含多个细胞的局部区域的斑点级谱不同,该任务需要对细胞间的表达变异性进行建模,而这种变异性强烈地由细胞类型结构化。我们提出了基因组引导的细胞类型特异性专家混合模型(GC-MoE),该模型通过路由网络估计细胞类型概率,并软组合细胞类型特异性专家进行基因表达预测。为了进一步编码细胞类型依赖的基因程序,我们引入了细胞类型特异性共表达感知预测器(CAP),以及一个轻量级的细胞间交互注意力(C2CA)模块用于邻域细胞上下文。在公共单细胞ST数据集上的实验和消融研究表明,该方法在现有单细胞和适应性斑点级基线方法上均有一致的改进。

英文摘要

Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike existing histology-to-ST methods that mainly predict spot-level profiles for local regions containing multiple cells, this task requires modeling cell-to-cell expression variability, which is strongly structured by cell type. We propose Genomics-Guided Cell-Type-Specific Mixture-of-Experts (GC-MoE), which estimates cell-type probabilities with a routing network and softly combines cell-type-specific experts for gene expression prediction. To further encode cell-type-dependent gene programs, we introduce the Cell-Type-Specific Co-Expression-Aware Predictor (CAP), together with a lightweight Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Experiments and ablations on public single-cell ST datasets show consistent improvements over existing single-cell and adapted spot-level baselines.

2606.02406 2026-06-02 cs.CV 版本更新

Edge Prediction for Roof Wireframe Reconstruction with Transformers

基于Transformer的屋顶线框重建边预测

Gustav Hanning, Ludvig Dillén, Jonathan Astermark, Johanna Lidholm, Viktor Larsson

发表机构 * Centre for Mathematical Sciences, Lund University(卢德大学数学科学中心)

AI总结 提出一种端到端Transformer编码器-解码器架构,利用稀疏SfM点云和语义分割图重建3D屋顶线框,在HoHo 22k数据集上取得0.6476的混合结构分数,位列挑战赛第二名。

Comments Presented at the 3rd Urban Scene Modeling (USM3D) Workshop at CVPR 2026

详情
AI中文摘要

本文提出了一种针对S23DR Challenge 2026的竞争性解决方案,该挑战旨在从稀疏SfM点云、地面级语义分割图和深度图中重建3D房屋屋顶线框模型。我们的方法采用受DETR启发的端到端Transformer编码器-解码器架构。为了有效处理几何和语义数据,稀疏SfM点云输入基于语义优先级进行动态子采样,并增强以Gestalt和ADE20k类别特征。为了进一步增加分割上下文,我们将点特征与额外的Gestalt特征编码融合,这些编码通过将点投影到冻结自编码器产生的潜在特征图中获得。然后,学习到的查询嵌入通过交叉注意力机制直接解码为3D线框边。在“HoHo 22k”数据集上的评估表明,我们的方法显著优于手工和学习的基线方法,取得了0.6476的混合结构分数(HSS),并在挑战赛私有排行榜上获得第二名。

英文摘要

This paper presents a competitive solution to the S23DR Challenge 2026, which aims to reconstruct 3D house roof wireframe models from sparse SfM point clouds and ground-level semantic segmentations and depth maps. Our proposed method utilizes an end-to-end Transformer encoder-decoder architecture inspired by DETR. To effectively process the geometric and semantic data, the sparse SfM point cloud input is dynamically subsampled based on semantic priority and augmented with Gestalt and ADE20k class features. To further increase segmentation context, we fuse the point features with additional Gestalt feature encodings which are obtained by projecting the points into latent feature maps produced by a frozen autoencoder. Learned query embeddings are then decoded directly into 3D wireframe edges via cross-attention mechanisms. Evaluated on the "HoHo 22k" dataset, our approach significantly outperforms both handcrafted and learned baselines, achieving a Hybrid Structure Score (HSS) of 0.6476 and securing the second-highest position on the challenge's private leaderboard.

2606.02402 2026-06-02 cs.CV 版本更新

Explainable Forensics of Manipulated Segments in Untrimmed Long Videos

未修剪长视频中被操纵片段的可解释取证

Yue Feng, Jingjing Li, Qijia Lu, Wei Ji, Jingrou Zhang, Fei Shen, Xiao Li, Yizhen Jia, Qiang Chen, Limin Wang, Wentong Li, Jie Qin

发表机构 * MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics(脑机智能技术关键实验室、人工智能学院、南京航空航天大学) Dalian University of Technology(大连理工大学) Nanjing University(南京大学) National University of Singapore(新加坡国立大学)

AI总结 针对长视频中AI生成片段的定位与解释任务,提出TASLE基准数据集和MSLoc粗到细取证方法,实现时序定位、真实性检测与可解释分析。

Comments Accepted to ICML 2026

详情
AI中文摘要

AI驱动视频生成的快速发展改变了内容创作,同时也通过长视频中的局部操纵增加了错误信息的风险。现有的视频取证方法主要处理短小的独立片段,因此无法捕捉AI生成内容稀疏嵌入真实视频中的现实场景。为弥补这一差距,我们提出了时序AI生成片段定位与解释任务,旨在对未修剪长视频中的操纵片段进行真实性检测、时序定位和可解释分析。我们进一步引入了TASLE,一个大规模基准数据集,包含12,472个未修剪视频,具有多样化的操纵模式和丰富的标注信号,包括时序边界、真实性标签和片段级理由。此外,我们提出了MSLoc,一种粗到细的取证基线方法,结合了边界敏感的建议生成模块用于高效长视频扫描,以及基于MLLM的细化模块用于精确边界定位和可解释推理。实验验证了所提基线的有效性,突显了片段级可解释取证对于长视频AI生成视频分析的重要性。我们的数据集和代码公开于https://debby-0527.github.io/TASLE。

英文摘要

The rapid advancement of AI-driven video generation has transformed content creation, while simultaneously increasing the risk of misinformation through localized manipulations in long-form videos. Existing video forensic methods predominantly operate on short, independent clips, and thus fail to capture realistic scenarios where AI-generated content is sparsely embedded within otherwise authentic footage. To bridge this gap, we formulate the task of Temporal AI-Generated Segment Localization and Explanation, which targets authenticity detection, temporal localization, and interpretable analysis of manipulated segments in untrimmed long videos. We further introduce TASLE, a large-scale benchmark comprising 12,472 untrimmed videos with diverse manipulation patterns and rich annotation signals, including temporal boundaries, authenticity labels, and segment-level rationales. In addition, we propose MSLoc, a coarse-to-fine forensic baseline that combines a boundary-sensitive proposal generation module for efficient long-video scanning with an MLLM-based refinement module for precise boundary localization and interpretable reasoning. Experiments validate the effectiveness of the proposed baseline, highlighting the importance of segment-level explainable forensics for long-form AI-generated video analysis. Our dataset and code are publicly available at https://debby-0527.github.io/TASLE.

2606.02379 2026-06-02 cs.CV 版本更新

Honey, I Shrunk the Arc de Triomphe!

亲爱的,我把凯旋门缩小了!

Yuanbo Xiangli, Hanyu Chen, Xueqing Tsang, Noah Snavely

发表机构 * Cornell University(康奈尔大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对单目度量几何估计中的“尺度坍缩”现象,通过构建新数据集MetricScenes并采用两阶段泊松补全方法提升深度图质量,微调MoGe-2模型显著缓解了尺度低估问题。

Comments Project page: https://metricscenes.github.io/

详情
AI中文摘要

度量尺度单目几何估计通过大规模数据聚合取得了显著进展,但当前的基础模型存在持续的“尺度坍缩”现象:远处地标和广阔景观被度量低估。我们假设这一性能差距源于训练数据瓶颈,现有度量尺度数据集受硬件限制,要么是均匀的车辆捕获LiDAR或短距离室内扫描,要么是缺乏物理世界语义复杂性的合成数据。为弥补这一差距,我们整理了一个新的度量级野外数据集MetricScenes,从多种来源收集,包括互联网照片集和立体图像。我们使用现成方法估计每个场景的相机姿态和初始深度图,并从地理标记元数据以及已知立体相机基线恢复绝对尺度。我们还通过一种新的两阶段泊松补全方法改进了从MetricScenes导出的深度图质量。在我们的数据集上微调MoGe-2显著缓解了尺度坍缩,并在无约束的开放域场景中实现了优越的度量精度,同时在标准基准上保持了最先进的性能。

英文摘要

Metric scale monocular geometry estimation has seen significant progress through large-scale data aggregation, yet current foundation models suffer from a persistent ''scale-collapse'' phenomenon: distant landmarks and vast landscapes are metrically underestimated. We hypothesize that this performance gap stems from a training data bottleneck, where existing metric-scale datasets are hardware-constrained to homogenous vehicle-captured LiDAR or short-range indoor scans, or consist of synthetic data that lacks the semantic complexity of the physical world. To bridge this gap, we curate a new metrically-grounded, in-the-wild dataset that we call MetricScenes, gathered from a variety of sources including Internet photo collections and stereo imagery. We estimate camera poses and initial depth maps for each scene using off-the-shelf methods, and recover absolute scale from geo-tagged metadata as well as known stereo camera baselines. We also improve the quality of depth maps derived from MetricScenes via a new two-stage Poisson completion method. Fine-tuning MoGe-2 on our dataset significantly mitigates scale-collapse and achieves superior metric accuracy in unconstrained, open-domain scenes while maintaining state-of-the-art performance on standard benchmarks.

2606.02366 2026-06-02 cs.CV 版本更新

PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation

PRIMA: 利用生物先验和测试时自适应提升动物网格恢复

Xiaohang Yu, Ti Wang, Mackenzie Weygandt Mathis

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)

AI总结 提出PRIMA框架,通过生物先验(BioCLIP嵌入)和测试时自适应策略,解决严重物种和姿态不平衡下的3D四足动物网格恢复问题,实现高泛化性能并构建大规模伪3D数据集Quadruped3D。

详情
AI中文摘要

我们提出PRIMA(*PRI*ors for *M*esh *A*daptation),一个在严重物种和姿态不平衡下进行鲁棒3D四足动物网格恢复的框架。现有的动物重建方法由于有限的3D监督和长尾物种分布,往往回归到平均形状和姿态,导致对欠代表性动物和罕见关节的泛化能力差。PRIMA通过三个关键贡献解决了这一挑战。首先,我们将BioCLIP嵌入作为生物先验,将语义和形态学知识注入重建过程,从而在多样化的四足动物中实现更准确和可泛化的形状预测。其次,我们引入了一种测试时自适应(TTA)策略,该策略利用2D重投影约束和辅助关键点指导来优化SMAL预测,改进了姿态和形状估计,同时能够从现有2D数据集中生成高质量的伪3D标注。第三,利用这个TTA框架,我们构建了Quadruped3D,一个大规模伪3D数据集,涵盖多样化的物种和姿态变化,以系统性地提升模型性能。在Animal3D、CtrlAni3D、Quadruped2D和Animal Kingdom上的大量实验表明,PRIMA达到了最先进的结果,在欠代表性物种和挑战性姿态上尤其有显著改进。我们的结果强调了生物先验和自适应驱动的数据扩展对于可扩展和可泛化的动物网格恢复的重要性。代码可在https://github.com/AdaptiveMotorControlLab/PRIMA获取。

英文摘要

We present PRIMA (*PRI*ors for *M*esh *A*daptation), a framework for robust 3D quadruped mesh recovery under severe species and pose imbalance. Existing animal reconstruction methods often regress toward mean shapes and poses due to limited 3D supervision and long-tailed species distributions, resulting in poor generalization to underrepresented animals and rare articulations. PRIMA addresses this challenge through three key contributions. First, we incorporate BioCLIP embeddings as biological priors to inject semantic and morphological knowledge into the reconstruction process, enabling more accurate and generalizable shape prediction across diverse quadrupeds. Second, we introduce a test-time adaptation (TTA) strategy that refines SMAL predictions using 2D reprojection constraints together with auxiliary keypoint guidance, improving pose and shape estimation while enabling the generation of high-quality pseudo-3D annotations from existing 2D datasets. Third, leveraging this TTA framework, we construct Quadruped3D, a large-scale pseudo-3D dataset that covers diverse species and pose variations to systematically improve model performance. Extensive experiments on Animal3D, CtrlAni3D, Quadruped2D, and Animal Kingdom demonstrate that PRIMA achieves state-of-the-art results, with particularly strong improvements on underrepresented species and challenging poses. Our results highlight the importance of biological priors and adaptation-driven data expansion for scalable and generalizable animal mesh recovery. Code is available at https://github.com/AdaptiveMotorControlLab/PRIMA.

2606.02357 2026-06-02 cs.CV cs.AI 版本更新

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

多模态智能体真的从工具使用中受益吗?能力增益的系统性研究

Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Qinghao Wang, Minpeng Liao

AI总结 通过对比工具增强与无工具的多模态智能体在多项任务上的表现,发现工具使用并未带来一致的性能提升,智能体更多是学会了工具调用模式而非真正利用工具扩展能力。

详情
AI中文摘要

工具增强的多模态智能体在基准测试中表现出显著提升,这常被视为智能体已学会使用工具的证据。我们认为这种解读可能为时过早:仅凭工具调用轨迹并不能证明工具提供了答案关键信息。我们研究了两种代表性的“用图像思考”智能体,Thyme 和 DeepEyesV2,在真实世界理解、OCR、图表理解和数学推理任务上的表现。每个智能体与其无工具版本以及从同一源池训练但不含工具调用轨迹的纯文本推理器进行比较。工具访问并未带来一致的总体改进,未能可靠地降低生成令牌成本,并且仅留下一个很小的仅工具解决集:DeepEyesV2 的 93% 工具解决问题和 Thyme 的 96% 也被至少一种无工具设置解决。机制消融进一步表明,完整的工具使用循环并不始终优于单独的工具调用格式或返回的执行结果。在我们研究的设置中,所分析的智能体似乎更可靠地学习了工具调用模式而非工具贡献的能力,这表明评估应区分工具的可用性与工具是否真正扩展了智能体可解决的问题。

英文摘要

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.

2606.02352 2026-06-02 cs.CV 版本更新

Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

多模态视频表示对齐用于鲁棒的自监督驾驶员分心检测

David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen

发表机构 * Fraunhofer IOSB(弗劳恩霍夫智能系统研究所) Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 提出一种多模态全局对齐框架,通过软目标和加权机制处理错误负样本和不可靠正样本,在Drive&Act数据集上优于现有方法,实现鲁棒的驾驶员分心检测。

Comments Accepted at the IEEE ITSC 2026

详情
AI中文摘要

鲁棒的自监督多模态视频表示学习对于现实应用(如驾驶员分心检测)至关重要,其中多个传感器提供互补但嘈杂的信号。传统的对比目标(如InfoNCE)假设所有负样本信息量相等且所有正样本可靠。然而,由于视角变化、遮挡或模态间的语义重叠,这一假设在多模态数据中经常被违反。在这项工作中,我们提出了一种新颖的多模态全局对齐框架,通过联合建模错误负样本和不可靠或错误正样本来解决这些挑战。我们引入基于循环一致性分数的软目标来放松硬负样本假设,并基于相似性分布的加权机制来减轻噪声或错误正样本的影响。我们的方法将传统的成对对齐扩展到原则性的全局多模态设置,聚合所有模态对的对齐信息。我们在Drive&Act数据集上评估了我们的方法,结果表明它在RGB、IR、深度和骨架模态上始终优于成对和现有的全局对齐基线。跨视角消融研究进一步显示了对未见相机视角的强泛化能力,突出了我们表示的鲁棒性。总体而言,我们的框架为自监督全局多模态表示学习提供了一种可扩展且有效的解决方案,实现了可靠的驾驶员分心检测,并在现实世界的多模态视频理解中具有开创性。我们的代码将在GitHub上发布。

英文摘要

Robust self-supervised learning of multi-modal video representations is critical for real-world applications such as driver distraction detection, where multiple sensors provide complementary but noisy signals. Conventional contrastive objectives, such as InfoNCE, assume all negatives are equally informative and all positives are reliable. However, this assumption is frequently violated in multi-modal data due to viewpoint changes, occlusions, or semantic overlap across modalities. In this work, we propose a novel framework for multi-modal global alignment that addresses these challenges by jointly modeling faulty negatives and unreliable or faulty positives. We introduce soft targets derived from cycle-consistency scores to relax the hard-negative assumption, and a weighting mechanism based on similarity distributions to mitigate the impact of noisy or faulty positives. Our approach extends traditional pairwise alignment to a principled global multi-modal setting, aggregating alignment information across all modality pairs. We evaluate our method on the Drive&Act dataset, demonstrating that it consistently outperforms both pairwise and existing global alignment baselines across RGB, IR, Depth, and Skeleton modalities. Cross-view ablation studies further show strong generalization to unseen camera perspectives, highlighting the robustness of our representations. Overall, our framework provides a scalable and effective solution for self-supervised global multi-modal representation learning, enabling reliable driver distraction detection and pioneering in real-world multi-modal video understanding. Our code will be published on GitHub.

2606.02350 2026-06-02 cs.CV 版本更新

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

TROPHIES:从多视角视频中重建场所、人和相机的时间序列

Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出TROPHIES框架,通过联合估计动态人体、静态场景和相机姿态,实现多视角视频中全局一致的四维重建。

详情
AI中文摘要

在全局一致的4D空间中重建人类及其周围环境对于全面感知至关重要。然而,先前的工作通常假设单视角输入或将人体、场景和相机解耦,导致无法恢复连贯的几何形状、稳定的运动和物理对齐的轨迹。这些局限性促使我们引入一项新任务:从多视角视频中统一重建人体-场景-相机,旨在在一个全局坐标系中联合估计动态人体、静态场景和相机姿态。我们提出了TROPHIES——从多视角视频中重建场所、人和相机的时间序列——一个为这项任务量身定制的统一框架。TROPHIES包含一个通过时间和空间推理建模人体的人体分支,以及一个通过人体感知注意力重建静态几何的场景分支。一个全局对齐和优化模块通过强制执行尺度一致性、接触先验和跨视角时间相干性来耦合两个分支。在EgoHuman和EgoExo4D上的实验表明,TROPHIES实现了全局对齐、物理上合理的4D重建,并在全局保真度和人体-场景一致性方面始终优于现有范式。

英文摘要

Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos-a unified framework tailored for this task. TROPHIES features a Human Branch that models humans through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.

2606.02346 2026-06-02 cs.CV 版本更新

VEDAL: Variational Error-Driven Asynchronous Learning for 3D Gaussian Splatting Pruning

VEDAL: 用于3D高斯泼溅剪枝的变分误差驱动异步学习

Aoduo Li, Jiancheng Li, Huan Ye, Hongjian Xu, Shiting Wu, Xiujun Zhang, Zimeng Li, Xuhang Chen

发表机构 * Guangdong University of Technology(广东工业大学) Huizhou Boluo Power Supply Bureau, Guangdong Power Grid Co., Ltd.(惠州市博罗供电局,广东电网有限责任公司) Shenzhen Polytechnic University(深圳职业技术大学) School of Computer Science and Engineering, Huizhou University(惠州市大学计算机科学与工程学院)

AI总结 提出VEDAL框架,通过变分自由能最小化、预测误差门控机制和变分不确定性头实现3D高斯泼溅的高效剪枝,在5.2倍压缩下仅损失0.31 dB PSNR。

Comments 12 pages, 5 figures. Accepted by CGI 2026

详情
AI中文摘要

3D高斯泼溅(3DGS)通过实时渲染实现了卓越的新视图合成质量,但由于数百万个高斯原语导致内存消耗过大。现有的剪枝方法依赖于启发式重要性分数或同步批量更新,导致压缩次优和训练不稳定。我们提出VEDAL,一个将高斯剪枝公式化为变分自由能最小化的原则性框架。我们的方法引入了(1)一种预测误差门控机制,基于每个高斯的重建不确定性异步激活剪枝,以及(2)一个变分不确定性头,将剪枝决策建模为具有可学习先验的潜变量。自由能目标通过信息论视角自然地平衡了重建保真度与模型复杂度。在Mip-NeRF 360、Tanks&Temples和Deep Blending上的大量实验表明,VEDAL在仅0.31 dB PSNR下降的情况下实现了5.2倍压缩,在更高压缩比下优于PUP 3D-GS 0.05 dB,在相当质量下优于LightGaussian 0.35 dB,同时保持185 FPS的实时渲染。

英文摘要

3D Gaussian Splatting (3DGS) achieves remarkable novel view synthesis quality with real-time rendering, yet suffers from excessive memory consumption due to millions of Gaussian primitives. Existing pruning methods rely on heuristic importance scores or synchronous batch updates, leading to suboptimal compression and training instability. We propose VEDAL, a principled framework that formulates Gaussian pruning as variational free energy minimization. Our approach introduces (1) a prediction-error gating mechanism that asynchronously activates pruning based on per-Gaussian reconstruction uncertainty, and (2) a variational uncertainty head that models pruning decisions as latent variables with learnable priors. The free energy objective naturally balances reconstruction fidelity against model complexity through an information-theoretic lens. Extensive experiments on Mip-NeRF 360, Tanks&Temples, and Deep Blending demonstrate that VEDAL achieves 5.2x compression with only 0.31 dB PSNR drop, outperforming PUP 3D-GS by +0.05 dB at a higher compression ratio and LightGaussian by +0.35 dB at comparable quality, while maintaining real-time rendering at 185 FPS.

2606.02342 2026-06-02 cs.CV 版本更新

Detecting Pen-In-Air States from Video: A Proof-of-Concept Toward Complementary Handwriting Analysis

从视频中检测笔在空中状态:迈向互补手写分析的概念验证

Lauren Sismeiro, Remy Plastre, Binbin Xu, Frederic Puyjarinet, Gerard Dray

发表机构 * IMT Mines Ales(IMT矿山阿勒大学) Occitanie Region, France(法国奥克西塔尼大区)

AI总结 提出一种基于YOLO的笔尖跟踪与运动特征提取及机器学习分类的可解释混合流程,通过俯视视频检测笔接触状态,作为数字化平板的低成本非侵入性补充,在试点数据集上实现了高达0.805的F2分数。

Comments accepted for 12th International Conference on Computer Technology Applications (ICCTA 2026)

详情
AI中文摘要

手写的动态方面对于评估如书写困难等发育障碍至关重要,通常通过数字化平板捕捉。然而,基于平板的传感将笔提起行为的分析限制在书写表面上方较短的接近范围内,可能错过高抬起的空中运动。作为概念验证,我们研究俯视视频是否能够提供补充信息源,用于推断笔接触状态,而无需依赖平板接近感应。我们提出了一种可解释的混合流程,结合了基于YOLO检测器的笔尖跟踪、运动特征提取和机器学习分类。一个包含多样化手写视频的试点数据集在帧级别进行了手动标注,并使用留一视频外(LOVO)协议进行评估。该方法实现了可靠的笔提起段事件级检测,F2分数高达0.805,与筛查导向场景中强调召回率一致。这些结果支持了基于视频的笔提起检测作为数字化平板低成本非侵入性补充的可行性,并为未来大规模研究奠定了基础。

英文摘要

Dynamic aspects of handwriting are critical for assessing developmental disorders such as dysgraphia and are typically captured using digitizing tablets. However, tablet-based sensing restricts analysis of Pen-Up behavior to a short proximity range above the writing surface, potentially missing high-lift in-air movements. As a proof of concept, we investigate whether top-view video can provide a complementary source of information for inferring pen-contact states without relying on tablet proximity sensing. We propose an interpretable hybrid pipeline combining pen-tip tracking using a YOLO-based detector with kinematic feature extraction and machine learning classification. A pilot dataset of diverse handwriting videos was manually annotated at the frame level and evaluation used a Leave-One-Video-Out (LOVO) protocol. The method achieved reliable event-level detection of Pen-Up segments, with an F_2 score up to 0.805, consistent with the emphasis on recall in a screening-oriented setting. These results support the feasibility of video-based Pen-Up detection as a low-cost and non-intrusive complement to digitizing tablets, and provide a foundation for future large-scale studies.

2606.02339 2026-06-02 cs.LG cs.CV 版本更新

Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging

无模型坍塌的熵最小化:减轻医学影像中的预测偏差

Tim Nielen, Sameer Ambekar, Johannes Kiechle, Daniel M. Lang, Julia A. Schnabel

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany(慕尼黑技术大学计算、信息与技术学院) Institute of Machine Learning in Biomedical Imaging, Helmholtz Munich, Germany(生物医学成像中的机器学习研究所,海德堡慕尼黑德国) School of Biomedical Engineering and Imaging Sciences, King’s College London, UK(伦敦国王学院生物医学工程与成像科学学院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML)) relAI – Konrad Zuse School of Excellence in Reliable AI(relAI——Konrad Zuse可靠性人工智能卓越学院) TUM University Hospital Rechts der Isar(慕尼黑技术大学医院Rechts der Isar)

AI总结 针对测试时适应中熵最小化导致的模型坍塌问题,提出分布偏移偏差减少(DSBR)方法,通过均衡各预测类对无监督熵最小化损失的贡献来纠正预测偏差,在四个医学影像数据集和ImageNet-C上验证了其稳定性和有效性。

详情
AI中文摘要

熵最小化(EM)是测试时适应的主导目标,但其失败模式——模型坍塌——仍然知之甚少。在这项工作中,我们表明分布偏移会导致模型表示空间中对应不同类别的特征簇合并,而决策边界保持不变。这导致预测类别分布出现系统性偏差,称为预测偏差。预测偏差是指预测类别分布的偏移,其中一些类别被过度代表,而其他类别被抑制。我们表明熵最小化通过收紧现有簇来放大这种预测偏差,强化错误的分类,直到所有预测坍缩为平凡解。接下来,为了证明预测偏差的重要性并减轻它,我们进一步提出了分布偏移偏差减少(DSBR),这是一种偏差纠正目标,通过均衡每个预测类别对无监督熵最小化损失的贡献来专门针对这种失败模式。为了研究这种失败模式,我们使用四个医学影像数据集设计了合适的适应设置,并在ImageNet-C上进行了额外评估。我们发现DSBR一致地稳定了测试时适应,防止了模型坍塌,并且匹配或超越了最先进的方法。此外,DSBR仅在测试时运行。

英文摘要

Entropy minimization (EM) is the dominant objective for test-time adaptation, yet its failure mode, model collapse, remains poorly understood. In this work, we show that distribution shifts can cause feature clusters corresponding to distinct classes in the model's representation space to merge, while the decision boundary remains fixed. This induces a systematic skew in the predicted class distribution, referred to as prediction bias. Prediction bias refers to a shift in the predicted class distribution, with some classes overrepresented and others suppressed. We show that entropy minimization amplifies this prediction bias by tightening the existing clusters, reinforcing the incorrect groupings until all predictions collapse to a trivial solution. Next, to demonstrate the significance of prediction bias and mitigate it, we further propose Distribution Shift Bias Reduction (DSBR), a bias-correcting objective that specifically targets this failure mode by equalizing the contribution of each predicted class to the unsupervised entropy minimization loss. To study this failure mode, we design suitable adaptation settings using four medical-imaging datasets and additionally evaluate on ImageNet-C. We find that DSBR consistently stabilizes test-time adaptation, prevents model collapse, and matches or outperforms state-of-the-art methods. Moreover, DSBR operates solely at test-time.

2606.02331 2026-06-02 cs.CV cs.LG 版本更新

Hallucination-Aware Diffusion Sampling for Inverse Problems via Robust Prior Updates

基于鲁棒先验更新的幻觉感知扩散采样用于逆问题

Pengfei Jin, Yiqi Tian, Kailong Fan, Bingjie Qi, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School(先进医学计算与分析中心,麻省总医院和哈佛医学院) Department of Industrial Engineering, University of Pittsburgh(工业工程系,匹兹堡大学)

AI总结 提出鲁棒先验更新模块,通过探测扩散先验更新的局部稳定性并重新锚定位移,减少逆问题求解中的测量条件幻觉,提升实例保真度。

详情
AI中文摘要

基于扩散的逆问题求解器可以产生逼真的重建结果,但仅凭逼真度并不能确保恢复的细节得到测量的支持。我们将这种失败研究为测量条件幻觉:视觉上有意义但要么不可信要么与测量实例不一致的内容。我们的分析将基于贝叶斯规则的扩散逆求解器分为先验更新和测量条件步骤,表明在应用测量校正之前,幻觉内容可能通过先验侧提议进入。受此观点启发,我们提出鲁棒先验更新(RPU),一个求解器级模块,探测扩散先验更新的局部稳定性,将产生的位移重新锚定在当前迭代点,并保持测量更新不变。我们在DPS中实例化RPU,并使用自动指标和人类忠实度研究在FFHQ和ImageNet逆问题上进行评估。在FFHQ上,RPU在框内修复、高斯去模糊和运动去模糊中相比DPS提高了PSNR和LPIPS。在人类判断中,RPU在FFHQ框内修复上获得了91.9%的盲选非平局多数偏好和91.1%的借助真实标签的非平局偏好,而ImageNet高斯阅读器研究中平局较多,但在非平局情况下RPU更受青睐。这些结果支持一个有针对性的主张:鲁棒化先验更新可以提高扩散逆求解器中的实例保真度,尤其是在先验塑造弱约束内容时。

英文摘要

Diffusion-based inverse problem solvers can produce realistic reconstructions, but realism alone does not ensure that the recovered details are supported by the measurement. We study this failure as measurement-conditioned hallucination: visually meaningful content that is either implausible or inconsistent with the measured instance. Our analysis separates Bayes-rule-based diffusion inverse solvers into a prior update and a measurement-conditioning step, showing that hallucinated content can enter through the prior-side proposal before the measurement correction is applied. Motivated by this view, we propose Robust Prior Update (RPU), a solver-level module that probes the local stability of the diffusion prior update, re-anchors the resulting displacement at the current iterate, and leaves the measurement update unchanged. We instantiate RPU in DPS and evaluate it on FFHQ and ImageNet inverse problems using automatic metrics and human faithfulness studies. On FFHQ, RPU improves PSNR and LPIPS over DPS across box inpainting, Gaussian deblurring, and motion deblurring. In human judgments, RPU receives 91.9% of blind non-tie majority preferences and 91.1% of ground-truth-assisted non-tie preferences on FFHQ box inpainting, while the ImageNet Gaussian reader study is tie-heavy but favors RPU among non-tie cases. These results support a targeted claim: robustifying the prior update can improve instance faithfulness in diffusion inverse solvers, especially when the prior shapes weakly constrained content.

2606.02321 2026-06-02 cs.CV 版本更新

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

基于视觉表示引导的视频-大语言模型推理的无训练组合视频检索

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang

发表机构 * School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院人工智能安全国家重点实验室) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 提出无训练框架,先利用冻结DINOv3模型筛选视觉相关候选,再通过大视觉语言模型评估指令匹配,最后推理精化,在CVPR 2026挑战赛中取得48.78 Recall@1和51.48 Recall@5。

Comments CVPR 2026, VidLLMs workshop

详情
AI中文摘要

近期大视觉语言模型的进展将视频检索从简单的基于文本搜索扩展到更灵活的场景,用户可以通过视觉示例和文本指令指定期望结果。在CVPR 2026的Reason-Aware组合视频检索挑战中,系统需要根据参考视频和修改指令检索目标视频。为解决该任务,我们开发了基于视觉表示引导的视频-大语言模型推理的无训练组合视频检索框架。该框架首先使用冻结的DINOv3模型获取紧凑的视觉相关候选集,然后应用大视觉语言模型评估每个候选是否满足修改指令。最后对顶部候选进行基于推理的精化以改善排名第一的预测。无需训练,我们的系统在测试集上达到48.78 Recall@1和51.48 Recall@5。未来工作可通过更强的视频-大语言模型以及视觉表示与语言推理的详细集成进一步提高检索精度。

英文摘要

Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.

2606.02310 2026-06-02 cs.CV cs.LG 版本更新

Deep Learning for Remote Sensing to Improve Flood Inundation Mapping

深度学习用于遥感以改进洪水淹没制图

Yogesh Bhattarai, Vijay Chaudhary, Wai Lim Kim, Sanjib Sharma

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出基于去噪扩散概率模型和掩码扩散Transformer的云去除框架,用于洪水影像,以生成无云图像并保持水文一致性,提升洪水监测的可靠性。

Comments This paper has been selected as the top 10 student finalists in IGRASS 2026 paper competition

详情
AI中文摘要

洪水是全球最普遍的自然灾害。及时准确的洪水淹没制图对于告知灾害风险管理至关重要。光学卫星任务提供了高分辨率、多光谱观测,对于洪水检测和淹没制图至关重要。然而,在极端降水事件期间,其操作实用性受到云层的严重限制。基于时间合成或插值的传统云去除技术通常无法捕捉淹没动态。在本研究中,我们引入了一种基于去噪扩散概率模型的洪水影像云去除框架,利用掩码扩散Transformer架构。所提出的方法利用自注意力机制捕获更广泛的空间上下文,并采用掩码令牌建模来显式学习云遮挡区域的重建。在具有真实云模式的多光谱Sentinel-2B洪水场景上训练,该模型生成保持视觉保真度和水文一致性的无云图像实现。使用标准图像质量指标以及洪水特定的水文指标评估重建性能,显示出水体连续性的改善和对水检测指数至关重要的光谱特征的保留。结果表明,基于扩散的生成建模为光学洪水监测中的云去除提供了一种稳健且物理一致的替代方案,从而实现更可靠、连续的观测,以支持灾害风险管理和洪水相关决策。

英文摘要

Flooding is the most pervasive natural disaster worldwide. Timely and accurate flood inundation mapping are essential for informing disaster risk management. Optical satellite missions provide high-resolution, multispectral observations critical for flood detection and inundation mapping. However, their operational utility is severely constrained by cloud cover during extreme precipitation events. Conventional cloud-removal techniques based on temporal compositing or interpolation often fail to capture inundation dynamics. In this study, we introduce a cloud-removal framework for flood imagery based on Denoising Diffusion Probabilistic Models, leveraging the Masked Diffusion Transformer architecture. The proposed approach exploits self-attention mechanisms to capture wider spatial context and employs masked token modeling to explicitly learn the reconstruction of cloud-obscured regions. Trained on multispectral Sentinel-2B flood scenes with realistic cloud patterns, the model generates cloud-free image realizations that preserve both visual fidelity and hydrological consistency. Reconstruction performance is evaluated using standard image quality metrics alongside flood-specific hydrological measures, demonstrating improved continuity of water bodies and preservation of spectral signatures critical for water detection indices. The results indicate that diffusion-based generative modeling offers a robust and physically consistent alternative for cloud removal in optical flood monitoring, enabling more reliable, continuous observations to support disaster risk management and flood-related decision making.

2606.02309 2026-06-02 cs.LG cs.CV 版本更新

Measurement Geometry and Design for Trustworthy Generative Inverse Problems

可信生成式逆问题的测量几何与设计

Pengfei Jin, Na Li, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School(先进医学计算与分析中心,麻省总医院和哈佛医学院) School of Engineering and Applied Sciences, Harvard University(工程与应用科学学院,哈佛大学)

AI总结 提出局部测量-流形兼容性度量,证明其控制重建误差的稳定部分,并基于体积保持设计固定和自适应测量策略,在多个成像任务中预测失败模式、减少幻觉并指导采样。

详情
AI中文摘要

生成模型越来越多地被用作逆问题的先验,但它们生成逼真图像的能力带来了一个基本的信任问题:一个看似合理的重建可能由测量支持,也可能由先验沿未观测方向填充。这一区别在医学成像中尤为重要,因为采集操作是在扫描时间、剂量和校准约束下设计的。我们从测量几何的角度研究生成式逆问题。核心问题是:固定的测量算子能否区分在生成先验下看似合理的邻近图像,以及这种关系能否指导更好的测量。我们引入了一个局部测量-流形兼容性度量,用于量化算子观测先验相关切线方向的程度。在局部正则性假设下,我们证明该量控制重建误差的稳定部分,而生成先验控制流形外漂移。这一最坏方向证书基于整体局部体积保持,提出了实用的固定和顺序采集规则,包括一种后验云设计,该设计在测试时自适应调整测量,无需训练采样策略。在行采样、断层扫描和MR采集设置中,所提出的分数预测失败模式,解释测量引起的幻觉,并指导更好的采样。在fastMRI笛卡尔采样中,后验云测量设计优于强大的非学习ACS保留基线,包括可变密度和泊松类掩模。

英文摘要

Generative models are increasingly used as priors for inverse problems, but their ability to produce realistic images creates a basic trust problem: a plausible reconstruction may be supported by the measurements, or it may be filled in by the prior along unobserved directions. This distinction is especially important in medical imaging, where acquisition operators are designed under scan-time, dose, and calibration constraints. We study generative inverse problems from a measurement-geometry perspective. The central question is whether a fixed measurement operator can distinguish nearby images that are plausible under the generative prior, and whether this relationship can guide better measurements. We introduce a local measurement-manifold compatibility measure that quantifies how well the operator observes prior-relevant tangent directions. Under local regularity assumptions, we prove that this quantity controls the stable part of the reconstruction error, while the generative prior controls off-manifold drift. This worst-direction certificate motivates practical fixed and sequential acquisition rules based on overall local volume preservation, including a posterior-cloud design that adapts measurements at test time without training a sampling policy. Across row-sampling, tomographic, and MR acquisition settings, the proposed scores predict failure modes, explain measurement-induced hallucinations, and guide better sampling. In fastMRI Cartesian sampling, posterior-cloud measurement design improves over strong non-learned ACS-preserving baselines, including variable-density and Poisson-like masks.

2606.02303 2026-06-02 cs.CV 版本更新

Cross-Domain Dead Tree Detection via Knowledge Distillation in Aerial Imagery

跨域航拍图像死树检测:基于知识蒸馏的方法

Anis Ur Rahman, Mete Ahishali, Einari Heinaro, Samuli Junttila

发表机构 * CSC – IT Center for Science Ltd.(CSC信息科技研究中心有限公司) Department of Forest Sciences, University of Helsinki(赫尔辛基大学森林科学系) KOKO Forest Ltd.(KOKO森林有限公司) School of Forest Sciences, University of Eastern Finland(东芬兰大学森林科学学院)

AI总结 针对航拍图像中死树检测的域差异和标注数据稀缺问题,提出基于知识蒸馏的TreeMort-1T-UNet模型,通过特征级蒸馏在多个目标域上实现鲁棒性能,并验证其在低数据场景下的优越性。

Comments 14 pages, 6 figures, journal

详情
AI中文摘要

航拍图像中的死树检测对于评估森林健康至关重要,尤其是随着气候变化导致全球树木死亡率上升,但域变异性和稀缺的标注数据常常限制模型的泛化能力。本研究改进了最初在芬兰航拍图像(源域)上训练的TreeMort-1T-UNet(树木死亡率单任务U-Net)模型,通过应用知识蒸馏(KD)使其适应各种目标域,包括代表不同森林类型的波兰、德国和爱沙尼亚数据集。我们评估了四种KD变体:基础、自蒸馏、特征级和集成,与微调基线进行比较,使用平均树木IoU、实例F1分数、实例精度和平均质心误差作为关键指标,并结合表征分析(如余弦相似度、CKA、SSIM、t-SNE和线性探针)评估域不变性。特征级KD优于其他方法,在波兰数据集上实现了平均树木IoU为0.106、实例F1分数为0.63、实例精度为0.55、平均质心误差为3.039,并在其他目标域上保持稳健精度(例如,芬兰为0.15,波兰为0.67,德国为0.60,爱沙尼亚为0.59)。它在低数据场景下表现优异,假阳性更少,并展现出优越的表征不变性(例如,更高深层CKA/SSIM、t-SNE中更好的域混合、线性探针AUC为0.95),使其成为精度关键的林业应用的理想选择。额外的消融研究证实,特征对齐等关键组件增强了其跨指标的平衡性能。我们的发现证明了KD在遥感中增强迁移学习的潜力,为生态监测和可持续森林管理提供了可扩展、域鲁棒的工具。

英文摘要

Detecting dead trees in aerial imagery is vital for assessing forest health, especially as tree mortality increases globally due to climate change, but domain variability and scarce labeled data often limit model generalization. This study advances the TreeMort-1T-UNet (Tree Mortality 1-Task U-Net) model, initially trained on Finnish aerial imagery (source domain), by applying knowledge distillation (KD) to adapt it to various target domains, including Polish, German, and Estonian datasets representing diverse forest types. We assess four KD variants: Basic, Self, Feature-level, and Ensemble, against a fine-tuning baseline, using Mean Tree IoU, Instance F1-score, Instance Precision, and Mean Centroid Error as key metrics, alongside representational analyses (e.g., cosine similarity, CKA, SSIM, t-SNE, and linear probing) for domain invariance. Feature-level KD outperforms others, yielding a Mean Tree IoU of 0.106, Instance F1-score of 0.63, Instance Precision of 0.55, and Mean Centroid Error of 3.039 on the Polish dataset, with robust precision across other target domains (e.g., 0.15 on Finnish, 0.67 on Polish, 0.60 on German, 0.59 on Estonian). It excels in low-data scenarios with fewer false positives and shows superior representational invariance (e.g., higher deep-layer CKA/SSIM, better domain mixing in t-SNE, and linear probing AUC of 0.95), making it ideal for precision-critical forestry applications. Additional ablation studies confirm that key components like feature alignment enhance its performance balance across metrics. Our findings demonstrate KD's potential to enhance transfer learning in remote sensing, offering a scalable, domain-robust tool for ecological monitoring and sustainable forest management.

2606.02301 2026-06-02 cs.HC cs.AI cs.CV 版本更新

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

定量运动测试:从单部智能手机视频测量患者运动

Pranav Mahajan, Amanda Wall, Eleonora Maria Camerone, Julie Stebbins, Eoin Kelleher, Shuangyi Tong, Annina Schmid, Katja Wiech, Anushka Irani, Ben Seymour

发表机构 * Nuffield Department of Clinical Neurosciences, University of Oxford(临床神经科学系,Nuffield大学,牛津大学) Max Planck Institute of Biological Cybernetics(生物信息学研究所) Oxford Gait Laboratory, University of Oxford(牛津大学步态实验室) Harvard Medical School(哈佛医学院) Massachusetts General Hospital(麻省总医院) Institute of Biomedical Engineering, University of Oxford(生物医学工程研究所,牛津大学) Mayo Clinic(梅奥诊所)

AI总结 提出基于计算机视觉的定量运动测试(QMT)方法,利用深度学习3D姿态估计从单目智能手机视频提取运动生物标志物,在实验室验证中与光学运动捕捉高度一致(r>0.85),并在纤维肌痛和慢性坐骨神经痛患者中展示了可靠性和纵向监测能力。

详情
AI中文摘要

慢性疼痛通过降低功能能力而损害生活质量,但在现实环境中客观测量这种功能影响仍然具有挑战性。虽然光学运动捕捉为评估运动质量改变提供了高精度,但成本高昂且局限于实验室环境。我们旨在开发并验证定量运动测试(QMT),这是一个从标准单目智能手机视频中提取3D运动生物标志物的计算机视觉流程,平衡临床可及性与生物力学精度。我们利用基于深度学习的3D姿态估计,在健康对照组(N=13)中针对金标准光学运动捕捉验证了QMT流程。经过留一法受试者校准以纠正系统偏差后,我们在两个前瞻性临床队列中部署QMT以评估现实世界效用:一项纤维肌痛患者的干预前后试验,以及一项慢性坐骨神经痛患者和健康对照的30天纵向家庭监测研究。在实验室验证中,QMT提取的临床运动指标与光学运动捕捉高度一致,显示出强相关性(r>0.85)和低平均绝对误差。QMT在纤维肌痛患者中显示出高重测信度(r>0.86),并成功追踪了慢性坐骨神经痛患者的日常运动波动。虽然现实家庭环境引入了比实验室环境更高的测量方差,但QMT完全基于远程记录发现了健康对照组和坐骨神经痛患者之间的组级差异。单目3D姿态估计为传统评估提供了一种可扩展的替代方案。QMT为临床试验中跟踪疾病进展和治疗反应提供了客观、可及的生物标志物,但需要进一步研究以优化家庭环境中的可靠性。

英文摘要

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.

2606.02292 2026-06-02 cs.CV 版本更新

Neural Acquisition & Representation of Subsurface Scattering

次表面散射的神经获取与表示

Arjun Majumdar, Raphael Braun, Hendrik Lensch

发表机构 * University of Tübingen(图宾根大学)

AI总结 提出一种通过U-Net CNN学习物体表面每个点的像素足迹响应来获取和估计高细节层次次表面散射特性的方法,实现任意高分辨率投影图案的重光照。

Comments 8 pages

详情
AI中文摘要

我们提出了一种方法,通过学习物体表面每个点的像素足迹响应,以高度细节化的水平获取和估计光传输的次表面散射特性。重建利用3D扫描技术作为U-Net CNN的输入。使用相移轮廓测量(PSP)图案的立体投影仪-相机设置高效捕获各种散射物体的数据。重建密集像素足迹允许使用任意高分辨率投影图案进行重光照。最终输出是重光照后的彩色图像。与真实世界捕获图像的定性和定量比较表明,预测的足迹与实际响应几乎相同。同一模型针对多个物体的多个视图进行训练,使得学习到的表示也能泛化到未见过的次表面散射材料。

英文摘要

We present a method to acquire and estimate the sub-surface scattering properties of light transport at a highly detailed level by learning the pixel footprint response at each point on the object surface. The reconstruction leverages 3D scanning techniques as input to a U-Net CNN. A stereo projector-camera setup using phase-shifted profilometry (PSP) patterns efficiently captures the data for a variety of scattering objects. Reconstructing dense pixel footprints allows for relighting with arbitrary high-resolution projector patterns. The final output is a relit color image. Qualitative and quantitative comparison against illuminated real-world captured images demonstrate that the predicted footprints are almost identical to the actual responses. The same model is trained for multiple views across multiple objects such that the learned representations can be used to generalize to unseen sub-surface scattering materials as well.

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

发表机构 * Lab for AI in Medicine(医学人工智能实验室) RWTH Aachen University(亚琛工业大学) Department of Diagnostic and Interventional Radiology(诊断与介入放射学部门)

AI总结 研究临床视觉-语言模型(VLM)在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险,并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情
AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型(VLM)学习了一个共享嵌入空间,该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中(例如仅图像数据共享或受控访问的报告)构成了隐私风险,因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索,并使用公共配对队列(其中真实配对是已知的)作为基准来审计风险,而不是作为隐私场景。在来自MIMIC-CXR(43,793个保留对)和外部CheXpert Plus(29,296个对)的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM,我们发现重链接率随专业化程度系统性地上升:最强的VLM在候选池N=100时以15倍随机概率检索到正确报告,在N=10,000时以50倍随机概率,在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在,表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险,我们冻结了两个编码器,仅对定义对齐层的投影头应用差分隐私优化(epsilon=0.34,delta=6x10^-6)。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%,并无需重新训练即可迁移到CheXpert Plus,同时图像侧效用基本保持:线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接,而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

2606.02273 2026-06-02 cs.CV 版本更新

Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset

用于驾驶员监控系统的视觉语言模型:一个驾驶员活动描述数据集

David J. Lerch, Sarath Mulugurthi, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen

发表机构 * Fraunhofer IOSB(弗劳恩霍夫智能系统研究所) Technische Hochschule Ingolstadt(图林根工业大学) Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 本文通过创建Drive&Act数据集的详细自然语言版本,评估并微调视觉语言模型,以提升对驾驶员细微动作的识别能力,微调后的模型在跨数据集评估中表现更优。

Comments Accepted at IEEE ITSC 2026

详情
AI中文摘要

理解细微的驾驶员动作对于构建可靠的驾驶员监控系统至关重要。现有的视觉语言模型(VLM)在通用数据集上训练,难以识别驾驶员行为的细微差别。本文通过创建Drive&Act数据集的详细自然语言版本来解决这一限制。我们使用基于LLM的评分方法在新的基准上评估了三个VLM。它们在新基准上的表现表明,它们无法可靠地生成准确的细粒度驾驶员活动描述。基于标注的Drive&Act数据集,我们创建了一个新的Drive&Act描述数据集,其中包含细粒度描述,用于训练VLM理解驾驶员活动。在驾驶员监控数据集(DMD)上的跨数据集评估表明,在我们的新Drive&Act描述数据集上微调的VLM能够很好地泛化到DMD数据集中的动作。在我们的Drive&Act描述数据集上微调的VLM取得了76的ACCR分数,优于零样本VLM基线的66 ACCR分数。这些发现表明,用丰富描述的驾驶员动作来适应VLM可以显著提高其解释驾驶员行为的能力,同时也突显了需要更多样化的数据集以支持未来应用中更广泛泛化的需求。我们的Drive&Act描述数据集和代码将在GitHub上公开。

英文摘要

Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the Drive&Act dataset. We evaluate three VLMs on our new benchmark using LLM-based scoring methods. Their performance on the new benchmark shows that they cannot reliably generate accurate fine-grained driver activity descriptions. Based on the labeled Drive&Act dataset we create a new Drive&Act description dataset containing finegrained descriptions to train VLMs on driver activity understanding. Cross dataset evaluation on the Driver Monitoring Dataset (DMD) shows that the VLM fine-tuned on our new Drive&Act description dataset generalizes well to actions in the DMD dataset. The VLM fine-tuned on our Drive&Act description dataset achieves an ACCR score of 76 outperforming the zero-shot VLM baseline with an ACCR score of 66. These findings demonstrate that adapting VLMs with richly described driver actions can significantly improve their ability to interpret driver behavior while also highlighting the need for more diverse datasets to support broader generalization in future applications. Our Drive&Act description dataset and code will be publicly available on GitHub.

2606.02268 2026-06-02 cs.CV 版本更新

From Extrinsic to Intrinsic: Geodesic-Guided Representation Learning for 3D Geometric Data

从外在到内在:面向3D几何数据的测地线引导表示学习

Yuming Zhao, Junhui Hou, Qijian Zhang, Jia Qin, Ying He

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PRISM预训练范式,通过恢复内在表面测地线度量学习等距嵌入,解决3D表示学习中外在空间与内在拓扑的脱节问题,在测地距离预测及下游任务中表现优异。

详情
AI中文摘要

几何分析从根本上区分了 extit{外在}和 extit{内在}视角。当前3D表示学习的主流范式依赖于外在空间结构或高层语义,难以捕捉形状本质和底层流形拓扑。为弥合这一差距,我们引入了一种新的3D表示学习范式,即 extbf{PRISM}(用于 extbf{预训练}),通过 extbf{恢复内在表面测地线度量}来学习等距嵌入。PRISM包含一个拓扑增强目标,显式约束潜在空间的结构,以及一个专门的两阶段训练策略,以缓解测地距离分布中固有的样本不平衡。实验表明,我们的方法在测地距离预测中表现出令人满意的准确性、鲁棒性和高效率,并在包括形状识别、表面参数化和非刚性对应在内的多种下游任务中取得了优越性能。代码将公开在 https://github.com/AidenZhao/PRISM。

英文摘要

Geometric analysis fundamentally distinguishes between \textit{extrinsic} and \textit{intrinsic} perspectives. The dominant paradigm in current 3D representation learning relies on either extrinsic spatial structures or high-level semantics, struggling to capture the essence of shape identity and underlying manifold topology. To bridge this gap, we introduce a novel 3D representation learning paradigm, namely \textbf{PRISM}, for \textbf{P}re-training, which learns isometric embeddings by \textbf{R}ecovering the \textbf{I}ntrinsic \textbf{S}urface geodesic \textbf{M}etric. PRISM incorporates a topology-enforcing objective that explicitly constrains the structure of latent space, alongside a specialized two-stage training recipe mitigating sample imbalance inherent in the distribution of geodesic distances. Experiments demonstrate that our approach shows satisfactory accuracy, robustness, and high efficiency in geodesic distance prediction and achieves superior performance across diverse downstream tasks, including shape recognition, surface parameterization, and non-rigid correspondence. The code will be publicly available at https://github.com/AidenZhao/PRISM.

2606.02267 2026-06-02 cs.LG cs.CV 版本更新

A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs

噪声与双边滤波的组合在CNN中实现超线性且可扩展的对抗鲁棒性

Nicolas Stalder, Benjamin F. Grewe, Matteo Saponati, Pau Vilimelis Aceituno

发表机构 * Institute of Neuroinformatics ETH Zürich, University of Zürich(神经信息学研究所,苏黎世联邦理工学院,苏黎世大学)

AI总结 本文提出结合高斯噪声和双边滤波的预处理方法,通过互补机制实现超线性对抗鲁棒性提升,并验证其与对抗训练结合后能以更低计算成本达到与最先进防御相当的性能。

Comments Main: 8 pages, 3 figures, 2 Tables. Supplement: 10 pages, 7 figures, 6 Tables

详情
AI中文摘要

深度神经网络对对抗样本的脆弱性对其实际部署构成了重大挑战。现有的增强深度网络鲁棒性的技术依赖于对抗训练,这种方法虽然强大,但计算密集且通常针对特定攻击类型。为了解决这些局限性,现有工作探索了添加高斯噪声或滤波图像等技术,这两种技术都能适度提升网络对各种对抗攻击的鲁棒性。在此,我们从理论上证明,这两种方法通过互补机制增强对抗鲁棒性,当结合时产生超线性鲁棒性。基于这一见解,我们通过实验表明,一个结合高斯噪声和双边滤波的简单预处理器能以最小计算成本实现对抗鲁棒性的超线性提升。接下来,我们将预处理器与对抗训练结合,并在RobustBench上进行测试,评估其相对于最先进防御的超线性改进。首先,该组合在AutoAttack上排名第二,总体排名第三,同时仅使用约35%的训练FLOPs,模型参数减少约50%,训练轮次减少约33%,数据量减少约15%(与最先进防御相比)。其次,我们的方法高效可扩展,在三个数量级上以大约2-8倍的总计算量匹配竞争模型的准确率。总体而言,我们的方法提供了一个有原则且易于集成的框架来增强对抗鲁棒性,具有可忽略的计算开销和简单但理论扎实的设计。

英文摘要

The vulnerability of deep neural networks to adversarial examples poses a significant challenge for real-world deployment. Existing techniques to enhance deep network robustness rely on adversarial training, an approach that is powerful but computationally intensive and typically tailored to specific attack types. To address these limitations, existing works have explored techniques such as adding gaussian noise or filtering images, both of which can boost the network robustness to various adversarial attacks, albeit modestly. Here, we theoretically demonstrate that these two approaches enhance robustness against adversarial attacks through complementary mechanisms, resulting in supralinear robustness when combined. Building on this insight, we experimentally show that a simple preprocessor combining Gaussian noise and bilateral filtering yields supralinear improvements in adversarial robustness with minimal computational cost. Next, we combine our preprocessor with adversarial training and test on RobustBench to assess its supralinear improvement over state-of-the-art defenses. First, this combination ranks second on AutoAttack and third overall, while using only $\sim$35% of the training FLOPs, using a model with $\sim$50% less parametets, trained with $\sim$33% of the epochs and $\sim$15% the data compared to state-of-the-art defenses. Second, our method scales efficiently, matching the accuracy of competing models with roughly 2-8x less total compute across 3 orders of magnitude. Overall, our approach provides a principled and easily integrable framework for enhancing adversarial robustness, offering negligible computational overhead and a simple yet theoretically grounded design.

2606.02246 2026-06-02 cs.CV 版本更新

Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark

Ego-METAS:面向自我中心的在线多模态节能时间动作分割基准

Maria Santos-Villafranca, Jesus Bermudez-cameo, Alejandro Perez-Yus, Giovanni Maria Farinella, Antonino Furnari

发表机构 * University of Zaragoza - I3A(萨拉戈塔大学 - I3A) Department of Mathematics and Computer Science, University of Catania(卡塔尼亚大学数学与计算机科学系)

AI总结 为解决资源受限设备上的能耗感知问题,提出了首个自我中心在线多模态节能时间动作分割基准Ego-METAS,包含超过100小时未裁剪视频和5种模态,要求模型动态选择传感器并遵守能量预算,评估显示最优路由高度依赖场景,现有方法难以适应连续环境。

Comments Project Page: https://maria-sanvil.github.io/Ego-METAS-website/

详情
AI中文摘要

为了在物理世界中运行,具身智能体必须以“始终在线”的方式感知环境,选择性访问信息最丰富的传感器,以平衡能量约束和任务准确性。尽管这对资源受限设备至关重要,但能耗感知感知仍未被充分探索,大多数先前工作假设无限计算。为了解决这一问题,我们引入了Ego-METAS:首个自我中心在线多模态节能时间动作分割基准。Ego-METAS提供了一个统一的测试平台,包含来自EgoExo4D、CMU-MMAC和CaptainCook4D的超过100小时未裁剪自我中心视频,涵盖5种模态(RGB、音频、注视、IMU和单色相机)。我们制定了一个在线时间动作分割任务,其中模型必须动态选择在每个时间步激活哪些传感器,同时严格遵守硬件代表性的能量预算。除了基准测试,我们还发布了统一的分割、清理后的标注、预提取特征以及一套多样化的基线路由策略。我们的评估表明,最优路由高度依赖于场景,并且现有的策略学习方法(主要针对裁剪片段设计)难以适应连续的未裁剪环境。然而,即使是互补模态的简单动态融合(例如通过随机路由)也被证明对于平衡预测准确性与严格能量预算至关重要。最终,Ego-METAS为开发自主、始终在线的具身AI的鲁棒、成本感知策略提供了标准化基础。

英文摘要

To operate in the physical world, embodied agents must perceive their environment in an "always-on" fashion, selectively accessing the most informative sensors to balance energy constraints and task accuracy. Despite its importance for resource-constrained devices, energy-aware perception remains under-explored, with most prior work assuming unlimited compute. To address this, we introduce Ego-METAS: the first Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark. Ego-METAS provides a unified testbed of more than 100 hours of untrimmed egocentric video from EgoExo4D, CMU-MMAC, and CaptainCook4D, spanning 5 modalities (RGB, audio, gaze, IMU, and monochrome camera). We formulate an online temporal action segmentation task where models must dynamically select which sensors to activate at each timestep while strictly adhering to hardware-representative energy budgets. Alongside the benchmark, we release unified splits, cleaned annotations, pre-extracted features, and a diverse suite of baseline routing policies. Our evaluations show that optimal routing is highly scenario-dependent, and that existing policy-learning methods, designed primarily for trimmed clips, struggle to adapt to continuous, untrimmed environments. However, even simple dynamic fusion of complementary modalities (e.g., via random routing) proves critical for balancing predictive accuracy against strict energy budgets. Ultimately, Ego-METAS provides a standardized foundation to develop robust, cost-aware policies for autonomous, always-on embodied AI.

2606.02242 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

解决基于图像和基于文本的行人重识别之间的优化冲突

Karina Kvanchiani, Timur Mamedov

发表机构 * Tevian, Russia(俄罗斯Tevian) Lomonosov Moscow State University, Russia(俄罗斯罗蒙诺索夫莫斯科国立大学)

AI总结 针对图像与文本行人重识别任务因模态差异和目标冲突导致共享表示次优的问题,提出解耦两阶段训练流程,使用单一视觉编码器避免跨任务干扰,实验表明图像预训练和文本监督能提升双任务性能。

详情
AI中文摘要

基于图像(I2I)和基于文本(T2I)的行人重识别(ReID)的联合优化受到模态差异和冲突训练目标的阻碍,导致共享表示次优。虽然I2I ReID关注同一人图像间的身份级不变性,但T2I ReID由与独特视觉特征相关的实例特定文本描述驱动。本文探讨了两个ReID任务及其优化过程之间的根本差异,以实现有效训练。由于I2I和T2I ReID通常分开研究,为一种检索设置优化的损失函数可能对另一种所需的表示质量产生负面影响。基于这些发现,我们提出了一种解耦的两阶段训练流程,用于学习跨图像和文本模态的共享表示。该流程基于单个视觉编码器,支持I2I和T2I检索,同时避免训练期间的跨任务干扰。我们在多种配置下进行了大量实验,改变了域混合程序、学习策略和任务目标。我们观察到I2I ReID预训练对T2I数据的泛化能力有积极影响。此外,我们发现视觉编码器训练阶段引入文本监督能提升I2I和T2I性能。我们相信,我们的见解为统一的ReID系统和跨模态检索整体迈出了有意义的一步。

英文摘要

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.

2606.02228 2026-06-02 stat.ML cs.CV cs.LG 版本更新

Bayesian meta-learning for modeling Alzheimer's disease progression

贝叶斯元学习用于阿尔茨海默病进展建模

Clara Hoffmann, Nadja Klein

发表机构 * Scientific Computing Center, Karlsruhe Institute of Technology, Germany(卡尔斯鲁厄理工学院科学计算中心,德国) Alzheimer’s Disease Neuroimaging Initiative(阿尔茨海默病神经影像计划)

AI总结 提出贝叶斯元学习方法,利用个体历史MRI体积和疾病轨迹预测疾病评分分布,无需重新训练即可动态预测,并减少长期预测的过度自信。

详情
AI中文摘要

预测阿尔茨海默病患者将经历轻度还是重度疾病进展对于个性化治疗至关重要。通常,临床医生试图预测离散疾病评分的分布,条件是个体当前的MRI体积及其历史疾病轨迹。经典的统计回归模型和单任务神经网络不适合此目的,因为拟合单独模型不可行(每个个体通常只有少量观测),而忽略个体间相关性会导致泛化能力差。相比之下,元学习提供了一种自然的方法来动态预测分布,无需重新训练,并能建模结果与协变量之间的非线性关系。受此启发,我们提出了一种贝叶斯元学习器,它在多个个体上训练,但根据每个个体的历史数据定制预测的疾病评分分布。我们的模型无需重新训练即可预测未见过的个体,与历史观测数量呈线性扩展,并且在预测长期疾病评分时,与确定性对应模型相比,保证更少的过度自信。在阿尔茨海默病神经影像学倡议(ADNI)数据库的真实世界数据上,我们的模型在性能上与单任务模型和确定性元学习器相当,同时在预测长期疾病进展时显著提高了性能。

英文摘要

Predicting whether an individual with Alzheimer's disease will experience mild or severe disease progression is essential for personalized treatment. Typically, practitioners seek to predict the distribution of a discrete disease score, conditional on an individual's current MRI volume and their historical disease trajectory. Classical statistical regression models and single-task neural networks are not well-suited for this purpose because fitting separate models is infeasible (since each individual typically has few observations), while ignoring individual-level correlation leads to poor generalization. Meta-learning, in contrast, provides a natural avenue to dynamically predict distributions without retraining and model nonlinear relationships between the outcome and covariates. Motivated by this, we propose a Bayesian meta-learner that is trained on multiple individuals but tailors the predictive disease score distribution to each individual's historical data. Our model predicts on unseen individuals without retraining, scales linearly with the number of historical observations, and is guaranteed to be less overconfident when predicting long-term disease scores compared to its deterministic counterpart. On real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, our model achieves performance competitive with both single-task models and deterministic meta-learners, while substantially improving performance when predicting long-term disease progression.

2606.02221 2026-06-02 cs.CV cs.LG 版本更新

CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations

CORE-MTL: 通过因果正交表示重新思考梯度平衡

Chengfeng Wu, Tao Zou, Yanru Wu, Jingge Wang

发表机构 * Tsinghua University(清华大学)

AI总结 提出CORE-MTL框架,通过因果正交表示将共享表示分解为语义流和残差流,以分离任务相关结构与虚假上下文,从而减少负迁移并提升泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

多任务学习旨在通过跨领域共享共同表示来构建联合模型。为实现这一目标,现有的优化中心方法要么平衡任务梯度,要么修改共享架构。然而,由于这些方法对共享表示的内容不可知,它们无法将任务相关结构与虚假上下文分离,导致负迁移和泛化能力差。为克服这一限制,我们提出了用于多任务学习的因果正交表示(CORE-MTL),这是一个因果驱动的表示中心框架,鼓励对共享表示进行结构化的语义-残差分解,将任务相关结构集中在语义流中,而将干扰变化归入残差流。我们通过利用结构化场景的物理先验和属性的统计约束,在视觉领域实例化了该框架。理论上,我们的方法比优化中心方法具有更紧的分布外泛化界,并且无需显式梯度投影或重新加权即可减少任务梯度干扰。实验上,CORE-MTL在视觉多任务基准测试中,在分布内和分布外设置下均持续优于现有方法。代码公开于 https://github.com/Hope-Rita/CORE-MTL。

英文摘要

Multi-task learning (MTL) aims to construct a joint model for multiple tasks by sharing a common representation across domains. To achieve this goal, existing optimization-centric methods either balance task gradients or modify the shared architecture. However, as these approaches remain agnostic to the content of the shared representation, they fail to disentangle task-relevant structure from spurious context, leading to negative transfer and poor generalization. To overcome this limitation, we propose Causal Orthogonal Representations for Multi-Task Learning (CORE-MTL), a causally motivated representation-centric framework that encourages a structured semantic-residual factorization of the shared representation, concentrating task-relevant structure in the semantic stream while relegating nuisance variation to the residual stream. We instantiate this framework in the visual domain by leveraging physical priors for structured scenes and statistical constraints for attributes. Theoretically, our method enjoys a tighter out-of-distribution generalization bound than optimization-centric methods and reduces task gradient interference without explicit gradient projection or reweighting. Empirically, CORE-MTL consistently outperforms existing methods on visual multi-task benchmarks in both in-distribution and out-of-distribution settings. Code is publicly available at https://github.com/Hope-Rita/CORE-MTL.

2606.02219 2026-06-02 cs.CV 版本更新

Symmetry-Aware 9D Pose Estimation with Sim(3)-Consistent Feature and Spherical Inception Convolution

对称感知的9D姿态估计:Sim(3)一致特征与球形Inception卷积

Panfei Cheng, Hongshan Yu, Wenrui Chen, Xiaojun Tang, Jian Liu, Naveed Akhtar

发表机构 * National Engineering Research Center for Robot Visual Perception and Control, School of Robotics and Artificial Intelligence, Hunan University(机器人视觉感知与控制国家工程研究中心,机器人与人工智能学院,湖南大学) Beijing Spacecrafts, China Academy of Space Technology(北京航天器,中国航天科技研究院) School of Computing and Information Systems, The University of Melbourne(计算与信息学院,墨尔本大学)

AI总结 提出一种类别级物体姿态估计方法,通过语义引导的对称感知模块和球形大核Inception卷积融合特征,实现无形状先验的精确平移/尺寸估计和鲁棒旋转估计,在基准和真实场景中达到最优性能。

Comments 12 pages, 7 figures

详情
AI中文摘要

物体姿态估计是智能系统感知或操作图像/视频中物体的基本问题。然而,当前的实例级方法难以泛化到未见物体。类别级方法试图解决这一问题,但仍受限于非线性Sim(3)空间的学习复杂性和类内变化。为应对这些挑战,我们提出一种有效的类别级物体姿态估计方法,包含两项关键创新:(1) 一个平移/尺寸估计器,具有语义引导的对称感知模块,利用大型视觉模型(LVM)的鲁棒泛化能力推断对称点,从而无需形状先验即可获得精确的平移和尺寸。该结果作为旋转估计的预计算线索,降低了在非线性Sim(3)空间学习的难度,并为处理更具挑战性的旋转估计奠定坚实基础。(2) 一个特征融合模块,基于我们提出的球形大核Inception卷积,将LVM的语义特征与系统计算的几何特征融合,通过建模长程依赖关系以较低计算成本从类内变化中提取关键姿态特征。基于这些创新,我们在基准和真实场景中达到最优性能,并开发了一个能够处理多样物体的鲁棒机器人抓取系统。我们的代码将在项目页面提供:{\hypersetup{urlcolor=blue}https://panfei-cheng.github.io/SSH-Pose}。

英文摘要

Object pose estimation is a fundamental problem for an agent system to perceive or manipulate objects in images or videos. However, current instance-level methods struggle with generalization to unseen objects. Category-level methods seek to address this, but remain constrained by the complexities of learning in the non-linear Sim(3) space and intra-class variations. To address these challenges, We propose an effective method for category-level object pose estimation with two key innovations: (1) A translation/size estimator, featuring a semantic-guided symmetry-aware module that leverages robust generalization capabilities of a large vision model (LVM) to infer symmetry points, resulting in accurate translation and size without shape priors. This result serves as a precomputed cue for rotation estimation, thereby reducing the difficulty of learning in the non-linear Sim(3) space and laying a robust foundation for tackling the inherently more challenging rotation estimation. (2) A feature fusion module, based on our proposed spherical large-kernel inception convolution, fuses semantic features from the LVM with systematically computed geometric features to extract essential pose features from intra-class variations by modeling long-range dependencies without excessive computational cost. Built on these innovations, we achieve SOTA on benchmarks and real-world scenes, while developing a robust robotic picking system capable of handling diverse objects. Our code will be available at the project page: {\hypersetup{urlcolor=blue}https://panfei-cheng.github.io/SSH-Pose}.

2606.02178 2026-06-02 cs.CV cs.AI 版本更新

Order within Chaos: Capturing Intrinsic Energy Anomalies for AI-Manipulated Image Forgery Localization

混沌中的秩序:捕捉AI操纵图像伪造定位的内在能量异常

Yiming Wang, Baiqi Wu, Qingming Li, Jiahao Chen, Tong Zhang, Shouling Ji

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出FLAME框架,利用扩散过程抑制局部高频方差产生的统计能量间隙,结合LAD图和SAM适配器实现像素级伪造定位,并引入EditStream流水线持续合成训练数据,在AI生成伪造数据集上达到最先进性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

近期生成式AI的进展催生了能够产生逼真伪造图像的图像编辑模型,这些伪造图像能规避传统图像伪造定位方法,因为传统方法依赖于合成数据中不存在的物理噪声。为应对这一挑战,我们从理论上证明扩散过程本质上抑制了局部高频方差,产生了与光学成像自然熵可区分的统计能量间隙。受此启发,我们提出FLAME,一个统一框架,利用LAD图捕捉这些内在异常,并结合SAM的参数高效适配器实现精确的像素级伪造定位。此外,为弥合取证基准与不断演变的生成模型之间的滞后,我们引入EditStream,一个基于指令的连续训练数据合成自动化流水线。大量实验表明,FLAME建立了新的最先进水平,在AI生成伪造数据集上显著优于先前方法,同时有效泛化到未见过的生成架构。我们的代码可在https://github.com/phoenixnir/FLAME获取。

英文摘要

Recent advancements in generative AI have led to image editing models capable of producing realistic forgeries that evade traditional image forgery localization methods, as these approaches depend on physical noise absent in synthetic data. To address this challenge, we theoretically demonstrate that the diffusion process inherently suppresses local high-frequency variance, creating a statistical energy gap that is distinguishable from the natural entropy of optical imaging. Guided by this insight, we propose FLAME, a unified framework that utilizes a LAD map to capture these intrinsic anomalies, coupled with a parameter-efficient adapter for SAM to achieve precise, pixel-level forgery localization. Furthermore, to bridge the lag between forensic benchmarks and evolving generative models, we introduce EditStream, an automated pipeline for continuous, instruction-based training data synthesis. Extensive experiments demonstrate that FLAME establishes a new state-of-the-art, significantly outperforming previous methods on AI-generated forgery datasets while effectively generalizing to unseen generative architectures. Our code is available at https://github.com/phoenixnir/FLAME.

2606.02172 2026-06-02 cs.LG cs.CV 版本更新

Closing the Alignment-Maturity Gap in Federated Prototype Learning

缩小联邦原型学习中的对齐-成熟度差距

Mario Casado-Diez, Alejandro Dopico-Castro, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas

发表机构 * CITIC, Universidade da Coruña(CITIC,科鲁纳大学)

AI总结 针对联邦学习中原型对齐压力抑制局部判别结构的问题,提出FedSAP框架,通过确定性对齐课程和几何驱动代理分离损失稳定表征学习,在多种异质性条件下提升分类性能。

详情
AI中文摘要

从分布式异质数据中学习判别性视觉表示是联邦学习(FL)中的一个基本挑战。基于原型的方法通过跨客户端共享类级表示来解决统计异质性,但在早期训练轮次中会产生距离依赖的梯度压力,这种压力尤其严重:对从噪声局部表示聚合而来的不成熟全局原型施加的对齐压力会产生大梯度,从而抑制局部判别结构的出现。结果导致嵌入空间组织不良,识别性能下降,尤其是在严重的非独立同分布(non-IID)条件下。我们提出FedSAP,一个通过两种互补机制稳定联邦表示学习的框架:一个确定性对齐课程,将全局对齐延迟到局部表示变得稳定;以及一个几何驱动的代理分离损失,利用现有原型库在单位超球面上强制执行类间结构,而不引入额外参数或通信开销。这些机制共同产生紧凑、分离良好的类簇,而不改变联邦参与者之间的底层通信协议。在三个基准测试和不同程度的异质性下的实验表明,与评估的原型基线相比,性能提升高达4个百分点,在高异质性下改进最为显著。我们框架的表示性质还使其能够直接扩展到半监督设置,其中未标记数据只需最小修改即可纳入,突显了调度对齐作为设计原则的通用性。

英文摘要

Learning discriminative visual representations from distributed, heterogeneous data is a fundamental challenge in Federated Learning (FL). Prototype-based methods address statistical heterogeneity by sharing class-level representations across clients but create a distance-dependent gradient pressure that is particularly severe during early training rounds: alignment pressure applied to immature global prototypes, aggregated from noisy local representations, generates large gradients that suppress the emergence of local discriminative structure. The result is a poorly organized embedding space and degraded recognition performance, particularly under severe non-IID conditions. We propose FedSAP, a framework that stabilises federated representation learning through two complementary mechanisms: a deterministic alignment curriculum that delays global alignment until local representations become stable and a geometry-driven proxy separation loss that enforces inter-class structure on the unit hypersphere using the existing prototype bank without introducing additional parameters or communication overhead. Together, these mechanisms produce compact, well-separated class clusters without altering the underlying communication protocol between federation's participants. Experiments across three benchmarks and varying degrees of heterogeneity show gains of up to 4 percentage points over the prototype-based baselines evaluated, with improvements most pronounced under high heterogeneity. The representational nature of our framework further enables a straightforward extension to semi-supervised settings, where unlabelled data is incorporated with minimal modification, underscoring the generality of scheduled alignment as a design principle.

2606.02171 2026-06-02 cs.CV 版本更新

InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

InsightVQA: 高维情感认知视觉问答基准

Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin, Zhongqian Mao, Jing Chen, Jiaqi Song, Yunshi Lan, Yan Wang

发表机构 * East China Normal University(东华师范大学)

AI总结 为解决现有基准仅关注情感识别而缺乏深层认知推理的问题,提出大规模层次化视觉问答数据集InsightVQA,包含725K问答对,并构建评估基准InsightVQA-Bench和基线模型InsightNet。

Comments 16 pages, 22 figures

详情
AI中文摘要

视觉情感理解要求模型不仅识别情感状态,还要理解其产生原因并进行更高层次的认知推理。然而,现有基准主要关注情感识别,对基于依据的理解和面向响应的分析支持有限。为弥补这一差距,我们引入了 extbf{InsightVQA},一个用于情感理解和认知推理的层次化视觉问答大规模数据集。我们从六个公开来源收集的351K图像出发,应用严格的多阶段过滤流程,筛选出138K高置信度图像。每张图像在三个层次上进行标注:用于情感和效价识别的感知QA、通过约束引导生成从视觉触发提取构建的基于依据的理解QA,以及以响应意图预测和序列洞察推理为中心的认知QA。总计,InsightVQA包含725K个问答对。我们还提出了 extbf{InsightVQA-Bench},一个包含30K样本的高质量评估基准,用于细粒度评估。为支持评估,我们引入了 extbf{InsightNet},一个针对多模态大语言模型的情感调优基线。结果表明,InsightVQA对基于依据的情感理解和推理提出了重大挑战。

英文摘要

Visual emotion understanding requires models not only to recognize emotional states, but also to why they arise and perform higher-level cognitive reasoning. However, existing benchmarks mainly focus on emotion recognition, offering limited support for grounded understanding and response-oriented analysis. To address this gap, we introduce \textbf{InsightVQA}, a large-scale dataset for hierarchical visual question answering on emotion understanding and cognitive reasoning. Building from 351K images collected from six public sources, we apply a rigorous multi-stage filtering pipeline to curate 138K high-confidence images. Each image is annotated at three hierarchical levels: perception QA for emotion and valence recognition, grounded understanding QA constructed from visual trigger extraction through constraint-guided generation, and cognition QA centered on response intent prediction and sequential insight reasoning. In total, InsightVQA contains 725K QA pairs. We further present \textbf{InsightVQA-Bench}, a high-quality evaluation benchmark comprising 30K samples for fine-grained evaluation. To support evaluation, we introduce \textbf{InsightNet}, an emotion-tuned baseline for MLLMs. Results demonstrate that InsightVQA poses significant challenges for grounded emotion understanding and reasoning.

2606.02168 2026-06-02 cs.CV cs.LG 版本更新

Disentanglement-Based Equivariant Learning for Compositional VQA

基于解耦的等变学习用于组合式VQA

Zhou Du, Zhaoquan Yuan, Xiao Wu, Changsheng Xu

发表机构 * IEEE Publication Technology Group(IEEE出版技术组) School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) Engineering Research Center of Sustainable Urban Intelligence Transportation, Ministry of Education, China(可持续智慧城市交通工程研究中心,中华人民共和国教育部) State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统(MAIS)国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院,中国科学院大学)

AI总结 提出DEAL框架,通过因果干预解耦视觉和文本概念,并利用等变约束增强组合推理能力,在CLEVR-CoGenT和GQA-SGL上超越现有方法。

Comments Accepted by IEEE Transactions on Multimedia

详情
Journal ref
IEEE Trans. Multimedia, vol. 27, pp. 8160-8173, 2025
AI中文摘要

组合式视觉问答(VQA)是一项具有挑战性但基础的任务,要求模型理解先前学习概念的新组合。当前方法往往忽视潜在概念的解耦,并且在有效捕捉组合变化机制方面受到限制。此外,最先进的技术依赖于额外的线索进行训练,这在现实世界的VQA场景中不可行。为了解决这些问题,本文提出了一种新颖的基于解耦的等变学习(DEAL)框架用于组合式VQA,该框架仅由真实答案指导。在DEAL中,我们采用因果启发的干预措施,在重新编码框架内解耦来自视觉和文本输入的概念。基于等变性原理,我们随后对推理输入进行组合变换,并对输出施加等变约束,以增强模型的组合推理能力。在基准数据集CLEVR-CoGenT和GQA-SGL上进行的全面实验验证了我们提出的DEAL方法在视觉和语言泛化设置下均优于现有的最先进方法。

英文摘要

Compositional visual question answering (VQA) represents a challenging yet fundamental task that requires models to comprehend novel combinations of previously learned concepts. The current methods often overlook the disentanglement of underlying concepts and are restricted in terms of their ability to effectively capture the compositional variation mechanism. Moreover, the state-of-the-art techniques depend on additional clues for training, which is not feasible in real-world VQA scenarios. To address these issues, in this paper, we introduce a novel Disentanglement-based EquivAriant Learning (DEAL) framework for compositional VQA, which is guided exclusively by ground-truth answers. In DEAL, we employ causality-inspired interventions to disentangle concepts derived from visual and textual inputs within a re-encoding framework. Based on the principle of equivariance, we subsequently perform a compositional transformation on the inference input and impose the equivariant constraint on the output to augment the compositional reasoning capacity of the model. Comprehensive experiments conducted on the benchmark CLEVR-CoGenT and GQA-SGL datasets validate the superiority of our proposed DEAL approach over the existing state-of-the-art methods for compositional VQA tasks in both visual and linguistic generalization settings.

2606.02162 2026-06-02 cs.CV cs.AI cs.CL cs.IR 版本更新

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

视觉丰富文档类型分类的多模态方法:一项比较分析

Catyana Heyne, Jürgen Frikel, Filippo Riccio

AI总结 针对视觉丰富文档类型分类中多模态建模策略难以系统比较的问题,本文在统一实验框架下对基于Transformer和LLM的四种代表性模型进行受控对比,发现专用多模态Transformer优于LLM方法,且图像信息贡献最大。

详情
AI中文摘要

视觉丰富文档中的文档类型分类仍然具有挑战性,因为相关信息分布在文本、视觉和布局模态中。为了捕捉这种复杂性,当前方法依赖于多样化的多模态建模策略,导致异构架构使得系统比较复杂化。这种变异性也反映在现有的比较研究中,这些研究通常依赖于异构评估设置,进一步复杂化了系统比较,并使得评估进展变得困难。为了解决这些局限性,本文提供了跨基于Transformer和基于LLM架构的多模态设计策略的结构化分析,并结合统一实验框架内的受控实证比较。具体来说,在RVL-CDIP基准上评估了四种代表性模型(LayoutLMv3、Donut、Qwen3-VL-32B-Instruct和Qwen3-32B),以系统分析文本、图像和布局信息对文档类型分类的贡献,特别关注对比OCR依赖和OCR无关的方法。结果表明,专用多模态Transformer在视觉丰富和布局密集型文档上优于基于LLM的方法。图像信息对可靠分类贡献最大,而OCR派生的文本提供有用但次要的支持。这些发现强调,对于具有显著布局结构的文档,多模态处理仍然是必不可少的。总体而言,该研究为比较多模态架构提供了系统基础,并为选择有效的特征组合和模型设计以进行文档类型分类提供了实用指导。

英文摘要

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

2606.02161 2026-06-02 cs.CV cs.CL 版本更新

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

InfoMerge: 信息感知的令牌压缩用于高效视频大语言模型

Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin, Lei Xie, Sanglu Lu

发表机构 * State Key Laboratory of Novel Software Technology(新型软件技术国家重点实验室)

AI总结 提出InfoMerge,一种无需训练的视觉令牌压缩方法,通过鲁棒冗余估计和内容感知预算分配,在减少85%视觉令牌的同时保持98.8%性能,实现4.24倍预填充加速。

Comments 15 pages, 8 figures

详情
AI中文摘要

视频大语言模型在视频理解中表现出色,但过多的视觉令牌带来了巨大的计算开销。现有的免训练压缩方法通过减少视觉令牌来提高推理效率,但它们通常依赖局部相邻帧相似性进行时间冗余估计,或主要根据片段长度分配令牌预算。这种设计对帧级噪声敏感,且无法捕捉真实视频的非均匀信息分布。为解决这些挑战,我们提出InfoMerge,一种无需训练的视觉令牌压缩方法,通过鲁棒冗余估计和内容感知预算分配来提高令牌利用率。具体来说,我们提出时间指纹差异:一种片段级二阶时间冗余估计策略,用于建模每个片段内相同空间位置令牌的时间相似性结构。我们进一步引入内容感知预算分配(CABA),根据片段独特性和基于谱熵的表征丰富性动态分配片段级令牌预算。通过减少对冗余静态区域的重复保留,并将更多令牌分配给信息丰富的片段,InfoMerge在保持强大性能的同时更好地利用了有限的令牌预算。大量实验表明,InfoMerge在多个基准和骨干网络上实现了强效的精度-效率权衡,在激进压缩下优势更为明显。在LLaVA-OneVision-7B上,InfoMerge保留了原始平均性能的98.8%,同时减少了85%的视觉令牌,并在预填充阶段实现了4.24倍的加速。

英文摘要

Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they often rely on local adjacent-frame similarity for temporal redundancy estimation or allocate token budgets mainly according to segment length. Such designs are sensitive to frame-level noise and fail to capture the non-uniform information distribution of real-world videos. To address these challenges, we propose InfoMerge, a training-free visual token compression method that improves token utilization through robust redundancy estimation and content-aware budget allocation. Specifically, we propose the Temporal Fingerprint Difference: a segment-level second-order temporal redundancy estimation strategy, which models the temporal similarity structure of tokens at the same spatial positions within each segment. We further introduce Content-Aware Budget Allocation (CABA), which dynamically allocates segment-level token budgets based on segment uniqueness and spectral-entropy-based representational richness. By reducing repeated preservation of redundant static regions and allocating more tokens to informative segments, InfoMerge makes better use of the limited token budget while maintaining strong performance. Extensive experiments show that InfoMerge achieves strong efficiency--accuracy trade-offs across multiple benchmarks and backbones, with more pronounced advantages under aggressive compression. On LLaVA-OneVision-7B, InfoMerge retains 98.8\% of the original average performance while reducing 85\% of visual tokens and achieving a 4.24-fold speedup in the prefill stage.

2606.02156 2026-06-02 eess.IV cs.AI cs.CV cs.IR cs.LG 版本更新

Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel

基于术前肠道血供映射预测结直肠吻合口漏风险

Zahra Tabatabaei, Jon Sporring, Mark Bremholm Ellebæk, Alaa El-Hussuna

发表机构 * Computer Science Department, Københavns Universitet (KU)(哥本哈根大学计算机科学系) University of Southern Denmark(南部丹麦大学) Odense University Hospital(奥登塞大学医院) OpenSourceResearch Collaboration(开源研究协作)

AI总结 提出一种基于术前CT影像的AI驱动系统,通过分析血管和组织特征量化吻合口漏风险,并结合内容检索支持临床决策。

详情
AI中文摘要

吻合口漏仍然是结直肠癌手术后最严重的并发症之一,显著影响患者预后、康复轨迹和医疗成本。尽管影像技术有所进步,目前的术前评估仍依赖临床评估,这一过程主观、易出错且高度依赖个人经验。迄今为止,尚无经过验证的基于CT的方法能够在术前预测吻合口漏风险。本方案论文概述了一个全面的框架,用于开发和验证一个AI驱动的系统,该系统利用对比增强前后的CT影像进行术前风险评估。研究描述了数据收集、伦理处理、符合GDPR的患者数据预处理、图像预处理以及旨在生成临床可解释输出的深度学习架构探索等阶段。该工作流程的两个主要成果是:1) 风险评估模块,通过分析CT扫描中的血管和组织特征量化漏液可能性;2) 基于内容的医学图像检索(CBMIR)模块,识别并显示相似历史病例以支持循证手术决策。该方案论文需要医院和大学之间的密切合作;本方案表明,此类系统在现有医疗基础设施内技术上可行且临床可实施。通过遵循所提出的方法论阶段和监管原则,其他机构可以复制此工作流程以开发类似的决策支持工具。最终,这一跨学科框架旨在加强手术规划、减少漏液发生率,并推动向可解释、数据驱动的精准手术的更广泛范式转变。

英文摘要

Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcomes, recovery trajectories, and healthcare costs. Despite advances in imaging technology, current preoperative assessment relies only on clinical assessment, a process that is subjective, error-prone, and highly dependent on individual expertise. To date, no validated CT-based method exists to predict anastomotic leak risk prior to surgery. This protocol paper outlines a comprehensive framework for developing and validating an AI-driven system for preoperative risk assessment using pre- and post-contrast CT imaging. The study describes the stages of data collection, ethical handling, and preprocessing of patient data in accordance with GDPR, image preprocessing, and the exploration of deep learning architectures designed to generate clinically interpretable outputs. Two integrated tools constitute the main deliverables of this workflow: 1) a risk assessment module, which quantifies the likelihood of leakage by analyzing vascular and tissue features in CT scans, and 2) a Content-Based Medical Image Retrieval (CBMIR) module, which identifies and displays similar historical cases to support evidence-based surgical decision making. The protocol paper requires close collaboration between hospitals and universities; this protocol demonstrates that such a system is technically feasible and clinically implementable within existing healthcare infrastructures. By following the proposed methodological stages and regulatory principles, other institutions can reproduce this workflow to develop analogous decision-support tools. Ultimately, this interdisciplinary framework aims to enhance surgical planning, reduce leak incidence, and contribute to a broader paradigm shift toward explainable, data-driven precision surgery.

2606.02153 2026-06-02 cs.CV cs.GR 版本更新

Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances

超扩散姿态估计器:基于扩散的从稀疏惯性传感器和测距传感器间距离的人体运动跟踪

Dominik Hollidt, Tommaso Bendinelli, Christian Holz

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 提出Ultra Diffusion Poser扩散模型,通过显式建模UWB测距的几何约束(空间布局模块解析重建传感器位置)和引入UWB扩散引导,在扩散采样中强制预测姿态与实测距离对齐,将关节位置误差降低22%。

Comments CVPR 2026 - Computer Vision and Pattern Recognition

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, pp. 7036-7046
AI中文摘要

使用惯性测量单元(IMU)的方法提供了一种可穿戴的替代基于摄像头的运动捕捉方案。为了减轻惯性信号的漂移,最近的稀疏惯性姿态估计器集成了由超宽带(UWB)测距测量的传感器间距离。到目前为止,UWB距离仅被用作额外的输入特征,忽略了它们对传感器位置施加的物理约束。然而,这些距离也可以用于重建底层3D传感器布局,从而为姿态重建提供更具信息性的输入。我们提出了Ultra Diffusion Poser,一种显式建模这些几何约束的扩散模型。它包括一个空间布局模块,该模块从UWB测量中解析地重建3D传感器位置。这些传感器位置与IMU信号和UWB距离一起作为扩散过程中的条件信号。尽管如此,网络预测可能违反传感器间距离测量。为了解决这个问题,我们引入了UWB扩散引导,它在扩散采样过程中鼓励预测姿态与测量距离之间的对齐。这些贡献共同使我们的模型达到了最先进的性能,将关节位置误差相比先前工作降低了高达22%。

英文摘要

Methods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture. To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions. However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements. These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion. Still, network predictions can violate inter-sensor distance measurements. To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling. Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.

2606.02134 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Rethinking Evaluation Paradigms in IBP-based Certified Training

重新思考基于IBP的认证训练中的评估范式

Konstantin Kaulen, Hadar Shavit, Holger H. Hoos

发表机构 * University of Freiburg(弗赖堡大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 针对认证训练中自然精度与认证精度的权衡问题,提出基于Pareto前沿的多目标超参数优化方法,实现公平的方法间比较,并发现先前配置的欠调优现象,建立新的最优性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

深度神经网络在许多监督学习任务上取得了强大性能,但仍易受对抗性扰动的影响。神经网络验证提供了数学上严格的鲁棒性保证,但计算成本高昂。为缓解这一问题,认证训练技术在训练过程中优化可验证的鲁棒性,通常通过方法特定的超参数控制自然精度与认证精度之间的权衡。由于这些指标本质上是冲突的,报告单一配置的常见做法存在问题:它可能误导关于整体性能的结论,并妨碍对最新技术的无偏评估。我们通过基于自然-认证精度权衡的Pareto前沿比较来评估认证训练方法。为了实现公平、方法无关的比较,我们执行高效的自动化多目标超参数优化,为每种方法识别一组Pareto最优配置。这种方法常常揭示先前报告配置中的显著欠调优,从而获得更优性能并建立新的最优水平。利用这些前沿,我们首次对认证训练方法进行了全面的多目标比较,表明先前的进展并不像假设的那样显著,并揭示了先前未报告的性能互补性。

英文摘要

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a trade-off between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural--certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.

2606.02129 2026-06-02 cs.CV 版本更新

Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

均衡扩散:面向均衡图像定制的频率感知文本嵌入

Liyuan Ma, Xueji Fang, Guo-Jun Qi

发表机构 * Westlake University(西湖大学) Zhejiang University(浙江大学)

AI总结 提出均衡扩散方法,通过频率空间分解概念特征并独立优化嵌入,实现风格与主体解耦,提升定制图像的保真度和文本对齐。

详情
AI中文摘要

图像定制从参考概念图像中学习目标主体,并根据文本提示生成条件图像,主要修改风格或背景。主流方法采用微调将多样化的概念属性打包到统一的潜在嵌入中,但纠缠的属性阻碍了从风格和背景中消除无关干扰。为解决此问题,我们提出均衡扩散,一种频率驱动的方法,解缠纠缠的概念特征,实现均衡定制和一致的文本-视觉匹配。与使用共享嵌入和统一调优学习完整概念的传统方法不同,我们的工作利用图像频率分量与语义之间的内在联系:低频表示主体内容,高频对应风格。我们在频率空间中分解概念并独立优化每个嵌入。这种分离优化使去噪器能够捕获与主体身份分离的风格,并更好地泛化到未见过的风格提示。合并多频率嵌入保留了模型原有的空间定制能力。我们进一步部署掩码引导扩散以限制无关背景变化并增强文本对齐。将残差参考注意力(RRA)插入空间注意力中以保持主体结构和身份一致性。实验证明,均衡扩散在主体保真度和文本遵循方面超过主流基线,验证了我们方法的优越性。

英文摘要

Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model's original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method's superiority.

2606.02120 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

理解增强的模型协作用于长尾自我中心错误检测

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(人工智能安全国家重点实验室,计算技术研究所,中国科学院) School of Computer Science and Tech., University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Institute of Information Engineering, CAS(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 提出理解增强的模型协作方法(UE-MCM),结合粗粒度视频理解与细粒度动作推理,通过双分支模型和自适应融合门检测自我中心视频中的错误,并优化长尾分布。

详情
AI中文摘要

在本报告中,我们解决了从自我中心视频数据中判断用户是否错误执行动作的问题。为此,我们提出了一种理解增强的模型协作方法(UE-MCM),该方法将高效的粗粒度视频理解与准确的细粒度动作推理相结合。具体来说,UE-MCM包含一个小模型分支和一个大模型分支。大模型分支关注细粒度动作本身是否执行错误,而小模型分支联合输入粗粒度视频和细粒度片段,以识别可能局部正确但与整体工作流不一致的动作。小模型分支基于CLIP4CLIP视频编码器构建,该编码器从通过扩散对比重建增强的CLIP模型初始化,大模型分支使用Qwen3-VL嵌入模型从细粒度动作片段中提取高容量表示。然后,通过轻量级协作门自适应融合小分支预测和大分支预测。为了处理错误实例的长尾分布,我们通过互补目标优化分类器,包括重加权交叉熵、AUC导向学习和标签感知调整。所得系统平衡了速度和准确性,使其能够有效检测自我中心教学视频中的细微、罕见和模糊错误。

英文摘要

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.

2606.02111 2026-06-02 cs.CV cs.AI cs.CL 版本更新

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

使用多片段视频破解多模态大语言模型

Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim

发表机构 * Department of Applied Artificial Intelligence, Sungkyunkwan University(应用人工智能系,成均馆大学) Department of Human-Artificial Intelligence Interaction, Sungkyunkwan University(人机交互系,成均馆大学)

AI总结 提出MCV SafetyBench数据集,通过多片段视频评估多模态大语言模型的安全漏洞,发现视频模态比图像更脆弱,动态和多样化上下文增加攻击成功率,并基于图像模态的鲁棒性提出防御策略。

Comments 27 pages, 20 figures, Accepted to the Main Conference of ACL 2026

详情
AI中文摘要

随着多模态大语言模型(MLLMs)发展到处理视频输入,人们开始担忧其被恶意滥用的可能性。先前的越狱研究表明,MLLMs中的安全对齐可以通过视觉输入被绕过,但尚不清楚视频输入的哪些属性导致了这种脆弱性。为填补这一空白,我们引入了Multi-Clip Video (MCV) SafetyBench,一个包含2,920个视频的数据集,旨在评估视频输入的多样性如何影响MLLMs的脆弱性。每个视频由多个短片段组成,描述与有害查询相关的不同上下文。对八个代表性视频MLLMs的实验表明,攻击成功率随着片段数量的增加而持续提高。我们的结果进一步表明,视频模态(1)比图像模态更脆弱,(2)对动态视频比对静态视频更脆弱,(3)当视频包含更多样化的上下文时更脆弱。基于这些发现,我们提出了一种利用图像模态相对鲁棒性的防御策略。

英文摘要

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.

2606.02105 2026-06-02 cs.CV 版本更新

Multimodal Action Diffusion for Robust End-to-End Autonomous Driving

多模态动作扩散用于鲁棒的端到端自动驾驶

Jorge Daniel Rodríguez-Vidal, Diego Porres, Gabriel Villalonga Pineda, Antonio M. López Peña

发表机构 * Computer Vision Center (CVC)(计算机视觉中心) Universitat Autònoma de Barcelona (UAB)(巴塞罗那自治大学)

AI总结 提出动作扩散变换器(ADT),通过多模态动作建模和最近邻匹配,在闭环Bench2Drive基准上超越先前最优方法,同时延迟降低十倍。

Comments Preprint. June 1st, 2026. Corresponding author: Jorge Daniel Rodríguez-Vidal

详情
AI中文摘要

端到端自动驾驶(E2E-AD)系统大多收敛于预测中间轨迹路点,将最终控制委托给具有GPS访问权限的手工控制器。直接控制信号预测(以端到端方式输出油门、转向和刹车)仍未被充分探索,且关键的是,动作多模态性在此类系统中的作用尚未被很好理解。我们认为,超越确定性单动作输出不仅是建模选择,更是驾驶性能、表示质量和训练稳定性的关键驱动因素。为验证这一点,我们引入了动作扩散变换器(ADT),这是一种无锚点扩散变换器,使用MSE目标训练,天然地对合理驾驶动作的多模态分布进行建模。ADT不承诺单一确定性命令,而是生成K个动作候选,并通过最近邻匹配(NNM)在推理时选择最合适的一个。除了强大的基准数值外,我们表明动作多模态性在学习表示和行为一致性方面带来了可衡量的好处,这些效果是确定性架构无法复制的。ADT在具有挑战性的闭环Bench2Drive基准上超越了先前最先进方法,同时实现了十倍更低的延迟,这表明表达性多模态动作建模对于鲁棒的端到端驾驶既实用高效又概念上必不可少。

英文摘要

End-to-End Autonomous Driving (E2E-AD) systems have largely converged on predicting intermediate trajectory waypoints, delegating final control to hand-crafted controllers with GPS access. Direct control-signal prediction (outputting throttle, steer and brake in an end-to-end fashion) remains underexplored, and critically, the role of action multimodality in such systems is not well understood. We argue that moving beyond deterministic, single-action outputs is not merely a modelling choice, but a key driver of driving performance, representational quality, and training stability. To validate this, we introduce the Action Diffusion Transformer (ADT), an anchor-free diffusion transformer trained with a MSE objective that natively models the multimodal distribution of plausible driving actions. Rather than committing to a single deterministic command, ADT generates K action candidates and selects the most suitable one at inference via Nearest Neighbour Matching (NNM). Beyond strong benchmark numbers, we show that action multimodality yields measurable benefits in learned representations and behavioral consistency, effects that deterministic architectures cannot replicate. ADT surpasses previous state-of-the-art on the challenging closed-loop Bench2Drive benchmark while achieving ten times lower latency, demonstrating that expressive, multimodal action modelling is both practically efficient and conceptually essential for robust end-to-end driving.

2606.02096 2026-06-02 cs.CV 版本更新

WebSpline: Structure-Informed Splines for Real-Time 3D Gaussians from Monocular Videos

WebSpline:面向单目视频实时三维高斯的结构化样条

Jongmin Park, Jeonghwan Yun, Minh-Quan Viet Bui, Munchurl Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 提出WebSpline框架,利用结构信息样条(SIS)表示和结构代理图(SPG),实现从单目视频中实时、高保真、结构连贯的动态三维高斯重建。

Comments The first two authors contributed equally to this work (equal contribution). Please visit our project page at https://kaist-viclab.github.io/webspline-site/

详情
AI中文摘要

从单目视频进行动态场景重建仍然极具挑战性,现有方法在有限的视角线索下往往难以平衡全局结构一致性与局部细节。为解决这一问题,我们提出WebSpline,一种新颖的动态三维高斯框架,能够从单目视频中实现结构连贯且高保真的重建,并支持快速渲染。WebSpline的核心是结构信息样条(SIS)表示,它使用可学习的三次埃尔米特样条对每个动态高斯轨迹进行建模,其运动通过辅助的结构代理图(SPG)进行结构化组织。所提出的框架分两个阶段进行优化:(i)第一阶段,从二维点轨迹初始化SPG,并通过时间刚性正则化进行细化,以建立序列中运动物体的结构连贯性;(ii)第二阶段,从细化后的SPG初始化SIS表示,并在空间和结构邻域约束下进行优化。推理时,仅通过评估学习到的SIS即可获得高斯运动,从而实现快速渲染。在具有挑战性的单目动态场景基准iPhone和NVIDIA上的大量实验表明,我们的WebSpline达到了最先进的渲染质量,同时在iPhone数据集上渲染速度比第二名WorldTree快10倍以上。

英文摘要

Dynamic scene reconstruction from monocular videos remains highly challenging, as existing methods often struggle to balance global structural coherence and local fine-grained details under limited multi-view cues. To address this challenge, we propose WebSpline, a novel dynamic 3D Gaussian framework that enables structurally coherent and high-fidelity reconstruction from monocular videos with fast rendering. The core of WebSpline is the Structure-Informed Spline (SIS) representation, which models each dynamic Gaussian trajectory using a learnable cubic Hermite spline whose motion is structurally organized with an auxiliary Structural Proxy Graph (SPG). The proposed framework is optimized in two stages: (i) in the first stage, the SPG is initialized from 2D point tracks and refined with temporal rigidity regularization to establish structural coherence for moving objects across the sequence; and (ii) in the second stage, the SIS representation is initialized from the refined SPG and optimized under both spatial and structural neighborhood constraints. At inference, Gaussian motion is obtained solely by evaluating the learned SIS, enabling fast rendering. Extensive experiments on the challenging monocular dynamic scene benchmarks, iPhone and NVIDIA, demonstrate that our WebSpline achieves state-of-the-art rendering quality while rendering over 10 times faster than WorldTree, the second-best method on the iPhone dataset.

2606.02092 2026-06-02 eess.IV cs.AI cs.CV 版本更新

LALE: Lightweight-Transformer Architecture for Land-Cover Estimation

LALE:用于土地覆盖估计的轻量级Transformer架构

Ümit Mert Çağlar, Alptekin Temizel

发表机构 * Middle East Technical University(中亚技术大学)

AI总结 提出LALE架构,通过分辨率分支编码器(轻量级ConvMixer处理高分辨率局部特征,Transformer处理低分辨率全局上下文)和全MLP多尺度解码器,在遥感图像分割中实现高效性能与计算成本的平衡。

详情
AI中文摘要

遥感图像的语义分割需要模型在严格的计算预算下同时捕捉全局上下文和局部细节。先前的工作通常针对这些轴之一进行优化:注意力用于全局上下文,卷积用于局部细节,或紧凑性用于效率。虽然混合方法旨在同时捕捉两者,但它们需要架构更改和带有计算开销的编码器骨干,限制了效率和性能。我们提出了LALE(用于土地覆盖估计的轻量级Transformer架构),一种端到端的遥感图像分割架构,它通过分辨率分支编码器:轻量级ConvMixer阶段处理高分辨率局部特征,而Transformer阶段处理低分辨率全局上下文,将自注意力的二次成本限制在深层、下采样的特征图上。全MLP多尺度解码器,以及贯穿始终的RMSNorm和StarReLU,进一步减少了计算量和参数数量。在大型ARAS400k遥感分割基准上,LALE相对于CNN、Transformer和混合基线建立了强大的效率-性能权衡。我们最小的变体(仅1.6M参数)在F1分数上达到最佳基线(UPerNet)的2.6分以内,同时使用4.5倍更少的参数、7倍更少的存储、17倍更少的GMACs,并提供1.8倍更高的吞吐量。

英文摘要

Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational budgets. Prior work typically optimizes for one of these axes: attention for global context, convolution for local detail, or compactness for efficiency. While hybrid approaches aim to capture both, they require architectural changes and encoder backbones with computational overhead, limiting efficiency and performance. We present LALE (Lightweight-transformer Architecture for Land-cover Estimation), an end-to-end remote sensing image segmentation architecture, that bifurcates its encoder by resolution: lightweight ConvMixer stages handle high-resolution local features, while transformer stages handle low-resolution global context, confining the quadratic cost of self-attention to deep, downsampled feature maps. An all-MLP multi-scale decoder, together with RMSNorm and StarReLU throughout, further reduces compute and parameter count. On the large-scale ARAS400k remote-sensing segmentation benchmark, LALE establishes a strong efficiency-performance trade-off against CNN, transformer, and hybrid baselines. Our smallest variant, (just 1.6M parameters), reaches within 2.6 F1 points of the best baseline (UPerNet) while using 4.5x fewer parameters, 7x less storage, 17x fewer GMACs, and delivering 1.8x higher throughput.

2606.02080 2026-06-02 cs.MA cs.AI cs.CV 版本更新

Agentic-J: An AI Agent for Biological Microscopy Image Analysis

Agentic-J:用于生物显微镜图像分析的AI智能体

Lukas Johanns, Marilin Moor, Davide Panzeri, Yu Zhou, Xinyi Chen, Nora F. K. Pauly, Zixuan Pan, Matthias Gunzer, Andreas Müller, Yiyu Shi, Hedi Peterson, Jianxu Chen

AI总结 提出基于容器的多智能体AI助手Agentic-J,通过自然语言接口集成ImageJ/Fiji工具,实现从细胞分割到多条件量化的可追溯、可复现生物图像分析工作流。

Comments Presented at Cell Biology at Scale 2026 (Poster). The Agentic-J project is available at https://mmv-lab.github.io/Agentic-J/

详情
AI中文摘要

生物图像分析日益需要整合异构工具、编程环境和领域知识,而很少有研究人员能同时掌握这些。我们提出Agentic-J,一个容器化的多智能体AI助手,主要面向ImageJ/Fiji,使生物学家能够用自然语言指定分析任务,从细胞核分割、细胞追踪到多条件量化。该智能体生成可执行的脚本,并组织成有文档记录的项目结构,因此每个分析决策都是可追溯的,工作流可以复现或共享。专门的子智能体负责插件管理、代码生成、调试、质量保证和统计报告。本文介绍系统的设计,展示真实的生物显微镜图像分析工作流,并详细说明技术实现。

英文摘要

Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that few researchers can command simultaneously. We present Agentic-J, a containerised, multi-agent AI assistant, primarily for ImageJ/Fiji that enables biologists to specify analysis tasks in natural language, from nuclei segmentation and cell tracking to multi-condition quantification. The agent generates executable scripts organised into a documented project structure, so every analysis decision is traceable and the workflow can be reproduced or shared. The specialised sub-agents handle plugin management, code generation, debugging, quality assurance, and statistical reporting. In this paper we introduce the system's design, demonstrate real biological microscopy image analysis workflows, and detailed the technical implementation.

2606.02079 2026-06-02 cs.CV 版本更新

FACT: A Simple and Efficient Framework for Active Finetuning

FACT:一种简单高效的主动微调框架

Wenshuai Xu, You Song, Yuzhuo Cui, Minjie Ren, Qingjie Liu, Zhenghui Hu

发表机构 * Zhejiang (No. 2024C01020)(浙江(No. 2024C01020)) National Natural Science Foundation of China (No. 62302031)(中国国家自然科学基金委员会(No. 62302031)) Zhejiang Provincial Natural Science Foundation of China (Nos. LQ23F020024 and LZJMZ24D050009)(中国浙江省自然科学基金委员会(Nos. LQ23F020024 and LZJMZ24D050009))

AI总结 针对主动微调中全量微调导致预训练特征失真和过拟合的问题,提出FACT三层分层微调框架,通过冻结特征增强和参数高效微调,在多种数据集和架构上显著提升性能,尤其在低采样率下实现超过20%的增益。

Comments ACCEPTED for publication as a REGULAR paper in the IEEE Transactions on Image Processing (T-IP)

详情
AI中文摘要

主动微调的主要目标是通过使用精心挑选的信息性或挑战性数据对预训练模型进行微调,以提高其在特定任务或领域上的性能。先前的研究主要关注主动方面(即数据选择),同时统一采用全量微调进行模型适应,这不可避免地因分布偏移而扭曲预训练特征。当模型大小相对于微调数据量较大时,这个问题变得尤为突出,导致过拟合风险增加。为了解决这一关键差距,我们正式概述了FiAF任务,该任务强调在主动学习中系统探索微调方法。我们提出了FACT,一个三阶段分层微调框架,兼具高效性和简洁性,专门为主动微调场景设计。我们的综合实验涵盖:(1)三大数据集类别,包括经典(CIFAR10、CIFAR100、ImageNet-1k)、不平衡(CIFAR10-LT、CIFAR100-LT)和细粒度(StanfordCars、FGVCAircraft)图像分类数据集,每个在3-5种不同采样率下评估;(2)多样化的预训练架构,包括卷积神经网络(ConvNeXt)、视觉变换器(ViT)和视觉LSTM(ViL)网络;(3)对冻结特征增强(FroFA)策略的系统研究;(4)对效率和泛化性的全面严格分析。结果表明,我们的框架具有显著改进,并具备强大的泛化性和鲁棒性。值得注意的是,在低采样率下,我们的框架在CIFAR10、CIFAR100和ImageNet-1k基准测试中,ViT模型实现了超过20%的显著性能提升。这种系统性的方法在保持参数效率的同时建立了新的最先进性能,在标记数据稀缺时尤其有效。

英文摘要

The main goal of active finetuning is to improve a pretrained model's performance on a specific task or domain by finetuning it with carefully selected informative or challenging data. Previous research has predominantly focused on the active aspect (i.e., data selection) while uniformly employing full finetuning for model adaptation, which inevitably distorts pretrained features due to distribution shift. This issue becomes particularly pronounced when the model size is large relative to the finetuning data quantity, leading to heightened overfitting risks. To address this critical gap, we formally outline the FiAF task that emphasizes systematic exploration of finetuning methodologies in active learning. We propose FACT, a three-phase hierarchical finetuning framework featuring both efficiency and simplicity, specifically designed for active finetuning scenarios. Our comprehensive experiments span: (1) Three major dataset categories encompassing classic (CIFAR10, CIFAR100, ImageNet-1k), imbalanced (CIFAR10-LT, CIFAR100-LT), and fine-grained (StanfordCars, FGVCAircraft) image classification datasets, each evaluated under 3-5 distinct sampling ratios; (2) Diverse pretrained architectures including Convolutional Neural Network (ConvNeXt), Vision Transformer (ViT), and Vision LSTM (ViL) networks; (3) A systematic investigation of frozen feature augmentation (FroFA) strategies. (4) A comprehensive and rigorous analysis of efficiency and generalizability. The results demonstrate significant improvements with strong generalization and robustness. Notably, under low sampling ratios, our framework achieves remarkable performance gains of over 20% on the ViT model for CIFAR10, CIFAR100, and ImageNet-1k benchmarks. This systematic approach establishes new state-of-the-art performance while maintaining parameter efficiency, proving particularly effective when labeled data is scarce.

2606.02068 2026-06-02 cs.CV cs.AI 版本更新

Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image

基于可微多平面图像的快速轻量级新视角合成

Kaidi Zhang, Guanxu Zhu

发表机构 * Universiti Malaya(马来大学) Wuhan University(武汉大学)

AI总结 针对现有方法在速度、模型大小和稀疏视角下的不足,提出基于可微多平面图像(MPI)的快速轻量级新视角合成方法,利用点图进行几何初始化并引入一步扩散处理空洞和伪影。

详情
AI中文摘要

近年来,新视角合成取得了显著进展,主流方法如神经辐射场(NeRF)和3D高斯泼溅(3DGS)产生了令人印象深刻的结果。然而,这些方法往往难以平衡渲染速度和模型大小,且其基于优化的训练可能非常耗时。此外,它们通常依赖于密集观测,在稀疏视角条件下往往无法产生令人满意的结果。尽管前馈重建显著减少了3DGS的优化时间,但其像素对齐公式从单张图像生成数百万个高斯,严重限制了其在移动设备上的实际部署。为了解决这些限制,我们重新审视了多平面图像(MPI)表示,该表示使用一组紧凑的平面层来表示场景,以实现高效的新视角合成。利用视觉基础模型的最新进展,我们使用预测的点图进行可靠的几何初始化,然后进行可微优化。为了解决稀疏初始化MPI中的空洞和伪影问题,我们引入了一步扩散,该扩散既参与MPI的可微优化,也参与渲染结果的后处理。与代表性的基于GS的方法相比,我们的方法速度快30.7%,模型大小仅为其14.8%,同时在前景场景中实现了具有竞争力的合成质量。

英文摘要

Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) delivering impressive results. However, these approaches often struggle to balance rendering speed and model size, and their optimization-based training can be highly time-consuming. Furthermore, they typically rely on dense observations, often failing to produce satisfactory results under sparse-view conditions. Although feed-forward reconstruction significantly reduces the optimization time of 3DGS, its pixel-aligned formulation generates millions of Gaussians from a single image, severely limiting its practical deployment on mobile devices. To address these limitations, we revisit the Multiplane Image(MPI) representation, which represents scenes using a compact set of planar layers for efficient novel view synthesis. Leveraging recent advances in visual foundation models, we utilize predicted point maps for reliable geometric initialization, followed by differentiable optimization. To address the issues of holes and artifacts in sparsely initialized MPI, we introduce one-step diffusion, which participates in both the differentiable optimization of MPI and the postprocessing of rendering results. Compared with a representative GS-based method, our approach is 30.7% faster and uses only 14.8% of its model size, while achieving competitive synthesis quality on front-view scenarios

2606.02058 2026-06-02 cs.CV cs.RO 版本更新

TIDES: Time-Derivative Event Simulation via Deformable Reconstruction

TIDES:基于可变形重建的时间导数事件模拟

Christopher Thirgood, Dipon Kumar Ghosh, Simon Hadfield

发表机构 * University of Surrey(萨里大学)

AI总结 提出TIDES,一种基于动态高斯泼溅的连续时间事件模拟器,通过显式3D场景表示推导逐像素强度动态,实现精确的阈值交叉预测,并利用遮挡引导自适应时间步长,达到最先进的事件流保真度。

详情
AI中文摘要

事件相机响应环境外观变化而发出异步事件。真实世界事件数据集的稀缺使得模拟至关重要。然而,大多数模拟器从帧序列推断事件时间戳,迫使许多阈值交叉共享一小组离散时间;我们将这种失效模式称为时间戳批处理,它在快速运动和遮挡下会恶化。我们提出TIDES,一种基于动态高斯泼溅的连续时间事件模拟器。由于TIDES在具有学习几何和运动的显式3D场景表示上运行,它可以直接从场景推导每像素强度动态,而不是通过渲染帧的差分。这使得能够精确预测阈值交叉,包括每个渲染步骤的多次交叉,而无需时间上采样或帧插值。相同的3D场景模型揭示了物体之间部分遮挡的位置;TIDES利用这一点来指导自适应时间步长,仅将计算集中在遮挡动力学使简单亮度变化模型不可靠的区域。最后,我们使用瓦片级仲裁器对有限传感器带宽进行建模,其吞吐量、抖动和事件丢失再现了真实的传感器伪影。在配对的RGB-事件基准测试中,TIDES达到了最先进的事件流保真度。我们还表明,TIDES模拟的事件比竞争对手更有效地转移到真实下游任务。

英文摘要

Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors'.

2606.02048 2026-06-02 cs.AI cs.CV physics.bio-ph 版本更新

Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties

动态酪蛋白凝胶化显微图像拓扑纹理分析及其与流变学性质的关系

Zahra Tabatabaei, Diana Soto Aguilar, Jose C. Bonilla, Mathias P. Clausen, Jon Sporring

发表机构 * Department of Computer Science, University of Copenhagen, Denmark(哥本哈根大学计算机科学系) Department of Green Technology, University of Southern Denmark, Denmark(南丹麦大学绿色技术系) Department of Food Science, University of Copenhagen, Denmark(哥本哈根大学食品科学系)

AI总结 提出结合拓扑数据分析、差分盒计数、多重分形分割和局部二值模式的工具箱,分析STED显微图像中酪蛋白凝胶化的拓扑与纹理特征,揭示与流变学性质相关的微观结构转变。

详情
AI中文摘要

我们提出了一种新颖的计算工具箱,集成了拓扑数据分析(TDA)、差分盒计数(DBC)、多重分形分割(MFP)和局部二值模式(LBP),应用于由葡萄糖酸-δ-内酯(GDL)在30°C和40°C以及两种GDL浓度(1.8%和3.5% w/v)下诱导的酪蛋白酸钠凝胶化的时间序列超分辨率STED显微图像。TDA通过最大Betti-1曲线追踪拓扑环,即反映蛋白质网络互连性的封闭环状结构,揭示了分散聚集体的滞后阶段、与网络渗透和流变学观察到的溶胶-凝胶转变相一致的急剧衰减,以及对应于网络重排的凝胶后增加。这些拓扑转变通过DBC和MFP得到证实,因为这些方法能够解析结构复杂性和空间异质性的变化。该工具箱在实验应用前在模拟分形图像上进行了验证。总之,这些描述符对体相流变学作为平均体相力学响应捕获的细微微观结构转变具有敏感性。这种集成方法为表征食品和材料科学中具有演化微观结构动力学的复杂微观结构提供了稳健的定量工具。代码可在https://github.com/Zahratabatabaei/Delifood_CV_paper.git获取。

英文摘要

We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Partition (MFP), and Local Binary Patterns (LBP), applied to time-lapse super-resolution STED microscopy images of sodium caseinate gelation induced by glucono-delta-lactone (GDL) at 30 °C and 40 °C and two GDL concentrations (1.8% and 3.5% w/v). TDA tracked topological loops, closed ring-like structures reflecting protein network interconnectivity, via max-Betti-1 curves, which revealed a lag phase of dispersed aggregates, a sharp decay coinciding with network percolation and the rheologically observed sol-gel transition, and a post-gelation increase corresponding to network rearrangements. These topological transitions were corroborated by DBC and MFP as these methods were able to resolve changes in structural complexity and spatial heterogeneity. The toolbox was validated on simulated fractal images prior to experimental application. Together, these descriptors provided sensitivity to subtle microstructural transitions that bulk rheology captured as averaged bulk mechanical responses. This integrated approach provides a robust quantitative tool for characterizing complex microstructure in food and material science with evolving microstructural dynamics. Code is available at https://github.com/Zahratabatabaei/Delifood_CV_paper.git

2606.02042 2026-06-02 cs.CV 版本更新

Normality-Preserving Continual Industrial Anomaly Detection via Orthogonal LoRA Banks

通过正交LoRA库保持正态性的持续工业异常检测

Weibai Fang, Haijun Che, Feiyang Ren, Qiancheng Lao

发表机构 * Yisu University(Yorkshire University)

AI总结 提出基于历史冻结正交LoRA库和分层新颖性自适应库增长模块的框架,解决扩散模型在持续工业异常检测中的历史正态先验漂移和灾难性遗忘问题。

Comments 33 pages,6 figures,Submitted to Advanced Engineering Informatics

详情
AI中文摘要

基于扩散模型的持续工业异常检测面临历史正态先验漂移和灾难性遗忘问题。现有的持续扩散方法通过回放或约束优化保留先前知识,但缺乏在顺序适应过程中隔离和保护类别特定正态先验的显式机制。尽管低秩适应提供了模块化残差更新,但标准LoRA既未冻结历史正态子空间,也未阻止新适配器干扰先前适配器。为解决此问题,我们提出基于两个模块的正态保持持续异常检测框架:历史冻结正交LoRA库(HF-OLB)和分层新颖性自适应库增长模块(HNABG)。HF-OLB冻结预训练的U-Net主干和已学习的LoRA库,并将新任务特定的正态残差约束到历史LoRA子空间的正交补空间中。HNABG进一步分配层依赖的残差容量,并仅在残差正态新颖性超过现有库的表达容量时扩展库。在MVTec和VisA上的大量实验证明了所提方法的有效性。在具有挑战性的VisA 2x6设置下,我们的方法实现了83.6/91.8的图像和像素级A-AUROC,以及3.8/3.9的FM,将像素级A-AUROC提升了3.2个百分点,同时将像素级FM降低了1.3。这些结果表明,我们的方法在长时间跨度的持续类别序列中有效保留了历史正态先验。

英文摘要

Continual industrial anomaly detection with diffusion models suffers from historical normality prior drift and catastrophic forgetting. Existing continual diffusion methods preserve previous knowledge through replay or constrained optimization, but they lack an explicit mechanism for isolating and protecting category-specific normality priors during sequential adaptation. Although low-rank adaptation provides modular residual updates, standard LoRA neither freezes historical normality subspaces nor prevents new adapters from interfering with previous ones. To address this issue, we propose a normality-preserving continual anomaly detection framework based on two modules: History Frozen Orthogonal LoRA Bank (HF-OLB) and Hierarchical Novelty Adaptive Bank Growth module (HNABG). HF-OLB freezes both the pre-trained U-Net backbone and the learned LoRA banks, and constrains new task-specific normality residuals to the orthogonal complement of historical LoRA subspaces. HNABG further allocates layer-dependent residual capacity and expands the bank only when the residual normality novelty exceeds the expressive capacity of existing banks. Extensive experiments on MVTec and VisA demonstrate the effectiveness of the proposed method. On the challenging VisA 2x6 setting, our method achieves 83.6/91.8 image and pixel level A-AUROC with 3.8/3.9 FM, improving pixel level A-AUROC over the state of the art by 3.2 points while reducing pixel level FM by 1.3. These results show that our method effectively preserves historical normality priors in long horizon continual category sequences.

2606.02022 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

排名 vs. 分配:多视角目标关联中的度量不匹配

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

发表机构 * Tevian Moscow(莫斯科Tevian) Lomonosov Moscow State University(莫斯科国立罗蒙诺索夫大学)

AI总结 本文揭示了多视角目标关联中常用的排名度量(如AP、FPR-95)与分配目标之间的根本性不匹配,并提出了基于Sinkhorn归一化的后处理方法以缓解该问题。

详情
AI中文摘要

多视角目标关联是一个重要的计算机视觉问题,是许多多相机感知任务的基础。虽然该任务自然被表述为受约束的一对一匹配问题,但最近的工作严重依赖成对排名度量(如AP和FPR-95)进行模型评估。我们强调了这些度量与实际分配目标之间的根本性不匹配。理论上,我们表明即使分配已经正确,AP和FPR-95也可能不完美,而基于Sinkhorn的归一化可以使它们完美。相反,最优的成对排名仍然可能导致错误的分配。我们通过使用基于Sinkhorn的归一化作为受控的后处理压力测试,在实践中验证了这种不匹配。我们表明,仅优化几个后处理参数就能显著提升AP和FPR-95,而分配级别的度量(如ACC和IPAA)却没有相应改进。

英文摘要

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

2606.02021 2026-06-02 cs.CV 版本更新

PerBite: A Curated Diagnostic Workflow for Bite-Aware Food Volume Estimation

PerBite: 一种用于咬合感知食物体积估计的精选诊断工作流

Ahmad AlMughrabi, Farid Al-Areqi, David Fernández Gómez, Umair Haroon, Marc Bolaños, Ricardo Marques, Petia Radeva

发表机构 * University of Barcelona(巴塞罗那大学) LogMeal Universitat Pompeu Fabra(庞培法布拉大学)

AI总结 提出PerBite工作流,通过分割、三维重建、尺度校准和网格后处理等步骤,从餐前餐后状态估计食物体积,在MetaFood挑战中排名第一。

详情
AI中文摘要

一个视觉上合理的食物网格能否被信任来估计消耗食物的体积?\method 使用来自MetaFood CVPR 2026连续三维重建与进食挑战的选定配对餐前和餐后状态来研究这个问题。提交的工作流遵循一个精选的重建协议:SAM~3分割食物和盘子区域;Hunyuan3D/SAM~3D生成无量纲食物网格;盘子直径提供度量尺度;在Blender中移除盘子几何形状;剩余的网格进行孔洞填充、水密化并积分以估计体积。MoGe-2仅作为辅助线索用于初始菜肴直径估计,当直接盘子测量不确定时;它不是报告挑战结果的主要尺度来源。\method 排名第一,在34个网格上使用刚性ICP(无尺度校正)的平均Chamfer距离为8.31。在17个餐前餐后对上,它实现了33.87%的状态级体积MAPE和零单调性违规,而消耗体积MAPE为53.74%。结果表明,表面重建、度量尺度、受控网格清理、水密体积积分和物理消耗一致性应分别评估以用于饮食评估。源代码和评估脚本将在\href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}提供。

英文摘要

Can a visually plausible food mesh be trusted to estimate the volume of consumed food? \method investigates this question using selected paired before- and after-consumption states from the MetaFood CVPR 2026 Continuous 3D Reconstruction While Eating Challenge. The submitted workflow follows a curated reconstruction protocol: SAM~3 segments the food and plate regions; Hunyuan3D/SAM~3D generates a dimensionless food mesh; the plate diameter provides the metric scale; the plate geometry is removed in Blender; and the remaining mesh is hole-filled, made watertight, and integrated to estimate volume. MoGe-2 is used only as an auxiliary cue for initial dish-diameter estimation when direct plate measurement is uncertain; it is not the primary scale source for the reported challenge result. \method ranks first, with an average Chamfer distance of 8.31 across 34 meshes using rigid ICP without scale correction. On 17 before- and after-pairs, it achieves 33.87\% state-level volume MAPE and zero monotonicity violations, while consumed-volume MAPE remains 53.74\%. The results show that surface reconstruction, metric scale, controlled mesh cleanup, watertight volume integration, and physical depletion consistency should be evaluated separately for dietary assessment. Source code and evaluation scripts will be available at \href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}.

2606.02002 2026-06-02 cs.CV 版本更新

Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

面向盲图像质量评估的统计与视觉-语言特征的失真感知融合

Bishr Omer Abdelrahman Adam, Xu Li

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 提出一种失真感知融合框架,通过乘法门控机制动态加权NSS统计特征与VLM嵌入,在三个基准上取得最优或竞争性能,并揭示NSS对不同失真的贡献差异。

详情
AI中文摘要

盲图像质量评估(BIQA)旨在无参考图像的情况下预测感知图像质量。经典的自然场景统计(NSS)描述符和现代视觉语言模型(VLM)嵌入从根本不同的角度解决这一问题,但两者结合是否能产生互补优势以及如何根据输入图像加权其贡献尚待探索。我们提出一种失真感知融合框架,通过乘法门控机制将138维NSS描述符与两种互补的VLM嵌入(SigLIP和CLIP-H)集成,该门控机制学习基于图像内容的每输入流权重。与静态拼接融合不同,所提出的门控网络根据输入抑制或放大每个流的贡献,产生的权重与在KADID-10k上通过独立消融测量的每失真NSS贡献呈正相关(Spearman秩相关系数ρ=0.33)。该框架无需对VLM骨干网络进行端到端微调,并使用结合均方误差、Pearson线性相关和成对排序目标的混合损失进行训练。我们在三个标准基准上评估:KonIQ-10k(SROCC=0.9142,PLCC=0.9279)、KADID-10k(SROCC=0.9715,PLCC=0.9733,超越近期最先进方法)和LIVE Challenge in-the-Wild(通过跨数据集预训练和微调,SROCC=0.8527,PLCC=0.8802)。在KADID-10k上的每失真分析表明,NSS特征对噪声和色彩偏移失真(像素统计直接影响)贡献最大,对感知失真(如色彩饱和度变化)贡献最小。学习到的门控值验证了这些发现,确认模型自主发现了与手动每失真研究一致的失真-流亲和模式。

英文摘要

Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.

2606.02000 2026-06-02 cs.CV cs.AI eess.IV 版本更新

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

迈向3D感知视频扩散模型:基于网格标记化的无渲染人体运动控制

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang

发表机构 * DAMO Academy, Alibaba Group(阿里巴巴集团大模型实验室) Hupan Lab(虎盘实验室) Zhejiang University(浙江大学) INSAIT

AI总结 提出一种无渲染框架,通过压缩的3D人体网格标记直接条件化视频生成,实现精确的人体运动控制,减少2D引导伪影并提升3D结构建模能力。

Comments Project page: https://jingyunliang.github.io/MeshToken/

详情
AI中文摘要

扩散模型在视频生成方面取得了显著成功。然而,这类模型是否真正感知视觉观察背后的3D结构,而不仅仅是生成合理的2D投影,仍是一个开放问题。本文通过人体运动控制这一任务来探究该问题,该任务需要对人体3D几何、运动、相机视角和场景上下文进行精确建模。与依赖渲染的2D运动引导视频的先前方法不同,我们提出了一种无渲染框架,直接基于压缩的3D人体网格标记条件化视频生成。该表示保留了完整的3D几何信息,同时实现了统一的基于标记的生成流程,在DiT架构中联合处理视频标记和运动标记。这种设计要求模型在视频生成过程中联合推理外观、3D结构和相机视角。实验结果表明,该方法在人体运动控制基准上表现强劲,同时减少了由视角依赖的2D引导和编辑过程中轨迹-姿态不匹配引起的伪影。这些发现表明,配备网格标记化的视频扩散模型能够更好地捕捉复杂的3D人体结构及其与周围环境的交互。

英文摘要

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

2606.01992 2026-06-02 cs.CV cs.AI cs.LG 版本更新

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

文本引导异常检测的结构化基准:当语言停止条件化决策时

Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci

发表机构 * Politecnico di Milano, AIRLab(米兰理工学院,AIRLab) S&H – Software & Hardware(S&H – 软件与硬件)

AI总结 提出结构化基准TGAD,通过三个场景逐步增加语言功能角色,评估多模态异常检测系统的文本引导能力,发现当前系统仅表面受语言条件化,标准基准高估了其能力。

详情
AI中文摘要

工业异常检测历来是单模态任务。最近的多模态视觉-语言模型产生了接受文本输入和图像的系统,并被呈现为支持文本引导的零样本和少样本检测。然而,这些方法使用继承自单模态基准的协议进行评估,这些协议保持文本条件不变,因此无法衡量语言是否条件化决策;报告的性能提升是否反映文本引导或强大的预训练视觉特征仍是开放问题。我们引入文本引导异常检测(TGAD),这是一个结构化基准,通过三个场景逐步增加语言的功能角色:MVTec AD上的受控提示敏感性设置;MVTec AD的组件标记扩展,要求模型将其评估限制在指定部件;以及新的组装面板数据集(APD),这是一个需要缺陷类型和组件位置知识的现实工业场景。我们评估每个范式的代表性模型:生成式大视觉-语言、无训练判别式和嵌入自适应判别式。在所有三个模型中,文本接口仅表面条件化决策:除非移除对象名词,否则提示内容被吸收(生成模型的I-AUROC从97.4降至82.6);一旦指令部件外的缺陷被视为正常,组件级指令不约束决策(从90.3降至66.3);当两者在APD上结合时,图像级判别崩溃至MVTec水平以下,一种情况低于随机水平(71.2、50.5、31.5)。这些结果表明,标准基准夸大了当前多模态异常检测系统的文本引导能力,并且此类协议是能够通过语言可靠控制以用于工业部署的模型的先决条件。

英文摘要

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

2606.01985 2026-06-02 cs.CV 版本更新

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

MT-EditFlow:基于流匹配的多轮图像编辑强化学习

Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie, Oscar Leong, Mingyuan Zhou, Nanzhu Wang, Ying Nian Wu

发表机构 * Apple(苹果公司) University of California, Los Angeles(加州大学洛杉矶分校) University of Texas at Austin(德克萨斯大学奥斯汀分校) Lambda, Inc(Lambda公司)

AI总结 提出MT-EditFlow框架,通过流匹配强化学习优化多轮图像编辑的奖励信号,解决单轮编辑模型在多轮交互中的失败和误差传播问题,显著提升多轮编辑性能。

详情
AI中文摘要

近年来,基于指令的图像编辑取得了重大突破,模型现在能够处理现实世界中的编辑需求,满足日常用户的实用性要求。然而,主要为单轮编辑训练的编辑模型在多轮编辑中常常失败——在这种自然的交互设置中,用户基于模型自身之前的输出迭代地细化图像。这种失败源于“全有或全无”的要求,即单次失败会破坏整个序列,以及误差传播,即暴露偏差导致编辑误差累积。为了解决这些挑战,我们引入了MT-EditFlow,一个流匹配强化学习框架,旨在优化序列图像编辑的奖励信号。MT-EditFlow整合了多轮视角和多奖励公式,为基于GRPO和NFT的强化学习方法提供了统一的结构。我们通过研究有效的轮次级聚合评分策略、权衡奖励偏差与方差的VLM推理模式以及防止奖励破解的优势融合级别,系统地分析和优化了奖励信号。我们的发现表明,将聚合优势广播到整个编辑轨迹中,有效地弥合了局部规划与全局多轮任务成功之间的差距。大量实验表明,MT-EditFlow在多种基础模型上显著提升了性能。值得注意的是,它在FLUX.1-Kontext-dev上将第3轮整体性能提升了6.85分,超越了Qwen-Image-Edit等最先进的开源模型。通过保持高边际成功率和减少暴露偏差,MT-EditFlow为视觉内容创作中更可靠、更自然的人机协作奠定了基础。

英文摘要

Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.

2606.01981 2026-06-02 cs.CV 版本更新

Generalization Limits in Vehicle Re-Identification

车辆再识别中的泛化极限

Anis Yassine Ben Mabrouk, Antoine Tadros, Rafael Grompone von Gioi, Gabriele Facciolo, Axel Davy, Rodrigo Verschae

AI总结 针对车辆再识别任务中模型对未见车辆类型泛化能力差的问题,提出了一种新的评估方法,并通过视角分割分析揭示了现有方法在视角鲁棒性和细节关注上的局限性。

详情
AI中文摘要

车辆再识别关注于根据查询图像从图库中检索同一车辆的图像。通过仔细检查常用数据集,我们观察到视觉差异很小的车辆——例如相同的品牌、型号和颜色——同时出现在训练集和测试集中。因此,有效记忆训练数据的方法在这些测试集上表现良好,但难以泛化到其他数据集。在本文中,我们通过提出一种新的评估方法来解决这个问题,该方法能更有效地衡量对未见车辆类型的泛化能力。为了进一步研究泛化性能,我们还提出基于视角进行分割评估,从而区分视角鲁棒性与同视角再识别的影响。我们的发现表明,大多数最先进的方法在处理未见车辆类型时存在困难,并且它们对视角变化的鲁棒性和对细节的关注仅限于训练中见过的车辆类型。

英文摘要

Vehicle re-identification focuses on retrieving images of the same vehicle from a gallery given a query image. Upon closer inspection of commonly used datasets, we observe that vehicles with few visual differences-e.g., the same make, model, and color-appear in both the training and test sets. As a result, methods that effectively memorize the training data tend to perform well on these test sets but struggle to generalize to other datasets. In this paper, we address this issue by proposing a novel evaluation approach that more effectively measures generalization capability to unseen vehicle types. To further study generalization performance, we also propose splitting the evaluation based on view, allowing us to differentiate the effect of viewpoint robustness from that of same-view re-identification. Our findings reveal that most state-of-the-art methods struggle with unseen vehicle types, and that their robustness to viewpoint changes and attention to detail are limited to vehicle types seen during training.

2606.01973 2026-06-02 cs.LG cs.CV 版本更新

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

开放集测试时自适应中分布内与分布外准确率的深入分析

Zefeng Li, Evan Shelhamer

发表机构 * University of British Columbia and Vector Institute(不列颠哥伦比亚大学和向量研究所)

AI总结 本文通过基准测试和提出新基线,揭示了当前开放集测试时自适应方法在平衡分布内准确率和分布外检测能力上的不足。

Comments TMLR 2026

详情
AI中文摘要

开放集测试时自适应(TTA)在存在输入偏移和未知输出类别的情况下更新模型。尽管近期方法在提高已知类别的分布内(InD)准确率方面取得了进展,但它们准确检测分布外(OOD)未知类别的能力仍未得到充分探索。我们在小规模CIFAR-10-C和大规模ImageNet-C的标准损坏基准上,对鲁棒和开放集TTA方法(SAR、OSTTA、UniEnt和SoTTA)进行了基准测试。对于CIFAR-10-C,我们使用来自SVHN和CIFAR-100的OOD数据,分别对应其损坏形式SVHN-C和CIFAR-100-C。对于ImageNet-C,我们使用来自ImageNet-O和Textures的OOD数据,分别对应其损坏形式ImageNet-O-C和Textures-C。ImageNet-O更接近ImageNet,包含未知但相关的物体类别(如食物类的“蒜香面包”与“热狗”,基础设施类的“高速公路”与“水坝”),而Textures则远离ImageNet,包含非物体图案(如“裂纹”泥土、“多孔”海绵、“纹理”树叶)。我们评估了TTA方法在CIFAR-10-C和ImageNet-C上对InD与OOD识别的准确率和置信度。我们在CIFAR-10-C上验证了每种方法自身OOD检测技术的准确率。我们还在ImageNet-C上进行了评估,并报告了准确率和标准OOD检测指标。我们进一步考察了更现实的设置,其中OOD数据的比例和速率可以变化。为了探索InD识别与OOD拒绝之间的权衡,我们提出了一种新的基线,将softmax/多类输出替换为sigmoid/多标签输出。我们的分析首次表明,当前的开放集TTA方法难以平衡InD和OOD准确率,并且它们仅能不完全地过滤OOD数据以进行自身的自适应更新。

英文摘要

Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ''garlic bread'' vs. ''hot dog'' for food, or ''highway'' vs. ''dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ''cracked'' mud, ''porous'' sponge, ''veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.

2606.01955 2026-06-02 cs.RO cs.CV 版本更新

WALL-WM: Carving World Action Modeling at the Event Joints

WALL-WM:在事件关节处雕刻世界动作建模

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang

发表机构 * X Square Robot Team(X Square机器人团队)

AI总结 提出WALL-WM世界动作模型,通过事件级视觉-语言-动作预训练解决固定长度动作块与语言、视觉、动作之间的粒度不匹配问题,实现跨语言、场景和任务的泛化,在大规模真实世界评估中达到最先进性能。

详情
AI中文摘要

WALL-WM是一种世界动作模型,它将视频-动作学习从以块为中心的优化转变为以事件为基础的视觉-语言-动作预训练,使用语义连贯的动作事件作为学习的基本单元。现有的WAM通常从多模态或视频基础模型初始化,然后直接基于当前观测和指令优化固定长度的动作块。尽管方便,但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件,视觉通过连续场景动态演变,动作在控制级时间尺度上运行;将三者强制纳入相同的固定长度预测窗口,使得VLA训练变成短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这种不匹配。具体来说,它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统配对,从而实现对多样化行为、场景和任务结构的可扩展学习。从相同的事件预训练骨干出发,WALL-WM支持两种互补的推理模式。事件模式消耗下一事件描述并实现可变长度的执行块,而统一模式使用带有阶梯式解码的VLM来调节传统的固定长度块推理,同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施,WALL-WM为通用WAM提供了实用的规模化方案。实验表明,WALL-WM在语言、场景和任务上广泛泛化,在大规模真实世界泛化评估中达到了最先进的性能。

英文摘要

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

2606.01950 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

面向刚性物体的学习动作条件与对象中心高斯溅射世界模型

Jens U. Kreber, Lukas Mack, Joerg Stueckler

发表机构 * Intelligent Perception in Technical Systems Group(技术系统智能感知组)

AI总结 提出MRO-GWM模型,通过对象中心高斯表示和时空变换器架构,学习刚性物体在3D中的动作条件动力学,支持多物体场景和部分观测下的未来运动预测。

详情
AI中文摘要

世界模型使智能体能够预测其动作对环境的影响。在本文中,我们提出了多刚性物体高斯世界模型(MRO-GWM),一种学习刚性物体在3D中动作条件动力学的新模型。通过用对象中心高斯表示场景,我们可以表示任意物体形状和多物体场景。我们开发了一种新颖的时空变换器架构,该架构根据物体高斯的历史和未来动作预测未来的刚体运动。物体通过其在规范坐标系中的高斯表示,从而可以将物体运动描述为刚体变换。我们的模型在多视角重建上进行训练,这要求模型处理因遮挡导致的物体部分观测。我们分析了该方法在由典型家庭物体组成的合成数据集上的预测性能,这些数据集包含多物体动力学和机器人末端执行器的交互。我们还在模拟中评估了模型在非抓取操作中的模型预测控制性能。

英文摘要

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

2606.01947 2026-06-02 cs.CV cs.AI 版本更新

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

大型预训练模型在实例分割任务中的参数高效微调

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 本研究针对实例分割任务,探索了适配器和低秩适应(LoRA)两种参数高效微调方法,在仅微调约1-6%参数的情况下取得竞争性能,并发现每个Transformer块使用2-3个适配器可达到性能与效率的最佳平衡。

Comments Published by the Machine Learning and Knowledge Extraction Journal

详情
Journal ref
Abou Baker N, Rohrschneider D, Handmann U. Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks. Machine Learning and Knowledge Extraction. 2024; 6(4):2783-2807
AI中文摘要

近年来,随着大型预训练模型的兴起,人工智能的研究和应用发生了转变,这些模型在众多任务中取得了最先进的结果。然而,参数的大量增加引入了对参数高效训练策略的需求。尽管取得了显著进展,但针对基于Transformer的模型在实例分割任务中的参数高效微调(PEFT)方法的研究仍然有限。为填补这一空白,本研究调查了PEFT方法的有效性,特别是适配器和低秩适应(LoRA),并将其应用于两个模型和四个基准数据集。通过集成顺序排列的适配器模块并将LoRA应用于可变形注意力(本文首次探索),在仅微调约1-6%模型参数的情况下取得了竞争性能,相比传统微调所需的40-55%有显著改进。关键发现表明,每个Transformer块使用2-3个适配器可实现性能与效率的最佳平衡。此外,LoRA在应用于可变形注意力时表现出强大的参数效率,并在某些情况下超越了适配器配置。这些结果表明,PEFT技术的影响因数据集复杂性和模型架构而异,强调了上下文特定调优的重要性。总体而言,这项工作展示了PEFT在实例分割任务中实现可扩展、可定制且计算高效的迁移学习的潜力。

英文摘要

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention--explored here for the first time--achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.

2606.01945 2026-06-02 cs.CV 版本更新

Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt Factorization

超越低秩:通过脉冲神经网络和提示分解实现低秩稀疏提示

Yumiao Zhao, Bo Jiang, Beibei Wang, Xixi Wan, Xiao Wang, Jin Tang

发表机构 * Information Materials and Intelligent Sensing Laboratory of Anhui Province(安徽省信息材料与智能感知实验室) Anhui Provincial Key Laboratory of Multimodal Cognitive Computation(安徽省多模态认知计算重点实验室) School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院)

AI总结 提出LoRSP框架,利用脉冲神经元的稀疏发放机制和低秩分解,生成实例特定的稀疏视觉提示,实现高效且鲁棒的视觉提示学习。

详情
AI中文摘要

视觉提示(VP)已成为一种高效范式,通过在输入层引入可学习提示来适应大规模预训练视觉模型到下游任务。然而,现有的VP方法通常采用密集的像素级提示,往往存在冗余扰动、泛化能力有限和能效低的问题。为克服这些限制,我们提出将脑启发脉冲学习融入视觉提示学习任务。我们知道,脉冲神经元可以通过将输入数据转换为离散脉冲序列并返回稀疏输出来进行低成本信息处理。受此启发,我们提出低秩视觉脉冲提示(LoRSP),一种新颖框架,通过脉冲神经元学习机制自然地学习动态低秩稀疏视觉提示。LoRSP的核心思想是利用脉冲神经元的脑启发稀疏发放机制为每个实例生成像素级稀疏提示。具体而言,我们首先通过低秩分解构建一系列提示因子以捕获不同的提示子空间。然后将这些提示因子输入SNN架构,执行整合-发放过程以发射脉冲。因此,我们的LoRSP在保持低秩约束的同时生成稀疏视觉提示。这种设计实现了实例特定的选择性提示,从而在多样化的下游任务中实现更紧凑和鲁棒的适应。在五个异构视觉骨干网络和多个基准上的大量实验表明,与现有VP方法相比,LoRSP在需要更少可调参数的情况下实现了具有竞争力的性能。

英文摘要

Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbf{Lo}w-\textbf{R}ank visual \textbf{S}pike \textbf{P}rompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emph{sparse} visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.

2606.01940 2026-06-02 cs.CV 版本更新

SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

SCAPO: 从单次3D观测中自监督学习类别级关节物体姿态估计

Can Zhang, Gim Hee Lee

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系)

AI总结 提出SCAPO框架,通过自监督方式从单张RGB-D图像中估计关节物体的规范几何、刚性部件分割和关节参数,无需真实标签或类别特定模型。

详情
AI中文摘要

现有的从单次3D观测中估计类别级物体关节的方法通常依赖密集监督、多帧输入或CAD模板,并且仍然难以从关节中解耦几何或恢复显式关节参数。我们提出SCAPO,一个自监督框架,从单张RGB-D观测中估计规范几何、刚性部件分割以及关节枢轴、轴和关节状态,无需真实标签或类别特定模型。我们的SCAPO首先使用SE(3)-等变向量神经元自编码器来分解全局姿态并将不同实例对齐到共享规范空间。在此对齐形状上,设计了一个关节感知的混合蒙皮模块来建模部件运动。我们通过观测形状和规范形状之间的循环重建以及可学习规范模板的跨空间对齐来学习这种表示,该模板将共享类别几何与实例特定残差形状解耦。在合成和真实关节物体数据集上的实验表明,我们的SCAPO恢复了一致的部件结构和准确的关节参数,并优于所有自监督基线。

英文摘要

Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.

2606.01939 2026-06-02 cs.CV 版本更新

SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video

SAVMap: 基于结构辅助的全景视频大规模2.5D曼哈顿线框视觉映射

Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng

发表机构 * Nokia Bell Labs(诺基亚贝尔实验室) NYU(纽约大学)

AI总结 提出SAVMap方法,利用全景视频和语义分割网络,结合曼哈顿网格几何约束,从仓库场景生成语义线框地图,实现高精度大规模3D重建。

Comments IEEE ICRA 2026

详情
AI中文摘要

工业环境的精确3D表示能够支持机器人定位和数字孪生生成等任务。我们提出SAVMap,一种仅使用全景视频相机作为传感器输入,生成仓库货架和灯光结构语义线框地图的方法。从沿仓库通道拍摄的全景视频中提取一系列带有货架和天花板视角的校正图像。通过语义分割网络前端,从每张图像中提取一组稀疏的语义结构特征点(例如货架结构的角点、灯光的中心),并在序列中跟踪这些点。通过考虑点之间的真实世界几何关系(如曼哈顿网格),一种受约束的运动恢复结构算法生成构成线框地图的3D点。我们在一个拥有46排货架的仓库中展示了我们方法的可扩展性和准确性,每排货架的面尺寸为55米×7米。从一小时的视频内容中,我们为超过5000个货架元素创建了线框地图,与真实值相比,总体平均绝对误差为4.8厘米。

英文摘要

Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.

2606.01933 2026-06-02 cs.CV 版本更新

3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval

CVPR 2026 CASTLE挑战赛第三名:基于层次化知识图谱检索的智能多视角长视频理解

Raghad Albusayes, Munirah Alyahya

发表机构 * TAHAKOM(塔哈科姆)

AI总结 提出一种免训练的智能框架,通过视频知识图谱和层次化检索索引,解决大规模多视角视频中的复杂时空推理问题,在CASTLE挑战赛中获得第三名。

详情
AI中文摘要

本文介绍了我们在CVPR 2026 EgoVis研讨会举办的CASTLE 2026挑战赛中的获胜方法,我们的团队在全球获得了第三名。该挑战要求参与者在海量多模态视频流中回答高度复杂的视觉、时空和语言问题,包括视觉计数、动作定位、多视角跟踪和说话者时间推理。底层数据集包含由15个自我和外部摄像头源捕获的超过600小时的同步视频。为了应对这种极端规模和长上下文的需求,我们引入了一种无需训练的智能框架,专门针对长视频理解进行了优化。我们的框架引入了两个核心架构组件:i) 视频知识图谱,映射静态和动态实体、它们的时间关系以及交叉事件,以实现多跳关系推理;ii) 自适应智能工作流,通过层次化检索和索引解决复杂查询。实验结果表明,我们的框架在长上下文多视角流上实现了高零样本推理精度。我们的代码将在https://github.com/RaghadKhaled/CASTLE-Challenge-Framework发布。

英文摘要

This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.

2606.01920 2026-06-02 cs.CV 版本更新

Pool-Select-Refine: Allocation-Aware Generative Dataset Distillation with Soft-Label-Guided Latent Refinement

Pool-Select-Refine: 基于软标签引导潜在精化的分配感知生成式数据集蒸馏

Wenmin Li, Shunsuke Sakai, Zhongkai Zhao, Tatsuhito Hasegawa

发表机构 * Graduate School of Engineering, University of Fukui(福井大学工学研究科) College of Computer Science and Artificial Intelligence, Southwest Minzu University(西南民族大学计算机科学与人工智能学院)

AI总结 提出Pool-Select-Refine两阶段框架,通过过完备候选池选择与软标签引导潜在精化解耦生成、选择和精化,提升扩散模型数据集蒸馏的预算利用效率。

详情
AI中文摘要

基于扩散的数据集蒸馏最近作为一种有前景的范式出现,用于将大规模数据集压缩为紧凑的合成集。通过利用预训练的生成先验,这些方法可以比传统的基于匹配的方法更高效地生成逼真的类别条件样本。然而,大多数现有的基于扩散的方法仍然采用僵化的“生成即用”策略,其中生成的样本在固定的每类图像预算下直接被视为最终的蒸馏集。这种设计将候选生成与最终预算分配紧密耦合,可能导致有限预算的冗余浪费或信息不足的样本。在本文中,我们提出“Pool-Select-Refine”,一个用于分配感知生成式数据集蒸馏的两阶段框架。首先,我们不直接使用固定数量的生成样本,而是构建一个过完备的候选池,并在目标预算下选择一个紧凑的子集。其次,我们使用从教师模型导出的软标签监督在潜在空间中精化所选样本,提高语义对齐同时保留生成先验。这种设计明确地将生成、选择和精化解耦,从而更有效地利用蒸馏预算。在大规模和细粒度图像分类基准上的实验表明,所提出的框架在基于扩散的基线上取得了一致的改进。结果表明,在精化之前引入一个筛选阶段是改进基于扩散的数据集蒸馏的一种简单而有效的方法。

英文摘要

Diffusion-based dataset distillation has recently emerged as a promising paradigm for condensing large-scale datasets into compact synthetic sets. By leveraging pretrained generative priors, these methods can produce realistic class-conditional samples more efficiently than traditional matching-based approaches. However, most existing diffusion-based methods still adopt a rigid ``Generate-and-Use'' strategy, where the generated samples are directly treated as the final distilled set under a fixed images-per-class budget. Such a design tightly couples candidate generation with final budget allocation, which may result in redundant waste of the limited budget or insufficiently informative samples. In this paper, we propose ``Pool-Select-Refine'', a two-stage framework for allocation-aware generative dataset distillation. First, instead of directly using a fixed number of generated samples, we construct an over-complete candidate pool and select a compact subset under the target budget. Second, we refine the selected samples in latent space using soft-label supervision derived from the teacher model, improving semantic alignment while preserving the generative prior. This design explicitly decouples generation, selection, and refinement, enabling more effective use of the distillation budget. Experiments on large-scale and fine-grained image classification benchmarks show that the proposed framework delivers consistent gains over diffusion-based baselines. The results suggest that introducing a curation stage before refinement is a simple yet effective way to improve diffusion-based dataset distillation.

2606.01914 2026-06-02 cs.CL cs.CV 版本更新

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

多模态大语言模型空间推理中空间词汇偏差的机制诊断

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi

发表机构 * Kyoto University(京都大学) NII LLMC(日本国立信息与通信技术研究所语言模型中心) RIKEN AIP(日本理化学研究所先进理工研究所) Case Western Reserve University(凯斯西储大学) The Hong Kong Polytechnic University(香港理工大学) The University of Osaka(大阪大学) University of Tokyo(东京大学)

AI总结 本文发现多模态大语言模型存在空间词汇偏差,即添加空间关系词会吸引模型选择该选项,并通过机制可解释性工具揭示偏差主要源于语言侧而非视觉侧,最后提出轻量级LLM-only DPO更新可有效缓解偏差。

详情
AI中文摘要

多模态大语言模型(MLLMs)在空间多项选择题上仍不可靠,其失败常归因于视觉信息关注不足。本文识别了一种互补的失败模式——空间词汇偏差:向答案选项添加空间关系词会吸引模型决策,使新添加的选项更可能被选中。使用九个开放权重的MLLMs,我们证明该现象广泛存在。特别地,模型能正确回答二元空间问题,但一旦向答案集添加第三个空间选项,模型便持续选择错误的第三选项。我们将这种二元稳定但三元脆弱的案例隔离为诊断示例,并利用机制可解释性工具,揭示失败的主要原因来自语言侧而非视觉侧:视觉注意力分析和残差流探针表明,在这些失败中,正确的空间关系在内部仍然可用,而不相关选项控制、激活修补和稀疏组件干预将偏差追溯到特定的LLM侧通道和神经元。基于此发现,我们证明在微小的单对象对合成数据上进行轻量级仅LLM的DPO更新可缓解偏差,在合成数据上将四路鲁棒准确率提升高达100个百分点,在更广泛的评估数据集WhatsUp、SpatialMQA-Direct和VSR上分别提升68.0、32.6和20.1个百分点。

英文摘要

Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.

2606.01911 2026-06-02 cs.CV 版本更新

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

残差解码器适配器:用于自回归文本渲染的身份保持分词器适配

Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan

发表机构 * Central South University(中南大学) University of Oxford(牛津大学) Microsoft Research(微软研究院)

AI总结 提出残差解码器适配器(RDA),通过引入配对码本和平行分支学习像素空间残差,在不重新训练分词器和自回归模型的情况下显著提升文本渲染性能。

Comments CVPR 2026 poster

详情
AI中文摘要

视觉自回归(AR)模型通过预测由视觉分词器解码的离散标记来生成图像。尽管展示了强大的整体图像生成能力,但在文本渲染方面仍表现不佳,出现模糊笔画和破坏字母形状。在这项工作中,我们将这一限制追溯到视觉分词器,它难以重建细粒度细节。改进分词器直接但昂贵,因为它需要重新训练分词器和AR模型。我们能否在不重新训练现有分词器和AR模型的情况下提高AR模型的文本渲染性能?为实现这一目标,我们提出了残差解码器适配器(RDA),它在不改变标记空间的情况下事后升级现有分词器。具体来说,它通过引入两个新颖组件来细化视觉分词器的解码器输出:(i)一个与原始标记分布共享的配对码本;(ii)一个并行分支,用于学习像素空间中重建图像与真实图像之间的微小差异(残差)。这种残差设计使我们能够非侵入性地增强分词器,同时保持与先前AR模型的兼容性。RDA大幅提升了文本渲染性能。例如,在具有竞争力的TextAtlas基准测试上,我们使微调后的Janus-Pro OCR准确率从24.52%提高到58.26%(TextVisionBlend),从12.75%提高到36.81%(StyledTextSynth)。代码可在https://github.com/CSU-JPG/RDA获取。

英文摘要

Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

2606.01910 2026-06-02 cs.GR cs.CV 版本更新

Single-Line Drawing Generation via Semantics-Driven Optimization

基于语义驱动的单线图生成

Tanguy Magne, Alexandre Binninger, Ruben Wiersma, Olga Sorkine-Hornung

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种基于语义驱动的方法,通过文本提示或输入图像自动生成矢量格式的单线图,利用分数蒸馏采样优化均匀有理B样条曲线参数,并引入额外损失项控制艺术风格,生成结果优于现有方法且支持下游制造。

Comments 18 pages, published in Computer Graphics Forum 2026

详情
AI中文摘要

线条画是一种高度表现力的艺术形式,要求艺术家抽象和提炼其主题的本质。我们提出了第一种语义驱动的方法,用于自动生成矢量格式的单线图,该方法可由描述概念的文本提示或描绘概念的输入图像引导。我们的方法利用分数蒸馏采样来优化均匀有理B样条(URBS)曲线的参数,确保绘图由单一连续笔画组成。这种表示提供了对细节水平的精细控制,而额外的损失项使我们能够引导最终的艺术风格。我们证明,我们的方法在此任务上优于最先进的文本到图像模型和优化流程,产生的结果在美学上更令人愉悦,并且更忠实于连续线条画艺术家的风格。此外,由于我们的方法生成矢量化的曲线,它直接支持下游制造过程,如刺绣、激光雕刻和弯线。我们的代码和结果可在 https://github.com/tanguymagne/SLDgen 获取。

英文摘要

Line drawings are a highly expressive art form that requires the artist to abstract and distill the essence of their subject. We present the first semantics-driven method for automatically generating single-line drawings in vector format, guided either by a text prompt describing the concept or an input image depicting it. Our approach leverages score distillation sampling to optimize the parameters of a uniform rational B-spline (URBS) curve, ensuring that the drawing consists of a single continuous stroke by design. This representation provides fine-grained control over the level of detail, while additional loss terms allow us to steer the final artistic style. We demonstrate that our method outperforms state-of-the-art text-to-image models and optimization pipelines for this task, producing results that are both more aesthetically pleasing and more faithful to the style of continuous line drawing artists. Furthermore, because our method generates a vectorized curve, it directly supports downstream fabrication processes such as embroidery, laser engraving and wire bending. Our code and results are available at https://github.com/tanguymagne/SLDgen.

2606.01908 2026-06-02 cs.LG cs.CV 版本更新

Private and Stable Test-Time Adaptation with Differential Privacy

具有差分隐私的私有且稳定的测试时自适应

Zefeng Li, Qiaoyue Tang, Mathias Lecuyer, Evan Shelhamer

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出将多种测试时自适应方法转化为差分隐私形式,通过逐样本梯度裁剪和高斯噪声保护测试数据隐私,在ImageNet-C上实现隐私与精度的平衡,并发现裁剪机制能提升连续自适应的准确性和稳定性。

Comments ICML 2026

详情
AI中文摘要

测试时自适应(TTA)可以通过在推理过程中更新模型来减少在新数据上的误差。然而,这些更新引发了关于测试数据隐私的问题,因为模型参数现在依赖于所有过去的输入。为了控制这种隐私风险,我们将多种流行的TTA方法(Tent、EATA、SAR、DeYO和COME)转化为差分隐私(DP)形式,对所有更新应用逐样本梯度裁剪和高斯噪声。在ImageNet-C上,我们的DP-TTA方法在精度损失较小的情况下提供了足够的隐私,并且在低隐私机制下,DP的裁剪机制甚至可以改善连续设置中自适应的准确性和稳定性。这些对隐私和精度的改进仅带来适度的计算开销。这些关于私有TTA的初步结果提高了对该问题的认识,为开发更私密的测试时更新提供了信息,并确定了逐样本裁剪作为提高自适应准确性和稳定性的有效技术。

英文摘要

Test-time adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP can even improve the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as an effective technique for improving the accuracy and stability of adaptation.

2606.01901 2026-06-02 cs.CV cs.AI cs.CL 版本更新

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

图像重建游戏:通过迭代多模态对话建立共同基础

Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam(波恩大学语言学系计算语言学部) German Research Center for Artificial Intelligence (DFKI), Berlin(德国人工智能研究中心(DFKI)柏林)

AI总结 提出图像重建游戏基准,通过多轮迭代中视觉语言模型向图像生成器发出纠正指令,使累积的共同基础直接可视化为重建图像,发现描述器是重建质量的主导因素,而生成器决定迭代改进的效果。

详情
AI中文摘要

我们引入了图像重建游戏,这是一个全自动基准测试,其中视觉语言模型在多轮迭代中向图像生成器发出纠正指令,使得累积的共同基础直接可视化为渲染图像。通过对七个图像类别中的两个描述器模型与两个生成器模型进行交叉基准测试,我们发现描述器是重建质量的主导因素,而生成器决定迭代改进是否有益。数学和几何图像构成了最大的挑战。描述器的令牌预算强烈影响收敛性:较短的预算产生更稀疏的初始渲染,有更多可见改进的空间,而较长的预算提高了绝对质量,但留下的修复空间较少。更强的描述器使用更丰富的纠正词汇,涵盖空间、数值和结构类别,而较弱的描述器则集中于表面属性,并且往往在几轮后停止。人工验证表明,最佳自动评判器与人类偏好之间仅达到轻微到中等的一致性,并且自动评分需要人工重新校准才能可靠使用。

英文摘要

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

2606.01896 2026-06-02 cs.CV cs.AI 版本更新

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

训练、测试、重新评估:用于手部检测的生成数据的调度敏感评估

Atmika Bhardwaj, Silvia Vock, Nico Steckhan

发表机构 * Federal Institute for Occupational Safety and Health(联邦职业安全与卫生研究所)

AI总结 本研究通过多阶段训练调度实验,评估生成性图像修补数据对安全关键场景下手部检测性能的影响,发现适当的训练流程能显著提升真实部署效果。

Comments 16 pages, 4 figures

详情
AI中文摘要

生成(或合成)图像数据越来越多地被用于增强或替代真实训练数据集,当目标图像稀缺、昂贵或存在偏差时。在手部检测中,特别是在职业安全设置中,公共数据集大多包含裸手。这低估了手套、纹身、珠宝和其他个人防护装备引入的手部外观变化,造成了安全关键应用在部署时遇到的分布偏移。我们测试生成性修补,即仅编辑真实照片的手部区域以引入配饰,是否能缩小这种偏移差距。在一个由真实图像及其合成对应物组成的配对数据集上,我们在六种训练和调度方案(实验A-F,每种三个随机种子)下训练YOLOv8n手部检测器,在真实测试集和仅真实手套测试子集上评估每个检测器,报告两个重叠阈值(mAP@0.5和mAP@0.5:0.95)下的平均精度(mAP),并进行配对统计检验。一个两阶段实验:在真实+合成数据上训练,然后在较低学习率下仅用真实数据微调得到的权重,与标准真实测试集上的仅真实基线模型相比,提高了mAP@0.5,并改善了真实手套的分布外差距。另一个三阶段实验最好地保持了框的紧密度,达到了研究中任何其他实验的最高mAP@0.5:0.95。合成数据对安全关键手部检测的效用由训练过程决定,简单的多阶段实验从修补的配饰数据中提取了实质性的真实部署收益。

英文摘要

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.

2606.01895 2026-06-02 cs.CV cs.AI 版本更新

Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations

LEO星座中基于多卫星视角的协作空间目标检测

Xingyu Qu, Wenxuan Zhang, Peng Hu

发表机构 * Government of Canada(加拿大政府) Natural Sciences and Engineering Research Council of Canada(加拿大自然科学和工程研究理事会)

AI总结 针对LEO星座中空间目标检测的挑战,提出基于深度学习框架的多视角观测融合方法,使用YOLO检测器处理多视角数据,实验表明多视角融合显著提升检测精度。

详情
AI中文摘要

随着低地球轨道(LEO)星座中卫星数量的增加,近地空间环境日益拥挤,使得空间目标检测(SOD)成为空间安全和可持续性面临的紧迫挑战。为了降低碰撞风险并确保空间操作的连续性,SOD系统必须在严格的星载约束下提供快速准确的检测。在本文中,我们研究了深度学习(DL)框架内多视角观测融合的潜力,以增强SOD性能。我们设计了一个实用的多视角流水线和几种输入表示,用于将多视角数据输入基于YOLO的检测器。我们的实验表明,在大多数情况下使用多视角输入是可行的,并且通常能在mAP50和mAP50-95上产生更好的结果。例如,在模型YOLOv9-m中,单视角与三视角融合RGB设置相比,mAP50从0.638增加到0.732,而mAP50-95从0.227提高到0.276。与单视角设置相比,最佳的三视角灰度配置将mAP50提高了36.3%,mAP50-95提高了46.5%。这些发现确立了多视角融合作为SOD的一种可行且有效的策略,对LEO星座部署中的空间态势感知具有广泛意义。

英文摘要

With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly congested, making space object detection (SOD) a pressing challenge for space safety and sustainability. To mitigate collision risks and ensure the continuity of space operations, SOD systems must deliver fast and accurate detection under stringent onboard constraints. In this paper, we investigate the potential of multi-viewpoint observation fusion within a deep learning (DL) framework to enhance SOD performance. We design a practical multi-view pipeline and several input representations for feeding multi-view data into YOLO-based detectors. Our experiments show that using multi-view inputs is feasible in most cases and typically produces better results for mAP50 and mAP50-95. For example, in model YOLOv9-m, single-view compared to a three-view fused RGB setting, mAP50 increases from 0.638 to 0.732, while mAP50-95 improves from 0.227 to 0.276. Compared with the single-view setting, the best three-view grayscale configuration improves mAP50 by 36.3% and mAP50-95 by 46.5%. These findings establish multi-view fusion as a viable and effective strategy for SOD, with broad implications for space situational awareness in LEO constellation deployments.

2606.01892 2026-06-02 cs.CV 版本更新

Adversarial Attacks on Robot Localization Systems via Deep Feature Perturbation

通过深度特征扰动对机器人定位系统的对抗攻击

Zhenyu Li, Tianyi Shang

发表机构 * Shandong Academy of Sciences(山东科学院) Fuzhou University(福州大学)

AI总结 提出一种基于轻量级乘积量化网络(LPQN)的对抗攻击框架,通过扰动查询特征编码来误导视觉定位系统中的检索过程,从而暴露深度学习定位管道的脆弱性。

Comments 11page

详情
AI中文摘要

机器人定位系统对于自主导航和安全至关重要。对抗性扰动可能误导这些系统,导致定位错误、导航失误或不安全交互,尤其是在关键任务场景中。本文研究了基于深度学习的定位管道对对抗攻击的脆弱性。我们提出了一种新颖的框架,用于生成专门针对视觉定位系统中乘积量化(PQ)的对抗查询。我们的方法采用轻量级乘积量化网络(LPQN)来扰动查询特征编码,通过返回语义无关的数据库条目来误导检索过程。对抗查询通过两阶段过程生成:前向传播扰动特征分布,后向传播通过优化细化扰动。LPQN的轻量级设计允许以最小的计算开销创建微妙但高效的扰动。在受控和真实机器人环境中的大量实验表明,我们的方法显著降低了PQN的性能,暴露了实际应用中的关键脆弱性。

英文摘要

Robot localization systems are critical for autonomous navigation and safety. Adversarial perturbations can mislead these systems, resulting in mislocalization, navigation errors, or unsafe interactions, especially in mission-critical scenarios. This paper investigates the vulnerability of deep learning based localization pipelines to adversarial attacks. We propose a novel framework for generating adversarial queries that specifically target Product Quantization (PQ) in visual localization systems. Our method employs a Lightweight Product Quantization Network (LPQN) to perturb query feature encodings, misleading the retrieval process by returning semantically irrelevant database entries. Adversarial queries are generated via a two-phase procedure: a forward pass that perturbs feature distributions and a backward pass that refines the perturbation through optimization. The lightweight design of LPQN allows the creation of subtle yet highly effective perturbations with minimal computational overhead. Extensive experiments in both controlled and real-world robotic environments demonstrate that our approach substantially degrades PQN performance, exposing critical vulnerabilities in practical applications.

2606.01885 2026-06-02 cs.CV 版本更新

Divide and Conquer: Reliable Multi-View Evidential Learning for Deepfake Detection

分而治之:用于深度伪造检测的可靠多视图证据学习

Xiaolu Kang, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Zhanhe Lei, Gang Wu, Qin Zou, Qian Wang

发表机构 * School of Computer Science, Wuhan University, Wuhan, China(武汉大学计算机学院) School of Integrated Circuits, Peking University, Beijing, China(北京大学集成电路学院) School of Information, Huazhong Agricultural University, Wuhan, China(华中农业大学信息学院) College of Cyber Security, Tarim University, Alaer, China(塔里木大学网络安全学院) School of Cyber Science and Engineering, Wuhan University, Wuhan, China(武汉大学网络安全与工程学院)

AI总结 提出分治多视图证据学习框架(DiCoME),通过几何视图净化解耦语义与伪影特征,并利用不确定性感知证据学习融合视图,提升深度伪造检测的泛化性和可靠性。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着生成模型的演进,深度伪造已实现近乎完美的语义真实性,仅在细微结构异常中留下取证痕迹。然而,现有的单视图范式通常难以泛化,因为主导的语义特征在纠缠表示中掩盖了微弱的伪影线索。这种不平衡导致过度自信但脆弱的预测——我们称之为语义掩蔽效应。为应对这一挑战,我们提出一个可靠的框架,称为分治多视图证据学习(DiCoME),用于深度伪造检测。在“分”阶段,我们采用几何视图净化,通过原则性的几何投影分解纠缠的表示空间。该过程抑制了伪影敏感表示中的语义干扰,为去相关但互补的语义和伪影视图奠定基础。在“治”阶段,我们利用不确定性感知证据学习来合成这些不同的视图。通过显式建模语义和伪影线索之间的“认知冲突”,该机制提供校准的不确定性估计,而不是强制僵化的确定性决策。跨多个基准的大量实验表明,我们的方法在泛化性能上持续优于现有方法,同时为可信的深度伪造检测提供可靠的不确定性估计。代码可在 https://github.com/kxl0825/DiCoME.git 获取。

英文摘要

With the evolution of generative models, deepfakes have achieved near-perfect semantic realism, leaving forensic traces only in subtle structural anomalies. However, existing single-view paradigms often fail to generalize, as dominant semantic features overwhelm subtle artifact cues within entangled representations. This imbalance leads to overconfident yet brittle predictions -- a phenomenon we term the Semantic Masking Effect. To address this challenge, we propose a reliable framework called Divide-and-Conquer Multi-View Evidential Learning (DiCoME) for Deepfake Detection. In the "Divide" phase, we employ Geometric View Purification to decompose the entangled representation space through principled geometric projection. This process suppresses semantic interference within artifact-sensitive representations, forming the foundation for decorrelated yet complementary semantic and artifact views. In the "Conquer" phase, we leverage Uncertainty-Aware Evidential Learning to synthesize these distinct views. By explicitly modeling the "epistemic conflict" between semantic and artifact cues, this mechanism provides calibrated uncertainty estimates instead of forcing rigid deterministic decisions. Extensive experiments across multiple benchmarks demonstrate that our method consistently outperforms existing approaches in generalization performance, while providing reliable uncertainty estimation for trustworthy deepfake detection. Code is available at https://github.com/kxl0825/DiCoME.git.

2606.01883 2026-06-02 cs.LG cs.CV 版本更新

Beyond the Simplex: Balanced Prototype Geometry for Scorer-Agnostic Open-Set Recognition

超越单纯形:用于评分器无关的开放集识别的平衡原型几何

Mayank Sharma, Rohit Kumar Mourya

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院乔浦尔)

AI总结 本文提出平衡等范数原型几何理论,统一分析不同嵌入维度下的开放集识别,证明评分器性能依赖于评分规则而非单纯形结构。

Comments 20 pages, 2 figures, 6 tables

详情
AI中文摘要

开放集识别(OSR)要求分类器拒绝来自未见类别的输入,这在医学成像等安全关键场景中至关重要。基于单纯形的方法将类原型固定在正则单纯形的顶点,然后通过距离比分数进行拒绝,这些方法在经验上表现良好但缺乏理论依据,且现有分析仅适用于嵌入维度d至少为C-1的情况,这是正则单纯形存在的条件。我们给出了在任意嵌入维度(包括d < C-1)下单纯形比OSR的理论解释。我们的分析集中于平衡等范数编码:具有等长和零和的原型配置,存在于所有d >= 2的情况,并包含正则单纯形作为特例。对于这些编码,我们证明辅助平方比分数的子水平集是欧几里得球的精确并集,进而包围了操作分数的接受区域;并且我们证明了一个尖锐的二分法:当且仅当d >= C-1时,原型达到等距对称性,行为类似于正则单纯形,低于该阈值时,由显式缺陷参数控制退化程度。我们进一步证明,在自然各向同性假设下,错误接受率随d指数衰减,并且操作分数是全局Lipschitz的,具有紧致接受区域。在实验上,我们将平衡原型几何作为分析工具和表示学习先验进行研究,而非作为独立的先进检测器。在CIFAR和MedMNIST开放集划分上,几何结构提供了有用的结构,但OSR性能仍然强烈依赖于评分规则:原始比率分数通常不如基于最近邻和logit的替代方案。

英文摘要

Open-set recognition (OSR) requires a classifier to reject inputs from unseen classes which is essential in safety-critical settings such as medical imaging. Simplex based methods, which fix class prototypes at the vertices of a regular simplex and then reject via a distance-ratio score, perform well empirically but lack theoretical justification, and existing analysis applies only when the embedding dimension d is at least C-1, which is the regime in which a regular simplex exists. We give a theoretical account of simplex-ratio OSR that holds in every embedding dimension, including d < C-1. Our analysis centers on balanced equal-norm codes: prototype configurations with equal lengths and zero sum, which exist for all d >= 2 and include the regular simplex as a special case. For these codes we show that an auxiliary squared ratio score has sublevel sets that are exact unions of Euclidean balls, which in turn bracket the acceptance region of the operational score; and we prove a sharp dichotomy: the prototypes attain one-distance symmetry, behaving like a regular simplex, if and only if d >= C-1, with controlled degradation governed by an explicit defect parameter below that threshold. We further show the false-acceptance rate decays exponentially in d under natural isotropy assumptions, and that the operational score is globally Lipschitz with compact acceptance regions. Empirically, we study balanced prototype geometry as both an analytic tool and a representation-learning prior, rather than as a stand-alone state-of-the-art detector. Across CIFAR and MedMNIST open-set splits, the geometry provides useful structure, but OSR performance remains strongly dependent on the scoring rule: raw ratio scores typically underperform nearest-neighbor and logit-based alternatives.

2606.01871 2026-06-02 cs.CV 版本更新

Deep Learning for Generating Computational PIN-4 Immunohistochemistry Staining from Prostate Biopsy H&E Images

深度学习从前列腺活检H&E图像生成计算PIN-4免疫组织化学染色

Vietbao Tran, Pratik Shah

发表机构 * Biomedical Engineering University of California, Irvine Irvine, CA, USA(生物医学工程 卡罗来纳大学伊文城分校 伊文城,加州,美国) Laboratory Medicine Biomedical Engineering Electrical Engineering(实验室医学 生物医学工程 电气工程) Computer Science University of California, Irvine Irvine, CA, USA(计算机科学 卡罗来纳大学伊文城分校 伊文城,加州,美国)

AI总结 本研究使用条件生成对抗网络(cGAN)从H&E图像合成PIN-4 IHC染色,实现了直接的空间对应,并在病理评估中取得了良好效果。

详情
AI中文摘要

免疫组织化学(IHC)常用于解决苏木精和伊红(H&E)染色组织上诊断不明确的前列腺癌活检结果。然而,PIN-4 IHC染色通常在相邻组织切片上进行,限制了H&E形态与相应免疫表型信号之间的直接空间比较。从常规临床前列腺活检全切片图像(WSI)构建了一个配对、配准的H&E/PIN-4数据集,并训练了一个条件生成对抗网络(cGAN)直接从原始H&E图像块合成PIN-4染色模式。最终数据集包含来自93名患者的172对配准WSI和27,298对配准的1024x1024图像块,涵盖腺癌阳性和良性病例,并代表了不同年龄、种族和民族群体。模型在来自17张WSI的1,814对图像块的保留测试集上进行了评估,平均峰值信噪比(PSNR)为21.88 dB,结构相似性指数(SSIM)为0.667,皮尔逊相关系数(PCC)为0.684,学习感知图像块相似度(LPIPS)为0.417。由委员会认证的病理学家进行的定性审查显示,生成的图像捕获了诊断相关的PIN-4染色模式,包括AMACR/消旋酶表达和基底细胞相关染色,同时保持了与源H&E形态的空间对应。在形态复杂的区域(包括高级别癌和导管内癌)中,合成的准确性有所变化。这些结果支持从常规采集的明场H&E前列腺活检图像进行监督式PIN-4合成的可行性。该方法能够在源前列腺H&E结构的背景下直接解释预测的PIN-4标记模式,解决了传统相邻切片IHC当前的空间局限性。

英文摘要

Immunohistochemistry (IHC)is frequently used to resolve diagnostically ambiguous prostate cancer biopsy findings on hematoxylin and eosin (H&E)-stained tissue. However, PIN-4 IHC staining is typically performed on adjacent tissue sections, limiting direct spatial comparison between the H&E morphology and the corresponding immunophenotypic signal. A paired, registered H&E/PIN-4 dataset was constructed from routine clinical prostate biopsy whole-slide images (WSIs), and a conditional generative adversarial network (cGAN) was trained to synthesize PIN-4 staining patterns directly from native H&E image patches. The final dataset comprised 172 paired WSIs from 93 patients and 27,298 registered 1024x1024 patch pairs, spanning adenocarcinoma-positive and benign cases with representation across age, race, and ethnicity groups. The model was evaluated on a held-out test set of 1,814 patch pairs from 17 WSIs, achieving a mean peak signal-to-noise ratio (PSNR) of 21.88 dB, structural similarity index measure (SSIM) of 0.667, Pearson correlation coefficient (PCC) of 0.684, and learned perceptual image patch similarity (LPIPS) of 0.417. Qualitative review by a board-certified pathologist showed that generated images captured diagnostically relevant PIN-4 staining patterns, including AMACR/racemase expression and basal-cell-associated staining, while preserving spatial correspondence with the source H&E morphology. Accuracy of synthesis varied across morphologically complex regions, including high-grade carcinoma and intraductal carcinoma. These results support the feasibility of supervised PIN-4 synthesis from routinely acquired brightfield H&E prostate biopsy images. The approach enables direct interpretation of predicted PIN-4 marker patterns in the context of the source prostate H&E architecture, addressing a current spatial limitation of conventional adjacent-section IHC.

2606.01858 2026-06-02 cs.CV 版本更新

Polaris: Scaling Up Instruction-Guided Image Generation Towards Millions of Personalized Style Needs

Polaris: 将指令引导的图像生成扩展到数百万个性化风格需求

Zhi-Kai Chen, Jun-Peng Jiang, Jun-Jie Tao, De-Chuan Zhan, Han-Jia Ye

发表机构 * Tsinghua University(清华大学)

AI总结 提出Polaris智能检索框架,通过索引和检索超过6500个检查点和75000个适配器,自动选择和集成最相关的模型组件,实现无需额外训练的可扩展、可控且对齐的指令驱动图像生成。

详情
AI中文摘要

用户越来越期望图像生成模型能够快速适应高度多样化和个性化的需求,例如生成具有独特风格或特征的图像。传统方法依赖于微调,成本高昂且难以扩展。为了应对这些限制,社区积累了一个不断增长的微调模块和适配器库,其中每个组件针对特定的生成需求,并共同作为处理新需求的基础。这自然引出一个问题:与其重复训练新模型,我们能否系统地利用这个不断扩展的生态系统来更好地满足用户指令?为此,我们提出了Polaris,一个智能检索框架,根据用户的指令自动从模型库中选择和集成合适的模型。关键见解是,利用如此庞大和异构的库不仅需要在数千个候选中找到最相关的模块,还需要将它们有效地对齐以进行指令驱动的生成和编辑。Polaris通过索引超过6500个检查点和75000个适配器,并根据用户的输入和指令检索最相关的组件来解决这一挑战。通过这种方式,它提供了可扩展、可控且良好对齐的生成——无需任何额外训练。

英文摘要

Users increasingly expect image generation models to quickly adapt to highly diverse and personalized requirements, such as producing images with distinctive styles or characteristics. Traditional approaches rely on fine-tuning, which is costly and difficult to scale. To cope with these limitations, the community has accumulated a growing library of fine-tuned modules and adapters, where each component targets specific generation needs and collectively serves as a foundation for handling new demands. This naturally raises a question: instead of repeatedly training new models, can we systematically exploit this expanding ecosystem to better fulfill user instructions? To this end, we present Polaris, an intelligent retrieval framework that automatically selects and integrates suitable models from the model library based on a user's instructions. The key insight is that harnessing such a massive and heterogeneous pool requires not only finding the most relevant modules among thousands of candidates, but also aligning them effectively for instruction-driven generation and editing. Polaris addresses this challenge by indexing over 6,500 checkpoints and 75,000 adapters, and retrieving the most relevant components given a user's input and instruction. In doing so, it delivers scalable, controllable, and well-aligned generation -- without any additional training.

2606.01848 2026-06-02 cs.CV 版本更新

RescueBench: Can Embodied Agents Save Lives in the Wild ?

RescueBench: 具身智能体能否在野外拯救生命?

Kui Wu, Beiyu Guo, Hao Chen, ShuHang Xu, Yuling Li, Yongdan Zeng, Zhoujun Li, Yizhou Wang, Fangwei Zhong

发表机构 * Beihang University(北京航空航天大学) Beijing Normal University(北京师范大学) Peking University(北京大学) City University of Macau(澳门城市大学) ATEC2025 Challenge Committee(ATEC2025挑战委员会)

AI总结 本文提出 RescueBench,一个四阶段流水线的逼真诊断基准,用于评估具身智能体在搜索与救援任务中的探索、记忆和交互能力,并揭示探索和记忆失败如何传播。

详情
AI中文摘要

搜索与救援(SAR)要求具身智能体在多模态不确定性下探索陌生环境,执行多阶段交互,并在长时域内检索空间记忆。现有基准通常孤立评估这些能力,当它们必须在现实工作流中组合时,失败如何叠加尚不清楚。我们提出 RescueBench,一个逼真的诊断基准,将 SAR 实例化为四阶段流水线:多模态探索、目标救援、记忆引导返回和最终交接。通过将顺序任务组合与阶段级评估相结合,RescueBench 能够分析探索和记忆失败如何在具身救援工作流中传播。它包含五个渐进难度级别,在环境复杂性、线索模糊性和空间层次上有所不同,并配有自动化的情节生成和标注流水线,用于可扩展的评估和训练。我们评估了七个基线、一个 oracle 参考和人类玩家,结果显示没有基线能在最大难度下完成全部任务。阶段级诊断将自主探索识别为主要失败模式,空间记忆为第二个独立瓶颈,表明当前拓扑视觉语言导航或基于地图的方法无法解决这些局限。代码见 https://github.com/wukui-muc/RescueBench。

英文摘要

Search-and-rescue (SAR) requires embodied agents to explore unfamiliar environments under multimodal uncertainty, perform multi-stage interactions, and retrieve spatial memory over long horizons. Existing benchmarks typically evaluate these capabilities in isolation, leaving unclear how failures compound when they must be composed in realistic workflows. We introduce RescueBench, a photo-realistic diagnostic benchmark that instantiates SAR as a four-stage pipeline: multimodal exploration, target rescue, memory-guided return, and final handoff. By combining sequential task composition with stage-level evaluation, RescueBench enables analysis of how exploration and memory failures propagate through embodied rescue workflows. It contains five progressive difficulty levels that vary in environmental complexity, clue ambiguity, and spatial hierarchy, along with an automatic episode generation and annotation pipeline for scalable evaluation and training. We evaluate seven baselines, an oracle reference, and human players, showing that no baselines complete the full task at the greatest difficulty. Stage-level diagnosis identifies autonomous exploration as the dominant failure mode and spatial memory as a second, independent bottleneck, suggesting that these limitations are not resolved by current topological visual-language navigation or map-based methods. Code is available in https://github.com/wukui-muc/RescueBench

2606.01843 2026-06-02 cs.CV cs.AI 版本更新

Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection

抑制伪造特定捷径以实现可泛化的深度伪造检测

Yihui Wang, Yonghui Yang, Jilong Liu, Fengbin Zhu, Le Wu, Tat-Seng Chua

发表机构 * Hefei University of Technology(合肥工业大学) National University of Singapore(国立新加坡大学)

AI总结 提出Shortcut Subspace Suppression (S^3)框架,通过子空间建模显式表征并抑制方法特定捷径,以提升深度伪造检测的跨方法泛化能力。

详情
AI中文摘要

深度伪造检测在跨伪造方法泛化方面表现不佳,因为现有模型倾向于依赖虚假的方法特定捷径,这些捷径无法迁移到未见过的篡改操作。尽管近期方法试图改进泛化性,但它们缺乏明确的机制来识别和抑制学习表示中的此类捷径。在这项工作中,我们提出了捷径子空间抑制(S^3)框架,通过子空间建模显式表征并抑制方法特定捷径。我们的关键洞察是,区分不同伪造方法的变体捕获了方法特定的伪影,因此可作为方法特定捷径的有效代理。为此,我们训练一个轻量级线性探针进行伪造方法分类,并执行奇异值分解(SVD)以提取主导的捷径子空间。基于此公式,我们开发了两种互补策略来减少对捷径的依赖。在训练期间,我们软性抑制特征表示中的捷径子空间,鼓励模型依赖更可泛化的线索进行真/假判别。在推理时,我们引入一个无需训练的对应方法,衰减与识别出的捷径方向对齐的神经元,从而实现即插即用的泛化增强,并提高可解释性。在多个基准上的大量实验表明,我们的方法显著改善了跨方法泛化,同时保持了强大的域内性能。代码将在论文被接收后发布。

英文摘要

Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific shortcuts that fail to transfer to unseen manipulations. While recent approaches attempt to improve generalization, they lack an explicit mechanism to identify and suppress such shortcuts in learned representations. In this work, we propose Shortcut Subspace Suppression (S^3) framework that explicitly characterizes and suppresses method-specific shortcuts via subspace modeling. Our key insight is that variations distinguishing different forgery methods capture method-specific artifacts and thus serve as an effective proxy for method-specific shortcuts. To this end, we train a lightweight linear probe for forgery method classification and perform Singular Value Decomposition (SVD) to extract the dominant shortcut subspace. Building on this formulation, we develop two complementary strategies to reduce shortcut reliance. During training, we softly suppress the shortcut subspace in feature representations, encouraging the model to rely on more generalizable cues for real/fake discrimination. At inference time, we introduce a training-free counterpart that attenuates neurons aligned with the identified shortcut directions, enabling plug-and-play generalization enhancement with improved interpretability. Extensive experiments on multiple benchmarks demonstrate that our method significantly improves cross-method generalization while maintaining strong in-domain performance. The code will be released upon acceptance of the submission.

2606.01834 2026-06-02 cs.CV cs.AI 版本更新

Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition

轻量级TCN中的物理引导注意力用于高效基于WiFi CSI的人体活动识别

Chinthaka Ranasingha, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Harshala Gammulle

发表机构 * Signal Processing, Artificial Intelligence and Vision Technologies (SAIVT) Research Group, School of Electrical Engineering and Robotics, Queensland University of Technology (QUT)(信号处理、人工智能与视觉技术(SAIVT)研究组,电气工程与机器人学院,昆士兰科技大学(QUT))

AI总结 提出一种紧凑的TCN框架,通过多普勒能量引导的时间注意力和方差驱动的通道注意力机制,显式引入运动感知归纳偏置,在减少参数和计算成本的同时实现优于深度基线模型的性能。

详情
AI中文摘要

基于WiFi信道状态信息(CSI)的人体动作识别(HAR)因其非接触、低成本及保护隐私的特性而受到越来越多的关注。然而,现有的基于学习的方法主要依赖深度、计算密集的架构来隐式地从CSI测量中捕捉运动动态,从而增加了模型复杂度并降低了效率。相反,我们认为,结合针对CSI信号物理特性的适当归纳偏置能够实现更高效和有效的学习。在这项工作中,我们提出一个紧凑的基于时间卷积网络(TCN)的框架,将运动感知的归纳偏置显式地融入特征学习。具体地,我们在特征空间中引入多普勒能量引导的时间注意力机制以强调运动显著的时间段,以及一个方差驱动的通道注意力模块,根据时间运动统计自适应地加权信息子载波。通过整合这些领域特定的先验知识,所提模型在不增加架构深度的情况下有效捕捉运动动态。在多个基准数据集上的大量实验表明,我们的方法相比更深的基线模型取得了优越的性能,同时显著减少了参数数量和计算成本。

英文摘要

Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost, and privacy-preserving nature. However, existing learning-based approaches largely rely on deep, computationally intensive architectures to implicitly capture motion dynamics from CSI measurements, thereby increasing model complexity and reducing efficiency. Instead, we argue that incorporating appropriate inductive biases tailored to the physical characteristics of CSI signals enables more efficient and effective learning. In this work, we propose a compact temporal convolutional network (TCN)-based framework that explicitly incorporates motion-aware inductive biases into feature learning. Specifically, we introduce a Doppler-energy-guided temporal attention mechanism in feature space to emphasize motion-salient time segments, and a variance-driven channel attention module to weight informative subcarriers based on temporal motion statistics adaptively. By integrating these domain-specific priors, the proposed model effectively captures motion dynamics without increasing architectural depth. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves superior performance compared to deeper baselines, while significantly reducing parameter count and computational cost.

2606.01825 2026-06-02 cs.CV cs.MM 版本更新

ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search

ROGLE: 基于自动区域监督的鲁棒全局-局部对齐用于文本行人搜索

Zequn Xie, Xibei Jia, Sihang Cai, Shulei Wang, Tao Jin

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ROGLE框架,通过自动区域-句子匹配策略和多重粒度学习,解决文本行人搜索中细粒度对齐不足的问题,并在新基准P-VLG上取得最优性能。

Comments 12 pages, 5 figures

详情
AI中文摘要

文本行人搜索(TBPS)旨在使用自然语言查询检索行人图像。然而,现有的TBPS模型,尤其是基于CLIP的模型,由于从短标题训练中继承的全局表示偏差和语义稀疏性,在细粒度理解方面存在困难。这导致弱细粒度对齐,而区域级标注的稀缺加剧了这一问题。为此,我们提出了ROGLE(鲁棒全局-局部嵌入),一个统一的框架,通过自动区域-句子匹配(RSM)策略克服了对昂贵人工标注的依赖。RSM自动挖掘伪区域-句子对,用于可扩展的细粒度监督。此外,ROGLE采用多粒度学习策略,融合全局对比学习和区域级局部对齐。我们还引入了P-VLG基准,这是一个通过从现有公共基准中整理和丰富图像构建的大规模数据集。它包含超过10万个标注区域和丰富的长标题,是第一个同时支持全局和局部评估协议的TBPS基准。大量实验表明,ROGLE显著优于现有方法,特别是在具有挑战性的长查询上。代码和P-VLG基准将公开提供。

英文摘要

Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.

2606.01819 2026-06-02 cs.CV eess.IV 版本更新

Hist2Style: Histogram-Guided Stylization with Bilateral Grids

Hist2Style: 基于双边网格的直方图引导风格化

Dekel Galor, Adam Pikielny, Zhoutong Zhang, Ke Wang, Laura Waller, Jiawen Chen, Ilya Chugunov

发表机构 * Adobe Nextcam University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Hist2Style,利用双边网格实现快速、边缘感知的逼真风格迁移,通过蒸馏大模型为轻量网络,并基于直方图嵌入提供可解释的用户控制。

Comments 10 pages, 8 figures. Extended results are at https://www.dekelgalor.com/hist2style

详情
AI中文摘要

逼真风格迁移旨在匹配输入图像与风格目标的颜色和色调,同时保留原始场景的内容和细节。尽管现有的大图像模型可以促进这类外观编辑,但它们的高计算需求、潜在的幻觉以及有限的用户控制使其不适合高分辨率、实时工作流。我们引入Hist2Style,一种双边网格公式,用于快速、边缘感知的风格化,通过将操作限制在双边空间中的局部仿射变换来保持视觉保真度。我们的模型通过在大规模监督语料库上训练(该语料库由语言和视觉语言模型生成),将大图像编辑模型蒸馏为轻量网络,针对空间变化的颜色编辑。网络以风格目标的直方图嵌入为条件,提供可解释的接口,通过修改目标颜色分布来调整输出风格。总体而言,Hist2Style通过构造保持内容结构,避免幻觉,并支持实时、高分辨率的逼真风格化,具有交互式用户可控的颜色和色调调整。

英文摘要

Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model distills a large image editing model into a lightweight network by training on a large supervised corpus generated with language and vision-language models, targeting spatially varying color edits. The network conditions on a histogram-based embedding of the style target to provide an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.

2606.01818 2026-06-02 cs.CV 版本更新

Unsupervised Collaborative Domain Adaptation for Driving Scene Parsing

无监督协作域自适应用于驾驶场景解析

Jiahe Fan, Shaolong Shu, Mingjian Sun, Tiehua Zhang, Bohong Xiao, Hanli Wang, Rui Fan

发表机构 * College of Electronic and Information Engineering, Tongji University(同济大学电子与信息学院) Department of Control Science and Engineering, Harbin Institute of Technology(控制科学与工程系,哈尔滨工业大学) Department of Vehicle Control System and Software Development, NIO(车辆控制系统与软件开发部,蔚来汽车) School of Computer Science and Technology, Tongji University(计算机科学与技术学院,同济大学) Key Laboratory of Embedded System and Service Computing (Ministry of Education), Tongji University(嵌入式系统与服务计算重点实验室(教育部),同济大学)

AI总结 提出无监督协作域自适应框架UCDA,通过多源模型协作优化和知识蒸馏,在无源数据条件下提升目标域驾驶场景解析的鲁棒性和泛化能力。

详情
AI中文摘要

可靠的驾驶场景解析是自动驾驶车辆在开放动态环境中运行的基本能力。然而,将感知模型适应新的部署域仍然具有挑战性,因为像素级标注成本高昂,且由于隐私、安全或所有权限制,源域数据通常无法访问。现有的无源无监督域自适应方法通常依赖于单个预训练源模型,这使得自适应后的感知系统容易受到源特定偏差的影响,并在不同的道路布局、光照条件、天气模式和交通状况下限制其鲁棒性。本文提出了一种无监督协作域自适应(UCDA)框架,用于无源设置下的驾驶场景解析,该框架将多个预训练源模型的互补知识迁移到统一的目标模型,而无需访问任何原始源样本。为了比较独立训练模型的预测,UCDA构建了一个类级原型记忆库,并通过原型相似性估计跨模型预测可靠性,从而减少源模型间不一致置信度尺度的影响。基于由此产生的互补监督,UCDA采用两阶段迁移策略:首先通过正负一致性约束的协作优化,在无标签的目标域驾驶数据上精炼多个源模型,然后将它们经过验证的专业知识蒸馏到单个可部署的目标模型中。在公开驾驶场景数据集和从自动驾驶车辆平台收集的真实世界数据上的全面评估表明,UCDA有效地整合了互补的多源知识,提高了目标域场景解析的可靠性和在不同驾驶环境中的泛化能力。

英文摘要

Reliable driving scene parsing is a fundamental capability for autonomous vehicles operating in open and dynamic driving environments. However, adapting perception models to new deployment domains remains challenging because pixel-level annotations are expensive to obtain, while source-domain data are often inaccessible due to privacy, security, or ownership constraints. Existing source-free unsupervised domain adaptation methods typically rely on a single pre-trained source model, which makes the adapted perception system vulnerable to source-specific biases and limits its robustness under diverse road layouts, illumination conditions, weather patterns, and traffic conditions. This article presents an unsupervised collaborative domain adaptation (UCDA) framework for driving scene parsing in a source-free setting, which transfers complementary knowledge from multiple pre-trained source models to a unified target model without accessing any original source samples. To compare predictions from independently trained models, UCDA constructs a class-level prototype memory bank and estimates cross-model prediction reliability through prototype similarity, reducing the effect of inconsistent confidence scales across source models. Based on the resulting complementary supervision, UCDA adopts a two-stage transfer strategy: multiple source models are first refined on unlabeled target-domain driving data through collaborative optimization with positive and negative consistency constraints, and their validated expertise is then distilled into a single deployable target model. Comprehensive evaluations on public driving-scene datasets and real-world data collected from an autonomous vehicle platform demonstrate that UCDA effectively consolidates complementary multi-source knowledge, improving target-domain scene parsing reliability and generalization across diverse driving environments.

2606.01808 2026-06-02 cs.CV 版本更新

Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI for Cardiac Digital Twins

基于电影MRI的个性化三维心肌梗死几何重建用于心脏数字孪生

Yilin Lyu, Mark YY Chan, Ching-Hui Sia, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore(新加坡国立大学生物医学工程系) Department of Medicine, National University of Singapore(新加坡国立大学医学系) Department of Cardiology, National University Heart Centre Singapore(新加坡国立心脏中心心内科部)

AI总结 提出一种显式几何-运动嵌入模型,从多视角电影MRI中全自动重建个性化、可仿真的三维心肌梗死几何结构,采用双分支自适应融合和AHA-17引导的多尺度监督,实现无对比剂梗死表征。

Comments 14 pages

详情
AI中文摘要

准确的三维心肌梗死(MI)几何表征对于构建心脏数字孪生(CDT)以精确模拟梗死相关电生理至关重要。晚期钆增强磁共振成像(LGE MRI)是定位MI的临床参考,但其对造影剂的依赖限制了在肾功能受损患者中的使用,并限制了纵向随访。作为替代,无对比剂电影MRI可可视化异常心室壁运动,这高度指示梗死区域。在本研究中,我们提出了一种新颖的显式几何-运动嵌入模型,直接从多视角电影MRI中全自动重建个性化、可仿真的三维MI几何结构。具体地,我们构建了一个4D(3D+t)双心室网格,以显式提取和解耦几何感知和运动感知特征。我们进一步设计了一个双分支模块用于自适应几何-运动融合,以捕获时空依赖性来映射梗死区域。此外,我们引入了一种利用AHA-17节段引导的交叉注意力机制的多尺度监督来指导预测,确保生物物理一致的重建。在225例电影MRI上的实验结果表明,所提出的三维MI重建实现了高性能,平均Dice得分为0.678±0.011。在下游的计算机电生理模拟评估中,结果与LGE衍生的真实情况高度一致,突显了所提出模型在无对比剂瘢痕表征和无缝集成到CDT建模中的巨大潜力。代码将在稿件被接受发表后公开。

英文摘要

Accurate 3D geometric characterization of myocardial infarction (MI) is essential for building cardiac digital twins (CDTs) to precisely simulate infarct-related electrophysiology. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is the clinical reference for locating MI, yet its reliance on contrast agents restricts use in renally impaired patients and limits longitudinal follow-ups. As an alternative, contrast-free cine MRI visualizes abnormal ventricular wall motion, which is highly indicative of the infarcted area. In this study, we propose a novel explicit geometry-motion embedded model to fully automatically reconstruct personalized, simulation-ready 3D MI geometries directly from multi-view cine MRIs. Specifically, we construct a 4D (3D + t) biventricular mesh to explicitly extract and decouple geometry-aware and motion-aware features. We further design a dual-branch module for adaptive geometry-motion fusion to capture spatiotemporal dependencies for mapping infarcted region. Furthermore, we introduce multi-scale supervision utilizing an AHA-17 segment-guided cross-attention mechanism to steer the prediction, ensuring biophysically consistent reconstruction. Experimental results on 225 cine MRIs demonstrated that the proposed 3D MI reconstruction achieved high performance with an average Dice score of 0.678 $\pm$ 0.011. In the downstream in-silico electrophysiological simulation evaluations, the results were highly consistent with the LGE-derived ground truth, highlighting the great potential of the proposed model for contrast-free scar characterization and seamless integration into CDT modeling. The code will be released publicly upon acceptance of the manuscript for publication.

2606.01790 2026-06-02 cs.CV cs.AI 版本更新

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

STaR-KV: 面向GUI视觉语言模型的时空自适应KV缓存压缩重加权方法

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang

发表机构 * EPIC Lab, SJTU(上海交通大学EPIC实验室) HKUST (GZ)(香港科技大学(广州)) The University of Sydney(悉尼大学) UESTC(电子科技大学) ZJU(浙江大学)

AI总结 提出STaR-KV,一种无需训练的KV缓存压缩框架,通过子空间感知评分、时间稳定性折扣和熵驱动温度三个维度自适应校准令牌重要性,在GUI任务中实现高精度和近40%的峰值GPU内存节省。

详情
AI中文摘要

基于视觉语言模型的图形用户界面(GUI)代理展现出广泛的自动化能力,但其部署受限于随交互步骤线性增长的键值(KV)缓存。例如,UI-TARS-1.5-7B在仅五个屏幕截图上消耗76 GB的GPU内存,接近主流80 GB加速器的容量。现有的KV压缩方法共享两个结构假设:将视觉令牌重要性聚合为单个共享显著性图,并对融合的分数分布应用固定的top-B截断。初步测量反驳了这两点:空间专门化存在于注意力子空间层面并在层间迁移,而分数分布沿轨迹漂移。我们提出STaR-KV(时空自适应重加权),一种无需训练的KV缓存压缩框架,沿三个维度校准令牌重要性:(i)由在线空间互信息驱动的子空间感知评分;(ii)时间稳定性折扣,抑制来自持续关注子空间的冗余缓存条目;(iii)熵导出的温度,自适应重塑分数分布。在四个GUI基准测试中,STaR-KV在匹配预算下实现了最先进的KV压缩方法(如GUIKV、SnapKV)中最强的平均准确率,无压缩阶段FLOPs开销(-0.07%),并在20% KV缓存预算下削减近40%的峰值GPU内存。代码可在https://github.com/kawhiiiileo/STaR-KV获取。

英文摘要

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.

2606.01788 2026-06-02 cs.CV 版本更新

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

PlatonicNav: 在导航中揭示柏拉图式拓扑地图的语义对应

Junlin Long, Zeyu Zhang, Xu Deng, Yiran Wang, Yue Yang, Luke Borgnolo, Maxwell Twelftree, Yang Zhao

发表机构 * USYD(新南威尔士大学) Maincode UNSW(新南威尔士大学) La Trobe(拉特罗布大学)

AI总结 提出PlatonicNav框架,通过自监督视觉编码器构建柏拉图式拓扑地图,无需跨模态训练即可统一视觉目标导航、跨模态目标导航和视觉语言导航任务。

详情
AI中文摘要

具身视觉导航中,智能体感知复杂环境并从原始感官输入出发行动以到达目标,支撑了家庭服务机器人、辅助机器人和大规模自主探索等广泛应用。然而,最近统一视觉语言导航(VLN)和目标目标导航(ObjNav)的尝试仍停留在架构融合、混合任务训练和大规模视觉语言预训练层面,未检验独立训练的视觉和语言编码器是否已共享共同的语义结构。此外,即使是面向目标的拓扑地图,仍通过显式跨模态监督(如CLIP或大型视觉语言模型)来锚定语言目标,尚不清楚这种锚定是否可能仅从纯视觉构建的地图实现。为解决这些挑战,我们将柏拉图式表示假说扩展到具身导航,并将纯视觉ObjNav、跨模态ObjNav和VLN重新解释为同一面向目标的语义流形的三种不同接口。我们进一步引入PlatonicNav,一个无需训练的框架,其柏拉图式拓扑地图融合来自自监督视觉编码器的几何和语义节点距离,并通过盲匹配(无需任何配对视觉语言数据)锚定语言目标。在HM3D-IIN、OVON和R2R-CE(基于MP3D)等仿真基准以及宇树Go2机器人上的实验表明,PlatonicNav无需显式跨模态训练即可跨任务、模态和具身形式泛化。代码:https://github.com/AIGeeksGroup/PlatonicNav。网站:https://aigeeksgroup.github.io/PlatonicNav。

英文摘要

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.

2606.01757 2026-06-02 cs.CV 版本更新

PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

PillarDETR:基于YOLO骨干和RT-DETR头的实时3D目标检测

Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出PillarDETR架构,结合YOLOv8的CSP骨干和RT-DETR解码器,实现无需NMS的端到端实时3D目标检测,在KITTI和nuScenes上取得精度与速度的良好平衡。

Comments 6 pages, 1 figures, 8 tables

详情
AI中文摘要

实时3D目标检测是自动驾驶系统和机器人安全运行的关键组成部分。虽然LiDAR点云提供准确的空间信息,但高效处理它们仍然是一个重大挑战。传统方法依赖于复杂的3D卷积或基于锚点的范式,难以平衡检测精度与推理速度。在本文中,我们提出PillarDETR,一种新颖的端到端3D目标检测架构,它将基于柱体的LiDAR编码的效率与现代2D视觉模型的表示能力相结合。具体来说,PillarDETR用源自YOLOv8的跨阶段局部(CSP)网络替代标准卷积骨干,从而能够从伪图像中提取更丰富的特征。此外,我们摒弃了传统的基于锚点或基于中心的检测头,转而采用实时检测Transformer(RT-DETR)解码器。这种混合设计使网络能够捕获全局上下文并直接预测3D边界框,而无需依赖非极大值抑制(NMS)。在KITTI和nuScenes基准上的大量实验表明,PillarDETR在平均精度(mAP)和推理延迟之间实现了令人信服的权衡。我们的消融研究证实,集成YOLOv8骨干和RT-DETR头相比PointPillars基线带来了显著改进,使PillarDETR成为实时3D感知的高效解决方案。

英文摘要

Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.

2606.01756 2026-06-02 cs.CV 版本更新

EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models

EvoCut:面向高效大型视觉语言模型的多层演化感知视觉标记压缩

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Pengfei Zhang, Yao Hu, Jiawei Li, Shikai Jiang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Xiaohongshu(小红书) Fudan University(复旦大学)

AI总结 提出一种无需训练和注意力的视觉标记压缩方法EvoCut,通过分析多层演化偏差估计标记重要性,在LLaVA-1.5-7B上仅保留11.1%的视觉标记即可保持94.4%的平均性能。

Comments Preprint. 12 pages, 6 figures, 7 tables

详情
AI中文摘要

大型视觉语言模型(LVLMs)在图像和视频理解任务上取得了强大性能,但其推理效率受到视觉编码器产生的大量视觉标记的限制。现有大多数视觉标记压缩方法从特定层的注意力分数或表示属性估计标记重要性,忽略了视觉标记在视觉编码器中的演化过程。这种逐层标准可能提供不完整的重要性估计,并限制压缩后的性能保持。为解决此问题,我们分析了逐层视觉标记演化方向,并观察到标记在视觉编码器各层形成多个组演化方向。进一步分析表明,信息性标记往往表现出与共同组演化方向的持续偏离。基于这一观察,我们提出了EvoCut,一种无需训练和注意力的视觉标记压缩方法,通过多层演化偏差估计标记重要性。实验结果表明,EvoCut在LLaVA-1.5-7B上仅保留11.1%的视觉标记即可保持94.4%的平均性能,展示了其在平衡效率和准确性方面的有效性。

英文摘要

Large vision-language models (LVLMs) achieve strong performance on image and video understanding tasks, but their inference efficiency is constrained by the large number of visual tokens produced by vision encoders. Most existing visual token compression methods estimate token importance from attention scores or representation properties at specific layers, overlooking how visual tokens evolve across the vision encoder. Such layer-specific criteria may provide incomplete importance estimates and limit performance preservation after compression. To address this issue, we analyze layer-wise visual token evolution directions and observe that tokens form multiple group evolution directions across vision-encoder layers. Our analysis further shows that informative tokens tend to exhibit persistent deviations from common group evolution directions. Based on this observation, we propose EvoCut, a training-free and attention-free visual token compression method that estimates token importance from multi-layer evolution deviation. Experimental results show that EvoCut can retain only 11.1\% of the visual tokens on LLaVA-1.5-7B while preserving 94.4\% of the average performance, demonstrating its effectiveness in balancing efficiency and accuracy.

2606.01753 2026-06-02 cs.CV 版本更新

Quality-Guided Semi-Supervised Learning for Medical Image Segmentation

质量引导的半监督学习用于医学图像分割

Kumar Abhishek, Ghassan Hamarneh

发表机构 * School of Computing Science, Simon Fraser University, Canada(Simon Fraser大学计算机科学学院)

AI总结 提出一种质量引导的半监督学习框架,通过专用网络估计分割质量,并利用质量感知正则化和伪标签重加权提升医学图像分割性能。

Comments Early Accept at MICCAI 2026, 13 pages, 2 figures

详情
AI中文摘要

训练准确的医学图像分割模型需要大量密集标注的数据,这既昂贵又耗时。半监督学习通过从大量未标注数据和少量标注数据中学习来缓解这一问题。然而,大多数现代半监督学习方法依赖未标注数据的伪标签,并通常通过模型置信度或不确定性来评估其可靠性,这些度量是自我指涉的,缺乏对分割质量的明确基础。相反,我们提出了一种质量引导的半监督学习框架,训练一个专用网络从图像-掩膜对中估计分割质量。该预测器在通过合成损坏生成的变质量掩膜上进行训练,这些损坏结合了部分训练分割模型产生的不完美输出,捕捉训练中遇到的真实错误模式。我们通过两种互补机制将质量预测器集成到半监督学习中:质量感知正则化损失和基于质量的伪标签样本重新加权方案。我们表明,我们的方法可以作为现有半监督学习框架的即插即用增强。在五个数据集和多种架构上的大量实验表明,与竞争性的半监督学习方法相比,我们的方法取得了一致的改进,推进了半监督医学图像分割的最新水平。

英文摘要

Training accurate medical image segmentation models requires large amounts of densely annotated data, which is costly and time-consuming to obtain. Semi-supervised learning (SSL) alleviates this by learning from both abundant unlabeled data and limited labeled data. However, most modern SSL methods rely on pseudolabels for unlabeled data, and typically assess their reliability through model confidence or uncertainty, measures that are self-referential and lack explicit grounding in segmentation quality. Instead, we propose a quality-guided SSL framework that trains a dedicated network to estimate segmentation quality from image-mask pairs. The predictor is trained on variable-quality masks generated through synthetic corruptions augmented with imperfect outputs from partially trained segmentation models, capturing realistic error patterns encountered during training. We integrate the quality predictor into SSL through two complementary mechanisms: a quality-aware regularization loss and a quality-based pseudolabel sample reweighting scheme. We show that our method serves as a drop-in enhancement to existing SSL frameworks. Extensive experiments across five datasets and multiple architectures demonstrate consistent improvements over competing SSL methods, advancing the state-of-the-art in semi-supervised medical image segmentation.

2606.01746 2026-06-02 cs.CV cs.LG 版本更新

Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness

敏感性是一把双刃剑:判别性与对抗鲁棒性之间的权衡

Kai Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文发现全连接分类器的高敏感性带来判别性但也导致脆弱性,而ℓ2距离分类器的不敏感性带来鲁棒性但限制性能,为此提出基于混合原型混合框架的ℓ2重分类器,通过融合稳定原型和动态原型实现判别性与鲁棒性的平衡,并设计混合替代攻击评估协议。

Comments 13 pages including reference, 4 figures

详情
AI中文摘要

现代神经网络极易受到对抗性扰动的影响。在这项工作中,我们指出这种脆弱性部分源于广泛使用的全连接分类器对此类扰动的敏感性。相比之下,简单的基于ℓ2距离的分类器表现出显著更强的鲁棒性。我们提供了充分的理论和实证分析,表明全连接分类器的高敏感性使其具有判别性,但也使其脆弱;相反,ℓ2分类器的不敏感性赋予了鲁棒性但限制了性能。受这种权衡的启发,我们提出了一种基于混合原型混合框架的新型ℓ2重分类器。该方法保留了全连接分类器的判别能力,同时利用了ℓ2距离的鲁棒性。它通过融合两种原型类型来产生基于ℓ2距离的预测:(1)通过指数移动平均更新的稳定数据集级原型,以及(2)使用直通估计器从全连接分类器预测生成的动态批量级原型。然而,这种基于直通估计器的动态架构给评估带来了重大挑战,例如梯度混淆和前向不连续性。为了解决这个问题,我们提出了一种新的严格评估协议——混合替代攻击,该协议使用多个替代模型以及强大的AutoAttack,以确保公平和稳健的评估。大量实验表明,我们的轻量级即插即用模块只需极少的微调,就能有效增强各种现有最先进对抗训练模型的对抗鲁棒性。

英文摘要

Modern neural networks are highly susceptible to adversarial perturbations. In this work, we identify that part of this vulnerability stems from the sensitivity of the widely used fully connected (FC) classifiers to such perturbations. In contrast, simple $\ell_2$ distance-based classifiers exhibit significantly greater robustness. We provide thorough theoretical and empirical analysis showing that while FC classifiers' high sensitivity makes them discriminative, it also makes them vulnerable. Conversely, $\ell_2$-classifiers' insensitivity grants robustness but limits performance. Motivated by this trade-off, we propose a novel $\ell_2$-reclassifier based on a Hybrid Prototype Mixing (HPM) framework. This method retains the discriminative power of FC classifiers while leveraging the robustness of $\ell_2$ distance. It yields $\ell_2$-distance-based predictions by fusing two prototype types: (1) stable, dataset-level prototypes updated via EMA, and (2) dynamic, batch-level prototypes generated from the FC classifier's predictions using a Straight-Through Estimator (STE). However, this dynamic, STE-based architecture introduces significant challenges for evaluation, such as gradient obfuscation and forward discontinuity. To address this, we propose a new, rigorous evaluation protocol, the Mixed Surrogate Attack (MSA), which uses multiple surrogates along with powerful AutoAttack to ensure a fair and robust assessment. Extensive experiments demonstrate that our lightweight, plug-and-play module, with minimal fine-tuning, effectively enhances the adversarial robustness of various existing SOTA adversarially trained models.

2606.01734 2026-06-02 cs.CV cs.LG cs.RO 版本更新

FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds

FlatVPR: 用于基础模型特征流形几何校正的即插即用地线性残差适配器

Rai Hisada, Kanji Tanaka

发表机构 * Fundamental Engineering for Knowledge-Based Society, Graduate School of Engineering, University of Fukui(知识社会基础工程,工程研究生院,福井大学)

AI总结 提出FlatVPR范式,通过可学习残差适配器和Pullback Flatness Loss抑制特征流形曲率,实现稀疏锚点下的线性插值重建,在NCLT数据集上显著提升视觉位置识别精度。

Comments 5 pages, 1 figure, technical report

详情
AI中文摘要

本文提出“FlatVPR”,一种新颖的几何校正范式,通过强制特征流形结构,使得两个相邻锚点 $\mathbf{z}_A$ 和 $\mathbf{z}_B$ 之间的任何描述符都可以通过线性插值 $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$(其中 $t \in [0,1]$ 表示相对位置)精确重建,从而有效平衡视觉位置识别(VPR)中地图轻量化和定位精度之间的权衡。尽管最先进的基础模型(如DINOv2-ViT-S/14)提供了鲁棒的语义特征,但其潜在流形表现出显著的曲率,将物理空间中的均匀线性运动投影到特征空间中高度非线性的轨迹上,这阻碍了稀疏锚点条件下的可靠重建。为了实现上述基于插值的重建,我们对原始基础特征 $\mathbf{z}$ 引入残差变换 $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$,其中 $\text{Res}(\cdot)$ 表示可学习的适配器。我们的方法通过数学上严谨的Pullback Flatness Loss显式抑制流形曲率,该损失最小化中间特征与连接相邻锚点的线性段之间的偏差,从而最小化流形的内在曲率。通过这种空间展平,地图构建被公式化为期望最大化(EM)框架,解耦为用于流形适应的连续M步和用于最优锚点选择准则的概念性E步。在NCLT数据集上的实验表明,即使在100米间隔的极端稀疏锚点和极端季节变化条件下,应用我们的适配器也能带来显著的性能提升。

英文摘要

This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.

2606.01710 2026-06-02 cs.CV cs.LG 版本更新

Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs

零样本VLM中虚假相关性的密度感知转换

Afsaneh Hasanebrahimi, Hanxun Huang, Christopher Leckie, Sarah Erfani

发表机构 * School of Computing and Information Systems, The University of Melbourne, Victoria, Australia(计算与信息系统学院,墨尔本大学,维多利亚,澳大利亚)

AI总结 提出密度感知转换(DAT)方法,利用局部几何密度项修正图像-文本相似度,以缓解CLIP等视觉语言模型在零样本分类中因虚假相关性导致的性能下降。

Comments ICML 2026

详情
AI中文摘要

视觉语言模型(如CLIP)实现了强大的零样本分类。然而,它们的预测仍然对虚假相关性敏感,即上下文线索主导语义内容。早期的解决方案通常依赖于微调或提示工程,这要么削弱了预训练模型的优势,要么容易产生幻觉。在这项工作中,我们提出了密度感知转换(DAT),它使用从组参考集导出的局部几何密度项来细化图像-文本相似度分数。我们的方法受到以下现象的启发:CLIP嵌入表现出模态间隙,并位于特征空间中的各向异性壳上:常见模式聚集在均值附近,而罕见模式被推向外围。这种几何结构产生了不均匀的对齐,其中虚假相关性被放大,而语义上有意义但罕见的线索被边缘化。为了解决这个问题,我们采用相对度量根据嵌入密度重新缩放相似度,抑制扩散区域中过度自信的分数,同时保留密集、语义一致的匹配。在基准数据集上的实验结果表明,最差组和平均准确率持续提高,突出了密度感知转换作为一种简单有效的校准机制,用于使用多模态模型进行可靠的零样本分类。

英文摘要

Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.

2606.01703 2026-06-02 cs.SD cs.AI cs.CV 版本更新

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

发表机构 * Jen Music AI

AI总结 提出JenBridge框架,通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制,实现长视频配乐的高保真生成与场景转换自然连贯。

详情
AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计,缺乏确保叙事连续性的机制。我们提出了JenBridge,一个模块化且可解释的自适应长视频配乐框架,确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型,采用流匹配目标训练,遵循两阶段范式:在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验,然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是,为了实现跨不同场景变化的长格式连贯性,JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包,包括一种生成式过渡方法,并独特地采用了一个大型语言模型(LLM)代理,作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务,我们提出了LVS基准,这是一个新基准,包含一个精选数据集和新的评估指标,侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明,JenBridge在客观和主观指标上均显著优于现有方法,特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

2606.01701 2026-06-02 cs.CV 版本更新

Spatio-Temporal Correlation Guided Geometric Partitioning for Versatile Video Coding

时空相关性引导的几何划分用于多功能视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

发表机构 * Institute of Digital Media, Department of Electronics Engineering and Computer Science, Peking University(数字媒体研究所,电子工程与计算机科学系,北京大学) Information Technology R&D Innovation Center of Peking University(北京大学信息科技研发创新中心) Peng Cheng Laboratory(鹏城实验室) School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)

AI总结 针对VVC中几何划分开销大的问题,提出时空相关性引导的几何划分(STGEO)方案,通过模式预测和运动候选选择减少边信息比特,提升编码效率。

详情
Journal ref
IEEE Transactions on Image Processing, vol. 31, pp. 30-42, 2022
AI中文摘要

几何划分因其在混合视频编码框架中卓越的运动场描述能力而受到越来越多的关注。然而,多功能视频编码(VVC)中现有的几何划分(GEO)方案给边信息的信令带来了不可忽视的负担,从而限制了编码效率。鉴于此,我们提出了一种时空相关性引导的几何划分(STGEO)方案,以有效描述视频编码运动场中的物体信息。所提方法可以节省用于边信息信令的比特,包括划分模式和运动信息。我们首先以统计合理的方式分析了划分模式决策和运动矢量选择的特性。基于观察到的时空相关性,我们设计了一种模式预测和编码方法,以减少表示上述边信息的开销。主要思想是预测具有较高选择可能性的STGEO模式和运动候选,这可以指导熵编码,即用更少的比特表示预测的高概率模式和运动候选。特别地,高概率STGEO模式基于边缘信息和相邻STGEO编码块的历史模式进行预测。相应的运动信息由合并候选列表中的索引表示,该索引基于离线训练的合并候选选择概率自适应地推断。仿真结果表明,与未使用GEO的VTM-8.0相比,所提方法在随机接入和低延迟B配置下平均分别节省了0.95%和1.98%的比特率。

英文摘要

Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.

2606.01700 2026-06-02 cs.CV 版本更新

MixerSENet: A Lightweight Framework for Efficient Hyperspectral Image Classification

MixerSENet: 一种用于高效高光谱图像分类的轻量级框架

Mohammed Q. Alkhatib, Swalpa Kumar Roy, Ali Jamali

发表机构 * College of Engineering and IT, University of Dubai(迪拜大学工程与信息技术学院) Department of Computer Science and Engineering, Alipurduar Government Engineering and Management College(阿利普杜尔政府工程与管理学院计算机科学与工程系) Department of Geography, Simon Fraser University(西蒙·弗雷泽大学地理系)

AI总结 提出轻量级框架MixerSENet,通过解耦空间与通道维度混合并引入挤压激励模块,在保持低参数量的同时实现高光谱图像分类的高精度与高效率。

Comments Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)

详情
AI中文摘要

本文提出了一种新颖的框架MixerSENet,用于高光谱图像(HSI)分类,旨在解决计算效率和有限标注数据带来的挑战。所提出的模型处理高光谱图像块,同时在整个网络中保持一致的尺寸和分辨率,有效解耦了空间和通道维度的混合。值得注意的是,MixerSENet轻量且计算高效,与传统模型相比所需参数更少,适用于资源受限环境。模型中嵌入了挤压激励块以细化特征提取,增强网络捕获更多信息特征的能力。在两个基准数据集上的实验结果表明,MixerSENet实现了优越的性能,在Houston13数据集上达到82.47%的总体精度(OA),在Qingyun数据集上达到96.70%,优于包括3D-CNN、HybridKAN、HSIFormer、SimPoolFormer和MorphMamba在内的最先进方法。此外,对计算效率的详细分析表明,MixerSENet在准确性和效率之间实现了良好的平衡,仅需53,146个参数和较低的推理时间,证实了其在实际应用中的实用性。发布时,源代码将在https://github.com/mqalkhatib/MixerSENet公开。

英文摘要

In this paper, a novel framework, MixerSENet, is introduced for hyperspectral image (HSI) classification, designed to address the challenges of computational efficiency and limited labeled data. The proposed model processes hyperspectral image patches while maintaining consistent size and resolution throughout the network, effectively decoupling the mixing of spatial and channel dimensions. Notably, MixerSENet is lightweight and computationally efficient, requiring fewer parameters compared to traditional models, making it suitable for resource-constrained environments. A squeeze and excitation block is incorporated into the model to refine feature extraction, enhancing the network's ability to capture more informative features. Experimental results on two benchmark datasets demonstrate that MixerSENet achieves superior performance, reaching an overall accuracy (OA) of 82.47% on Houston13 dataset and 96.70% on the Qingyun dataset, outperforming state-of-the-art methods including 3D-CNN, HybridKAN, HSIFormer, SimPoolFormer, and MorphMamba. Furthermore, a detailed analysis of computational efficiency shows that MixerSENet achieves a favorable balance between accuracy and efficiency, with only 53,146 parameters and an low inference time, confirming its practicality for real-world applications. At publication, source code will be publicly available at https://github.com/mqalkhatib/MixerSENet.

2606.01698 2026-06-02 cs.CV 版本更新

Learning Label-Efficient Interpretable Medical Image Diagnosis via Semi-supervised Hypergraph Concept Bottleneck Model

通过半监督超图概念瓶颈模型实现标签高效的医学图像可解释诊断

Yijun Yang, Ruiqiang Xiao, Lijie Hu, Angelica I Aviles-Rivero, Yunzhu Wu, Jing Qin, Lei Zhu

发表机构 * HKUST(GZ)(香港科技大学(广州)) Joy Future Academy(未来正义学院) MBZUAI(穆罕默德·本·拉希德智能研究院) Tsinghua University(清华大学) Sichuan University(四川大学) PolyU

AI总结 提出一种半监督超图概念瓶颈模型,利用双层超图学习建模高阶概念依赖并生成领域自适应伪标签,在胎盘植入谱系等医学图像诊断中实现高可解释性和性能。

详情
AI中文摘要

深度学习在医学图像分析中取得了革命性进展,在多种应用中提供了卓越的诊断准确性。然而,其决策缺乏可解释性阻碍了临床采纳,特别是在高风险医疗场景中,透明度对可信度至关重要。例如,在胎盘植入谱系(PAS)中,超声图像中的细微线索挑战了可靠诊断,使得黑盒模型难以获得准确的评分信任。为了解决这一问题,概念瓶颈模型(CBM)通过将临床上有意义的中间概念嵌入诊断流程,提供了一种有前景的途径,使临床医生能够审查和优化模型输出。然而,传统的CBM在捕捉复杂的概念间依赖关系方面表现不佳,并且需要昂贵、专家驱动的概念注释,限制了其可扩展性。本研究引入了一种新颖的半监督CBM框架,专为医学成像设计,利用双层超图学习来建模高阶概念依赖并生成领域自适应伪标签。我们的方法通过集成概念级超图以增强推理和图像级超图以生成鲁棒的伪标签,实现了卓越的可解释性和性能。在新标注的PAS超声数据集和乳腺超声公共数据集上的实验证明了所提出的概念标签高效可解释框架的有效性。其通用性在皮肤镜图像数据集SkinCon上得到了进一步验证。代码可在https://github.com/scott-yjyang/HyperCBM获取。

英文摘要

Deep learning has revolutionized medical image analysis, delivering exceptional diagnostic accuracy across diverse applications. Yet, the lack of interpretability in its decision-making hinders clinical adoption, particularly in high-stakes medical contexts where transparency is paramount for trustworthiness. For example, in Placenta Accreta Spectrum (PAS), subtle cues in ultrasound imaging challenge reliable diagnosis, rendering black-box models untrustworthy for accurate scoring. To address this, Concept Bottleneck Models (CBMs) offer a promising avenue by embedding clinically meaningful intermediate concepts into the diagnosis pipeline, enabling clinicians to scrutinize and refine model outputs. However, conventional CBMs falter in capturing complex inter-concept dependencies and demand costly, expert-driven concept annotations, limiting their scalability. This study introduces a novel semi-supervised CBM framework designed for medical imaging, which leverages dual-level hypergraph learning to model high-order concept dependencies and generate domain-adaptive pseudo-labels. Our approach achieves superior interpretability and performance by integrating a concept-level hypergraph for enhanced reasoning and an image-level hypergraph for robust pseudo-label generation. Experiments on a newly annotated PAS ultrasound dataset and a breast ultrasound public dataset demonstrate the effectiveness of the proposed concept label-efficient interpretable framework. Its universality is further validated on the dermoscopic image dataset SkinCon. The code is available at https://github.com/scott-yjyang/HyperCBM.

2606.01694 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

通过场景级一致性理解热视频中的身份连续性

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang

发表机构 * Department of Electrical and Computer Engineering, Information Processing Lab, University of Washington, USA(电气与计算机工程系,信息处理实验室,华盛顿大学,美国)

AI总结 针对热行人多目标跟踪中身份碎片化问题,提出轻量级后处理方法,通过在线短间隙重映射和离线轨迹重链接恢复身份连续性,在PBVS热行人MOT基准上提升IDF1。

Comments Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 1411-1419
AI中文摘要

热行人多目标跟踪仍然具有挑战性,因为弱外观线索和频繁的检测中断导致严重的轨迹碎片化。我们研究轻量级后处理是否可以在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。从YOLOv8和SORT基线开始,我们添加了一个模块化的身份修复后端,包括基于时间、空间、运动和边界线索的在线短间隙重映射和离线轨迹重链接。在固定验证集上的受控消融实验和在官方PBVS热行人MOT基准上的评估表明,主要身份增益来自保守的重链接,将IDF1从82.25提升到84.93,同时保持MOTA,而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明,在低信息热图像中,通过高精度轨迹重链接比增加跟踪器复杂性更能有效地实现鲁棒的身份恢复。这些结果提供了对热视频中身份恢复的受控分析,表明与局部帧到帧关联相比,场景级时空一致性在身份连续性中起主导作用。

英文摘要

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

2606.01689 2026-06-02 cs.CV cs.AI 版本更新

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

RPCASSM: 基于鲁棒主成分分析的状态空间模型用于红外小目标检测

Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang, Qiuzhan Zhou

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University(教育部符号计算与知识工程重点实验室) College of Software, Jilin University(吉林大学软件学院) School of Geosciences, Yangtze University(长江大学地球科学学院) College of Communication Engineering, Jilin University(吉林大学通信工程学院)

AI总结 针对红外小目标检测中主流状态空间模型难以准确建模目标边缘的问题,提出基于鲁棒主成分分析(RPCA)的RPCASSM网络,通过设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)分别利用空间异质信号显著性和目标稀疏局部高亮特性进行状态空间建模,有效解决了边缘建模难题。

Comments 12 pages, 8 figures, under review

详情
AI中文摘要

红外小目标的检测与分割在监控安防、海上救援等领域具有重要的应用意义。由于这些目标在远距离成像中占据像素少,主流的视觉状态空间模型效率低下且难以准确建模目标边缘。现有的红外状态空间模型并未从红外小目标的结构特性出发偏离主流视觉状态空间结构框架。为了解决这一问题,本文基于鲁棒主成分分析(RPCA)的模型范式提出了RPCASSM网络,旨在通过红外小目标在空间域的性质设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)。BSSM旨在利用空间异质信号的显著性设计空间探测扫描机制(SPCM)来建模背景信息。TSSM利用目标的稀疏性和局部高亮特性设计可变形提示扫描机制(DPCM),聚焦于目标的可变形空间进行状态空间建模。通过上述设计,我们有效解决了现有主流视觉状态空间模型难以准确建模红外小目标边缘结构的问题。在现有基准数据集上的实验结果证明了RPCASSM设计的有效性。我们的代码将在\href{https://github.com/PepperCS/RPCASSM}{RPCASSM}公开。

英文摘要

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.

2606.01652 2026-06-02 eess.SP cs.CV 版本更新

Physics-Aware Linearized ADMM and Its Unrolling

物理感知线性化ADMM及其展开

Satoshi Takabe, Shunta Arai, Tadashi Wadayama

发表机构 * Japan Society for the Promotion of Science (JST), CRONOS(日本学术振兴会(JST)、CRONOS)

AI总结 针对基于PDE测量过程的逆问题,提出物理感知线性化ADMM算法,通过子问题线性化实现高效更新,并利用深度展开训练内部参数,在光纤通信压缩感知和噪声各向异性扩散图像恢复中验证有效性。

Comments 5 pages, 3 figures

详情
AI中文摘要

近年来,偏微分方程(PDE)已被用于直接建模信号处理中的测量过程,尽管其评估成本高昂。本文提出一种新颖的基于交替方向乘子法(ADMM)的算法,称为物理感知线性化ADMM(PA-LADMM),用于基于PDE测量过程的逆问题。关键思想是对包含PDE的子问题进行线性化,从而得到一种成本高效的更新规则,每次迭代仅需调用PDE求解器及其梯度评估。该算法在特定条件下具有理论收敛保证。此外,我们将其与深度展开相结合,展开PA-LADMM并使用监督数据训练其内部参数。两个不同的实验——光纤通信压缩感知和噪声各向异性扩散图像恢复——证明了所提算法的有效性。

英文摘要

Recently, partial differential equations (PDEs) have been used to directly model the measurement process in signal processing, although their evaluation is costly. In this paper, we propose a novel alternating direction method of multipliers (ADMM)-based algorithm called physics-aware linearized ADMM (PA-LADMM) for inverse problems from PDE-based measurement processes. The key idea is the linearization of the subproblem with PDEs, leading to a cost-efficient update rule that calls only a PDE solver and its gradient evaluation per iteration. The algorithm has a theoretical convergence guarantee under certain conditions. In addition, we combine it with deep unfolding to unroll the PA-LADMM and train its internal parameters using supervised data. Two distinct experiments, compressed sensing with optical fiber communication and image restoration from noisy anisotropic diffusion, demonstrated the effectiveness of the proposed algorithms.

2606.01651 2026-06-02 cs.CV 版本更新

Restoring Initial Noise Sensitivity in Text-to-Image Distillation via Geometric Alignment

通过几何对齐恢复文本到图像蒸馏中的初始噪声敏感性

Huayang Huang, Ruoyu Wang, Jinhui Zhao, Wei Deng, Daiguo Zhou, Jian Luan, Yu Wu, Ye Zhu

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 提出几何感知蒸馏(GAD)框架,通过匹配雅可比-向量积来对齐教师和学生模型的局部功能行为,从而恢复文本到图像蒸馏中丢失的初始噪声敏感性,提升下游噪声驱动控制任务的性能。

Comments ICML 2026

详情
AI中文摘要

生成式蒸馏通过将多步轨迹压缩为少步学生模型,在保持感知质量的同时显著加速文本到图像(T2I)生成。然而,现有方法主要优化效率和输出保真度,往往忽略了原始轨迹的关键属性。在这项工作中,我们识别出一个缺失的关键属性:对初始噪声的敏感性,其退化会损害依赖噪声优化和操作的下游控制方法。我们将此问题追溯到标准的蒸馏目标,这些目标强制逐点输出对齐,无意中压平了输入-输出景观并抑制了教师的局部几何结构。为了解决这个问题,我们提出了几何感知蒸馏(GAD),一种保持敏感性的框架,用于对齐教师和学生模型的局部功能行为。具体而言,GAD匹配关于输入噪声的雅可比-向量积,使学生能够再现教师对扰动的微分响应。在多个T2I范式和噪声驱动控制任务上的大量实验表明,GAD显著恢复了敏感性并提高了多样性,同时保持了高视觉保真度。代码可在 https://github.com/Hannah1102/GAD 获取。

英文摘要

Generative distillation significantly accelerates text-to-image (T2I) generation by compressing multi-step trajectories into few-step student models while preserving perceptual quality. However, existing methods primarily optimize efficiency and output fidelity, often neglecting critical properties of the original trajectory. In this work, we identify a key missing property: sensitivity to initial noise, whose degradation impairs downstream control methods relying on noise-based optimization and manipulation. We trace this issue to standard distillation objectives that enforce pointwise output alignment, inadvertently flattening the input-output landscape and suppressing the teacher's local geometric structure. To address this, we propose Geometry-Aware Distillation (GAD), a sensitivity-preserving framework that aligns the local functional behavior of teacher and student models. Specifically, GAD matches Jacobian-vector products with respect to input noise, enabling the student to reproduce the teacher's differential response to perturbations. Extensive experiments across multiple T2I paradigms and noise-driven control tasks demonstrate that GAD significantly restores sensitivity and improves diversity while maintaining high visual fidelity. Code is available at https://github.com/Hannah1102/GAD.

2606.01643 2026-06-02 cs.CV 版本更新

Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument

手语生成中的条件坍塌:诊断与缩放论证

Rui Hong, Jana Košecká

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文通过提出三个独立评估层级(初始姿态条件、输出多样性、目标忠实度)并利用冻结运动自编码器的潜在表示计算成对距离比,诊断手语生成模型中的条件坍塌问题,并论证句子级配对数据集规模是瓶颈。

详情
AI中文摘要

手语生成(SLP)是从自然语言文本生成虚拟人物手语动作的任务。生成动作的质量通常通过运动空间弗雷歇距离(FID)和反向翻译(BT)BLEU分数在How2Sign等基准上进行评估。这两个指标可能大幅提升,而底层生成器未能忠实表示手语手势。在这项工作中,我们提出在三个独立层级上评估生成的动作:(τ1)初始姿态条件,(τ2)输出多样性,以及(τ3)目标忠实度。我们使用冻结运动自编码器(MoAE)的潜在表示计算这些成对距离比。我们在How2Sign数据集上评估了14个SLP模型检查点,包括重新实现的Neural Sign Actors(NSA),并表明τ3忠实度从未达到,而FID变化近两个数量级且与忠实度不相关。我们表明,在孤立词汇数据集ASL3DWord上可以达到有利的τ3,因此将句子级配对数据集的大小确定为瓶颈。

英文摘要

Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fréchet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially while the underlying generator fails to faithfully represent the sign language gestures. In this work we propose to evaluate the generated motion at three independent levels: ($\tau1$) initial-pose conditioning, ($\tau2$) output diversity, and ($\tau3$) target faithfulness. We compute these as pairwise-distance ratios using latent representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign dataset, including a re-implemented Neural Sign Actors (NSA), and show that $\tau3$ faithfulness is never attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show that on the isolated gloss dataset ASL3DWord favorable $\tau3$ can be attained, hence isolating the size of the sentence-level paired-dataset as the bottleneck.

2606.01641 2026-06-02 cs.CV 版本更新

Edge-directed geometric partitioning for versatile video coding

面向多功能视频编码的边缘导向几何划分

Xuewei Meng, Xinfeng Zhang, Chuanmin Jia, Xia Li, Shanshe Wang, Siwei Ma

AI总结 针对VVC标准,提出基于时空边缘信息构建最可能模式列表的几何划分模式预测策略,以降低索引开销并提升编码效率,平均BD-rate增益0.58%-1.00%。

Comments This paper has been published in IEEE ICME

详情
Journal ref
IEEE International Conference on Multimedia and Expo (ICME), 2020, pp. 1-6
AI中文摘要

为了提升编码性能,针对即将到来的VVC标准提出了几何划分(GEO)。GEO提供140个划分候选。最优GEO模式的索引需要显式地信令。考虑到不同CU的结构特性以及空间相邻块与时序同位块之间的相关性,我们提出了一种GEO模式预测策略,通过构建最可能模式(MPM)列表来减少GEO索引的开销并提高编码效率。基于划分模式与物体边界高度相关的观察,提出了一种边缘导向的几何划分方案,根据时空边缘信息构建MPM列表。与VTM-6.0相比,所提方法在RA和LDB配置下平均提供了0.58%和1.00%的客观BD-rate增益。此外,它还提升了物体边界的视觉质量。

英文摘要

To improve the coding performance, geometric partition (GEO) was proposed for the upcoming VVC standard. GEO provides 140 partition candidates. The index of optimal GEO mode needs to be signaled explicitly. Considering different structural characteristics of different CUs and the correlation between spatial adjacent blocks and temporal collocated blocks, we propose a GEO mode prediction strategy by constructing a Most Probable Mode (MPM) list to reduce the overhead of GEO index and improve coding efficiency. Based on the observation of the high correlation between the partition mode and object boundaries, an edge-directed geometric partition scheme is proposed to construct the MPM list according to spatio-temporal edge information. The proposed method provides an objective BD-rate gain of 0.58% and 1.00% on average for RA and LDB configurations compared to VTM-6.0. Besides, it also promotes the visual quality of object boundaries.

2606.01638 2026-06-02 cs.CV 版本更新

CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation

CanonCGT:基于参考的颜色分级通过规范枢轴表示

Jinwon Ko, Keunsoo Ko, Chang-Su Kim

发表机构 * Korea University(韩国大学) The Catholic University of Korea(韩国天主教大学)

AI总结 提出一种基于规范枢轴的两阶段框架CanonCGT,通过去除内在色调偏差并匹配参考风格,实现稳定、真实的颜色分级。

Comments CVPR 2026 accepted

详情
AI中文摘要

基于参考的颜色分级旨在再现参考图像的色调和光照,同时保持色彩和谐与场景结构。现有的逼真和基于滤镜的方法通常产生不稳定的色调映射——过度偏移或不一致地保留颜色——导致不自然的结果。我们提出CanonCGT,一个基于规范枢轴的两阶段框架——一种风格中立的中间表示,用于稳定的颜色映射。第一阶段通过去除内在色调偏差来规范化输入,第二阶段对其进行颜色分级以匹配参考风格。一种双阶段训练方案DP-CGT结合了监督预设学习和非配对照片上的自监督细化。CanonCGT在多种数据集上产生逼真且色调一致的结果,在稳定性和视觉保真度上超越了最先进的方法。我们的代码可在\href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}获取。

英文摘要

Reference-based color grading aims to reproduce the tonal mood and lighting of a reference while preserving color harmony and scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings -- over-shifting or inconsistently retaining colors -- leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot -- a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity. Our codes are available at \href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}

2606.01636 2026-06-02 cs.CV 版本更新

Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition

Pave-GRPO:通过原则性平均速度分解超越瞬时引导

Pengyang Ling, Jiazi Bu, Yujie Zhou, Yibin Wang, Zhenyu Hu, Zihan Zhang, Yi Jin, Huaian Chen, Yuhang Zang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Harbin Institute of Technology(哈尔滨工业大学) Beihang University(北京航空航天大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Pave-GRPO方法,通过原则性平均速度分解将粗粒度过渡分解为细粒度子轨迹,在不增加生成成本的情况下将奖励反馈传播到更多中间步骤,实现更全面的偏好对齐。

Comments 8 pages,5 figures

详情
AI中文摘要

通过群体相对策略优化(GRPO)的后训练已成为将基于流的生成模型与人类偏好对齐的强大范式。然而,流模型的迭代去噪性质在生成用于策略梯度更新的群体展开时会产生巨大成本,迫使现有方法使用极少的去噪步骤进行训练。这种时间稀疏性严重限制了偏好优化:奖励反馈只能到达每个轨迹的少数阶段,使得绝大多数中间去噪步骤缺乏直接监督,从而损害了对齐的粒度。为了解决这个问题,我们提出了Pave-GRPO,它通过原则性平均速度分解重新表述了GRPO目标。我们不生成昂贵的高步数展开,而是保持高效的少步数群体采样,但将每个粗粒度转换分解为跨越多个中间时间步的等效细粒度子轨迹集合。这将奖励反馈传播到更密集的时间阶段集,从而实现更全面的偏好对齐,而无需额外的生成成本。这种设计有两个好处:(i)零成本视野扩展:通过直接重用分段群体样本及其相关奖励,Pave-GRPO在固定采样预算下显著拓宽了有效优化范围;(ii)全面的时间监督:通过将瞬时速度目标等效分解为多时间步集合,它将奖励信号分布到去噪过程的更多中间阶段,从而实现更细粒度、更彻底的偏好优化。大量实验验证了Pave-GRPO在不同奖励设置下有效推进了偏好对齐,提供了全面的性能提升。

英文摘要

Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.

2606.01620 2026-06-02 cs.CV 版本更新

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

基于参考引导深度压缩VAE的流式说话人肖像视频实时生成

Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo

发表机构 * Microsoft Research(微软研究院) Microsoft AI(微软人工智能)

AI总结 提出一种结合因果视频VAE和自回归潜在去噪模型的流式说话人肖像视频生成框架,通过参考图像引导实现实时高质量生成。

Comments CVPR 2026 (Highlight) Camera ready

详情
AI中文摘要

视频扩散模型显著推动了肖像视频生成的发展,但其高计算需求限制了在交互式应用中的使用。本文提出一个框架,用于生成以语音音频和参考图像为条件的可流式说话人肖像视频。该框架专为流式场景精心设计,包含一个用于深度潜在压缩的因果视频VAE和一个自回归潜在去噪模型。我们的因果VAE集成了可变数量的参考图像作为引导,使网络能够专注于动态信息而非静态外观,从而提升压缩效率和重建质量。此外,我们扩展了残差自编码范式,以改善VAE中的时空因果处理。生成器基于Rectified Flow Transformer架构,并以块状自回归方式生成视频潜在表示。我们的方法能够实时生成高质量的说话人肖像视频,速度显著快于基线模型。此外,综合实验表明,在逼真度、生动性和视频质量方面,该方法与这些大型模型相当甚至更优。

英文摘要

Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

2606.01615 2026-06-02 cs.CV cs.MM 版本更新

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval

图灵模式用于多媒体:反应-扩散多模态融合用于语言引导的视频时刻检索

Xiang Fang, Wanlong Fang, Wei Ji, Tat-Seng Chua

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学)

AI总结 提出基于反应-扩散过程的多模态融合框架RDMF,通过模拟生物模式形成机制实现视频与文本的动态对齐,用于视频时刻检索与高亮检测。

Comments Published in ACM MM 2025. Address some typos

详情
AI中文摘要

视频-语言模型对于时刻检索和高亮检测等任务至关重要,但它们通常难以捕捉时间视频序列与文本语义之间的动态、非线性交互。现有方法依赖静态交叉注意力或提示调优机制,无法自适应地建模模态间的演化关系,导致对齐次优和泛化受限。受系统生物学启发,我们提出 extbf{反应-扩散多模态融合(RDMF)},这是一个新颖的框架,将视频-语言对齐重新构想为反应-扩散(RD)过程,借鉴了Alan Turing引入的模式形成原理。在RDMF中,视频特征随时间扩散以捕捉时间上下文,而文本-视频交互被建模为非线性反应,放大相关特征并抑制噪声,形成类似于生物系统的涌现模式。利用Gray-Scott RD模型,我们设计了一个计算高效的融合模块,集成视频和文本表示,并通过图灵不稳定性准则对稳定性和收敛性进行严格的数学分析。我们的框架具有理论依据,采用先进的数学工具确保稳定的模式形成,并且实际可行,集成了标准组件如预训练编码器和DETR风格的头用于时刻检索和显著性预测。RDMF代表了一种开创性的跨学科方法,桥接了系统生物学和多媒体研究,以解决传统多模态融合的局限性。初步实验表明,它在识别显著视频时刻方面具有超越现有方法的潜力,为视频-语言任务提供了新的范式。

英文摘要

Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbf{Reaction-Diffusion Multimodal Fusion (RDMF)}, a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks.

2606.01612 2026-06-02 cs.CV cs.LG 版本更新

Self-Improving Small Object Grounding in LVLMs

LVLMs中的自改进小目标定位

Tianze Yang, Yucheng Shi, Ruitong Sun, Ninghao Liu, Jin Sun

发表机构 * University of Georgia(佐治亚大学)

AI总结 利用LVLMs内部注意力模式,通过轻量级IoU回归器或无需训练的注意力熵选择器,从多个候选框中选出最佳框,实现小目标定位的自改进。

Comments 29 Pages, 15 Figures

详情
AI中文摘要

大型视觉语言模型(LVLMs)中的内部注意力模式能否在无需微调的情况下识别可靠的小目标框?在这项工作中,我们给出了肯定的答案。LVLMs中的注意力结构编码了定位质量——一个仅基于注意力图训练的轻量级IoU回归器实现了强IoU预测(Pearson r > 0.67)。该回归器驱动了我们基于注意力的候选选择(ACS)框架的回归器变体,称为ACS-Learned,它从多个采样候选中选择最佳框以改进目标定位。通过分析回归器学习的内容,我们揭示了哪些Transformer层和头最为关键,并推导出ACS-Free:一个无需训练的选择器,它根据这些判别性头上的注意力熵对候选进行排序,推理时无需任何学习组件。在COCO和Objects365上的实验表明,小目标定位的自改进高达19%,其中ACS-Free在所有无需训练的方法中排名最佳,表明有用的注意力结构提高了LVLMs中定位的可靠性和可解释性。

英文摘要

Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67). This regressor powers the regressor-based variant of our Attention-based Candidate Selection (ACS) framework, called ACS-Learned, which selects the best box from multiple sampled candidates to improve object grounding. By analyzing what the regressor learns, we reveal which transformer layers and heads are most critical and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on these discriminative heads, with no learned component at inference. Experiments on COCO and Objects365 demonstrate up to 19% self-improvement on small object localization, with ACS-Free ranking best among all training-free methods, demonstrating that useful attention structure improves both localization reliability and interpretability in LVLMs.

2606.01608 2026-06-02 cs.CV 版本更新

Exploiting Semantic and Pixel Representations for Ultra-Low Bitrate Image Compression

利用语义和像素表示进行超低比特率图像压缩

Hao Wei, Yanhui Zhou, Chenyang Ge, Saeed Anwar, Ajmal Mian

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学) School of Information and Telecommunication, Xi’an Jiaotong University(信息与电信学院,西安交通大学) Department of Computer Science and Software Engineering, The University of Western Australia(计算机科学与软件工程系,西澳大学)

AI总结 提出SPRDiff扩散压缩方法,通过三重编码器架构和失真感知重建模块,在超低比特率下同时保持语义一致性和像素级保真度,实现率-失真-感知权衡最优。

详情
AI中文摘要

大多数现有的极端压缩方法未能实现最优的率-失真-感知权衡,因为它们通常优先考虑感知保真度和视觉真实性而非像素级精度。因此,重建结果往往与原始图像有明显偏差。超低比特率图像压缩因此至关重要——不仅要产生极其紧凑的表示,还要确保重建图像在语义上与源图像保持一致,并在像素级忠实于源图像。为此,我们提出了SPRDiff,一种基于扩散的压缩方法,充分利用语义和像素表示,从而在超低比特率约束下增强重建保真度。具体来说,我们开发了一个三重编码器架构,利用预训练的面向失真和面向语义编码器的高保真特征来补偿冻结的VAE编码器提取的有限表示,从而改善潜在压缩和熵建模。为了进一步提高扩散模型的重建保真度,我们引入了一个具有双特征提取的失真感知重建模块。该模块不仅生成保留主要结构的粗略重建,还提供实用且准确的语义级和像素级条件信号来指导扩散模型。在基准数据集上的大量实验表明,我们的方法在极低比特率(低于0.03 bpp)下在率-失真-感知权衡方面优于最先进的方法,有效保持了重建图像中的感知质量和像素级保真度。我们将在https://github.com/cshw2021/SPRDiff发布源代码和训练模型。

英文摘要

Most existing extreme compression methods fail to achieve an optimal rate-distortion-perception trade-off, as they typically prioritize perceptual fidelity and visual realism over pixel-level accuracy. Consequently, the resulting reconstructions often deviate noticeably from the originals. Ultra-low bitrate image compression is therefore crucial-not only for producing extremely compact representations but also for ensuring that reconstructed images remain semantically coherent and faithful to the source at the pixel level. To this end, we propose SPRDiff, a diffusion-based compression method that fully leverages both semantic and pixel representations, thereby enhancing reconstruction fidelity under ultra-low bitrate constraints. Specifically, we develop a triple-encoder architecture that utilizes high-fidelity features from the pretrained distortion-oriented and semantic-oriented encoders to compensate for the limited representations extracted by the frozen VAE encoder, thereby improving latent compression and entropy modeling. To further enhance the reconstruction fidelity of diffusion models, we introduce a distortion-aware reconstruction module with dual feature extraction. This module not only generates a coarse reconstruction that preserves the main structures, but also provides practical and accurate semantic- and pixel-level conditional signals to guide the diffusion model. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the rate-distortion-perception tradeoff at extremely low bitrates (below 0.03 bpp), effectively preserving both perceptual quality and pixel-wise fidelity in the reconstructed images. We will release the source code and trained models at https://github.com/cshw2021/SPRDiff.

2606.01604 2026-06-02 cs.CV 版本更新

Paving the Way for Point Cloud Video Representation Learning Using A PDE Model

使用PDE模型为点云视频表示学习铺平道路

Zhuoxu Huang, Zhenkun Fan, Jungong Han, Josef Kittler

发表机构 * Department of Computer Science, Aberystwyth University(阿伯里斯يث大学计算机科学系) Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University(自动化系、北京信息科学与技术国家研究中心、清华大学) Department of Electrical Engineering, Surrey University(Surrey大学电子工程系)

AI总结 提出MotionPDE方法,通过将时空相关性学习建模为可解的偏微分方程(PDE),并利用对比学习结构优化,作为即插即用模块提升点云视频表示学习性能。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI) in 2026

详情
AI中文摘要

研究时空相关性,特别是空间点随时间的变化,对于理解点云视频至关重要。传统方法,尤其是基于流的技术,由于顺序点云数据的无序空间排列,难以处理这些相关性。为了解决这一挑战,我们提出了一种新方法,通过将问题建模为可解的偏微分方程(PDE)来正则化时空相关性学习。虽然PDE在物理领域长期有效,但其在点云视频等新型序列数据上的应用仍未充分探索。受流体分析启发,我们构建了一个简化的PDE,并通过时间嵌入和空间嵌入之间的对比学习结构来指导和优化PDE的求解过程。借助这种额外的监督,我们的方法MotionPDE作为现有骨干模型的有效、即插即用的增强模块,仅增加极少的计算开销和参数。利用对比学习过程,我们进一步挖掘了MotionPDE的自监督能力,取得了有希望的结果,突显了其在点云视频数据解释中的实用性和适应性。带有训练检查点的代码仓库将在https://github.com/zhh6425/motionpde.git提供,以促进未来研究。

英文摘要

Investigating spatial-temporal correlations, specifically how spatial points vary over time, is crucial for understanding point cloud videos. Traditional methods, particularly flow-based techniques, struggle with these correlations due to the unordered spatial arrangement of sequential point cloud data. To address this challenge, we propose a novel approach that regularizes spatial-temporal correlation learning by formulating the problem as a solvable Partial Differential Equation (PDE). While PDEs have long been effective in the physical domain, their application to novel sequential data like point cloud video remains underexplored. Inspired by fluid analysis, we construct a simplified PDE, and the process of solving PDE is guided and refined by a contrastive learning structure between the temporal embeddings and the spatial embeddings. With this extra supervision, our method, named MotionPDE, serves as an effective, plug-and-play enhancement module for existing backbone models, adding minimal computational overhead and parameters. Capitalizing on the contrastive learning process, we delve deeper into the self-supervised capabilities of MotionPDE, yielding promising results that underscore its utility and adaptability in point cloud video data interpretation. The code repo with trained checkpoints will be available at https://github.com/zhh6425/motionpde.git for facilitating future research.

2606.01601 2026-06-02 cs.CV 版本更新

EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers

EIVE: 面向检测Transformer的端到端实例特定视觉解释

Jianlin Xiang, Yanshan Li, Linhui Dai

发表机构 * Institute of Intelligent Information Processing, Shenzhen University(智能信息处理研究院,深圳大学) Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University(广东省智能信息处理重点实验室,深圳大学) Shenzhen Key Laboratory of Modern Communications and Information Processing, Shenzhen University(深圳现代通信与信息处理重点实验室,深圳大学)

AI总结 提出EIVE框架,通过重新公式化解码器交叉注意力为实例级特征归因路径,直接生成实例级显著性图,无需梯度计算或输入扰动,高效解释DETR类检测器。

Comments 17 pages, 11 figures

详情
AI中文摘要

由于目标检测的多实例特性,其视觉可解释性仍然具有挑战性。现有方法主要采用事后范式(如基于梯度或扰动的解释方法)来解释预训练检测器。然而,这些方法需要额外的梯度计算或重复模型推理,导致效率有限。为解决此问题,我们提出了一种端到端实例特定视觉解释框架(EIVE),该框架在Detection Transformer(DETR)类模型的前向传播后直接生成实例级显著性图。具体而言,我们将解码器中的交叉注意力机制重新公式化为实例级特征归因路径,使得每个目标查询的交叉注意力对应于其预测实例的视觉归因。基于此公式,我们设计了一个跨层混合共识融合(CLHCF)模块,聚合解码器各层的交叉注意力信号,生成稳定且紧凑的解释。EIVE的解释过程既不需要梯度计算也不需要输入扰动,具有高计算效率,并适用于单尺度和多尺度的DETR类目标检测器。最后,我们提出了一种注意力感知联合训练策略(AAJTS)作为面向训练的应用,该策略对交叉注意力模式施加空间约束,以鼓励稳定且集中的归因表示,从而提高可解释性和检测性能。在MS COCO 2017、ExDark和Cityscapes上的实验表明,EIVE生成高质量的实例级显著性图,在标准指标上达到与最先进事后方法相当或更好的性能,同时显著提高了解释效率。代码可在https://github.com/xjlDestiny/EIVE.git获取。

英文摘要

Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance. Based on this formulation, we design a cross-layer hybrid consensus fusion (CLHCF) module to aggregate cross-attention signals across decoder layers, producing stable and compact explanations. The explanation process of EIVE requires neither gradient computation nor input perturbation, yielding high computational efficiency, and applies to single- and multi-scale DETR-like object detectors. Finally, we present an attention-aware joint training strategy (AAJTS) as a training-oriented application, which imposes spatial constraints on cross-attention patterns to encourage stable and concentrated attribution representations, thereby improving both interpretability and detection performance. Experiments on MS COCO 2017, ExDark, and Cityscapes demonstrate that EIVE produces high-quality instance-level saliency maps and achieves performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics, while substantially improving explanation efficiency. Code is available at https://github.com/xjlDestiny/EIVE.git.

2606.01600 2026-06-02 cs.CV cs.CL cs.RO 版本更新

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

RoboTrustBench:机器人操作视频世界模型的可信度基准测试

Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu

发表机构 * Singapore Management University(新加坡国立管理学院) Fudan University(复旦大学) Princeton University(普林斯顿大学)

AI总结 针对视频世界模型在机器人操作中的可信度问题,提出RoboTrustBench基准,包含正常、约束敏感、反事实和对抗四种场景,通过专家验证的指令-图像对和六维评估协议,发现当前模型在约束推理、反事实基础、物理交互和不安全指令抑制方面存在不足。

Comments Project: https://huiqiongli.github.io/RoboTrustBench/

详情
AI中文摘要

视频世界模型越来越多地用于机器人操作,然而现有基准大多在有效、可行和安全的指令下评估它们。我们引入了RoboTrustBench,一个用于评估视频世界模型在四种场景下可信度的基准:正常、约束敏感、反事实和对抗。基于真实世界的DROID片段构建,RoboTrustBench包含1,207个专家验证的指令-图像对和一个六维评估协议,包含13个细粒度标准。通过人类和MLLM评估七个代表性的视频世界模型,我们发现当前模型通常生成视觉上连贯的视频,但在约束推理、反事实基础、物理交互和不安全指令抑制方面存在困难。这些结果表明,视觉质量和表面级别的指令遵循不足以实现可信赖的机器人视频世界建模。

英文摘要

Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.

2606.01591 2026-06-02 cs.CV cs.LG 版本更新

TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

TLG: 通过源标注重建和类别目标推理实现视频问答的时间逻辑基础

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出TLG三阶段系统,通过重建动作时间线、解析问题为时间逻辑程序并确定性执行,结合强视觉语言模型和前沿推理模型,将视频问答准确率从46.9%提升至71.37%。

详情
AI中文摘要

TimeLogic挑战评估对视频的形式时间逻辑推理——包括16个算子(之前、之后、直到、自从、总是、共现、排序等),采用布尔和四选一形式。端到端视频语言模型在此任务上接近随机水平,因为它们将视频视为帧的集合,无法定位动作发生的时间。我们提出TLG(时间逻辑基础),一个三阶段系统:(i)从生成基准测试的公共源数据集标注中重建每个视频的动作时间线,将每个问题解析为时间逻辑程序,并确定性执行;(ii)在没有标注的情况下回退到强大的开放视觉语言模型;(iii)仅将视觉语言模型经验上最弱的问题类别路由到前沿推理模型。TLG将测试准确率从46.9%的视觉语言模型基线提升到71.37%,绝对增益+24.5,达到排行榜前三名3分以内。我们报告了广泛的消融实验,包括三种基于模型的时间线重建变体,它们都低于整体视觉语言模型,将时间基础隔离为不可约的瓶颈,并表明真正的标注——而非更大的模型——驱动准确率。

英文摘要

The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, ...) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video's action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.

2606.01590 2026-06-02 cs.CV cs.GR 版本更新

Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis

面向街景新视角合成的有效多传感器条件控制

Zhengfei Kuang, Adam Sun, Liyuan Zhu, Tong Wu, Shengqu Cai, Jonathan Tremblay, Iro Armeni, Ehsan Adeli, Lior Yariv, Gordon Wetzstein

发表机构 * Stanford Univerity(斯坦福大学) NVIDIA

AI总结 提出StreetNVS视频扩散框架,通过参考增强相机注意力模块和相对射线级位置编码联合利用LiDAR、环视图像和相机位姿,实现稀疏LiDAR条件下的高质量街景新视角合成。

详情
AI中文摘要

现代车辆平台配备了丰富的传感器套件,包括LiDAR、标定多相机系统和精确的自车运动,这原则上为从新视角重新渲染驾驶场景提供了强信号。最近一系列工作利用视频扩散模型完成此任务,通过其生成先验从稀疏车辆观测中合成合理的新视角。然而在实践中,现有方法仅利用了该信号的一部分,且其质量往往随着目标轨迹偏离记录驾驶路径而下降。我们认为这本质上是一个多传感器融合问题:稀疏LiDAR重投影提供准确但不完整的度量几何,环视参考图像提供密集外观但不提供度量深度,而相机位姿将两者跨视图连接起来。我们引入StreetNVS,一种视频扩散框架,通过基于相对射线级位置编码的参考增强相机注意力模块,联合对所有三种信号进行条件控制。我们开发了一种两阶段课程训练策略,逐步使模型适应越来越稀疏的LiDAR。在Waymo Open数据集上,StreetNVS在稀疏LiDAR条件下显著优于最先进的基线,与依赖密集10-100倍点云的方法性能相当。我们进一步展示了沿极端轨迹外路径(如高程、车道偏移、拉回和旋转)合成连贯视频的能力。我们的网站:https://streetnvs.github.io

英文摘要

Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: https://streetnvs.github.io

2606.01577 2026-06-02 cs.CV 版本更新

FLAME: Physics-Guided Neural Operators for Onboard Satellite Methane Detection in Hyperspectral Imagery

FLAME:物理引导的神经算子用于高光谱图像中星载甲烷检测

Junhyuk Heo, Junhwan Park, Sancheol Sim, Beomkyu Choi, Woojin Cho

发表机构 * KAIST(韩国科学技术院)

AI总结 提出FLAME,一种将甲烷吸收物理直接嵌入架构的物理引导神经算子,在星载甲烷检测中实现最高精度,像素级假阳性率降低近3倍,参数最少且满足星载硬件延迟预算。

详情
AI中文摘要

甲烷是近期气候变化的主要驱动因素,快速识别其排放源是一项关键的气候干预措施。星载高光谱成像是完成此任务的主要工具,但每个传感器产生的数据量使得地面检测不切实际,因此需要星载检测。经典方法在星载硬件上产生过高的计算成本,而深度学习模型速度快但检测质量不足。我们提出FLAME,一种物理引导的神经算子,将甲烷吸收的物理直接构建到其架构中。在甲烷检测基准上,FLAME在所有评估方法中实现了最高的检测精度,将像素级假阳性率相比最强神经基线降低了近3倍,在学习基线中使用参数最少,并且在星载卫星硬件的延迟预算内运行。

英文摘要

Methane is a major driver of near-term climate change, and rapidly identifying its emission sources is a critical climate intervention. Spaceborne hyperspectral imagery is the primary tool for this task, but the volume of data produced by each sensor makes ground-based detection impractical and necessitates onboard detection. Classical methods incur prohibitive computational cost on onboard hardware, while deep learning models are fast but fall short on detection quality. We propose FLAME, a physics-guided neural operator that builds the physics of methane absorption directly into its architecture. On the methane detection benchmark, FLAME achieves the highest detection accuracy among all evaluated methods, reduces the pixel-level false positive rate by nearly $3\times$ over the strongest neural baseline, uses the fewest parameters among learned baselines, and runs within the latency budget of onboard satellite hardware.

2606.01576 2026-06-02 cs.CV 版本更新

Deformable Wiener Filter for Future Video Coding

可变形维纳滤波器用于未来视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

发表机构 * National Engineering Research Center of Visual Technology, School of Computer Science, Peking University(视觉技术国家工程研究中心,北京大学计算机科学学院) Core Media Technology, Disney Streaming(核心媒体技术,迪士尼流媒体) Wangxuan Institute of Computer Technology, Peking University(王萱计算机技术研究所,北京大学) Information Technology R&D Innovation Center of Peking University(北京大学信息技术研发创新中心) Peng Cheng Laboratory, Shenzhen(鹏城实验室,深圳)

AI总结 提出一种结合局部与非局部特征的可变形维纳滤波器(DWF),通过监督训练和自适应融合实现高效环路滤波,在VVC标准上平均节省1.16%~2.67%的码率。

Comments This paper has been published in IEEE Transactions on Image Processing

详情
Journal ref
IEEE Transactions on Image Processing, vol. 31, pp. 7222-7236, 2022
AI中文摘要

环路滤波器由于在混合视频编码框架中显著的降噪能力而受到越来越多的关注。然而,现有通用视频编码(VVC)中的环路滤波器主要利用图像局部相似性。尽管一些基于非局部的环路滤波器可以弥补这一不足,但非局部滤波器广泛使用的无监督参数估计方法限制了性能。鉴于此,我们提出了一种可变形维纳滤波器(DWF)。它结合了局部和非局部特性,并基于维纳滤波器理论监督地训练滤波器系数。在滤波过程中,首先为每个感兴趣样本导出局部相邻样本和非局部相似样本。然后,基于块级噪声和样本级特征将待滤波样本分类到特定组中。每组样本共享相同的滤波器系数。之后,根据分类结果自适应融合局部和非局部参考样本。最后,对每个待滤波样本进行带有异常值数据约束的滤波操作。此外,详细分析了所提出的DWF在不同参考样本导出方案下的性能。仿真结果表明,与VTM-11.0相比,所提方法在全内、随机访问和低延迟B配置下平均分别节省1.16%、1.92%和2.67%的码率。

英文摘要

In-loop filters have attracted increasing attention due to the remarkable noise-reduction capability in the hybrid video coding framework. However, the existing in-loop filters in Versatile Video Coding (VVC) mainly take advantage of the image local similarity. Although some non-local based in-loop filters can make up for this shortcoming, the widely-used unsupervised parameter estimation method by non-local filters limits the performance. In view of this, we propose a deformable Wiener Filter (DWF). It combines the local and non-local characteristics and supervisedly trains the filter coefficients based on the Wiener Filter theory. In the filtering process, local adjacent samples and non-local similar samples are first derived for each sample of interest. Then the to-be-filtered samples are classified into specific groups based on the patch level noise and sample-level characteristics. Samples in each group share the same filter coefficients. After that, the local and non-local reference samples are adaptively fused based on the classification results. Finally, the filtering operation with outlier data constraints is conducted for each to-be-filtered sample. Moreover, the performance of the proposed DWF is analyzed with different reference sample derivation schemes in detail. Simulation results show that the proposed approach achieves 1.16%, 1.92%, and 2.67% bit-rate savings on average compared to the VTM-11.0 for All Intra, Random Access, and Low-Delay B configurations, respectively.

2606.01572 2026-06-02 eess.IV cs.CV 版本更新

PINNOCHIO: Physics-Informed Neural Network for Coupled Hyperelastic Interface-Volume Simulation in Orthognathic Surgery

PINNOCHIO: 用于正颌手术中耦合超弹性界面-体积模拟的物理信息神经网络

Jungwook Lee, Daeseung Kim, Kevin Gu, Zhangfeng Hu, Tianshu Kuang, Finn Hopeman, Michael A. K. Liebschner, Jaime Gateno, Pingkun Yan

发表机构 * Department of Biomedical Engineering and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute(生物医学工程系和生物技术与跨学科研究中心,伦塞拉尔理工学院) Department of Oral and Maxillofacial Surgery, Houston Methodist Research Institute(口腔颌面外科系,休斯顿方法主义研究学院) Department of Neurosurgery, Baylor College of Medicine(神经外科系,贝勒医学院)

AI总结 提出PINNOCHIO框架,通过混合顺序分解解耦不连续骨-软组织界面运动与连续体积超弹性变形,实现稳定训练和物理启发的模拟到真实适应策略,在40名患者队列中优于现有基线,解决了精度-效率权衡问题。

Comments This work has been submitted to MICCAI 2026

详情
AI中文摘要

预测患者特定面部软组织变形对于迭代正颌手术规划至关重要。然而,当前计算方法面临严格的精度-效率权衡:高保真有限元方法计算成本过高,而纯深度学习模型往往产生生物力学不一致的结果。尽管物理信息神经网络提供了一条有前景的途径,但在仅有部分临床监督(即外表面)下学习骨-软组织相互作用的复杂异质力学仍然高度不稳定。为克服这些挑战,我们提出了PINNOCHIO,一种用于面部软组织模拟的新型物理信息框架。PINNOCHIO引入了一种混合顺序分解,明确地将不连续的骨-软组织界面运动与连续的体积超弹性变形解耦。这种结构分离实现了稳定训练,并促进了物理启发的模拟到真实适应策略,确保内部生物力学一致性而无需体积真实数据。在40名患者临床队列上的评估表明,PINNOCHIO在表面精度和物理有效性方面均优于现有基线。此外,它实现了比有限元方法显著的加速,成功解决了精度-效率权衡,为交互式手术规划提供了高度可靠和实用的工具。

英文摘要

Predicting patient-specific facial soft-tissue deformation is critical for iterative orthognathic surgery planning. However, current computational methods face a strict accuracy-efficiency trade-off: high-fidelity Finite Element Methods (FEM) are computationally prohibitive, whereas pure deep learning models often produce biomechanically inconsistent results. While Physics-Informed Neural Networks (PINNs) offer a promising avenue, learning the complex heterogeneous mechanics of bone--soft-tissue interactions with only partial clinical supervision (i.e., outer facial surfaces) remains highly unstable. To overcome these challenges, we present PINNOCHIO, a novel physics-informed framework for facial soft-tissue simulation. PINNOCHIO introduces a hybrid sequential decomposition that explicitly decouples discontinuous bone--soft-tissue interface movements from continuous volumetric hyperelastic deformation. This structural separation enables stable training and facilitates a physics-enabled sim-to-real adaptation strategy, ensuring internal biomechanical consistency without requiring volumetric ground truth. Evaluated on a 40-patient clinical cohort, PINNOCHIO outperforms existing baselines in both surface accuracy and physical validity. Furthermore, it achieves a substantial speedup over FEM, successfully resolving the accuracy-efficiency trade-off to provide a highly reliable and practical tool for interactive surgical planning.

2606.01565 2026-06-02 cs.RO cs.CV 版本更新

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

层级语义增强导航:面向视觉语言导航的最优传输与图驱动推理

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore(新加坡南洋理工大学交叉学科研究生项目) University College London(伦敦大学学院)

AI总结 提出层级语义增强导航框架,通过动态层级语义场景图、基于最优传输的拓扑规划器与图感知强化学习策略,解决连续环境中的视觉语言导航难题,实现最优性能。

Comments Published in NeurIPS 2025, address some typos

详情
AI中文摘要

连续环境中的视觉语言导航(VLN-CE)对自主智能体构成严峻挑战,要求无缝整合自然语言指令与视觉观察以在复杂3D室内空间导航。现有方法在长程任务中常因场景理解有限、规划效率低下及缺乏稳健决策框架而表现不佳。我们引入层级语义增强导航(HSAN)框架,这是一种开创性方法,通过三项协同创新重新定义VLN-CE。首先,HSAN构建动态层级语义场景图,利用视觉语言模型捕捉从物体到区域到区域的多级环境表示,实现细粒度空间推理。其次,它采用基于最优传输的拓扑规划器,以Kantorovich对偶为基础,通过平衡语义相关性与空间可达性来选择长期目标,并具有理论最优性保证。第三,图感知强化学习策略确保精确的低层控制,在稳健避障的同时导航子目标。通过整合谱图理论、最优传输和先进的多模态学习,HSAN解决了先前工作中静态地图和启发式规划器的缺陷。在多个具有挑战性的VLN-CE数据集上的大量实验表明,HSAN实现了最先进的性能,在导航成功率和泛化到未见环境方面均有显著提升。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.

2606.01558 2026-06-02 cs.CV 版本更新

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

注意力引导的多模态大语言模型微调提升思维链推理能力

Sanchit Sinha, Guangzhi Xiong, Bohan Liu, Zhenghao He, Aidong Zhang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 针对多模态大语言模型中思维链推理效果不佳的问题,提出注意力引导的微调目标Attentive-CoT,通过延迟答案承诺和维持视觉令牌访问来提升推理性能。

详情
AI中文摘要

思维链提示在多模态大语言模型中的有效性仍不确定:在多个视觉推理基准上,与直接提示相比,思维链提示常常降低性能。在本文中,我们对三个现代多模态大语言模型系列在不同模型规模下,针对需要逐步视觉证据的数据集进行了思维链行为的系统分析。我们的分析识别出两种反复出现的失败模式:过早的答案承诺和推理生成过程中有限的直接视觉令牌访问。我们进一步发现,标准的思维链式监督微调只能部分缓解这些问题,同时往往增加对文本先验的依赖并减少反事实视觉依赖。受这些发现的启发,我们提出了Attentive-CoT,一种注意力引导的微调目标,它鼓励思维链轨迹延迟答案承诺,同时维持持续的视觉令牌访问。Attentive-CoT可以插入任何思维链式监督微调训练中,无需架构更改。在六个多模态大语言模型上的三个视觉推理基准实验表明,Attentive-CoT相比标准微调提升了思维链性能。

英文摘要

The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.

2606.01549 2026-06-02 cs.CV 版本更新

ForestMamba: Sparse Mamba with Geometry-guided Queries for 3D Forest Point Cloud Segmentation

ForestMamba: 基于几何引导查询的稀疏Mamba用于3D森林点云分割

Trung Thanh Nguyen, Tuan-Anh Vu, Duc Viet Le, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide, Teja Kattenborn

发表机构 * Nagoya University(名古屋大学) RIKEN Seika(日本理化学研究所Seika研究中心) University of California, Los Angeles(加州大学洛杉矶分校) University of Twente(埃因霍温理工大学) Ritsumeikan University(立命馆大学)

AI总结 提出ForestMamba方法,通过稀疏编码器、几何引导查询初始化和Mamba查询解码器,实现高效且结构感知的森林点云分割,在七个森林区域上优于现有方法,推理速度提升3倍,GPU内存降低2.3倍。

详情
AI中文摘要

基于AI的地面和无人机LiDAR点云语义和实例分割正成为一种变革性方法,将森林的复杂3D结构转化为可操作的信息,用于森林监测和生物多样性评估。然而,由于数据量大、采样密度不规则、冠层结构复杂重叠以及地理变异性,森林LiDAR场景仍然极具挑战性。基于稀疏卷积或Transformer的现有方法取得了有希望的结果,但存在两个关键限制:注意力的二次复杂度难以扩展到大型森林场景,以及通用上下文建模未利用森林结构先验,限制了复杂区域中的树木分离。为了解决这些挑战,我们提出了ForestMamba,一种结构感知方法,将森林特定先验融入特征编码、查询生成和查询细化中,同时用线性时间状态空间建模替代二次注意力。首先,我们引入了一个具有垂直优先 slab 序列化的稀疏编码器,将稀疏体素组织成垂直连贯的序列,以实现高效的长程上下文建模。其次,我们提出了一种基于实时多尺度冠层高度模型(CHM)的几何引导查询初始化策略,其中冠层最大值提供了生态学上有意义的查询种子,并通过最远点采样(FPS)补充以覆盖林下树木。第三,我们设计了一个基于Mamba的查询解码器,将局部kNN体素聚合与空间双路径Mamba相结合,以线性计算复杂度进行查询细化。在七个森林区域上的大量实验表明,ForestMamba在分割任务中始终优于现有基线,同时实现比基于Transformer的方法快3倍的推理速度和低2.3倍的GPU内存。

英文摘要

AI-based semantic and instance segmentation of terrestrial and drone LiDAR point clouds is emerging as a transformative approach for converting the complex 3D structure of forests into actionable information for forest monitoring and biodiversity assessment. However, forest LiDAR scenes remain highly challenging due to their large data volumes, irregular sampling density, overlapping and complex canopy structure, and geographic variability. Existing methods based on sparse convolutions or Transformers achieve promising results, but suffer from two key limitations: Quadratic complexity of attention scales poorly to large forest scenes, and Generic context modeling does not exploit forest structural priors, limiting tree separation in complex regions. To address these challenges, we propose ForestMamba, a structure-aware method that incorporates forest-specific priors into feature encoding, query generation, and query refinement, while replacing quadratic attention with linear-time state-space modeling. First, we introduce a sparse encoder with vertical-priority slab serialization that organizes sparse voxels into vertically coherent sequences for efficient long-range context modeling. Second, we propose a geometry-guided query initialization strategy based on an on-the-fly multi-scale Canopy Height Model (CHM), where canopy maxima provide ecologically meaningful query seeds, supplemented by Farthest Point Sampling (FPS) to cover understory trees. Third, we design a Mamba-based query decoder that combines local kNN voxel aggregation with a spatial dual-path Mamba for query refinement with linear computational complexity. Extensive experiments across seven forest regions demonstrate that ForestMamba consistently outperforms existing baselines in both segmentation tasks, while achieving 3 times faster inference and 2.3 times lower GPU memory than Transformer-based methods.

2606.01543 2026-06-02 cs.CV 版本更新

PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images

PathAR: 结构优先的多模态病理图像自回归合成

Yuan Zhang, Jiahao Xia, Junzhang Huang, Meng Wang, Feng Chen, Guanyu Yang, Huazhu Fu

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(新一代人工智能技术及其交叉应用重点实验室(东南大学),教育部) Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore(创新与精准眼健康中心,新加坡国立大学 Yong Loo Lin 医学院) Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore(眼科学系,新加坡国立大学 Yong Loo Lin 医学院) Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University(生物统计学系,全球健康中心,南京医科大学) Institute of High-Performance Computing, Agency for Science, Technology and Research(高性能计算研究所,科技研究局)

AI总结 提出PathAR,一种结构优先的自回归合成框架,通过显式分解结构与外观并使用交错自回归Transformer,实现模态标签条件下的病理图像生成,改善结构一致性和模态保真度。

Comments 12 pages, 7 figures

详情
AI中文摘要

多模态病理学中的数据稀缺推动了统一生成模型的发展,这些模型在保持解剖学一致结构的同时合成模态特定的外观。尽管模态在外观统计上存在差异,但细胞拓扑和组织边界等形态结构在不同采集协议中基本保持不变。然而,现有方法通常将这些因素建模在均匀的token流中,隐式地将结构与外观耦合,削弱了模态变化下的结构可控性。为解决这一问题,我们提出病理自回归建模(PathAR),一种结构优先的自回归合成框架,显式分解结构和外观,用于模态标签条件下的病理生成。PathAR采用双向量量化(Dual-VQ)分词器将样本分解为掩码引导的结构和外观token,以及一个具有非对称注意力可见性的交错自回归(IAR)Transformer,以强制执行结构到外观的依赖关系。PathAR在异质模态特定外观下稳定形态,并支持空间对齐的图像-掩码对生成。大量实验表明,PathAR在结构一致性和模态保真度上优于基线,保持样本多样性,支持数据稀缺情况下的下游分割,并展现出对更细粒度器官标签变化的可扩展性。

英文摘要

Data scarcity in multimodal pathology motivates unified generative models that synthesize modality-specific appearance while preserving anatomically coherent structure. Although modalities differ in appearance statistics, morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols. However, existing methods often model these factors within a homogeneous token stream, implicitly coupling structure with appearance and weakening structural controllability under modality shifts. To address this, we propose pathology Autorgressive modeling (PathAR), a structure-first autoregressive synthesis framework that explicitly factorizes structure and appearance for modality-label-conditioned pathology generation.PathAR employs a dual vector quantization (Dual-VQ) tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive (IAR) transformer with asymmetric attention visibility to enforce structure-to-appearance dependence. PathAR stabilizes morphology under heterogeneous modality-specific appearances and enables spatially aligned image--mask pair generation. Extensive experiments show that PathAR improves structural consistency and modality fidelity over baselines, maintains sample diversity, supports downstream segmentation in data-scarce regimes, and demonstrates extensibility to finer-grained intra-modality organ-label variation.

2606.01518 2026-06-02 cs.CV cs.GR 版本更新

MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes

MotionDreamer: 面向3D绑定形状的通用骨骼运动生成

Ye Tao, Yuxin Yao, Kendong Liu, Dapeng Wu, Junhui Hou

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出基于扩散的框架MotionDreamer,通过结构-语义注入机制从2D视频生成类别无关的骨骼动画,并构建大规模动态数据集,实现跨形态的高保真运动合成。

Comments 18 pages, 7 figures

详情
AI中文摘要

绑定形状的运动生成对于可扩展的4D资产制作至关重要。然而,基于模板的方法受限于特定拓扑结构,无法泛化到不同形态。相反,逐案例优化计算成本高,易陷入局部最优,且对视角引起的歧义高度敏感。在本文中,我们提出MotionDreamer,一个基于扩散的框架,旨在从2D视频指导中生成类别无关的骨骼动画。为了克服高质量训练数据的稀缺性,我们整理了一个大规模动态数据集,包含约20,000个多样化的3D模型,每个模型具有完整的纹理、骨骼绑定和广泛的动画序列。为了弥合2D视觉运动线索与异构3D骨骼结构之间的运动学差距,我们提出了一种结构-语义注入机制。我们的模型将纹理和语义属性直接集成到骨骼关节表示中,使其能够将感知的视觉动态映射到特定的关节层次及其功能角色。这使得MotionDreamer能够合成高保真动画,在从现有生物物种到幻想生物的广泛未见类别中保持解剖一致性。大量实验表明,我们的方法显著优于现有方法,为鲁棒且高效的4D资产生成设立了新的最先进基准。代码将在接收后公开。

英文摘要

Motion generation for rigged shapes is vital for scalable 4D asset production. However, template-based methods are limited by specific topologies and fail to generalize across diverse morphologies. Conversely, per-case optimization is computationally expensive, susceptible to local optima, and highly sensitive to viewpoint-induced ambiguities. In this paper, we present MotionDreamer, a diffusion-based framework designed for category-agnostic skeletal animation generation from 2D video guidance. To overcome the scarcity of high-quality training data, we have curated a large-scale dynamic dataset comprising approximately 20,000 diverse 3D models, each featuring complete textures, skeletal rigging, and a wide array of comprehensive animation sequences. To bridge the kinematic gap between 2D visual motion cues and heterogeneous 3D skeletal structures, we propose a structural-semantic injection mechanism. Our model integrates texture and semantic attributes directly into skeletal joint representations. This allows it to map perceived visual dynamics to specific joint hierarchies and their functional roles. This enables MotionDreamer to synthesize high-fidelity animations that maintain anatomical consistency across a vast range of unseen categories, from existing biological species to fantastical beings. Extensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art benchmark for robust and efficient 4D asset generation. The code will be made publicly available upon acceptance.

2606.01503 2026-06-02 cs.CV cs.AI cs.CL 版本更新

On the Limits of Token Reduction for Efficient Unified Vision Language Training

论高效统一视觉语言训练中令牌缩减的极限

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

发表机构 * University of Michigan(密歇根大学) Sony AI(索尼人工智能)

AI总结 本文通过分析层注意力分配,发现视觉理解与视觉生成在令牌冗余上存在不对称性,设计任务特定加速器,但统一训练中任务特定令牌丢弃导致协同损失,表明高效统一建模需保留共享跨任务结构。

详情
AI中文摘要

统一视觉语言模型(VLM)在单个自回归骨干中集成了视觉理解和视觉生成,但其联合训练计算成本高昂且从效率角度常被忽视。在这项工作中,我们研究了基于令牌缩减的加速在统一VLM训练中的可行性和极限。通过对逐层注意力分配的系统分析,我们揭示了一个基本的不对称性:视觉理解在后期层表现出显著的视觉冗余,而视觉生成在深度上对图像令牌保持持续依赖。受此观察启发,我们设计了任务特定的加速器,针对每个目标选择性地减少图像令牌计算。虽然这些方法在孤立设置中实现了显著的效率提升,但我们在统一训练下观察到一致的协同损失——任务特定的令牌丢弃需要不同的参数路径,并消除了联合优化中通常观察到的相互性能增益。我们的发现表明,高效统一建模需要保留共享的跨任务结构,强调了需要协同感知的加速策略。项目页面:https://chicychen.github.io/TokenReductionUnifiedVLM/。

英文摘要

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

2606.01493 2026-06-02 cs.CV 版本更新

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

Splatshot: 从单张非约束照片生成3D人脸头像

Hao Liang, Zhixuan Ge, Soumendu Majee, Joanna Li, Ashok Veeraraghavan, Guha Balakrishnan

发表机构 * Rice University(里士大学) Samsung Research America(三星美国研究院)

AI总结 提出SplatShot,一种无需训练的方法,通过将3D高斯泼溅与扩散模型去噪过程耦合,从单张照片生成多视图一致的逼真3D人脸头像。

Comments 28 pages, 15 figures

详情
AI中文摘要

从单张非约束照片重建逼真的3D人脸头像具有挑战性:前馈3D高斯泼溅(3DGS)模型在分布外输入上性能下降,而预训练扩散模型生成高保真图像但缺乏多视图一致性。我们观察到这些范式本质上是互补的:显式3D表示保证几何一致性,而2D扩散先验确保逼真度。基于此,我们提出SplatShot,一种无需训练的框架,直接在去噪过程中耦合这些表示。给定一个基础3DGS人脸模型和一张参考图像,我们使用每步3D反馈循环联合去噪所有目标视图。在每个时间步,我们从噪声潜变量预测干净图像,将3DGS重新拟合到这些多视图预测,并将3DGS重新渲染与2D预测之间的光度差异反向传播到噪声估计中。这将采样轨迹引导向严格3D一致、身份保真的输出。在各种野外图像上的实验表明,SplatShot生成的3D头像具有优越的身份保持、逼真度和多视图一致性。

英文摘要

Reconstructing a photorealistic 3D face avatar from a single unconstrained photograph is challenging: feed-forward 3D Gaussian Splatting (3DGS) models degrade on out-of-distribution inputs, while pretrained diffusion models produce high-fidelity images but lack multi-view consistency. We observe that these paradigms are fundamentally complementary: explicit 3D representations guarantee geometric consistency, whereas 2D diffusion priors ensure photorealism. Building on this, we propose SplatShot, a training-free framework that couples these representations directly within the denoising process. Given a base 3DGS face model and a single reference image, we jointly denoise all target views using a per-step 3D feedback loop. At each timestep, we predict clean images from the noisy latents, refit the 3DGS to these multi-view predictions, and back-propagate the photometric discrepancy between the 3DGS re-renderings and 2D predictions into the noise estimate. This steers the sampling trajectory toward strictly 3D-coherent, identity-faithful outputs. Experiments on diverse in-the-wild images demonstrate that SplatShot produces 3D avatars with superior identity preservation, photorealism, and multi-view consistency.

2606.01485 2026-06-02 cs.CV cs.LG 版本更新

Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering

感知优先:具有自一致性的前沿原生视频模型用于隐式视频问答

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 本文通过系统实验发现隐式视频问答基准是感知受限而非推理受限,并指出提升基础模型感知能力和轻量级测试时去噪是唯一可靠手段。

详情
AI中文摘要

我们描述了提交至CVPR 2026 VRR挑战赛的方案,该方案基于ImplicitQA / VRR-QA基准:一种多项选择视频问答任务,其中答案有意地不在任何单帧中可观察,必须从创意视频的不连续帧中的空间布局、运动、深度、视角、因果关系和社会背景推断。我们对开源视频大语言模型(Qwen2.5-VL、Qwen3-VL、InternVL3、Gemma-3以及经过强化学习训练的视频推理器Video-R1和VideoChat-R1.5)和一系列推理时策略(思维链、问题分解、描述-推理级联、音频转录、空间状态提示、自一致性、多模型集成和类别路由)进行了系统的、无需训练的研究。我们的核心发现是,该基准是感知受限而非推理受限:推理侧的增强是中性的甚至有害的,而基础模型的感知能力和轻量级测试时去噪是唯一可靠的杠杆。按类别的错误分析将困难定位到低级感知——相对深度、视角和计数是最困难的类别,而因果和社会推理几乎已解决——一个明确注入单目深度线索以攻击最弱类别的提示将测试准确率降低了5.8个百分点,证实了模型需要更好的感知,而非更好的过程。

英文摘要

We describe our submission to the VRR Challenge @ CVPR 2026, built on the \emph{ImplicitQA} / \emph{VRR-QA} benchmark~\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL~\cite{qwen25vl}, Qwen3-VL~\cite{qwen3vl}, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1~\cite{videor1} and VideoChat-R1.5~\cite{videochatr15}) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency~\cite{selfconsistency}, multi-model ensembling, and category routing). Our central finding is that this benchmark is \emph{perception-bound rather than reasoning-bound}: reasoning-side augmentations are neutral-to-harmful, whereas base-model perceptual capability and lightweight test-time denoising are the only reliable levers. A per-category error analysis localizes the difficulty to low-level perception -- relative depth, viewpoint, and counting are the hardest categories, while causal and social reasoning are nearly solved -- and a prompt that explicitly injects monocular depth cues to attack the weakest category \emph{lowers} test accuracy by $5.8$ points, confirming that the model needs a better \emph{percept}, not a better \emph{procedure}.

2606.01481 2026-06-02 cs.CV 版本更新

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

SafeGen-Bench: 图像条件文本到视频生成中的安全性基准测试

Yingzi Ma, Xiaogeng Liu, Yawen Zheng, Chaowei Xiao

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tsinghua University(清华大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对图像条件文本到视频生成中安全文本和图像组合仍可能产生有害内容的问题,提出SafeGen-Bench基准,定义10个恶意类别并评估现有模型,发现当前模型难以避免生成恶意内容,且单模态护栏防御不足。

Comments 8 pages, 7 figures, 2 tables

详情
AI中文摘要

随着文本到图像扩散模型的快速发展,像Sora这样的生成式视频模型(T2V模型)现在可以从文本提示或初始图像生成短视频。然而,合成视频生成——尤其是在初始图像引导下——常常带来风险,包括可能创建非法、政治敏感或不道德的内容。现有基准已开始考虑生成视频的安全性,但它们主要关注用恶意文本提示测试模型,忽略了文本提示和图像组合仍可能导致有害视频内容的场景。在实践中,这是一个常见且具有挑战性的问题:从安全文本和图像输入生成的视频仍可能传达有害信息。为弥补这一差距,我们引入了SafeGen-Bench,一个专门设计用于评估条件T2V模型安全性的基准。我们的基准定义了10个恶意类别,重点关注与时间序列和描绘行为相关的风险。SafeGen-Bench包含从多样图像和视频源中精心选择的起始帧,并配以相应的文本提示以模拟真实输入。我们在SafeGen-Bench上评估了多种条件T2V模型,结果表明当前模型难以持续避免生成恶意内容,不安全分数高达44.5,尤其是在需要高质量的条件下。此外,我们评估了基于文本和基于图像的护栏在我们的基准上的有效性,发现单模态护栏单独不足以提供稳健防御,在七个恶意类别中失败率达80%。我们希望SafeGen-Bench能促进更安全、更可控的条件T2V模型的开发。

英文摘要

With the rapid advancements in text-to-image diffusion models, generative video models (T2V models) like Sora can now produce short synthetic videos from a text prompt or an initial image. However, synthetic video generation -- especially when guided by an initial image -- often poses risks, including the potential creation of illegal, politically sensitive, or unethical content. Existing benchmarks have started to consider the safety of generated videos, but they primarily focus on testing models with malicious text prompts, ignoring the scenario where text prompt and image combination may still lead to harmful video content. In practice, this is a common and challenging issue: videos generated from safe text and image inputs can nonetheless convey harmful information. To bridge this gap, we introduce SafeGen-Bench, a benchmark specifically designed to evaluate the safety of conditional T2V models. Our benchmark defines 10 malicious categories, concentrating on risks related to both temporal sequences and depicted behaviors. SafeGen-Bench consists of carefully selected start frames from diverse image and video sources, paired with corresponding text prompts to simulate realistic inputs. We evaluate a variety of conditional T2V models on SafeGen-Bench, and the results indicate that current models struggle to consistently avoid generating malicious content with unsafety scores reaching up to 44.5, especially under conditions requiring high quality. Furthermore, we assess the effectiveness of both text-based and image-based guardrails on our benchmark, finding that unimodal guardrails alone were insufficient to provide a robust defense, with an 80\% failure rate across seven malicious categories. We hope that SafeGen-Bench will foster the development of safer and more controllable conditional T2V models.

2606.01443 2026-06-02 cs.LG cs.AI cs.CV 版本更新

UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures

UR-JEPA:均匀可整流性作为联合嵌入预测架构的正则化器

Triet M. Le

发表机构 * Spatiolyx LLC(Spatiolyx公司)

AI总结 提出UR-JEPA,通过高斯核平滑的Carleson型平方函数实现均匀n-可整流测度正则化,防止表示坍塌,在多个数据集上达到与LeJEPA相当的峰值精度但具有更低的种子方差。

详情
AI中文摘要

训练联合嵌入预测架构(JEPA)的一个核心困难是防止表示坍塌。LeJEPA通过素描各向同性高斯正则化(SIGReg)对嵌入施加各向同性高斯目标来解决这一问题。该目标与流形假设相矛盾,流形假设期望嵌入集中在环境空间的低维子集上。我们提出\emph{UR-JEPA},其目标是在小尺度上具有局部切向维度$n$的均匀$n$-可整流测度,通过高斯核平滑的Carleson型平方函数$\mathcal{L}^{ ext{CGLT}}$实现,并辅以Jones $β$数公式。在Inet10上,UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)达到$0.9141 \pm 0.0014$,相比LeJEPA($\mathcal{L}^{ ext{SIGReg}}$)提高了$+0.83$个百分点,种子标准差降低约$30\%$;在匹配配方的Galaxy10~SDSS、单种子ImageNet-$100$运行和3种子EuroSAT遥感运行中,两种方法在收敛时处于相同的峰值精度区间,UR-JEPA保持其较低的种子方差特征。在EuroSAT上,域内对在$96.0$到$96.1\%$之间具有竞争力,且使用大型遥感基础模型迁移时骨干网络缩小$25$倍。区别在于几何结构:对投影仪输出分布的直接可视化显示,在所有四个数据集上,UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)产生的全局PCA谱在索引$\sim 20$到$25$(共$D=32$)处出现$4$到$5$个数量级的下降,而LeJEPA的谱接近平坦(顶部到底部比率最多为$3.6$)。两种方法的每维度边缘分布同时接近高斯分布(平均Shapiro-Wilk $W \in [0.992, 0.996]$),这是Diaconis-Freedman结果的一个推论。因此,在匹配精度下,两种正则化器产生结构上不同的投影表示。

英文摘要

A central difficulty in training Joint-Embedding Predictive Architectures (JEPAs) is preventing representation collapse. LeJEPA addresses this by enforcing an isotropic Gaussian target on the embeddings via Sketched Isotropic Gaussian Regularization (SIGReg). This target is in tension with the manifold hypothesis, which expects embeddings to concentrate on a low-dimensional subset of the ambient space. We propose \emph{UR-JEPA}, which targets a uniformly $n$-rectifiable measure of local tangent dimension $n$ at small scales, realized through a Gaussian-kernel smoothed Carleson-type square function $\mathcal{L}^{\text{CGLT}}$, with a complementary Jones $β$-number formulation. On Inet10, UR-JEPA($\mathcal{L}^{\text{CGLT}}$) attains $0.9141 \pm 0.0014$ for a $+0.83$\,pp gain over LeJEPA($\mathcal{L}^{\text{SIGReg}}$) with $\sim 30\%$ lower seed standard deviation; on matched-recipe Galaxy10~SDSS, a single-seed ImageNet-$100$ run, and a $3$-seed EuroSAT remote-sensing run, the two methods lie in the same peak-accuracy band at convergence, with UR-JEPA retaining its lower-seed-variance signature. On EuroSAT the in-domain pair is competitive at $96.0$ to $96.1\%$ with large remote-sensing foundation-model transfer at a $25\times$ smaller backbone. The distinction is geometric: direct visualization of the projector output distribution shows that on all four datasets UR--JEPA($\mathcal{L}^{\text{CGLT}}$) produces a global PCA spectrum with a $4$ to $5$ order-of-magnitude drop at index $\sim 20$ to $25$ out of $D = 32$, while LeJEPA's spectrum is near-flat (top-to-bottom ratio at most $3.6$). Per-dimension marginals are simultaneously near-Gaussian for both methods (mean Shapiro-Wilk $W \in [0.992, 0.996]$) as a Diaconis-Freedman consequence. At matched accuracy the two regularizers therefore yield structurally distinct projected representations.

2606.01419 2026-06-02 cs.CV 版本更新

DENSER: Depth-Guided Ensemble with Staged EFA-GS Reconstruction for Soccer Novel View Synthesis

DENSER:面向足球新视角合成的深度引导集成与分阶段EFA-GS重建

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods(Dick’s Sporting Goods 游戏变革)

AI总结 提出DENSER方法,通过深度引导集成和分阶段EFA-GS重建,结合相机高度损失加权、单目深度监督和三模型像素平均集成,提升足球场景新视角合成质量。

Comments CVPR 2026 SoccerNet Novel View Synthesis Challenge, Rank 1

详情
AI中文摘要

我们提出DENSER,一种面向足球新视角合成的深度引导集成与分阶段EFA-GS重建方法。DENSER在EFA-GS基础上做出三项关键贡献:(1)基于相机高度的损失加权,优先考虑地面级广播视角;(2)来自Depth-Anything-V2的单目深度监督,用于在无纹理区域正则化几何结构;(3)三模型像素平均集成,其成员通过改变训练长度和高斯尺度限制从共享基础检查点发散。在五个保留的挑战场景上,我们实现了平均PSNR为29.89 dB,SSIM为0.791,LPIPS为0.366。

英文摘要

We propose DENSER, a Depth-guided ENSemble with Staged EFA-GS Reconstruction for soccer novel view synthesis. DENSER extends EFA-GS with three key contributions: (1) camera-height-based loss weighting that prioritises ground-level broadcast views, (2) monocular depth supervision from Depth-Anything-V2 to regularise geometry in textureless regions, and (3) a three-model pixel-average ensemble whose members diverge from a shared base checkpoint by varying training length and Gaussian scale clamping. On five held-out challenge scenes we achieve a mean PSNR of 29.89 dB, SSIM of 0.791, and LPIPS of 0.366.

2606.01414 2026-06-02 cs.CV 版本更新

Agent Skills Should Go Beyond Text: The Case for Visual Skills

Agent技能应超越文本:视觉技能的必要性

Binxiao Xu, Ruichuan An, Bocheng Zou, Hang Hua

发表机构 * Peking University(北京大学) University of Wisconsin(威斯康星大学) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室)

AI总结 针对现有技能学习方法仅存储文本经验导致视觉任务瓶颈的问题,提出多模态技能范式,结合文本逻辑与视觉支持,通过自动系统将经验转化为可复用的视觉技能,在GUI等视觉任务中显著优于纯文本技能。

详情
AI中文摘要

可复用技能是扩展智能体能力的关键机制,使智能体能够积累经验并解决日益复杂的任务。然而,现有大多数技能学习方法仅将可复用经验存储为文本资产,如指令、推理轨迹或总结的轨迹。我们认为,这种纯文本范式为视觉中心任务造成了根本性瓶颈,因为可复用知识通常依赖于空间布局、视觉定位、细粒度外观和局部状态变化。为解决这一局限,我们提出\NAME,一种结合声明式文本逻辑与显式视觉支持的多模态技能范式。我们区分三种可复用形式:静态先验(用于稳定的空间惯例)、动态先验(用于现场视觉工作记忆)以及交错视觉技能(将有序文本步骤绑定到源帧、截图或页面区域,以证明其合理性)。视觉技能不仅描述要做什么,还编码了在哪里看、如何检查以及如何验证视觉结果。为了规模化构建视觉技能,我们引入\SYSTEM,一种自动系统,通过保留任务轨迹中的文本推理、空间引用、视觉边界和交互模式,将智能体经验转化为可复用的多模态技能。在GUI和其他视觉中心任务上的实验表明,视觉技能始终优于纯文本技能,尤其是在成功需要空间对应、视觉证据和状态感知交互时。这些结果支持我们的核心立场:可复用智能体技能应超越文本,成为未来多模态智能体的多模态资产。

英文摘要

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf{\NAME}, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \textbf{\SYSTEM}, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.

2606.01399 2026-06-02 cs.CV 版本更新

PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion

PAI-Studio: 具有相机感知运动的电影级视频背景替换

Heyuan Gao, Bangxun Tang, Yiren Song, Guian Fang, Zijian He, Jie Yang, Mike Zheng Shou

发表机构 * Utopai Studios(Utopai工作室) Nanyang Technological University(南洋理工大学) University of California, Irvine(加州大学尔湾分校) Show Lab, National University of Singapore(新加坡国立大学Show实验室)

AI总结 提出PAI-Studio,一种基于扩散变换器的视频合成任务,通过双向注意力机制统一处理前景动态与背景参考,实现运动一致的背景生成、高保真前景重光照和身份保持。

详情
AI中文摘要

我们提出PAI-Studio,一种新的参考条件视频合成任务,解决了电影级背景替换中长期存在的挑战:生成与前景运动对齐的动态背景,同时保持前景身份、匹配参考场景外观,并实现具有真实前景重光照的全局一致光照。现有的开源系统和商业API无法同时确保运动一致的背景生成、高保真前景重光照和前景身份保持,常常导致静态背景、不一致边界和明显的合成伪影。为弥补这一差距,我们基于扩散变换器视频骨干,将问题重新表述为上下文条件生成任务。通过双向注意力,我们的模型在统一架构中联合捕获前景动态和背景参考信息。我们进一步构建了一个来自高质量电影和在线视频的30K规模数据集来支持此任务。大量评估表明,我们的方法显著优于现有的开源和商业API解决方案。

英文摘要

We present PAI-Studio, a new reference-conditioned video synthesis task that addresses a long-standing challenge in cinematic background replacement: generating dynamic backgrounds aligned with foreground motion while preserving foreground identity, matching reference scene appearance, and achieving globally consistent illumination with realistic foreground relighting. Existing open-source systems and commercial APIs cannot simultaneously ensure motion-consistent background generation, high-fidelity foreground relighting and foreground identity preservation, often resulting in static backgrounds, inconsistent boundaries, and noticeable compositing artifacts. To bridge this gap, we build upon a Diffusion Transformer video backbone and reformulate the problem as an in-context conditional generation task. Through bidirectional attention, our model jointly captures foreground dynamics and background reference information within a unified architecture. We further construct a 30K-scale dataset sourced from high-quality films and online videos to support this task. Extensive evaluations demonstrate that our method significantly outperforms existing open-source and commercial API solutions.

2606.01393 2026-06-02 cs.CL cs.AI cs.CV 版本更新

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Dr. DocBench:专家级与困难文档解析的综合基准

Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo, Zhenting Qi, Konwoo Kim, Longtian Ye, Xiaolong Luo, Jinhe Bi, Henry Zhang, Haris Riaz, Xuan Zhang, Yunze Xiao, Bangya Liu, Tom Tang, Yunfei Zhao, Qunshu Lin, Zihan Wang, Minghao Liu, Michael Lingzhi Li, Yilun Du, Jesse Thomason, Rogerio Feris, Alex Pentland, Zexue He

发表机构 * Stanford University(斯坦福大学) MIT(麻省理工学院) Carnegie Mellon University(卡内基梅隆大学) University of Southern California(南加州大学) Harvard University(哈佛大学) IBM Research(IBM研究院) University of Arizona(亚利桑那大学) Duke University(杜克大学) UC Berkeley(加州大学伯克利分校) LMU Munich(慕尼黑路德维希-马克西米利安大学)

AI总结 提出Dr. DocBench基准,通过基于解析器失败的采样从多语言书籍语料库中选取挑战性文档,包含52个BISAC主题领域和65k高质量标注,用于评估专家级文档解析能力。

Comments 27 pages, 13 figures, 14 tables

详情
AI中文摘要

文档解析和识别是视觉语言模型(VLM)和文档处理系统的基本能力。然而,现有的光学字符识别(OCR)和文档解析基准在覆盖范围和难度上日益受限:许多基准专注于常见文档类型或均匀采样的页面,现代解析器在这些页面上已表现良好,而对专家领域结构(如化学公式、乐谱、复杂表格和跨页布局)的标注有限。我们引入了Dr. DocBench,一个面向专家级文档解析的难度感知基准。Dr. DocBench基于大规模多语言书籍语料库构建,涵盖52个BISAC主题领域,并通过基于解析器失败的采样选择挑战性文档,针对多个最先进系统难以处理的案例。它包含来自平均约100页的长文档的4,514个标注页面,具有65k高质量的页面级和块级标注,涵盖布局、阅读顺序、层次关系和特定领域的视觉内容。对基于流水线的解析器和通用VLM的评估表明,在现有基准上的强性能并不能迁移到我们的专家级文档解析中。我们的分析揭示了跨主题、内容类型和结构属性的重大失败,突显了Dr. DocBench作为诊断和推进文档智能的综合测试平台的作用。

英文摘要

Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.

2606.01380 2026-06-02 cs.CV 版本更新

Training-free image inversion for one-step diffusion models

无需训练的一步扩散模型图像反演

Tao Wu, Senmao Li, Yaxing Wang, Shiqi Yang, Kai Wang, Joost van de Weijer

发表机构 * CVC, University of Alabama in Birmingham(CVC,阿拉巴马大学伯明翰分校) Machine Intelligence Institute, Masdar Institute of Science and Technology(机器智能研究所,马斯达尔科技 institute) Jilin University(吉林大学) City University of Hong Kong, Department of Geography(香港城市大学地理系)

AI总结 提出一种无需训练的反演框架TFinv,通过迭代噪声对齐和后缀学习解决一步扩散模型中真实图像反演与编辑的关键挑战,实现高效编辑。

Comments Accepted to Pattern Recognition

详情
AI中文摘要

在这项工作中,我们为一步扩散模型引入了一种新颖的无需训练的反演(TFinv)框架,解决了真实图像反演和编辑中的关键挑战。我们首先确定了阻碍真实图像反演和编辑的两个关键因素:(1)初始潜在可编辑性,与初始噪声与理想高斯分布之间的距离有关;(2)描述差距,即文本描述与图像表示之间的对齐。这两个因素都影响一步扩散模型的反演效率和可编辑性。然后,我们提出了两种新颖的技术:迭代噪声对齐(iterNA),它最小化分布差距以与正态高斯分布对齐;以及后缀学习(suffL),它通过引入学习到的后缀提示令牌来增强文本到图像的描述对齐。这些技术能够将输入图像精确反演为其初始噪声表示,并促进图像编辑。此外,我们提出了一种基于掩码的编辑技术,用于局部编辑同时保持背景完整性。在PIE-Bench数据集上的全面实验验证了我们的方法TFinv不仅在一阶扩散编辑中实现了最先进的性能,而且在效率上显著优于现有的多步方法。代码可在https://github.com/tttao-uwu/TFinv.git获取。

英文摘要

In this work, we introduce a novel training-free inversion (TFinv) framework for one-step diffusion models,addressing key challenges in real image inversion and editing. We first identify two critical factors hamperingreal-image inversion and editing: (1) Initial Latent Editability, which is related to the distance between theinitial noise and the ideal Gaussian distribution, and (2) Caption Gap, which means the alignment betweentext captions and image representations. Both factors influence inversion efficiency and the editability ofone-step diffusion models. Then, we propose two novel techniques: iterative noise alignment (iterNA), whichminimizes the distribution gap to align with the normal Gaussian distribution, and suffix learning (suffL),which enhances text-to-image caption alignment by introducing learned suffix prompt tokens. These techniquesenable precise inversion of input images into their initial noise representations and facilitate image editing.Furthermore, we propose a mask-based editing technique for localized edits while preserving backgroundintegrity. Comprehensive experiments on the PIE-Bench dataset validate that our method TFinv not onlyachieves state-of-the-art performance in one-step diffusion editing, but also significantly outperforms existingmultistep approaches in efficiency. The code is available at https://github.com/tttao-uwu/TFinv.git.

2606.01372 2026-06-02 cs.LG cs.AI cs.CV 版本更新

BRo-JEPA: Learning Modular Arithmetic in Latent Space

BRo-JEPA:在潜空间中学习模算术

Divyansh Jha, Yuanfang Xie, Varan Mehra, Brennen Yu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) NYU Langone Health(纽约大学Langone医疗中心)

AI总结 本文提出BRo-JEPA模型,通过在潜空间中施加模10算术的循环结构,实现零样本泛化,解决了标准模型无法外推未见操作的问题。

Comments 10 pages, 14 figures

详情
AI中文摘要

神经网络能否学习抽象的代数规则,还是仅仅记忆训练模式?我们使用MNIST数字作为状态,模算术运算作为动作,在JEPA风格的潜世界模型中进行研究。标准监督基线和带有加法操作嵌入的JEPA模型能够学习已见操作,但无法可靠地外推到未见操作。为了弥补这一差距,我们引入了一个块旋转预测器,在潜空间中施加模10算术的循环结构。这使得模型具有强大的零样本泛化能力,最佳的基于ResNet的JEPA块旋转模型达到了99.46%的零样本准确率和99.46%的展开准确率。我们的结果表明,当架构与问题结构匹配时,潜世界模型可以学习符号变换规则。我们的代码可以在此处访问:https://github.com/DL-World-Models/mnist-math。

英文摘要

Can neural networks learn abstract algebraic rules, or do they merely memorize training patterns? We investigate this using MNIST digits as states and modular arithmetic operations as actions in a JEPA-style latent world model. Standard supervised baselines and JEPA models with additive operation embeddings fit seen operations but fail to extrapolate reliably to unseen ones. To bridge this gap, we introduce a block-rotation predictor that imposes the circular structure of modulo-10 arithmetic in latent space. This enables strong zero-shot generalization, with the best ResNet-based JEPA block-rotation model achieving 99.46\% zero-shot and 99.46\% rollout accuracy. Our results suggest that latent world models can learn symbolic transformation rules when architecture matches the structure of the problem. Our code can be \href{https://github.com/DL-World-Models/mnist-math}{accessed here}.

2606.01367 2026-06-02 cs.RO cs.CV 版本更新

ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo

ActMVS:基于单目多视图立体的主动场景重建

Guo Pu, Yixuan Han, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所)

AI总结 提出ActMVS框架,通过视图因子图构建和全局深度优化,实现单目相机在线生成高质量、全局一致的密集深度图,支持机器人/UAV的主动场景重建与安全轨迹规划。

Comments ICRA 2026

详情
AI中文摘要

主动场景重建使机器人/UAV能够自主规划轨迹并重建环境,无需昂贵的手动数据采集。与被动方法不同,主动重建需要实时构建高置信度占据地图以实现无碰撞导航。现有方法依赖深度传感器更新占据地图,增加了平台成本和重量。为推进空间智能,我们旨在实现纯视觉单目解决方案。然而,当前单目场景重建方法离线运行,无法在机器人/UAV导航所需的帧率下提供全局一致的密集深度。为弥补这一差距,我们引入ActMVS,这是首个单目主动重建框架。我们的框架集成了用于信息多视图立体深度预测的视图因子图构建,以及全局深度优化,从而实现在线生成高质量、全局一致的密集深度图。这使得单目机器人/UAV能够在重建过程中维护可靠的占据地图,以实现安全的轨迹规划。在Replica数据集上的实验表明,其性能与RGB-D方法相当。我们的代码和数据可在https://github.com/TrickyGo/ActMVS获取。

英文摘要

Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at https://github.com/TrickyGo/ActMVS.

2606.01362 2026-06-02 cs.GR cs.CV 版本更新

AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance

AlbedoEdit: 基于反照率引导的统一实例级视频编辑

Xilong Zhou, Bao-Huy Nguyen, Zheng Zeng, Jacob Munkberg, Jon Hasselgren, Thomas Leimkühler, Nima Kalantari, Miloš Hašan, Christian Theobalt

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所) University of California Santa Barbara(加州大学圣巴巴拉分校) NVIDIA Research(NVIDIA研究) Texas A & M University(德克萨斯A&M大学)

AI总结 提出 AlbedoEdit,一个统一框架,利用反照率图实现对象插入、移除和纹理编辑,通过微调视频基础模型,在合成数据集上训练,实现编辑内容的和谐融合与复杂视觉效果模拟。

详情
AI中文摘要

视频生成模型在合成逼真视频序列方面取得了显著进展。然而,实现更广泛和更具创造性的下游应用需要细粒度的实例级视频编辑,包括对象插入、对象移除和纹理编辑,这已成为一个突出但具有挑战性的问题。现有方法要么提出仅具有粗略语义控制的统一生成框架,要么为单个编辑任务设计特定任务框架,限制了它们在多样化真实场景中的灵活性和适用性。为解决这些限制,我们提出了 AlbedoEdit,一个统一的生成式视频编辑框架,同时支持对象插入、对象移除和纹理编辑。我们的关键洞察是,内在反照率图(对光照不变且不包含镜面反射、阴影和相互反射效应)为指定细粒度外观编辑提供了一种有效且用户友好的机制。基于视频基础模型,AlbedoEdit 被微调以将源 RGB 视频转换为编辑后的 RGB 视频,条件为用户编辑的第一帧反照率。在覆盖所有三种编辑任务的新配对合成数据集上训练后,AlbedoEdit 隐式学习协调编辑内容并模拟由编辑操作触发的复杂真实世界视觉效果,包括镜面高光、软阴影和镜面反射。AlbedoEdit 在定性和定量上均优于最先进的视频编辑方法。项目网页为 https://vcai.mpi-inf.mpg.de/projects/AlbedoEdit/。

英文摘要

Video generative models have achieved remarkable progress in synthesizing photorealistic video sequences. However, enabling broader and more creative downstream applications requires fine-grained instance-level video editing, including object insertion, object removal, and texture editing, which has emerged as a prominent yet challenging problem. Existing approaches either propose unified generative frameworks with only coarse semantic control, or design task-specific frameworks for individual editing tasks, limiting their flexibility and applicability across diverse real-world scenarios. To address these limitations, we propose AlbedoEdit, a unified generative video editing framework that jointly supports object insertion, object removal, and texture editing. Our key insight is that the intrinsic albedo map, which is invariant to lighting and contains no specularity, shadowing and inter-reflection effects, provides an effective and user-friendly mechanism for specifying fine-grained appearance edits. Built upon video foundation models, AlbedoEdit is fine-tuned to translate source RGB videos into edited RGB videos, conditioned on a user-edited first-frame albedo. Trained on a new paired synthetic dataset covering all three editing tasks, AlbedoEdit implicitly learns to harmonize edited contents and simulate complex real-world visual effects triggered by editing operations, including specular highlights, soft shadows, and mirror reflections. AlbedoEdit demonstrates superior performance over state-of-the-art video editing approaches, both qualitatively and quantitatively. Project webpage is https://vcai.mpi-inf.mpg.de/projects/AlbedoEdit/.

2606.01361 2026-06-02 cs.CV 版本更新

Diamonds in the Sky: Pareidolic Animals in Clouds

天空中的钻石:云中的空想性动物

Miriam Horovicz, Yacov Hel-Or, Yael Moses

发表机构 * Reichman University, Israel(里奇曼大学,以色列)

AI总结 提出基于扩散模型的方法,预测人们可能在云中感知到的空想性动物,并通过生成相似形状的动物图像和变形视频辅助识别。

详情
AI中文摘要

人们常在云中看到动物形状,这种现象被称为空想性错视。我们提出一种基于AI的方法,旨在预测人们可能在云中感知到哪些动物,尽管最先进的识别方法通常无法检测到此类动物。此外,我们引入一种方法帮助个体感知特定的空想性动物,即使他们最初未能识别。我们的方法使用扩散模型将云片段转换为视觉上类似于原始云的动物形状。这种扩散技术的灵感来源于观察:扩散过程仅在目标动物与云形状相似时成功,且微妙的视觉线索通常足以帮助个体识别特定的空想性动物。从扩散模型成功生成的图像随后用于预测空想性动物。此外,使用从生成图像过渡回原始云片段的短变形视频进一步增强人类对空想性动物的感知。

英文摘要

People often see animal shapes in clouds, a phenomenon known as pareidolia. We propose an AI-based method that aims to predict which animals people are likely to perceive in clouds, even though state-of-the-art recognition methods typically fail to detect such animals. Additionally, we introduce a method to assist individuals in perceiving specific pareidolic animals, even if they did not recognize them initially. Our approach uses a diffusion model to transform cloud segments into an animal shape that visually resemble the original cloud. This diffusion technique is inspired by the observation that the diffusion process succeeds only when the target animal resembles the shape of the cloud, and that subtle visual hints often suffice to help individuals recognize specific pareidolic animals. A generated image, successfully derived from the diffusion model, is then used to predict the pareidolic animal. Additionally, a short morphing video transitioning from the generated image back to the original cloud segment is employed to further enhance the human's perception of the pareidolic animals.

2606.01339 2026-06-02 cs.LG cs.AI cs.CL cs.CV cs.ET 版本更新

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

FreqLite:一种轻量级频率分解线性模型,具有自适应可逆归一化,用于稳健的长期时间序列预测

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

发表机构 * Hamdard University(哈姆达德大学)

AI总结 提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器,通过可学习的无损谱滤波器进行频带分解和线性预测,并引入自适应可逆实例归一化(A-RevIN)处理非平稳性,在长期预测基准上以更少参数和计算资源超越PatchTST等模型。

Comments 26 pages, 5 figures

详情
AI中文摘要

长期时间序列预测需要既准确又能在商用硬件上高效运行的模型。轻量级线性预测器在此领域表现出色,但仍存在两个问题:可逆实例归一化(RevIN)使用单一回溯统计量对整个预测区间进行去归一化,在非平稳性下不准确;时域趋势/季节分解依赖于固定的非自适应滤波器。我们提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器:一个可学习的、无损的单位划分谱滤波器将输入分割成多个频带,由每个频带的线性头进行预测,与低通截断方法不同,高频带被保留并建模。FreqLite在标准长期预测基准上是最佳的轻量级模型,在长回溯(L=336)时,其平均误差低于PatchTST Transformer(0.3244 vs 0.3587 MSE),同时参数减少4倍,内存减少2.2倍,在单块4 GB笔记本GPU上每轮时间减少2.2倍;尽管幅度不大,但在所有匹配单元上的配对Wilcoxon检验中,其改进具有统计显著性(p < 1e-5)。我们进一步引入自适应可逆实例归一化(A-RevIN),一种自适应可逆归一化,严格推广了RevIN(在其门关闭时完全恢复),在非平稳性下起作用,并在平稳数据上无害地退化为RevIN。我们在一个真实的强非平稳数据集(ILI,MSE降低约5%)和一个受控合成漂移扫描中验证了这一点,其中A-RevIN的收益及其学习门都随注入的非平稳性单调增加。每个组件均可独立消融(Linear和RLinear是FreqLite的特例),所有结果均可在商用硬件上复现。

英文摘要

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecasters are remarkably strong in this regime, yet they leave two openings: reversible instance normalization (RevIN) de-normalizes the entire horizon with a single lookback statistic, which is inaccurate under non-stationarity, and time-domain trend/seasonal decomposition relies on a fixed, non-adaptive filter. We present FreqLite, an ultra-lightweight, channel-independent frequency-decomposed linear forecaster: a learnable, lossless, partition-of-unity spectral filter splits the input into bands that are forecast by per-band linear heads and, unlike low-pass-truncation approaches, the high-frequency band is retained and modeled. FreqLite is the best lightweight model on the standard long-term forecasting benchmarks and, at long lookback (L=336), attains a lower average error than a PatchTST Transformer (0.3244 vs. 0.3587 MSE) while using 4x fewer parameters, 2.2x less memory, and 2.2x less time per epoch on a single 4 GB laptop GPU; although modest in magnitude, its improvements are statistically significant under paired Wilcoxon tests across all matched cells (p < 1e-5). We further introduce Adaptive Reversible Instance Normalization (A-RevIN), a regime-adaptive reversible normalization that strictly generalizes RevIN (recovered exactly when its gate is closed), engages under non-stationarity, and reduces to RevIN without harm on stationary data. We validate this on both a real strongly non-stationary dataset (ILI, up to ~5% MSE reduction) and a controlled synthetic drift sweep in which A-RevIN's benefit and its learned gate both rise monotonically with injected non-stationarity. Every component is independently ablatable (Linear and RLinear are special cases of FreqLite), and all results are reproducible on commodity hardware.

2606.01334 2026-06-02 cs.CV 版本更新

HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition

HOLA: 面向开放集3D识别的全息多模态对齐

Koby Aharonov, Oren Shrout, Ayellet Tal

发表机构 * Technion – Israel Institute of Technology(技术ion-以色列理工学院)

AI总结 提出HOLA方法,通过解耦多正例对比损失和对齐点云与多视图图像及文本描述,实现开放集3D识别中的全息多模态对齐,在长尾基准上取得最先进零样本性能。

详情
AI中文摘要

开放集3D识别需要模型能够泛化到罕见或未见类别。最近的方法通过将语言-视觉知识蒸馏到3D编码器来解决这一问题,通常依赖重型2D ViT,并将每个点云与单张图像或标题对齐,从而将表示锚定到局部视图。我们提出将每个点云与多张图像和文本描述对齐,以捕获对3D对象的更全面理解。为实现这一想法,必须设计一个损失函数,能够联合对齐一个3D实例与多个匹配信号(多视图图像和多个文本),同时将正例聚合与负例竞争分离。我们引入了这样的函数,称为解耦多正例对比损失。我们的公式增强了损失对困难负例的难度感知关注,避免了当多个正例与所有负例共享同一个softmax时出现的“聚光灯拥挤”现象。作为补充,我们提出了一个轻量级文本适配器,仅应用于网络标题,减少了与精心标注之间的领域差距,并能够有效利用大规模无监督文本。我们的模型在长尾基准上展示了最先进的开放词汇性能,在保持高帧率的同时实现了显著的零样本改进。

英文摘要

Open-set 3D recognition requires models that generalize to rare or unseen categories. Recent approaches address this by distilling language-vision knowledge into 3D encoders, typically relying on heavy 2D ViTs and aligning each point cloud with a single image or caption, thus anchoring representations to partial views. We propose aligning each point cloud with multiple images and textual descriptions to capture a more holistic understanding of 3D objects. To realize this idea, it is essential to design a loss function capable of jointly aligning a 3D instance with multiple matched signals, multi-view images and multiple texts, while separating positive aggregation from negative competition. We introduce such a function, termed the decoupled multi-positive contrastive loss. Our formulation enhances the loss's hardness-aware focus on challenging negatives, avoiding the "spotlight crowding" that occurs when many positives share the same softmax with all the negatives. Complementing this, we present a lightweight text adapter applied only to web captions, reducing the domain gap to curated annotations and enabling effective use of large-scale unsupervised text. Our model demonstrates state-of-the-art open-vocabulary performance on long-tail benchmarks, yielding substantial zero-shot improvements while sustaining high frame rates.

2606.01315 2026-06-02 cs.CV 版本更新

DeblurNVS: Geometric Latent Diffusion for Novel View Synthesis from Sparse Motion-Blurred Images

DeblurNVS:基于几何潜在扩散的稀疏运动模糊图像新视角合成

Changyue Shi, Wangbo Yu, Chaoran Feng, Li Yuan

发表机构 * School of AI for Science, Peking University Shenzhen Graduate School(人工智能科学学院,北京大学深圳研究生院) School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School(电子与计算机工程学院,北京大学深圳研究生院)

AI总结 提出DeblurNVS框架,利用几何潜在扩散从稀疏运动模糊图像中直接合成高保真新视角,无需逐场景优化。

详情
AI中文摘要

新视角合成(NVS)是计算机视觉和图形学中的一个基本问题。神经辐射场(NeRF)、3D高斯泼溅(3DGS)和生成式视角合成的最新进展显著提高了其质量。然而,大多数方法仍然依赖于清晰观测,其中图像结构和跨视角几何线索得以良好保留。运动模糊通过破坏局部细节和削弱多视角对应关系打破了这一假设。这种模糊通常由实际拍摄中的相机抖动、场景运动或有限曝光引起。模糊感知的NVS方法通过建模图像形成来解决这一退化问题,但它们依赖于昂贵的逐场景优化,限制了高效且可泛化的稀疏视角合成。为了解决这个问题,我们提出了DeblurNVS,一种新颖的框架,可以直接从稀疏运动模糊图像中合成高保真新视角,无需逐场景优化。DeblurNVS恢复了多视角推理所需的中间几何表示,使模糊输入能够恢复可靠的结构和对应线索。然后,将恢复的表示与目标相机信息结合,合成目标视角表示并重建清晰的RGB新视角。为了实现大规模训练,我们使用基于插值的有限曝光模糊合成方法,从DL3DV-10K构建了一个运动模糊NVS数据集。大量实验表明,DeblurNVS在合成运动模糊基准上优于现有基线,并能泛化到真实运动模糊场景,生成感知上更清晰、结构上更稳定的新视角,同时避免了昂贵的逐场景优化。项目页面:https://github.com/PKU-YuanGroup/DeblurNVS。

英文摘要

Novel view synthesis (NVS) is a fundamental problem in computer vision and graphics. Recent advances in neural radiance fields (NeRF), 3D Gaussian Splatting (3DGS), and generative view synthesis have substantially improved its quality. Yet most methods still rely on clean observations, where image structures and cross-view geometric cues are well preserved. Motion blur breaks this assumption by corrupting local details and weakening multi-view correspondences. Such blur commonly arises from camera shake, scene motion, or finite exposure in practical capture. Blur-aware NVS methods address this degradation by modeling image formation, but their reliance on costly per-scene optimization limits efficient and generalizable sparse-view synthesis. To address this, we propose DeblurNVS, a novel framework for synthesizing high-fidelity novel views directly from sparse motion-blurred images, without requiring per-scene optimization. DeblurNVS restores the intermediate geometric representations needed for multi-view reasoning, enabling blurred inputs to recover reliable structure and correspondence cues. The restored representations are then combined with target camera information to synthesize the target-view representation and reconstruct a sharp RGB novel view. To enable the large-scale training, we construct a motion-blurred NVS dataset from DL3DV-10K using interpolation-based finite-exposure blur synthesis. Extensive experiments demonstrate that DeblurNVS outperforms existing baselines on synthetic motion-blur benchmarks and generalizes to real motion-blurred scenes, producing perceptually sharper and structurally more stable novel views while avoiding costly per-scene optimization. Project page: https://github.com/PKU-YuanGroup/DeblurNVS.

2606.01293 2026-06-02 eess.IV cs.AI cs.CV 版本更新

ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI

ResNet-34与轻量级解码器用于胎儿脑部MRI的准确高效分割

Ashiqur Rahman, Muhammad E. H. Chowdhury, Md. Abu Sayed, Md. Sharjis Ibne Wadud, Abu Naser Md. Arafat, Mehedi Hasan Prince

发表机构 * Department of Biomedical Physics and Technology, University of Dhaka(达卡大学生物医学物理与技术系) Department of Electrical Engineering, College of Engineering, Qatar University(卡塔尔大学工程学院电气工程系) Department of Biomedical Engineering, Jashore University of Science and Technology(贾沙尔大学科学与技术学院生物医学工程系)

AI总结 提出一种结合ResNet-34编码器和基于MLP的轻量级解码器的深度学习模型,以解决胎儿脑MRI分割中的运动伪影和强度不均匀问题,在FeTA 2021数据集上达到97.37%准确率和90.33%平均DSC。

详情
AI中文摘要

在磁共振成像(MRI)中准确分割胎儿脑组织对于先天性异常的早期诊断和改善产前护理至关重要。然而,由于胎儿运动、组织对比度低以及整个孕龄期解剖结构变异大,特别是分割白质、灰质、侧脑室、深部灰质、脑外脑脊液、小脑和脑干等复杂结构时,该任务仍然困难。针对这些难题,本研究引入了一种新颖的深度学习模型,该模型将ResNet-34编码器与利用多层感知器(MLP)模块进行自适应特征细化的轻量级解码器相结合。这种设计特别增强了模型保留解剖边界并减轻由运动伪影和强度不均匀引起的分割误差的能力。通过减少参数数量、采用双线性上采样代替转置卷积以及优化解码器以提高速度而不牺牲精度,实现了计算效率。在FeTA 2021数据集上使用5折交叉验证进行训练和验证,所提出的模型优于UNet、UNet++、DeepLabV3和DeepLabV3+等基线架构,平均准确率达到97.37%,平均Dice相似系数(DSC)为90.33%,平均交并比(IoU)为86.93%,精确率为90.83%。此外,其快速的推理时间和减少的计算负载使其非常适合集成到实时临床工作流程中。

英文摘要

Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalities and improving prenatal care. However, the task remains difficult because of fetal motion, low tissue contrast, and major anatomical variability throughout gestational ages, particularly in segmenting complex structures such as white matter, gray matter, lateral ventricles, deep gray matter, extra-cerebrospinal fluid, cerebellum, and brainstem. As a solution to these difficulties, this research introduces a novel deep learning model that combines a ResNet-34 encoder with a lightweight decoder leveraging multi-layer perceptron (MLP) modules for adaptive feature refinement. This design specifically enhances the model's ability to preserve anatomical boundaries and mitigate segmentation errors caused by motion artifacts and intensity inhomogeneities. Computational efficiency is achieved by reducing parameter count, employing bilinear upsampling instead of transposed convolutions, and optimizing the decoder for speed without sacrificing accuracy. Trained and validated on the FeTA 2021 dataset using 5-fold cross-validation, the proposed model outperforms baseline architectures such as UNet, UNet++, DeepLabV3, and DeepLabV3+, achieving an average Accuracy of 97.37% with a mean Dice Similarity Coefficient (DSC) of 90.33%, mean Intersection over Union (IoU) of 86.93%, and Precision of 90.83%. Additionally, its fast inference time and reduced computational load make it well-suited for integration into real-time clinical workflows.

2606.01287 2026-06-02 cs.CV cs.AI 版本更新

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

超越视觉记忆:潜在视觉推理的机制诊断

Garvin Guo, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Shuai Dong

发表机构 * Amap, Alibaba Group(阿里集团亚马通) Shanghai Innovation Institute(上海创新研究院)

AI总结 通过分解潜在令牌为三个可测试组件,发现边界标记和格式而非潜在槽贡献了主要性能提升,揭示了潜在视觉推理的真正机制。

详情
AI中文摘要

最近的潜在视觉推理方法通过在多模态语言模型中插入连续潜在令牌取得了显著提升。这些提升通常归因于令牌编码了视觉证据;然而,最近的分析揭示了一个悖论:令牌与图像关联松散,对答案贡献甚微。关键的是,这些分析将潜在令牌视为一个整体,掩盖了提升的真正来源。因此,我们将潜在令牌分解为三个可测试组件:潜在槽、边界标记和格式,并在有利条件下开发了一种最先进的方法作为探针。在六个方法-阶段设置和四个感知密集型基准测试中,潜在槽未能通过视觉记忆解释的所有预测。引人注目的是,在几种设置中,仅保留边界标记即可保留78%至100%的提升,而模型在潜在位置比在答案位置更窄地关注图像。因此,提升来自边界标记、格式以及这种注意力模式,而非潜在槽。每种方法如何利用这一机制取决于其训练监督:在匹配的准确率下,机制仍可能显著不同。因此,潜在视觉推理不仅需要根据准确率评估,还需要根据模型实际依赖的内容进行评估。

英文摘要

Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We therefore decompose latent tokens into three testable components: latent slots, boundary markers, and format, and develop a state-of-the-art method as a probe under favorable conditions. Across six method-stage settings and four perception-heavy benchmarks, latent slots fail every prediction of the visual-memory account. Strikingly, retaining only the boundary markers preserves 78 to 100% of the gain in several settings, while the model attends to the image more narrowly at latent positions than at answer positions. The gain therefore comes from boundary markers, format, and this attention pattern, not from latent slots. How each method engages this mechanism depends on its training supervision: at matched accuracy, mechanisms can still differ markedly. Latent visual reasoning thus needs evaluation not only by accuracy but by what the model actually relies on.

2606.01285 2026-06-02 cs.CV cs.AI 版本更新

Knowledge-Intensive Video Generation

知识密集型视频生成

Chenxu Wang, Mingda Chen

发表机构 * Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对文本到视频生成在事实性和实用性方面的不足,提出知识密集型视频生成(KIVI)任务,构建KIVI-Bench基准和自动评估指标,实验表明现有模型在视觉属性、操作过程和信息呈现上落后于人类。

详情
AI中文摘要

文本到视频生成在视觉质量上取得了快速进步,但在事实性和实际有用性方面仍缺乏评估。我们引入了知识密集型视频生成(KIVI),其中模型根据简短的信息寻求提示生成视频,这些提示要求解释、步骤或演示。为了评估这一设置,我们构建了KIVI-Bench,一个包含1,080个提示的基准,并提出了用于事实性和有用性的自动指标。人类评估表明,我们的指标比现有替代方案更符合人类标注。对七个最先进的视频生成模型的实验表明,当前系统仍落后于人类表现,尤其是在视觉属性、程序性操作和清晰的信息呈现方面。这些结果凸显了KIVI作为事实性和教学性视频生成的一个具有挑战性的方向。

英文摘要

Text-to-video generation has advanced rapidly in visual quality, but remains under-evaluated for factuality and practical usefulness. We introduce knowledge-intensive video generation (KIVI), where models generate videos from short information-seeking prompts that ask for explanations, procedures, or demonstrations. To evaluate this setting, we construct KIVI-Bench, a benchmark of 1,080 prompts, and propose automatic metrics for factuality and helpfulness. Human evaluation shows that our metrics significantly better align with human annotations than existing alternatives. Experiments on seven state-of-the-art video generation models show that current systems still lag behind human performance, especially on visual properties, procedural operations, and clear information presentation. These results highlight KIVI as a challenging direction for factual and instructionally useful video generation.

2606.01282 2026-06-02 cs.CV cs.CY cs.LG 版本更新

KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation

KG-FairDiff: 知识图谱引导的提示词精炼用于人口统计公平的文本到图像生成

Farbod Davoodi, Seyed Reza Tavakoli Shiyadeh, Pooria Safaei, Sana Harighi, Parsa Gholami, Amirali Amini, Kimia Vanaei, Emad Firoozi, Parham Abed Azad, Babak Khalaj, Siavash Ahmadi, Amir Hossein Payberah, Mohammad Hossein Rohban, Soheil Kolouri, Ali Diba

发表机构 * University of Science and Technology of China(中国科学技术大学) Sharif University of Technology(谢赫·伊斯兰大学) Iran University of Science and Technology(伊朗科学技术大学)

AI总结 提出KG-FairDiff框架,通过知识图谱引导的提示词精炼,在推理时优化公平性损失,减少文本到图像生成中的性别、种族、年龄等人口统计偏差,同时保持语义保真度。

详情
AI中文摘要

文本到图像(TTI)系统现已成为新闻、教育、广告和公共传播的日常基础设施,它们从训练数据中继承的人口统计和文化刻板印象(将女性、有色人种、老年人和非西方文化描绘为代表性不足或漫画化)在部署规模上成为人口层面的危害。现有的缓解措施要么需要昂贵的重新训练,这对于主导消费产品的闭源骨干网络不可行,要么依赖于忽略文化背景的固定人口统计模板。我们提出了KG-FairDiff,一个模型无关、推理时框架,将公平感知的提示词精炼形式化为一个约束优化问题,并将其实现为一个闭环流水线:一个包含约1200个文化和偏见相关三元组的知识图谱检索结构化上下文,一个LLM改写器提出精炼,一个验证器仅接受那些减少基于散度的公平性损失同时保持用户原始意图语义保真度的提示词。我们证明了精炼循环的有限终止界限,贡献了一个数学上一致的评估套件,将Bias-P/Bias-W与目标分布的散度以及ENS与KL散度联系起来,并审计了八个广泛部署的骨干生成器。KG-FairDiff显著减少了性别、种族、年龄和交叉差异,同时保持了提示词语义,为更公平的生成式AI提供了一条实用、可部署的路径。

英文摘要

Text-to-Image (TTI) systems are now everyday infrastructure for journalism, education, advertising, and public communication, and the demographic and cultural stereotypes they inherit from training data (rendering women, people of colour, older adults, and non-Western cultures as under-represented or caricatured) become a population-level harm at deployment scale. Existing mitigations either require costly retraining, infeasible for the closed-source backbones that dominate consumer products, or rely on fixed demographic templates that ignore cultural context. We present KG-FairDiff, a model-agnostic, inference-time framework that formalises fairness-aware prompt refinement as a constrained optimisation problem and operationalises it as a closed-loop pipeline: a knowledge graph of ~1,200 culture- and bias-related triples retrieves structured context, an LLM rewriter proposes refinements, and a validator accepts only prompts that reduce a divergence-based fairness loss while preserving semantic fidelity to the user's original intent. We prove a finite-termination bound for the refinement loop, contribute a mathematically consistent evaluation suite linking Bias-P/Bias-W to divergence from target distributions and ENS to KL divergence, and audit eight widely-deployed backbone generators. KG-FairDiff substantially reduces gender, race, age, and intersectional disparities while preserving prompt semantics, offering a practical, deployment-ready route to more equitable generative AI.

2606.01280 2026-06-02 cs.CV 版本更新

Event-Based Vision in Space: Applications, Trends, and Future Directions

太空中的事件视觉:应用、趋势与未来方向

Luigi Capogrosso, Pietro Bonazzi, Michele Magno

发表机构 * Interdisciplinary Transformation University of Austria(交叉学科转型奥地利大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文综述了事件视觉传感器在太空领域的应用,通过分类四个主要领域(大气与高速观测、环境监测与变化检测、操作支持与星上处理、地理空间建模与预测分析),指出神经形态工程是解决现代遥感与可持续太空探索关键瓶颈的范式转变。

Comments Accepted at the XXIV Annual Conference on Sensors and Microsystems (AISEM) 2026

详情
AI中文摘要

地球观测(EO)正经历由新型传感技术部署驱动的重大变革。传统的基于帧的光学传感器在具有挑战性的轨道环境中常受运动模糊、高功耗和极端数据冗余的困扰。相比之下,事件传感器(也称为神经形态相机)提供了一种仿生异步方法。通过仅捕获局部光照变化,它们提供微秒级时间分辨率、极高动态范围和卓越能效。尽管这些传感器的使用正从地面系统迅速扩展到轨道平台,但围绕其太空应用的科学文献仍然高度分散。为弥合这一差距,本文对太空领域事件视觉的最新技术进行了全面综述。基于检索到的文献,我们引入了一个围绕四个主要领域构建的分类体系:1)大气与高速观测;2)环境监测与变化检测;3)操作支持与星上处理;4)地理空间建模与预测分析。因此,本综述强调,神经形态工程远不止是一种补充成像技术;它是一种范式转变,可直接用于解决现代遥感和可持续太空探索中的关键瓶颈。

英文摘要

Earth Observation (EO) is undergoing a significant transformation driven by the deployment of novel sensing technologies. Traditional frame-based optical sensors often struggle with motion blur, high power consumption, and extreme data redundancy in challenging orbital environments. In contrast, event-based sensors, also known as neuromorphic cameras, offer a bio-inspired asynchronous approach. By capturing only local illumination changes, they provide microsecond temporal resolution, an extremely high dynamic range, and exceptional energy efficiency. Although the use of these sensors is rapidly expanding from terrestrial systems to orbital platforms, the scientific literature surrounding their space-based applications remains heavily fragmented. To bridge this gap, this article presents a comprehensive review of the state-of-the-art in event-based vision in the space domain. Based on the retrieved literature, we introduce a taxonomy structured around four primary domains: 1) atmospheric and high-speed observation; 2) environmental monitoring and change detection; 3) operational support and onboard processing; and 4) geospatial modeling and predictive analysis. As a result, this survey highlights that neuromorphic engineering is far more than a supplementary imaging technique; it is a paradigm shift that can be used to directly address critical bottlenecks in modern remote sensing and sustainable space exploration.

2606.01277 2026-06-02 cs.RO cs.AI cs.CV cs.SY eess.IV eess.SY 版本更新

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

DeepIPCv3: 面向突发行人穿越避让的事件感知多模态传感器融合

Oskar Natan, Andi Dharmawan, Aufaclav Zatu Kusuma Frisky, Jazi Eko Istiyanto, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada(计算机科学与电子系,加雅马达大学) Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,东福士大学)

AI总结 提出DeepIPCv3框架,通过Transformer交叉模态注意力融合LiDAR点云与DVS事件流,实现突发行人穿越场景下的高反应性避让,在自定义多模态数据集上达到最优轨迹与控制精度。

详情
AI中文摘要

当前的端到端自动驾驶系统主要依赖基于帧的传感器,这类传感器在高度动态的突发行人穿越场景中存在固有的感知延迟和运动模糊问题。为解决这一关键安全漏洞,我们提出DeepIPCv3,一种新颖的多模态自主导航框架,它将LiDAR点云的密集3D空间几何与动态视觉传感器(DVS)的微秒级异步事件流协同融合。我们引入了一种受Transformer启发的交叉模态注意力机制,以动态关联这些不同模态,使网络能够即时优先处理高速动态更新,同时不牺牲场景结构感知。融合后的潜在表示通过一个混合策略网络映射到安全的局部路径点和可执行控制命令,该网络结合了启发式轨迹跟踪与直接神经预测。由于在真实场景中测试这些突发穿越场景存在严重物理风险,该框架使用在光照良好的正午和具有挑战性的傍晚条件下收集的自定义多模态数据集进行严格离线评估。广泛的对比和消融研究表明,DeepIPCv3达到了最先进的预测性能。通过有效消除曝光失败和运动模糊,所提出的LiDAR与DVS融合实现了最低的轨迹和控制命令误差,使得无论环境光照如何,都能实现高反应性、数学上有界的规避机动。为支持未来研究,我们将代码发布到GitHub仓库:https://github.com/oskarnatan/DeepIPCv3。

英文摘要

Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.

2606.01271 2026-06-02 cs.CV 版本更新

Exploiting In-Sensor Computing for Energy-Efficient Earth Observation

利用传感器内计算实现节能地球观测

Luigi Capogrosso, Pietro Bonazzi, Loris Hoxhaj, Michele Magno

发表机构 * Interdisciplinary Transformation University of Austria(跨学科转型奥地利大学) ETH Zurich(苏黎世联邦理工学院) University of Verona(威尼斯大学)

AI总结 针对卫星数据下行带宽瓶颈,提出基于TinyML和索尼IMX500传感器的传感器内计算框架,在8MB约束下达到96.68%精度和42.26 GMAC/J能效。

Comments Accepted at the XXIV Annual Conference on Sensors and Microsystems (AISEM) 2026

详情
AI中文摘要

卫星产业的快速增长推动了地理空间数据获取的大幅增加,凸显了一个关键瓶颈:收集的传感器数据量与地面站有限的可用下行带宽之间的严重不匹配。虽然星载计算通过在轨预处理数据帮助解决了这一问题,但本文通过引入传感器内计算框架进一步推进了这一范式。我们通过将TinyML技术与索尼IMX500智能视觉传感器集成,提出了一种针对严格计算约束优化的端到端地球观测流水线。具体来说,我们的方法将处理直接转移到传感器级别,减轻了主嵌入式设备的计算负担,并有效减少了噪声或无关数据的下行传输。我们在EuroSAT数据集上评估了几种高效的卷积神经网络,即SqueezeNet、ShuffleNetV2和MCUNetV1。实验结果表明,尽管部署在IMX500平台上需要优化,我们的模型在其8 MB约束内保持了具有竞争力的96.68%准确率。具体来说,模型达到平均处理吞吐量17.40 FPS,延迟27.43 ms。此外,我们的系统配置文件表现出高能效,每次推理的低能耗为14.19 mJ,能效评级为42.26 GMAC/J,证明了其在传感器内部署的可行性。

英文摘要

The rapid growth of the satellite industry has driven a significant increase in geospatial data acquisition, highlighting a critical bottleneck: the severe disparity between the volume of collected sensor data and the limited downlink bandwidth available to ground stations. While On-Board Computing (OBC) has helped address this by pre-processing data in orbit, this article further advances the paradigm by introducing an in-sensor computing framework. We present an optimized end-to-end Earth Observation (EO) pipeline tailored for strict computational constraints by integrating TinyML techniques with the Sony IMX500 Intelligent Vision Sensor. Specifically, our approach shifts processing directly to the sensor level, offloading the computation from the primary embedded device, and effectively mitigating the downlink transmission of noisy or irrelevant data. We evaluated several efficient Convolutional Neural Networks (ConvNets), i.e., SqueezeNet, ShuffleNetV2, and MCUNetV1, on the EuroSAT dataset. Experimental results show that, despite the optimizations required for deployment on the IMX500 platform, our models maintain a competitive 96.68% accuracy while operating within its 8 MB constraints. Specifically, the models reach an average processing throughput of 17.40 FPS with a latency of 27.43 ms. Furthermore, our system profile exhibits high energy efficiency, with a low energy footprint of 14.19 mJ per inference and an efficiency rating of 42.26 GMAC/J, demonstrating its viability for in-sensor deployment.

2606.01247 2026-06-02 cs.CV 版本更新

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

看向哪里:基础模型能否通过主动探索达到目标视角?

Liyang Li, Muzhi Zhu, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen

发表机构 * Zhejiang University(浙江大学)

AI总结 提出目标视角复现(TVR)任务及TVRBench基准,通过分析现有模型瓶颈并构建统一后训练框架,将9B开源模型成功率提升至50%以上。

Comments Project page: https://github.com/aim-uofa/TVRBench

详情
AI中文摘要

人类可以通过主动的头部和身体运动复现目标图像指定的视角,然而基础模型中的空间智能主要被研究为对预先收集的观测的被动理解。我们引入了目标视角复现(TVR)——一个主动任务,其中智能体在3D环境中调整其视角,直到其观测与给定的目标图像匹配——以及TVRBench,一个涵盖场景尺度和目标视角视觉丰富度的室内模拟基准。TVR远未解决:在评估集上,最强的开源和闭源模型仅达到7.8%和12.0%的成功率。细粒度分析识别出两个一致的瓶颈:现成模型难以处理多轮视觉历史,并且当视角复现需要身体平移而非原地旋转时,性能急剧下降,暴露了将空间差异映射到具身运动方面的差距。为了研究缩小这一差距,我们构建了一个统一的TVR后训练框架,涵盖专家轨迹SFT、理由监督的CoT-SFT、离线单轮GRPO以及来自实时模拟器rollout的在线多轮GRPO。视觉-动作SFT提供了主要增益,将9B开源模型提升至50.8%的成功率;多轮GRPO提供了针对性的多房间细化,总体达到51.4%,而CoT监督和单轮GRPO降低了闭环性能。这些结果将TVRBench确立为衡量和训练主动感知并在3D环境中行动的基础模型的测试平台。我们的代码、数据和模型可在https://github.com/aim-uofa/TVRBench获取。

英文摘要

Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.

2605.03403 2026-06-02 cs.CV cs.LG 版本更新

GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

GRPO-TTA:基于GRPO驱动的强化学习进行视觉语言模型的测试时视觉调优

Yujun Li, Hongyuan Zhang, Yuan Yuan

发表机构 * School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University(人工智能、光学与电子学院(iOPEN),西北工业大学)

AI总结 提出GRPO-TTA方法,将GRPO应用于测试时适应,通过将类特定提示预测重构为组策略优化问题,并设计对齐奖励和分散奖励,在多种基准上优于现有方法。

详情
AI中文摘要

组相对策略优化(GRPO)最近在大型语言模型和视觉语言模型的后训练中展现出强大性能。这引发了一个问题:GRPO是否也能显著促进视觉语言模型的测试时适应(TTA)。在本文中,我们提出了用于测试时适应的组相对策略优化(GRPO-TTA),通过将类特定提示预测重构为组策略优化问题,将GRPO适应到TTA设置。具体来说,我们通过从CLIP相似度分布中采样top-K类候选来构建输出组,从而在无需真实标签的情况下实现概率驱动的优化。此外,我们设计了针对测试时适应的奖励函数,包括对齐奖励和分散奖励,以指导有效的视觉编码器调优。在多种基准上的大量实验表明,GRPO-TTA一致优于现有的测试时适应方法,在自然分布偏移下性能提升尤为显著。

英文摘要

Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.

2606.01234 2026-06-02 econ.GN cs.CE cs.CV cs.GT cs.LG physics.soc-ph q-fin.EC 版本更新

Differing Roles of Leisure and Productivity in GDP - A Machine Learning based comparative analysis of Germany and USA

休闲与生产力在GDP中的不同作用——基于机器学习的德国与美国比较分析

Achintya Ranjan, Uma Ranjan

发表机构 * Achintya Ranjan(阿金蒂亚·兰詹) Uma Ranjan(乌玛·兰詹)

AI总结 本研究通过随机森林模型分析工作时间和全要素生产率对GDP的影响,并利用Gini重要性、SHAP图和部分依赖图揭示德国与美国社会结构差异在GDP贡献中的体现。

Comments International Conference on Emerging Techniques in Computational Intelligence 2025

详情
AI中文摘要

一个国家的GDP被建模为两个因素之间的相对相互作用——工作时间,反映人口的社会选择,以及全要素生产率,反映对生产力提升因素的集体投资。研究表明,随机森林模型可以从这两个因素准确预测GDP。通过Gini重要性、SHAP图和部分依赖图分析了德国和美国所做的选择差异。结果表明,国家社会结构的差异反映在工作时间和生产率对GDP的相对贡献中。

英文摘要

The GDP of a country is modelled as the relative interaction between two agents - working hours, reflecting the social choice of a population, and Total Factor Productivity, reflecting the collective investment in productivity enhancers. It is shown that a Random Forest model can accu- rately predict the GDP from these two factors. The differences in the choices made by Germany and USA are analysed though Gini importance, SHAP plots and partial dependency. It is shown that the differences in the social structure of the countries are reflected in the relative contribution of working hours and productivity to the GDP.

2606.01217 2026-06-02 cs.CV cs.LG stat.AP 版本更新

Analysis of Ethnic Disparities in Autism Spectrum Disorder among Toddlers

幼儿自闭症谱系障碍中的种族差异分析

Aadithya Prabha Ramaharsha, Deevna Reddy, Uma Ranjan

发表机构 * Sri Ramachandra Institute of Higher Education and Research(Sri Rajachandra高等教育部与研究机构)

AI总结 通过逻辑回归分析,研究种族、行为评分、性别和新生儿黄疸对幼儿自闭症谱系障碍(ASD)的影响,发现白种人ASD风险比亚洲人高81%,中东人低79%,并确认新生儿黄疸和男性为显著风险因素。

Comments Third International Conference Biomedical Engineering Science and technology

详情
AI中文摘要

自闭症谱系障碍(ASD)是一种以沟通和行为挑战为特征的神经发育障碍。本研究考察了种族与ASD特征之间的关系,以及行为评分、性别和新生儿黄疸在三个种族群体(白种人、亚洲人和中东人)中的差异。我们进行了逻辑回归分析,表明种族对ASD发病率有显著影响。与亚洲人相比,白种人患ASD的风险增加81%,而中东人患ASD的风险降低79%。我们还证实了早期研究,即新生儿黄疸是ASD的重要预测因子,而男性儿童患ASD的风险远高于女性儿童。这些结果表明,需要建立考虑种族差异的诊断框架和干预措施,以评估ASD特征的表现和评估。

英文摘要

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by challenges in communication and behavior. This study examines the relationship between ethnicity and ASD traits, along with behavioural scores, sex and neonatal jaundice across three ethnic groups: White Europeans, Asians, and Middle Eastern individuals. We perform a logistic regression and show that ethnicity has a significant effect on incidence of ASD. White Europeans are 81% increased risk of ASD and Middle Easterners are at 79\% reduced risk of ASD compared to Asians. We also confirm earlier studied which show that neonatal jaundice is a significant predictor of ASD, while male children are at much higher risk of ASD compared to female children. These results suggest the need for diagnostic frameworks and interventions that account for ethnic in the presentation and assessment of ASD traits

2606.01215 2026-06-02 cs.CV cs.AI cs.CL cs.MM 版本更新

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

将神经符号程序蒸馏到3D多模态大语言模型中

Wentao Mo, Yang Liu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出APEIRIA,通过三阶段课程学习将符号推理模式蒸馏到3D多模态大语言模型中,实现透明推理与开放词汇空间推理的统一。

Comments To appear in ICML 2026

详情
AI中文摘要

当前的3D空间推理方法面临根本性权衡:神经符号3D(NS3D)概念学习器通过组合程序实现可解释推理,但受限于封闭集概念词汇和简单程序;端到端3D多模态大语言模型(3D MLLMs)能处理复杂自然语言和开放词汇概念,但缺乏显式空间验证的黑箱推理。我们提出APEIRIA,一种神经符号3D MLLM,通过将符号推理模式以自然语言思维链形式蒸馏到MLLMs中,桥接两种范式。我们的三阶段课程逐步构建推理能力:a) 3D感知对齐将物体视觉-几何特征接地到LLM,b) CoT-SFT从符号程序轨迹中教授查询分解和逐步验证,c) CoT-RL将推理模式扩展到开放集概念和深度嵌套指令。通过迁移推理模式而非概念特定知识,APEIRIA保留了NS3D的关键优点:透明推理以及规划和感知组件的模块化可互换性。在接地、问答和描述任务上的评估表明,APEIRIA超越了先前的NS3D方法,并在3D空间推理数据集上匹配最先进的3D MLLMs,统一了符号方法的系统推理与MLLMs的灵活性。代码见https://github.com/oceanflowlab/APEIRIA。

英文摘要

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.

2606.01213 2026-06-02 cs.CV cs.AI cs.CL 版本更新

TECCI: Tricky Edits of Collected and Curated Images

TECCI:收集与策划图像的棘手编辑

Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge

发表机构 * Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 提出TECCI基准,包含7550对图像与编辑指令,通过人工与自动评估揭示现有图像编辑模型在指令遵循、最小编辑和视觉质量方面的不足。

详情
AI中文摘要

尽管近期取得了巨大进展,但当前的文本引导图像编辑方法在涉及指令遵循、最小化编辑源图像以及确保高视觉质量等多个方面仍面临困难。当请求的编辑具有挑战性时,例如涉及位置、运动、视角、比例和创意编辑,这些问题尤为明显。为了系统性地测试生成式图像编辑器,我们提出了一个新的图像编辑基准——TECCI:收集与策划图像的棘手编辑。TECCI包含我们发布的全新图像集。TECCI中的图像涵盖7个图像类别。这些图像和类别经过有意策划,以针对现有方法的弱点。TECCI中的编辑指令由Gemini自动生成,每个源图像覆盖5种编辑类型。我们还策划了一组530张图像,为其创建了具有挑战性的人工编写编辑指令。总体而言,TECCI包含7550对图像和编辑指令。我们对TECCI上的五个领先图像编辑模型进行了人工评估。人类从三个维度判断输出:1)指令遵循,2)编辑的最小性,以及3)视觉质量。为了扩大评估规模,我们还使用Gemini构建了一个自动评分器,在匹配人类评估方面达到了74.7%的准确率。我们的评估揭示:1)没有一个模型的总体成功率超过22%,这显示了TECCI的挑战性;2)Nano Banana Pro是整体表现最好的模型;3)模型在指令遵循方面表现显著优于最小编辑和视觉质量;4)模型在编辑建筑和自然图像方面存在困难,这些需要较强的空间布局和复杂视觉细节理解能力;5)推理和创意编辑是最困难的,而颜色和外观编辑是最容易的。

英文摘要

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

2606.01207 2026-06-02 cs.CV cs.LG 版本更新

Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning

特征对齐决定融合策略:多模态学习中交叉注意力与拼接的比较研究

Zhiqiang Zhou, Xuezhen Xie

发表机构 * Hunan Chemical Industry Vocational and Technical College(湖南化学工业职业技术学院)

AI总结 通过实验和理论分析,证明特征对齐质量而非数据规模是决定多模态融合策略优劣的关键因素,当特征预对齐时拼接优于交叉注意力。

Comments 8 pages,6 figures,4 tables

详情
AI中文摘要

在多模态融合中,交叉注意力与拼接的选择仍由实践者直觉而非原理性理解主导。本文通过使用两个特征提取骨干(ResNet18和CLIP ViT-B/32)在Flickr8k上的控制实验,证明特征对齐质量(而非仅数据规模)是决定哪种融合策略更优的主要因素。当特征通过视觉语言预训练目标预对齐时,在所有测试规模(2048-16384样本)下,拼接比交叉注意力高出4.1-5.1个百分点。我们提供了基于样本复杂度分析的理论解释:拼接需要O(d_v + d_t)个样本来学习其融合投影,而交叉注意力需要O(d_v * d_t)个样本来学习双线性注意力权重,对于512维CLIP特征,后者是前者的256倍以上。当特征已经对齐时,两种方法的近似误差差距消失,拼接的样本效率在所有实际数据集规模上占优。对齐退化研究证实了单调趋势:随着特征对齐退化,拼接的优势从1.3%增长到2.8%。这些发现为多模态系统中的融合方法选择提供了原理性决策框架,对多模态大语言模型的设计具有直接影响。

英文摘要

The choice between cross-attention and concatenation for multimodal fusion remains governed by practitioner intuition rather than principled understanding. In this paper, we demonstrate that feature alignment quality, not data scale alone, is the primary determinant of which fusion strategy excels. Through controlled experiments on Flickr8k using two feature extraction backbones (ResNet18 and CLIP ViT-B/32), we show that concatenation outperforms cross-attention by 4.1-5.1 percentage points across all tested scales (2048-16384 samples) when features are pre-aligned by a vision-language pretraining objective. We provide a theoretical explanation grounded in sample complexity analysis: concatenation requires O(d_v + d_t) samples to learn its fusion projection, while cross-attention requires O(d_v * d_t) samples to learn bilinear attention weights, over 256 times as many for 512-dimensional CLIP features. When features are already aligned, the approximation error gap between the two methods vanishes, and concatenation's sample efficiency dominates at all practical dataset sizes. An alignment degradation study confirms a monotonic trend: as feature alignment degrades, concatenation's advantage grows from 1.3% to 2.8%. These findings provide a principled decision framework for fusion method selection in multimodal systems, with direct implications for the design of Multimodal Large Language Models.

2606.01192 2026-06-02 cs.CV 版本更新

PairedGTA: Generating Driving Datasets for Controlled Photometric Shift Analysis

PairedGTA:用于受控光度偏移分析的驾驶数据集生成

Andrea Chianese, Giulio Rossolini, Alessandro Biondi, Marco Cococcioni, Giorgio Buttazzo

发表机构 * Scuola Superiore Sant’Anna(圣安娜高等学院) Department of Excellence in Robotics & AI(机器人与人工智能卓越部门) University of Pisa(比萨大学)

AI总结 提出基于高保真游戏引擎的PairedGTA框架,通过生成完美配对的图像,实现独立于几何和语义变化的光度偏移分析,并用于评估语义分割模型在恶劣条件下的性能退化。

Comments Under review

详情
AI中文摘要

评估自动驾驶视觉感知系统的性能对于确保在不同环境场景下的可靠运行至关重要。理想情况下,要在不同恶劣条件下进行平衡和公平的分析,需要同一场景在不同天气或光照变化下的完美配对图像。这将允许独立于几何和语义变化来评估光度偏移的影响。不幸的是,真实世界数据集很少提供同一场景在不同环境条件下的图像,因为通常相机姿态、交通和动态物体(车辆、行人等)的位置随时间变化,因此只能提供粗略配对的数据。为了解决这一挑战,本工作引入了一种基于高保真游戏引擎的数据生成框架,用于提取完美配对的图像。通过利用与GTA游戏引擎通信的软件API,该框架在保持场景几何、相机姿态以及动态物体的身份和位置的同时,修改光照和天气条件。对于每个采样位置,它程序化地实例化动态实体,并在各种恶劣条件下渲染像素对齐的图像。通过在语义分割模型上的系统分析,展示了所提出的生成框架在驾驶场景中的优势,其输出退化可以更直接地归因于光度偏移,而不是不受控制的语义或几何因素。

英文摘要

Evaluating the performance of visual perception systems for autonomous driving is essential to ensure reliable operation across diverse environmental scenarios. Ideally, a balanced and fair analysis across different adverse conditions would require perfectly paired images of the same scene under different weather or illumination changes. This would allow evaluating the effect of photometric shifts independently of geometry and semantic changes. Unfortunately, real-world datasets rarely provide images of the same scene under different environmental conditions, because, normally, camera pose, traffic, and locations of dynamic objects (vehicles, pedestrians, etc.) vary over time, thus yielding only coarsely paired data. To address this challenge, this work introduces a data generation framework based on a high-fidelity game engine for extracting perfectly paired images. By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel-aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.

2606.01173 2026-06-02 cs.CV 版本更新

Reusing Fusion-Time Spectral Reliability for Adaptive Fusion and Expert Routing in RGB-Infrared Object Detection

复用融合时频谱可靠性用于RGB-红外目标检测的自适应融合与专家路由

Yefeng Wu

发表机构 * Tsinghua University(清华大学)

AI总结 提出一种无参数的7维频谱可靠性描述符,通过频谱可靠性融合和可靠性条件专家路由,提升RGB-红外目标检测在退化条件下的性能。

详情
AI中文摘要

RGB-红外检测器通常会丢弃跨模态融合过程中产生的统计信息,使得下游模块无法知晓当前交互是否可靠。我们提出提取一个无参数的7维频谱可靠性描述符——汇总频带能量、幅度比、相位一致性和跨模态相关性——并在融合阶段之外复用该描述符。该描述符驱动频谱可靠性融合(SRF),它将频谱残差与保守的空间基进行门控,以及可靠性条件专家路由(RCER),它将描述符与池化内容结合以引导稀疏的后融合专家。在匹配消融实验下,描述符感知门控相比仅内容自适应门控提高了mAP50;一个2×2因子分析进一步表明,在参数数量几乎相等的情况下,描述符条件路由相比仅专家架构提供了更大的边际增益。在DroneVehicle上的六种合成退化条件下,平均保留率提升至95.0%,而仅内容MoE为92.0%,拼接为87.9%,在模态缺失下增益最大;同一模型在自然白天/黑夜分割上也分别提高了+5.2/+5.3的mAP50。这些结果表明,将融合时可靠性作为显式信号保留有利于自适应融合和融合后条件计算。

英文摘要

RGB-infrared detectors typically discard the statistics generated during cross-modal fusion, leaving downstream modules unaware of whether the current interaction is reliable. We propose to extract a parameter-free, 7-dimensional spectral reliability descriptor -- summarizing band energy, amplitude ratio, phase consistency, and cross-modal correlation -- and to reuse it beyond the fusion stage. The descriptor drives both Spectral Reliability Fusion (SRF), which gates a spectral residual against a conservative spatial base, and Reliability-Conditioned Expert Routing (RCER), which combines the descriptor with pooled content to steer sparse post-fusion experts. Under matched ablations, descriptor-aware gating improves mAP50 over content-only adaptive gating; a $2{\times}2$ factorial analysis further shows that descriptor-conditioned routing provides the larger marginal gain over expert architecture alone at near-equal parameter count. Under six synthetic degradations on DroneVehicle, average retention rises to 95.0%, versus 92.0% for content-only MoE and 87.9% for concatenation, with the largest gain under modality drop; the same model also improves mAP50 by +5.2/+5.3 on the natural day/night split. These results suggest that preserving fusion-time reliability as an explicit signal benefits both adaptive fusion and post-fusion conditional computation.

2606.01164 2026-06-02 cs.CV 版本更新

Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends

迈向交互式视频世界建模:前沿、挑战、基准与未来趋势

Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang, Sitian Shen, Marc Pollefeys, Masayoshi Tomizuka, Ayush Tewari, Per Ola Kristensson

发表机构 * Department of Engineering, University of Cambridge, U.K.(剑桥大学工程系) Peking University(北京大学) University of Twente(埃因霍温理工大学) Mechanical Systems Control Laboratory, University of California, Berkeley, USA(加州大学伯克利分校机械系统控制实验室) ETH Zurich(苏黎世联邦理工学院) Microsoft(微软公司) University of Oxford(牛津大学)

AI总结 本文系统综述了交互式世界建模的研究趋势、技术挑战、评估基准,并提出了未来方向,重点在于动作条件可控性、长程交互与记忆以及实时响应性。

Comments Under review. The GitHub repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model

详情
AI中文摘要

随着大语言模型和基于扩散的内容生成的快速发展,世界建模引起了越来越多的研究关注,惠及游戏引擎、具身人工智能、自动驾驶等多个下游领域。通过将用户动作明确纳入世界状态转换,最近的文献在动作条件视频或3D生成范式中赋予了世界建模交互性,进一步增强了世界演化的可控性,并促进用户自由遍历、操纵、导航和个性化状态演化。本文旨在系统回顾交互式世界建模的最新研究趋势、技术发展、评估基准,并提出未来潜在方向。具体而言,我们首先总结了在应用场景、世界状态演化和场景模态方面的近期工作和趋势。随后,我们深入探讨三个关键的技术挑战,包括动作条件可控性、长程交互与记忆,以及实时交互的动作跟随响应性。此外,我们还全面比较了四个特定应用领域(开放世界探索、游戏引擎、自动驾驶和机器人)中的现有基准和指标。最后,我们讨论了实现下一代交互式世界建模的几个有前景的未来方向。相应的代码库已公开在:https://github.com/liujiuming123/Awesome-Interactive-World-Model。

英文摘要

With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous driving, etc. Through explicitly incorporating user actions into world state transition, recent literature empowers world modeling with interactivity in an action-conditioned video or 3D generation paradigm, further enhancing controllability over world evolutions and facilitating users to freely traverse, manipulate, navigate, and personalize the state evolution. In this paper, we aim to systematically review recent research trends, technical developments, evaluation benchmarks, and also propose future potential directions in interactive world modeling. Specifically, we first summarize recent efforts and trends in terms of application scenarios, world state evolution, and scene modality. Afterwards, we delve into three crucial technical challenges, including action-conditioned controllability, long-horizon interactions and memory, and action-following responsiveness for real-time interactivity. Furthermore, we also thoroughly compare existing benchmarks and metrics in four specific application fields: open-world exploration, game engine, autonomous driving, and robotics. Finally, we discuss several promising future directions in achieving next-generation interactive world modeling. The corresponding repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model.

2606.01157 2026-06-02 cs.CV 版本更新

HiTokSR: A Coarse-to-Fine Tokenizer with Hierarchical Codebooks for High-Fidelity Real-World Image Super-Resolution

HiTokSR: 一种用于高保真真实世界图像超分辨率的具有层次化码本的从粗到细分词器

Mingxi Li

AI总结 提出HiTokSR层次化标记预测框架,通过将潜在空间沿通道维度划分为频率感知组并独立量化,解耦全局结构与细节,结合视觉基础模型先验和索引级扰动策略,实现真实世界图像超分辨率的最优感知质量和重建保真度。

详情
AI中文摘要

向量量化(VQ)生成模型在真实世界图像超分辨率(Real-ISR)中显示出有希望的结果。然而,现有方法通常依赖于一个将低频结构与高频纹理纠缠在一起的单一潜在空间。这种纠缠迫使单个码本捕获组合上复杂的结构-纹理配对集合,这限制了表示能力并降低了码本利用率。为了解决这个问题,我们提出了HiTokSR,一个层次化的标记预测框架。HiTokSR不使用单一码本,而是将潜在空间沿通道维度划分为频率感知组,并用独立的子码本对每组进行量化。这种从粗到细的设计将全局结构与精细细节解耦,增强了组合表达能力,同时避免了高维最近邻查找的优化不稳定性。为了进一步提高语义一致性,我们的生成器通过自适应特征调制、多尺度类别标记和表示对齐损失,整合了来自视觉基础模型的先验。此外,我们在解码器微调过程中引入了一种索引级扰动策略,以弥合离散标记预测中的训练-测试差异。在真实世界基准上的大量实验表明,HiTokSR在感知质量和重建保真度方面均达到了最先进的性能。

英文摘要

Vector-quantized (VQ) generative models have shown promising results in real-world image super-resolution (Real-ISR). However, existing methods typically rely on a monolithic latent space that entangles low-frequency structures with high-frequency textures. This entanglement forces a single codebook to capture a combinatorially complex set of structure-texture pairings, which constrains representational capacity and limits codebook utilization. To address this issue, we present HiTokSR, a hierarchical token prediction framework. Instead of using a single codebook, HiTokSR partitions the latent space along the channel dimension into frequency-aware groups, quantizing each with an independent sub-codebook. This coarse-to-fine design disentangles global structures from fine details, enhancing combinatorial expressiveness while circumventing the optimization instability of high-dimensional nearest-neighbor lookups. To further improve semantic consistency, our generator integrates priors from a vision foundation model via adaptive feature modulation, multi-scale class tokens, and a representation alignment loss. Additionally, we introduce an index-level perturbation strategy during decoder fine-tuning to bridge the train-test discrepancy in discrete token prediction. Extensive experiments on real-world benchmarks demonstrate that HiTokSR achieves state-of-the-art performance in both perceptual quality and reconstruction fidelity.

2606.01149 2026-06-02 cs.CV 版本更新

CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

CoSTL:面向时刻检索与高亮检测的综合时空表征学习

Xin Dong, Wenjia Geng, Wenfeng Deng, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Pengcheng Laboratory(鹏城实验室)

AI总结 提出综合时空表征学习框架CoSTL,通过文本驱动的渐进细粒度图像编码器和多尺度时间感知模块,联合学习空间细节与时间动态,在时刻检索和高亮检测任务上达到最优性能。

Comments 14 pages, 3 figures

详情
AI中文摘要

视频时刻检索(MR)和高亮检测(HD)是视频分析中的关键任务,旨在根据给定的文本查询定位特定时刻并估计片段级相关性。最近的方法将它们视为类似的视频定位任务,并使用相同的架构来解决。这些任务需要在图像级别进行细粒度理解,以及在整个视频中进行高级时间理解。现有方法主要关注使用帧级特征的时间建模,通常忽略了单个帧内与文本查询相关的丰富视觉信息。这种疏忽导致定位结果不准确。为了解决这一局限性,我们提出了一个综合时空表征学习框架(CoSTL),该框架捕获了细粒度的图像级信息和时间动态。具体来说,CoSTL包含一个文本驱动的渐进细粒度图像编码器,执行两步文本驱动的知识提取过程以学习细粒度空间表征。此外,一个多尺度时间感知模块捕获综合的时空表征,增强了模型处理时间动态的能力。我们在四个公开基准数据集上展示了最先进的性能:QVHighlights、Charades-STA、TACoS和TVSum。

英文摘要

Video Moment Retrieval (MR) and Highlight Detection (HD) are crucial tasks in video analysis that aim to localize specific moments and estimate clip-wise relevance based on a given text query. Recent approaches treat them as similar video grounding tasks and use the same architecture to solve them. These tasks require both fine-grained comprehension at the image level and high-level temporal understanding across the entire video. Existing approaches have primarily focused on temporal modeling using frame-level features, often neglecting the rich visual information related to the text query within individual frames. This oversight leads to inaccurate grounding results. To address this limitation, we propose a Comprehensive Spatial-Temporal Representation Learning Framework (CoSTL), which captures both fine-grained image-level information and temporal dynamics. Specifically, CoSTL incorporates a text-driven progressive fine-grained image encoder, performing a two-step text-driven knowledge extraction process to learn fine-grained spatial representations. Furthermore, a multi-scale temporal perception module captures comprehensive spatial-temporal representations, enhancing the model's ability to process temporal dynamics. We demonstrate state-of-the-art performance on four public benchmarks: QVHighlights, Charades-STA, TACoS, and TVSum.

2606.01132 2026-06-02 cs.CV 版本更新

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

HakushoBench:来自政府白皮书的日语图表VQA基准

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Naoaki Okazaki

发表机构 * Institute of Science Tokyo(东京科学研究所) NII(日本学术振兴会) NII LLMC(日本学术振兴会LLMC)

AI总结 利用政府白皮书构建日语图表VQA基准HakushoBench,包含2053张图像和人工标注问答对,评估视觉语言模型对图表的深度理解。

Comments 16 pages, 17 figures

详情
AI中文摘要

理解图表和表格图像对于将视觉语言模型(VLM)应用于现实世界的文档理解至关重要。虽然英语基准已经快速发展,但非英语基准仍然稀缺,这使得人们不清楚这种进展是否跨语言泛化。一个关键障碍是难以大规模收集真实且多样化的非英语图表和表格图像。为了解决这个问题,我们利用政府白皮书作为英语之外基准构建的可扩展来源,因为它们包含跨多种格式和领域的自然出现的图表和表格,并且在许多国家免费提供。作为首次实例,我们介绍了HakushoBench,这是一个基于33份政府白皮书构建的具有挑战性的日语图表和表格VQA基准。HakushoBench包含2053张图像,涵盖超过10种图像类型,并带有人工标注的问答对,旨在评估对图表和表格的深入和整体理解,而不仅仅是局部视觉线索。跨广泛VLM的实验表明,HakushoBench对开放权重模型仍然具有挑战性:最佳开放权重模型仅达到58.6%的准确率,开放权重与专有模型之间34.9个百分点的差距凸显了在复杂图表和表格理解方面仍有很大的改进空间。我们发布了数据集和代码。

英文摘要

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.

2606.01126 2026-06-02 cs.LG cs.AI cs.CV 版本更新

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

STARFISH: 从内部状态修复中实现剪枝网络的快速精度恢复

Shir Maon, Odelia Melamed, Adi Shamir

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 提出STARFISH方法,通过少量无标签校准集优化剪枝网络与原始网络内部状态对齐,高效恢复精度,在ViT网络上优于现有方法。

详情
AI中文摘要

剪枝是一种旨在减少大型神经网络中权重数量的过程。这可以显著加快推理速度,但可能导致模型精度大幅下降,因此通常随后会进行修复过程以恢复部分丢失的精度。在本文中,我们提出了一种新的修复方法STARFISH,它可以高效地恢复任何剪枝网络的(大部分)精度。STARFISH的主要思想是使用少量无标签示例的校准集,优化剪枝网络以与原始网络的内部状态表示对齐。对于去除50%权重的常见情况,在基于ViT的网络中,STARFISH修复相比最先进方法将恢复精度提高了高达22%。在激进剪枝下其优势更为显著。例如,在ImageNet的DeiT-B网络中去除75%权重后,STARFISH仅使用训练图像数量的0.4%作为校准集,恢复了原始稠密模型精度的82%,而竞争恢复技术仅达到稠密模型精度的40%。

英文摘要

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but might cause a considerable reduction in the model's accuracy, and thus it is usually followed by a healing process that regains some of the lost accuracy. In this paper, we propose a new healing method, STARFISH, that can recover (most of) the accuracy of any pruned network efficiently. The main idea of STARFISH is to optimize the pruned network to align with the original network's internal state representations using a tiny calibration set of unlabeled examples. For the common case of removing 50% of the weights, STARFISH healing improves the recovered accuracy by up to 22% over the state-of-the-art methods on ViT-based networks. Its advantage is even more pronounced under aggressive pruning. For example, after eliminating 75% of the weights in a DeiT-B network for ImageNet, STARFISH uses only 0.4% of the number of training images as a calibration set and recovers 82% of the original dense accuracy, whereas competing recovery techniques reach only 40% of the dense model accuracy.

2606.01118 2026-06-02 cs.CV 版本更新

Rank-Aware Quantile Activation for Motion-Robust Crop Segmentation in UAV Imagery

面向无人机影像中运动鲁棒作物分割的秩感知分位数激活

Abinav Kiran, Sravan Danda, Aditya Challa, Sougata Sen, Daya Sagar B S

发表机构 * Senior Member, IEEE(IEEE高级会员)

AI总结 针对高速无人机影像中的运动模糊导致语义分割退化的问题,提出秩感知的双分位数激活(QAct)模块,通过实例级秩归一化替代幅度门控,在零样本和模糊监督两种设置下均显著提升mIoU,尤其在稀有纹理依赖类上表现突出,且与模糊域训练互补。

详情
AI中文摘要

高速无人机采集的运动模糊会降低对具有高农业价值的稀有纹理依赖类别的语义分割性能。标准CNN依赖于高频幅度特征,而模糊会破坏这些特征,导致少数信号被统计性擦除。我们提出双分位数激活(QAct),一种秩感知模块,用实例级秩归一化替代幅度门控。在Agriculture-Vision 2021数据集上,在零样本和模糊监督两种设置下、多种严重程度上进行评估,QAct是主导架构因素:它在两种设置和所有严重程度上都比ReLU带来一致的mIoU提升,在稀有结构和纹理依赖类别上增益最强。一些主导类别(水、播种机跳过)在蒸馏下表现出混合的每类性能。在中等模糊下,零样本QAct优于蒸馏训练的ReLU;在所有严重程度上,Distill-QAct达到最佳性能,证实了秩感知激活和模糊域训练是互补的鲁棒性来源。

英文摘要

Motion blur from high-speed UAV acquisition de-grades semantic segmentation on rare texture-dependent classes with high agronomic value. Standard CNNs rely on high-frequency magnitude features that blur destroys, causing statistical erasure of minority signals. We propose Dual Quantile Activation (QAct), a rank-aware block replacing magnitude gating with instance-level rank normalization. Evaluated onAgriculture-Vision 2021 across zero-shot and blur-supervised regimes at multiple severities, QAct is the dominant architectural factor: it delivers consistent mIoU gains over ReLU across both regimes and all severities, with strongest gains on rare structural and texture-dependent classes. Some dominant classes (water,planter skip) show mixed per-class performance under distillation. At moderate blur, zero-shot QAct outperforms distillation-trained ReLU; across all severities, Distill-QAct achieves best performance, confirming rank aware activation and blur-domain training are complementary robustness sources.

2606.01106 2026-06-02 cs.CV 版本更新

Temporal Evidence Routing with Structured Visual Evidence for TimeLogicQA

基于结构化视觉证据的时间证据路由用于TimeLogicQA

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University(东南大学) National University of Singapore(新加坡国立大学) Independent Researcher(独立研究员) Opus AI Research(Opus AI研究院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出视觉证据路由流水线,分离感知与符号时间推理,通过结构化视觉证据和确定性时间规则在TimeLogicQA上达到81.8 AvgAcc。

详情
AI中文摘要

TimeLogicQA评估视频问答系统是否能推理事件存在、顺序、持续性、边界条件和重叠等时间关系。我们通过一个视觉证据路由流水线来处理此任务,该流水线将感知与符号时间推理分离。系统首先将每个问题解析为事件目标、答案模式、候选选项和时间算子。然后,根据持续时间和算子难度对视频进行路由,对短片段使用有序的全帧证据,对长视频使用以事件为中心的候选窗口。多模态大语言模型为相关事件生成结构化视觉证据,而程序化验证器恢复密集的动作区间,确定性归约器应用算子特定的时间规则产生最终答案。保守融合仅在视觉证据、时间程序和置信度检查一致时接受答案,减少噪声答案翻转。在官方测试评估中,我们的最终系统实现了81.8的平均准确率。

英文摘要

TimeLogicQA evaluates whether video question answering systems can reason over temporal relations such as event existence, ordering, persistence, boundary conditions, and overlap. We address this task with a visual evidence routing pipeline that separates perception from symbolic temporal reasoning. The system first parses each question into event targets, answer mode, candidate options, and temporal operators. It then routes videos according to duration and operator difficulty, using ordered full-frame evidence for short clips and event-focused candidate windows for long videos. A multimodal large language model produces structured visual evidence for the relevant events, while programmatic verifiers recover dense action intervals and a deterministic reducer applies operator-specific temporal rules to produce the final answer. Conservative fusion accepts an answer only when the visual evidence, temporal program, and confidence checks agree, reducing noisy answer flips. On the official test evaluation, our final system achieves an AvgAcc of 81.8.

2606.01104 2026-06-02 cs.CV 版本更新

Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge

自适应密集证据精炼用于视频关系推理:VRR-QA挑战

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University(东南大学) National University of Singapore(国立新加坡大学) Independent Researcher(独立研究员) Opus AI Research(Opus AI研究) University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种自适应测试时计算系统,通过轻量视图识别不稳定问题并路由到高预算密集证据模块,在VRR-QA测试集上达到90.07%平均准确率。

详情
AI中文摘要

VRR-QA评估视频语言系统能否推断空间、时间、视角、深度和可见性关系,这些关系通常无法通过单帧解决。我们提出一个仅推理的系统,基于自适应测试时计算。系统首先通过直接视频语言模型传递回答每个问题,然后使用多个轻量视图发现不稳定问题。只有这些困难问题被路由到高预算密集证据模块,该模块构建带时间戳的帧观察、关系特定探针、候选验证和保守的时间聚合。这种设计分离了视频问答中常混淆的两个问题:寻找合理的替代答案以及决定何时应更改当前答案。在测试集上,最终系统获得90.07平均准确率和87.81宏平均准确率。报告重点介绍最终测试系统和复现自适应密集验证器所需的实现设置。

英文摘要

VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.

2606.01097 2026-06-02 cs.CV 版本更新

Dual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R

双路Top-K检索与1v1 VLM重排序用于CoVR-R

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University(东南大学) National University of Singapore(新加坡国立大学) Independent Researcher(独立研究者) Opus AI Research(Opus AI研究) University of Science and Technology of China(中国科学技术大学)

AI总结 提出双路Top-K检索与1v1 VLM重排序方法,通过解耦召回与选择,在CoVR-R挑战中达到95.28% R@1。

详情
AI中文摘要

我们描述了用于CoVR-R挑战的\emph{双路Top-K检索与1v1 VLM重排序}方法。该方法将组合视频检索视为两个耦合问题:找到一个足够完整的Top-K候选集,然后安全地决定是否有任何候选应替换当前强Top-1。我们首先通过VLM槽选择器对现有候选进行推理/文本种子改进,而不引入DFN视觉检索。然后,我们使用DFN-H/DFN-L从联系表嵌入中添加视觉路径。这些路径合并为一个Top-10候选集,之后VLM最终重排序器在当前Top-1和每个挑战者之间进行保守的1v1比较。在隐藏测试集上,最终系统达到95.28 R@1、97.47 R@5、98.48 R@10和99.66 R@50。主要经验是CoVR-R从召回-选择解耦中获益更多,而非广泛的文本重排序或直接的多候选VLM分类。

英文摘要

We describe \emph{Dual-Route Top-K Retrieval with 1v1 VLM Reranking} for the CoVR-R challenge. The method treats composed video retrieval as two coupled problems: finding a sufficiently complete top-k candidate set, and then safely deciding whether any candidate should replace a strong current top-1. We first improve the reasoning/text seed with a VLM slot selector over existing candidates, without introducing DFN visual retrieval. We then add a visual route from contact-sheet embeddings using DFN-H/DFN-L. The routes are merged into a top-10 candidate set, after which a VLM final reranker performs conservative 1v1 comparisons between the current top-1 and each challenger. On the hidden test split, the final system reaches 95.28 R@1, 97.47 R@5, 98.48 R@10, and 99.66 R@50. The main lesson is that CoVR-R benefits more from recall-selection decoupling than from broad text reranking or direct multi-candidate VLM classification.

2606.01079 2026-06-02 cs.CV 版本更新

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

Chameleon: 面向跨域对象合成的风格-内容解耦框架

Sukhun Ko, Soo Ye Kim, Jihyong Oh

发表机构 * CMLab, Chung-Ang University(Chung-Ang大学CMLab) Adobe Research(Adobe研究)

AI总结 提出基于大规模数据集ChameleonDataset的两阶段训练框架Chameleon,通过联合硬对比学习和时空注意力门控实现跨域对象合成的风格-内容解耦与自适应风格化。

Comments The last two authors are co-corresponding authors. Please visit our project page at https://cmlab-korea.github.io/Chameleon/

详情
AI中文摘要

图像合成旨在将前景对象无缝插入背景图像中,扩散模型的最新进展显著提升了合成质量,尤其是当前景和背景图像来自同一域(例如自然图像)时。然而,当前景和背景来自不同域时,跨域合成相对未被充分探索且仍具挑战性,因为模型必须保留前景对象的身份,同时对其进行风格化以匹配背景域。现有的跨域合成方法主要依赖无训练的混合和细化策略,部分原因是缺乏大规模配对数据集用于跨域合成,限制了基于训练的方法的发展。因此,它们局限于色调级对齐,常常产生风格不一致或过度风格化的结果。为克服这些限制,我们构建了ChameleonDataset,这是首个用于跨域合成的大规模训练数据集,并配有全面的评估基准,通过可扩展的数据构建流水线实现。在此基础上,我们提出了Chameleon,一种新颖的两阶段基于训练的跨域合成框架。在第一阶段,我们提出联合硬对比学习(JHCL)来训练ChameleonEncoder,有效解耦风格和内容表示。在第二阶段,我们将时空注意力门控(STAG)引入扩散变换器以实现有效的风格化,自适应地调节来自第一阶段编码器的风格标记如何在空间和时间维度上注入。我们的方法优于最先进的域内和跨域合成模型、顺序流水线和商业模型,在合成合理性和风格保真度方面均取得了改进。

英文摘要

Image compositing aims to seamlessly insert a foreground object into a background image, and recent advances in diffusion models have significantly enhanced the quality, especially when the foreground and background images come from the same domain (e.g., natural images). However, cross-domain compositing, where the foreground and background come from different domains, is relatively underexplored and remains challenging because the model must preserve the foreground object's identity while stylizing it to match the background domain. Existing cross-domain compositing approaches largely rely on training-free blending and refinement strategies. This is partly due to the lack of large-scale paired datasets for cross-domain compositing, limiting the development of training-based solutions. As a result, they are limited to tone-level alignment and often produce style-inconsistent or overstylized results. To overcome such limitations, we construct ChameleonDataset, the first large-scale training dataset for cross-domain compositing, with a comprehensive evaluation benchmark, built through a scalable data construction pipeline. Building on this, we propose Chameleon, a novel two-stage training-based cross-domain compositing framework. In the first stage, we propose Joint Hard Contrastive Learning (JHCL) to train ChameleonEncoder, which effectively disentangles style and content representations. In the second stage, we introduce Spatio-Temporal Attention Gating (STAG) into a diffusion transformer for effective stylization, adaptively regulating how style tokens from the first-stage encoder are injected across spatial and temporal dimensions. Our method outperforms state-of-the-art in-domain and cross-domain compositing models, sequential pipelines and commercial models, achieving improvements in both compositional plausibility and stylistic fidelity.

2606.01069 2026-06-02 cs.CV 版本更新

A Multiscale Network with Supervised Contrastive Learning for Real-Time Facial Emotion Recognition

基于监督对比学习的多尺度网络用于实时面部情感识别

Rejoy Chakraborty, Archisman Adhikary, Chayan Halder, Payel Rakshit, Sanchita Ghosh, Kaushik Roy

发表机构 * Indian Statistical Institute(印度统计研究所) Department of Biological Sciences, Bose Institute(生物科学系, Bose 院) Ramakrishna Mission Vivekananda Centenary College(拉马克里希纳使命 Vivekananda 百年学院) Maheshtala College(Maheshtala 学院) West Bengal State University(西孟加拉州大学)

AI总结 提出一种结合监督对比学习的多尺度深度学习网络,用于实时视频中面部表情变化的情感识别,在标准数据集上取得满意效果。

Comments 13 pages

详情
AI中文摘要

从面部表情进行实时情感识别是一项具有挑战性的任务,特别是在视频场景中,多个情感状态可能随时间出现。由于每个情感状态对应的面部表情在不同个体间差异显著,难度进一步增加。描绘情感状态的面部表情变化不是离散的,而是连续的,这通过计算手段来表征非常困难。能够检测面部表情变化的系统对于确定个体的情感状态具有重要影响。这样的系统在心理咨询中可以为治疗师提供关于受试者情感状态的额外见解,从而非常有益。本文提出了一种基于深度学习的系统,通过建模面部表情的变化来检测个体实时视频中的情感变化。本研究在标准数据集上进行深度学习系统的训练,并在此方面取得了非常满意的结果。

英文摘要

Real-time emotion recognition from facial expressions is a challenging task, particularly in video-based scenarios where multiple emotional states may occur over time. The difficulty increases further due to the fact that each emotional state is associated with facial expressions that vary significantly across individuals. The change of facial expressions portraying emotional state is not discrete, but rather continuous, which is very challenging to represent through computational aids. A system with the ability to detect variations in facial expressions can have a significant impact on determining the emotional state of an individual. Such a system can be very beneficial for psychologists during counseling by providing additional insights into the emotional state of a subject. In this paper, a deep learning-based system is presented to detect emotional changes in real-time video of a person by modeling the change in facial expressions. The current study is conducted on a standard dataset for training of the deep learning system and the system has provided very satisfactory outcomes in this respect.

2606.01057 2026-06-02 cs.CV cs.AI cs.GR cs.LG 版本更新

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

3DCodeBench:通过代码进行智能体程序化3D建模的基准测试

Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong, Ameesh Makadia, Meiqi Guo, Laurent Itti, Jindong Chen

发表机构 * Google DeepMind(谷歌DeepMind) University of Southern California(南加州大学) Google Research(谷歌研究)

AI总结 提出3DCodeBench基准,评估12种视觉语言模型将文本和图像参考转换为程序化3D建模代码的能力,并构建基于人类偏好的3DCodeArena排名平台。

Comments Project Page: https://www.3dcodebench.com/; 11 pages (main), with appendix

详情
AI中文摘要

通过代码进行程序化3D建模正成为一种通用的范式,提供确定性、引擎就绪且可精确编辑的资产,而神经3D生成器天生缺乏这些特性。然而,编写此类程序化内容需要深厚的3D软件API、参数化设计和代码级几何推理专业知识。在本文中,我们提出了3DCodeBench,一个系统性的基准,用于评估3D建模软件中用于程序化3D生成的视觉语言模型(VLM)智能体。具体来说,3DCodeBench评估了12种先进VLM如何有效地充当程序化3D建模器,将文本和图像参考转换为3D建模软件的程序化代码。认识到自动度量可能无法完全捕捉3D形状的感知质量,我们构建了3DCodeArena,一个基于成对人类偏好对生成的3D输出进行排名的平台。通过广泛的评估和结果,我们观察到:(1)失败主要源于API不匹配,而成功渲染的模型仍然存在断开或浮动的3D几何组件。(2)测试时扩展,如更高的思考预算和多轮细化,总体上提高了性能。我们的发现突显了对高质量程序化编码数据以推进商业VLM的迫切需求。此外,有效的程序化3D建模需要一个强大的执行环境,为迭代细化提供高保真反馈。我们发布了3DCodeBench,包括精心策划的大规模多模态(文本/图像)提示数据集、程序化代码、3D对象三元组、评估协议以及公共3DCodeArena平台,作为探索基于VLM的程序化3D建模器的基础工具包。

英文摘要

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

2606.01050 2026-06-02 cs.CV 版本更新

TextFake: Benchmarking AI-Generated Image Detection on Text-Rich Images

TextFake: 对富含文本图像中AI生成图像检测的基准测试

Yuning Zhang, Changtao Miao, Mingyu Liao, Tingyu Liu, Xinghao Wang, Tao Gong, Qi Chu, Nenghai Yu

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络科学与技术学院) Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) Individual Researcher(独立研究者)

AI总结 针对AI生成图像检测在富含文本图像上的空白,构建包含28种语言、2万图像的TextFake基准,评估14种检测器和3种VLM API,发现系统性能差距并诊断三种失败模式。

详情
AI中文摘要

最近的AI生成图像(AIGI)检测器在自然图像基准上表现良好,但它们在富含文本的伪造图像(如虚假截图、文档和新闻页面)上的行为尚未得到测试,这些伪造图像在虚假信息中普遍存在。我们引入了TextFake,一个包含20,000张图像的富含文本AIGI检测基准,涵盖28种语言、4个主题类别和2种场景模态。伪造图像通过一个四阶段流水线合成,该流水线沿三个受控维度注释真实图像,并通过分布对齐的结构化提示生成对应图像,排除了协变量捷径。对14个专用检测器和3个前沿VLM API的零样本评估揭示了巨大的系统性差距:没有方法超过80%的准确率,有些方法相比自然图像基准下降了60%以上。诊断评估识别出三种失败模式:文本密度诅咒,即密集字形压倒低级检测器;通过渲染保真度进行伪装,即更强的文本渲染抑制生成伪影;以及阈值崩溃,即常规扰动将检测器推向随机水平。

英文摘要

Recent AI-generated image (AIGI) detectors perform well on natural-image benchmarks, but their behavior on text-rich forgeries, such as fabricated screenshots, documents, and news pages prevalent in misinformation, remains untested. We introduce TextFake, a 20,000-image benchmark for text-rich AIGI detection spanning 28 languages, 4 topic categories, and 2 scene modalities. Fake images are synthesized via a four-stage pipeline that annotates real images along three controlled dimensions and generates counterparts through distribution-aligned structured prompting, ruling out covariate shortcuts. Zero-shot evaluation of 14 specialized detectors and 3 frontier VLM APIs reveals a large systematic gap: no method exceeds 80% accuracy, with some dropping over 60% from natural-image benchmarks. Diagnostic evaluations identify three failure modes: the Text Density Curse, where dense glyphs overwhelm low-level detectors; Cloaking via Rendering Fidelity, where stronger text rendering suppresses enerative artifacts; and Threshold Collapse, where routine perturbations drive detectors toward chance-level performance.

2606.01048 2026-06-02 cs.CV 版本更新

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

解耦残差去噪扩散模型用于统一且数据高效的图像到图像翻译

Ziyue Lin, Jiahe Hou, Hongyu Xia, Xinrui Xie, Feifei Wang, Yuyin Zhou, Wei Wang, Jiawei Liu, Liangqiong Qu

发表机构 * The University of Hong Kong(香港大学) Shenyang Institute of Automation, Chinese Academy of Sciences(中国科学院沈阳自动化研究所) The Chinese University of Hong Kong(香港中文大学) University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 提出解耦残差去噪扩散模型(DRDD),通过将扩散过程解耦为随机噪声扩散和确定性残差扩散两个独立阶段,实现统一且数据高效的图像到图像翻译。

Comments CVPR 2026

详情
AI中文摘要

我们提出解耦残差去噪扩散模型(DRDD),用于统一且数据高效的图像到图像(I2I)翻译。尽管扩散模型在质量和多样性方面推动了I2I翻译的发展,但我们揭示了扩散模型中一个先前未被充分探索的性质。关键在于,除了其传统的流形提升作用(即将数据移出低维流形),注入高斯噪声通过隐式对齐跨域的特征分布促进了域协调,这一性质对于统一的I2I翻译尤其有利。然而,现有的扩散模型过早地削弱了这种协调效果,因为噪声和残差在单个耦合的扩散过程中被同时移除。为解决这一问题,DRDD将扩散过程解耦为两个顺序且独立的扩散阶段:(1)用于域协调和流形提升的随机噪声扩散,以及(2)完全在固定噪声域内学习核心语义映射的确定性残差扩散。这种解耦在整个变换过程中保留了协调和流形提升效果,极大地简化了跨不同任务和域的统一映射学习。值得注意的是,噪声扩散阶段仅在丰富的、无配对的目标域图像上训练,大大提高了数据效率。全面的理论和实证分析表明,DRDD与主流扩散模型广泛兼容,即使在有限配对数据下也能持续提供稳健、统一的I2I翻译。我们的代码可在 https://github.com/HKU-HealthAI/DRDD 获取。

英文摘要

We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for domain harmonization and manifold lifting, and (2) a deterministic residual diffusion that learns the core semantic mapping entirely within the fixed-noise domain. This decoupling preserves harmonization and manifold lifting effects throughout the transformation, substantially simplifying the learning of unified mappings across diverse tasks and domains. Notably, the noise diffusion stage is trained exclusively on abundant, unpaired target-domain images, greatly improving data efficiency. Comprehensive theoretical and empirical analysis demonstrates that DRDD is broadly compatible with mainstream diffusion models and consistently delivers robust, unified I2I translation, even under limited paired data. Our code is available at https://github.com/HKU-HealthAI/DRDD.

2606.01044 2026-06-02 cs.CV 版本更新

Ask4VG: Risk-Aware Question Selection for Reducing Prior-Driven Answers in Medical VQA

Ask4VG: 用于减少医学VQA中先验驱动答案的风险感知问题选择

Xiaorong Zhu, Qiang Li, Zibo Xu, Weijie Wang, Weizhi Nie

发表机构 * School of Microelectronics, Tianjin University, Tianjin 300072, China(天津大学电子工程学院,天津 300072,中国) DISI, University of Trento, Trento, Italy(特伦托大学DISI研究所,意大利特伦托) School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China(天津大学电气与信息工程学院,天津 300072,中国)

AI总结 提出Ask4VG框架,通过反事实视觉探测估计问题引发的幻觉风险,并重排问题改写以选择更依赖图像证据的问题,从而减少医学VQA中的先验驱动答案。

详情
AI中文摘要

医学视觉问答要求模型将回答建立在图像证据上,因为缺乏视觉支持的答案可能误导下游解读。然而,许多医学VQA问题是通用的、模板化的或形式高度相似,这可能鼓励模型学习问答捷径而非依赖图像的推理,从而增加幻觉回答的风险。我们提出Ask4VG,一个无标签的试点框架,用于风险感知的问题选择。Ask4VG通过反事实视觉探测估计问题引发的幻觉风险:在原始图像、扰动图像、空白图像和错配图像下提出相同问题,并将得到的答案关系转换为反事实风险估计器的弱监督信号。然后,学习到的估计器对候选问题改写进行重排,以优先选择那些对缺失或错配视觉证据更不具不变性的、保留意图的问题,再进行最终答案生成。在VQA-RAD上使用Qwen2-VL-2B-Instruct,仅提示改写增加了反事实风险,而基于预测风险的重排将留出风险从0.658降至0.623,并将精确准确率从0.337提升至0.356。一个300样本的PMC-VQA外部检查显示了相同的风险降低方向,并伴有小幅准确率提升。这些结果表明,问题选择是响应级幻觉缓解的一个有前景的补充,有助于实现可靠的医学VQA。

英文摘要

Medical visual question answering requires models to ground their responses in image evidence, because visually unsupported answers can mislead downstream interpretation. However, many medical VQA questions are generic, template-like, or highly similar in form, which can encourage models to learn question-answer shortcuts instead of image-dependent reasoning and thereby increase the risk of hallucinated responses. We propose Ask4VG, a label-free pilot framework for risk-aware question selection. Ask4VG estimates question-induced hallucination risk through counterfactual visual probing: the same question is asked under the original image, a perturbed image, a blank image, and a mismatched image, and the resulting answer relations are converted into weak supervision for a counterfactual risk estimator. The learned estimator then reranks candidate question rewrites to favor intent-preserving questions that are less invariant to missing or mismatched visual evidence before final answer generation. On VQA-RAD with Qwen2-VL-2B-Instruct, prompt-only rewriting increases counterfactual risk, whereas predicted-risk reranking reduces held-out risk from 0.658 to 0.623 and improves exact accuracy from 0.337 to 0.356. A 300-sample PMC-VQA external check shows the same direction of risk reduction with a small accuracy gain. These results suggest that question selection is a promising complement to response-level hallucination mitigation for reliable medical VQA.

2606.01031 2026-06-02 cs.GR cs.AI cs.CV cs.LG cs.MM 版本更新

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

音频驱动说话头生成的时序对齐评估

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)(新南威尔士大学商学院) School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院) Data61/CSIRO(Data61/澳大利亚国家科学委员会)

AI总结 针对现有帧级评估指标对时序偏差敏感的问题,提出基于软动态时间规整的序列级对齐评估框架,提升评估鲁棒性并揭示不同建模范式间的系统权衡。

Comments Research report

详情
AI中文摘要

音频驱动的说话头生成技术发展迅速,但现有评估协议主要依赖帧级指标,假设生成视频与参考视频之间存在严格的时间对应关系。这一假设与语音驱动的面部运动不符,后者自然包含轻微的时间偏移、不同的说话速度和风格变化。因此,传统指标可能将无害的时间差异视为质量错误,使得公平比较方法并理解其权衡变得更加困难。在这项工作中,我们认为动态生成模型的评估应被表述为序列对齐问题,而非独立的帧比较。我们引入了一种统一的序列级重新表述,将软动态时间规整集成到已有的评估流程中。通过在对齐特征轨迹的同时保持时间顺序,所提出的框架对有限的时间错位具有鲁棒性,且不改变底层的感知、身份或同步编码器。我们表明,在刚性对齐下,帧级评估可被视为一个特例,而序列级对齐提供了更好的稳定性、对时间差异的更低敏感性以及建模范式之间更清晰的区分。基于这一原则性表述,我们在标准化协议下,对涵盖规范、野外和风格多样场景的七个数据集上的20种方法进行了大规模基准测试。大量实验表明,时序对齐的指标对时间差异更鲁棒,跨数据集提供更一致的结果,并能更好地揭示建模范式之间的系统权衡,例如同步性与真实性、表现力与稳定性之间的权衡。

英文摘要

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.

2606.01022 2026-06-02 cs.CV cs.AI 版本更新

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

ProductWebGen: 多模态产品网页生成基准测试

Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng

发表机构 * School of Computer Science & Zhiyuan College(计算机科学学院及智远学院) Shanghai Jiao Tong University(上海交通大学) Kuaishou Technology(快手科技)

AI总结 提出ProductWebGen基准,用于评估多模态生成模型从产品图像和指令生成一致产品展示网页的能力,并比较了基于编辑和基于统一模型两种工作流。

Comments Accepted by KDD 2026

详情
AI中文摘要

从源产品图像以及布局和视觉内容指令中制作产品展示网页,对于营销、广告和电子商务等领域具有重要的实用价值。直观上,该任务要求产品展示之间严格的视觉一致性以及高保真度的指令遵循,以联合生成可渲染的HTML代码。这些对可控性和指令遵循的要求与先进多模态生成模型(如图像编辑模型和统一模型)的核心特征紧密一致。为此,本文引入ProductWebGen来系统性地基准测试这些模型的产品网页生成能力。我们组织了包含500个测试样本的ProductWebGen,涵盖13个产品类别;每个样本由源图像、视觉内容指令和网页指令组成。任务是根据源图像和指令生成包含多个一致图像的产品展示网页。鉴于任务的混合模态输入输出性质,我们设计并系统比较了两种评估工作流——一种使用大语言模型和图像编辑模型分别生成HTML代码和图像(基于编辑),另一种依赖单个统一模型生成两者,其中图像生成依赖于先前的多模态上下文(基于统一模型)。实验结果表明,基于编辑的方法在网页指令遵循和内容吸引力方面取得领先结果,而基于统一模型的方法在满足视觉内容指令方面可能展现出更多优势。我们还构建了一个监督微调数据集ProductWebGen-1k,包含1000组真实产品图像和LLM生成的HTML代码。我们在开源统一模型BAGEL上验证了其有效性。数据和代码可在https://github.com/SJTU-DENG-Lab/ProductWebGen获取。

英文摘要

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation -- one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.

2606.01021 2026-06-02 cs.CV 版本更新

Learning Neural Deformation Representation for 4D Dynamic Shape Generation

学习神经变形表示用于4D动态形状生成

Gyojin Han, Jiwan Hur, Jaehyun Choi, Junmo Kim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出一种新的神经变形表示,结合条件神经符号距离场,设计解耦运动与形状潜在空间的4D表示架构,通过扩散模型生成高质量、高时间一致性的4D动态形状。

Comments ECCV 2024

详情
AI中文摘要

近期3D形状表示的发展为生成精细3D形状开辟了新可能性。尽管取得了这些进展,但关于生成随时间变形的3D对象形式的4D动态形状的研究仍然很少。为弥补这一差距,本文聚焦于生成4D动态形状,同时强调生成质量和效率。先前关于4D生成的工作HyperDiffusion提出了一种直接生成4D占用场权重参数的方法,但由于运动表示未与4D占用场的形状表示分离,导致时间一致性差且渲染速度慢。因此,我们提出一种新的神经变形表示,并将其与条件神经符号距离场结合,设计了一种4D表示架构,其中运动潜在空间与形状潜在空间解耦。所提出的变形表示通过预测多个部分的蒙皮权重和刚体变换来工作,在理解形状结构方面也优于现有4D表示的变形模块。此外,我们设计了一种扩散模型的训练过程,利用由我们的4D表示提取的形状和运动特征作为数据点。无条件生成、条件生成和运动重定向实验结果表明,我们的方法不仅在4D动态形状生成方面表现出优于先前工作的性能,而且具有多种潜在应用。

英文摘要

Recent developments in 3D shape representation opened new possibilities for generating detailed 3D shapes. Despite these advances, there are few studies dealing with the generation of 4D dynamic shapes that have the form of 3D objects deforming over time. To bridge this gap, we focus on generating 4D dynamic shapes with an emphasis on both generation quality and efficiency in this paper. HyperDiffusion, a previous work on 4D generation, proposed a method of directly generating the weight parameters of 4D occupancy fields but suffered from low temporal consistency and slow rendering speed due to motion representation that is not separated from the shape representation of 4D occupancy fields. Therefore, we propose a new neural deformation representation and combine it with conditional neural signed distance fields to design a 4D representation architecture in which the motion latent space is disentangled from the shape latent space. The proposed deformation representation, which works by predicting skinning weights and rigid transformations for multiple parts, also has advantages over the deformation modules of existing 4D representations in understanding the structure of shapes. In addition, we design a training process of a diffusion model that utilizes the shape and motion features that are extracted by our 4D representation as data points. The results of unconditional generation, conditional generation, and motion retargeting experiments demonstrate that our method not only shows better performance than previous works in 4D dynamic shape generation but also has various potential applications.

2606.01014 2026-06-02 cs.CV cs.AI 版本更新

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

基于文本的三维人体运动编辑中的跨轴特征融合与关节运动差异预测

Gyojin Han, Junmo Kim

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 提出一种跨轴特征融合架构和辅助任务,通过联合锚定变换器预测关节运动差异,实现文本驱动的三维人体运动编辑,在MotionFix数据集上达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

我们研究基于文本的三维人体运动编辑,目标是保留源运动的风格和结构,同时应用自然语言描述的编辑。MotionFix数据集的发布推动了基于训练扩散模型的直接生成编辑运动的研究,这些模型从源运动和文本指令生成编辑运动。虽然先前的工作主要关注学习编辑在时间上何时发生,但我们的目标是创建一个不仅理解时间方面,还理解哪些特定关节负责变化的模型。为此,我们提出了一种新颖的架构和一个互补的辅助任务来辅助其训练。我们的架构由两个轴锚定变换器组成,分别沿关节和时间维度提取不同特征,以及一个跨轴融合块来整合这些表示。我们进一步引入一个辅助任务,训练关节锚定变换器回归源和目标关节旋转之间的Soft-DTW距离。该目标教会模块理解哪些关节需要修改,哪些需要保留。通过在MotionFix数据集上的全面实验,我们证明我们的方法显著提高了与文本指令和源运动的语义对齐,以及生成运动的整体保真度,达到了最先进的结果。

英文摘要

We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.

2606.01006 2026-06-02 cs.CV 版本更新

Automated Erythrocyte Detection and Tracking for Retinal Blood Flow Quantification in Erythrocyte-Mediated Angiography

自动红细胞检测与追踪用于红细胞介导血管造影中的视网膜血流定量

Chiao-Yi Wang, Havish S Gadde, Yi-Ting Shen, Saige M. Oechsli, Osamah Saeedi, Yang Tao

发表机构 * Department of Bioengineering, University of Maryland, College Park, MD 20742, USA(生物工程系,马里兰大学,学院公园,MD 20742,美国) Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA(眼科学与视觉科学系,马里兰大学医学院,巴尔的摩,MD 21201,美国) Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA(电气与计算机工程系,马里兰大学,学院公园,MD 20742,美国)

AI总结 提出EMTrack框架,通过流上下文模块和拓扑感知追踪策略实现红细胞自动检测与追踪,用于视网膜血流定量,并在新数据集RBF-EMA上优于基线方法。

详情
AI中文摘要

毛细血管水平的视网膜血流(RBF)作为多种眼病的生物标志物具有巨大潜力。然而,测量毛细血管水平RBF的方法仍然有限。红细胞介导血管造影(EMA)是一种新兴成像技术,通过可视化单个红细胞实现毛细血管水平RBF测量,但自动红细胞检测与追踪(量化血流所必需)仍鲜有探索。为填补这一空白,我们提出EMTrack,一种新颖框架,包含用于区分运动与静止细胞的红细胞检测流上下文模块,以及能够在帧间大位移和显著运动变化下进行追踪的拓扑感知追踪策略。此外,我们建立了RBF-EMA,一个包含全面红细胞检测与追踪标注的新EMA数据集。实验结果表明,我们的方法在RBF-EMA数据集上的检测与追踪任务中,在定量和定性上均优于基线方法。此外,RBF量化结果凸显了我们的框架在自动化视网膜血流测量中的巨大潜力。

英文摘要

Capillary-level retinal blood flow (RBF) has strong potential as a biomarker for various ocular diseases. However, modalities for measuring capillary-level RBF remain limited. Erythrocyte-mediated angiography (EMA), an emerging imaging technique, enables capillary-level RBF measurement by visualizing individual erythrocytes, yet automated erythrocyte detection and tracking, which are essential for quantifying blood flow, remain largely unexplored. To address this gap, we propose EMTrack, a novel framework featuring a flow-context module for erythrocyte detection that distinguishes moving from paused cells and a topology-aware tracking strategy that enables tracking under large inter-frame displacements and substantial motion variations. In addition, we establish RBF-EMA, a new EMA dataset with comprehensive erythrocyte detection and tracking annotations. Experimental results demonstrate that our method outperforms baseline methods both quantitatively and qualitatively on detection and tracking tasks in the RBF-EMA dataset. Moreover, RBF quantification results highlight the strong potential of our framework for automated retinal blood flow measurement.

2606.00999 2026-06-02 cs.CV 版本更新

SWARD: Stochastic Window-Attention-Based Relational Distillation for Cross-Architectural Semantic Segmentation

SWARD:基于随机窗口注意力的关系蒸馏用于跨架构语义分割

Aditya Makineni, Qing Tian

发表机构 * Department of Computer Science University of Alabama at Birmingham(计算机科学系阿拉巴马大学伯明翰分校)

AI总结 提出SWARD框架,通过多尺度窗口注意力蒸馏和原型判别正则化,弥合Transformer教师与CNN学生之间的表征差距,实现跨架构语义分割的知识蒸馏。

详情
AI中文摘要

大规模视觉基础模型在语义分割等密集预测任务上取得了显著进展,但其规模使得在资源受限环境中部署不切实际,因此知识蒸馏成为将其能力迁移至轻量级学生网络的一种手段。然而,现代基础教师模型主要基于Transformer,编码全局上下文,而高效学生模型通常是具有局部偏置感受野的卷积网络。现有蒸馏方法大多假设架构同质性,并依赖直接特征模仿,这未能弥合这种表征差距,且忽略了准确语义分割所需的结构化空间依赖和判别性组织。在本文中,我们提出SWARD,一种通过两种互补机制解决这一差距的知识蒸馏框架。首先,我们引入多尺度窗口注意力蒸馏(MWAD)模块,该模块在随机移位窗口分区中对齐师生基于注意力的关系,窗口偏移在每次训练迭代中随机重新采样。这消除了窗口边界偏差,并结合多尺度设计,捕获了短程和长程空间依赖。其次,我们引入原型判别正则化(PDR),一种通过强制类间分离和类内紧凑性来塑造学生特征分布的损失,进一步锐化判别结构,超越仅靠特征模仿在学生容量减少下所能产生的效果。在不同视觉应用(即城市场景解析和医学图像分割)上的实验表明,SWARD达到了最先进的性能。

英文摘要

Large-scale vision foundation models have driven substantial gains on dense prediction tasks such as semantic segmentation, but their size makes deployment impractical in resource-constrained settings, motivating knowledge distillation as a means of transferring their capabilities to lightweight student networks. However, modern foundation teachers are predominantly transformer-based that encode global context, whereas efficient students are typically convolutional networks with locally biased receptive fields. Existing distillation methods largely assume architectural homogeneity and rely on direct feature mimicry, which fails to bridge this representational gap and neglects the structured spatial dependencies and discriminative organization required for accurate semantic segmentation. In this paper, we propose SWARD, a knowledge distillation framework that addresses this gap through two complementary mechanisms. First, we introduce a Multi-Scale Windowed Attention Distillation (MWAD) module that aligns teacher-student attention-based relations within stochastically shifted window partitions whose offsets are randomly resampled at every training iteration. This removes window boundary bias, and, combined with the multi-scale design, captures both short- and long-range spatial dependencies. Second, we introduce Prototype Discriminative Regularization (PDR), a loss that helps shape the student's feature distribution by enforcing inter-class separation and intra-class compactness, further sharpening the discriminative structure beyond what feature mimicry alone can produce under the student's reduced capacity. Experiments across different vision applications (i.e., urban scene parsing and medical image segmentation) show that SWARD achieves state-of-the-art performance.

2606.00987 2026-06-02 cs.CV cs.AI 版本更新

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

多时相指代分割的开源基准与基线

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence (TeleAI)(人工智能研究所) China Telecom(中国电信) School of Artificial Intelligence, Optics and Electronics (iOPEN)(人工智能、光学与电子学院) Northwestern Polytechnical University(西北工业大学)

AI总结 提出多时相指代分割任务,通过自动化数据构建管道CRAFT-Agent生成首个基准MTRefSeg-21K,并设计两阶段训练的变化感知LVLM框架MTRefSeg-R1,实现优于现有基线的性能。

详情
AI中文摘要

大型视觉语言模型(LVLMs)展现了强大的视觉理解和语言引导定位能力,但其多时相视觉推理能力仍未充分探索。为填补这一空白,我们引入了 extbf{多时相指代分割(MTRS)},这是一个新任务,旨在从多时相图像中分割语言描述的时间变化。MTRS通过联合要求时相对应推理、语言定位和像素级掩码预测,扩展了传统的指代分割和变化检测。我们提出了 extbf{CRAFT-Agent},一个带有人工审核的自动化数据构建管道,并构建了 extbf{MTRefSeg-21K},这是第一个MTRS基准,包含21K个高质量的多时相图像-文本-掩码三元组,覆盖多样化的场景、视角和领域。对一系列基于VLM和LVLM的模型进行基准测试表明,直接推理表现较差,而任务特定的微调仍然有限。为解决这一问题,我们提出了 extbf{MTRefSeg-R1},一个采用两阶段策略训练的变化感知LVLM框架。它首先从20K个仅视觉的双时相样本中学习通用时间变化感知,然后在MTRefSeg-21K上进行微调,以实现细粒度的语言引导时间定位。MTRefSeg-R1显式建模跨时相视觉差异,将语言指令与时间变化对齐,并预测所指变化掩码。大量实验表明,与现有的LVLM基线相比,MTRefSeg-R1实现了强大且通常更优的性能,展示了MTRS的挑战和潜力。

英文摘要

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.

2606.00963 2026-06-02 cs.CV cs.CL 版本更新

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Reasmory: 3D重建作为VLMs空间推理的显式记忆

Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang

发表机构 * Cornell Tech, Cornell University(康奈尔科技学院、康奈尔大学) NVIDIA(英伟达) illoca AI(illoca人工智能) The University of California, Merced(加州大学梅尔塞德斯分校)

AI总结 提出Reasmory框架,通过结构化程序执行重建的3D显式记忆,并引入轻量级领域特定语言约束VLM查询和操作,在空间推理任务上提升6-18%。

详情
AI中文摘要

视觉语言模型(VLM)展现出新兴的空间推理能力,但在需要精确空间理解的任务(如视角推理、方向比较和距离估计)上仍不可靠。在多视图图像和单目视频中,相关空间线索通常稀疏且分布在冗余观测中,难以组织和利用。基于重建的视觉基础模型(VFM)提供了一种自然的方式将这些观测聚合为显式空间记忆,例如点云。然而,简单地将重建模型作为自由形式工具使用是脆弱的,VLM可能错误调用工具、跳过所需的空间变换或误用中间结果。我们提出 extbf{Reasmory},一个将空间推理形式化为对重建空间记忆的结构化程序执行的框架。Reasmory构建显式3D记忆,用语义锚定的3D对象实例增强它,并引入轻量级领域特定语言(DSL),约束VLM在推理过程中如何查询对象和相机、变换视角以及渲染观测。生成的程序在执行前被解析和验证,从而比无约束的工具使用更可靠地与空间记忆交互。在多视图图像和视频空间推理基准上的实验表明,与强基线(包括GPT-5-mini和Gemini-3-flash)相比,一致提升6-18%,表明显式3D记忆在通过约束、验证的操作而非自由形式的工具调用访问时最为有用。

英文摘要

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6--18\% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.

2606.00957 2026-06-02 cs.CV 版本更新

Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers

面向大规模文生视频扩散Transformer的边界保护W8A8 HiFloat8量化

Yiming Zhao

发表机构 * Yiming Zhao(赵毅铭)

AI总结 针对Wan2.1-T2V-14B模型,提出一种边界保护策略的W8A8 HiF8后训练量化方法,通过保留首尾边界块为BF16而量化中间块,在VBench五个维度上匹配或略优于BF16基线。

Comments 6 pages, 5 figures. Accepted to ICME 2026 Grand Challenge

详情
AI中文摘要

我们提出了一种针对Wan2.1-T2V-14B(一个140亿参数文生视频扩散Transformer)的后训练量化方法,目标是在Ascend 910B NPU上实现W8A8 HiFloat8(HiF8)格式。量化视频DiT模型的一个核心挑战是跨Transformer块的异构激活分布:边界块(前几个和后几个块)表现出与中间块根本不同的统计特性,使得均匀量化无效。我们对所有40个WanAttentionBlock进行了系统的逐块激活分析,并利用这些发现提出了一种边界保护策略,该策略保留前两个和后三个块为BF16,同时用W8A8 HiF8量化剩余的35个块。所提出的PTQ方法在评估的所有五个VBench维度上匹配或略优于BF16基线,表明在5提示评估集内没有可测量的精度损失。对四种保护配置的消融研究证实,完全边界保护产生最高的平均VBench分数,验证了数据驱动的块选择。我们还研究了量化感知训练作为补充微调阶段,并分析了在单卡硬件上它无法优于普通PTQ的条件。

英文摘要

We present a post-training quantization (PTQ) approach for Wan2.1-T2V-14B, a 14-billion-parameter text-to-video diffusion transformer, targeting the W8A8 HiFloat8 (HiF8) format on Ascend 910B NPUs. A central challenge in quantizing video DiT models is the heterogeneous activation distribution across transformer blocks: boundary blocks (the first and last few blocks) exhibit fundamentally different statistical properties from middle blocks, making uniform quantization ineffective. We conduct a systematic per-block activation analysis across all 40 WanAttentionBlocks and use the findings to motivate a boundary-protection strategy that retains the first two and last three blocks in BF16 while quantizing the remaining 35 blocks with W8A8 HiF8. The proposed PTQ method matches or marginally exceeds the BF16 baseline on all five VBench dimensions evaluated, indicating no measurable accuracy loss within the 5-prompt evaluation set. An ablation study over four protection configurations confirms that full boundary protection yields the highest average VBench score, validating the data-driven block selection. We additionally investigate quantization-aware training (QAT) as a complementary fine-tuning stage and analyze the conditions under which it fails to outperform plain PTQ on single-card hardware.

2606.00954 2026-06-02 cs.CV 版本更新

COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation

COLLAR: 级联对象级潜在精化用于高保真条件生成

Xinlong Zhang, Jia Wei, Xiaoyu Zhang, Teng Zhou, Chengyu Lin, Yongchuan Tang

发表机构 * College of Computer Science, Zhejiang University(浙江大学计算机科学学院)

AI总结 提出COLLAR框架,通过视场扩展和级联对象级潜在精化,在扩散Transformer中实现无训练的高保真对象级控制,优于现有方法。

详情
AI中文摘要

尽管引入了深度和Canny图等结构先验,在扩散Transformer中实现高保真对象级控制仍然是一个重大挑战。当前的对象级条件生成方法经常出现视觉伪影,并且难以在小的局部区域内保持对对象的精确控制。为了解决这些限制,我们提出了级联对象级潜在精化(COLLAR),这是一个无训练框架,通过视场(FoV)扩展逐步优化对象级特征。首先,我们提出了跨尺度语义对齐(CSSA)模块,通过注意力机制将对象级特征注入到扩展FoV分支中,以解决空间语义差距。为了进一步优化这些特征,循环特征注入(CFI)模块引入了一个互逆的背景反馈机制。它利用基于频率的自适应策略,用上下文对齐的局部信息选择性更新全局主干。最后,扩展FoV分支作为特征优化的枢纽,确保对象级特征被集成到全局生成过程中,而不损害最终图像质量。在COCO-MIG和COCO-POS基准上的大量实验表明,我们的方法在语义对齐、图像质量和空间保真度方面始终优于最先进的方法。

英文摘要

Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over objects within small localized regions. To address these limitations, we propose Cascaded Object-Level Latent Refinement (COLLAR), a training-free framework that progressively optimizes object-level features via the Field-of-View (FoV) expansion. First, we propose the Cross-Scale Semantic Alignment (CSSA) module to address spatial-semantic gaps by injecting object-level features into extended-FoV branches via attention mechanisms. To further optimize these features, the Cyclic Feature Injection (CFI) module introduces a reciprocal background feedback mechanism. It leverages a frequency-based adaptive strategy to selectively update the global backbone with context-aligned local information. Finally, the extended-FoV branch serves as a hub for feature optimization, ensuring that object-level features are integrated into the global generation process without compromising final image quality. Extensive experiments on the COCO-MIG and COCO-POS benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods across semantic alignment, image quality, and spatial fidelity.

2606.00936 2026-06-02 cs.CV 版本更新

One Channel to Rule Them All: Rethinking Input Representation for Visual Place Recognition

一个通道统治一切:重新思考视觉地点识别的输入表示

Timur Ismagilov, Shakaiba Majeed, Michael Milford, Tan Viet Tuyen Nguyen, Sarvapali D. Ramchurn, Shoaib Ehsan

发表机构 * School of Computer Science and Electronic Engineering, University of Essex(埃塞克斯大学计算机科学与电子工程学院)

AI总结 通过实验证明灰度图像在视觉地点识别中与RGB性能相当甚至更优,尤其在严重外观变化下,灰度更具鲁棒性,且能减少参数和资源消耗。

Comments 8 pages

详情
AI中文摘要

视觉地点识别(VPR)是长期机器人定位和SLAM的基础,但当前系统主要依赖RGB输入,隐含地假设颜色对于全局地点识别是必要的。我们挑战这一假设,研究在真实世界外观变化下,颜色信息在不同训练机制、模型架构和标准基准中的作用。我们发现灰度通常与RGB性能相当,在颜色不变性学习不足的严重外观变化下甚至优于RGB,而颜色仅在存在持久且可区分的颜色线索时提供有意义的增益。在选定的基准上,完全灰度训练的MixVPR模型平均Recall@1为82.4%,而RGB对应模型为81.2%。在某些情况下,参数减少60%的轻量级灰度变体可以超越更重的RGB模型。灰度还在存储、带宽和与资源受限系统的兼容性方面提供实际优势。我们得出结论,对于场景随光照、天气、季节和环境变化的全局VPR,颜色贡献极小,仅灰度足以实现可靠的地点识别。

英文摘要

Visual Place Recognition (VPR) is fundamental to long-term robot localization and SLAM, yet current systems overwhelmingly rely on RGB input, implicitly assuming color is necessary for global place recognition. We challenge this assumption, investigating the role of chromatic information across training regimes, model architectures and standard benchmarks under real-world appearance variation. We find that grayscale matches RGB performance generally and outperforms it under severe appearance shifts where color invariance is insufficiently learned, while color provides meaningful gains only where persistent and discriminative chromatic cues are present. Across selected benchmarks, a fully gray-trained MixVPR model achieves an average 82.4% Recall@1 compared to 81.2% for its RGB counterpart. In some cases, lightweight grayscale variants with 60% fewer parameters can outperform heavier RGB models. Grayscale further offers practical advantages in storage, bandwidth and alignment with resource-constrained systems. We conclude that for global VPR where scenes vary across illumination, weather, season and setting, color contributes minimally, and grayscale alone is sufficient for reliable place recognition.

2606.00931 2026-06-02 cs.CV cs.AI 版本更新

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

CV-Arena: 面向教学计算机视觉问题求解的开放基准与人类-AI协作偏好

Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen, Qianwen Ge, Shuo Xing, Mingyang Wu, Xiangbo Gao, Siyuan Yang, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhen Dong, Ming-Hsuan Yang, Zhengzhong Tu

发表机构 * Texas A&M University(德克萨斯A&M大学) Worcester Polytechnic Institute(沃斯特理工大学) Tohoku University(东北大学) Georgia Institute of Technology(佐治亚理工学院) NVIDIA(英伟达) UCSB(加州大学圣塔芭芭拉分校) UC Merced(加州大学默塞德分校)

AI总结 提出CV-Arena基准,包含12K高分辨率真实图像指令对,覆盖16种任务类型,并采用Active Elo协议结合人类与AI偏好评估21个系统,揭示指令遵循、物理推理等方面的差距,同时开发CV-Agent代理模型展示闭环推理的潜力。

Comments 26 pages, 7 figures, 11 tables

详情
AI中文摘要

指令引导的图像编辑正成为视觉工作的通用接口,然而现有基准仍主要聚焦于狭窄的外观编辑,未能充分捕捉专业工作流程中真实图像任务的多样性。在此,我们将教学计算机视觉问题求解定义为图像编辑的更广泛形式:给定真实输入图像和自然语言指令,系统必须生成编辑后的输出,实现所要求的变换,同时满足明确的保持性、几何、物理和可用性约束。我们引入了CV-Arena,一个旨在以专业规模评估此能力的开放基准。CV-Arena包含12K高分辨率真实图像指令对,涵盖16种基于指令的视觉任务类型,通过CogRetriever构建,这是一个结合目标网络搜索、代理查询精化、验证和可追溯性的双轨检索与筛选流水线。为了在保持人类保真度的同时大规模评估模型,我们提出了Active Elo,一种人类-AI协作偏好协议,利用CV-Judge(一个逻辑门控、多维度VLM评估器)拒绝明显失败并解决高置信度比较,并将接近的高质量比较路由给专家评分者。然后通过可靠性加权的Elo更新聚合混合的人类和AI监督。我们对21个系统(包括专有、开源和代理模型)在CV-Arena上的全面评估揭示了指令遵循、物理推理、结构控制和细粒度细节保持方面的持续差距。我们进一步开发了CV-Agent,一个轻量级代理模型,结合了规划、编辑和验证,并证明了闭环推理是专业级指令遵循视觉编辑的一个有前景的方向。

英文摘要

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

2606.00928 2026-06-02 cs.CV cs.LG 版本更新

Single-Channel Tissue Segmentation via Cross-Modal Distillation from Foundation Models

基于基础模型跨模态蒸馏的单通道组织分割

Sakib Mohammad, Jarin Ritu, Md Sakhawat Hossain

发表机构 * Department of Engineering Technology(工程技术系) Department of Electrical and Computer Engineering(电气与计算机工程系) Department of Mechanical Engineering(机械工程系)

AI总结 提出跨模态知识蒸馏框架,将多通道输入的基础模型教师知识迁移到仅使用核通道的轻量级学生网络,实现单通道组织分割性能大幅提升。

Comments 6 pages, 3 figures

详情
AI中文摘要

多重荧光显微镜通过提供互补通道(包括核(DAPI)和膜(E-cadherin))改善组织分割,这些通道共同编码比单通道成像更丰富的空间上下文。然而,多重模型在推理时需要所有通道,限制了在仅部分通道可用时的部署。本文提出一个跨模态知识蒸馏框架,将处理多重输入的基础模型教师的语义信息迁移到仅使用核通道的轻量级学生网络。蒸馏目标结合了基于MSE的概率匹配、边界感知监督和可学习的不确定性加权。在TissueNet和BBBC038上,评估了SAM ViT-H和CellSAM作为教师,四个U-Net学生:Swin-Tiny(27M)、ResNet18(11M)、EfficientNet-B0(5.3M)和MobileNetV3(1.5M)。在TissueNet上,SAM蒸馏的Swin-Tiny学生达到Dice 78.36(±1.44),比无KD基线(65.31±1.35)提高13.05分,并以23倍参数缩减恢复了教师oracle性能(89.12±1.21)的87.9%。KD一致地使所有四个学生提高约12个Dice点,确认了架构无关的蒸馏。在所有设置中,SAM ViT-H作为教师优于CellSAM。在BBBC038上的跨数据集评估显示,无需教师重新训练即可获得一致增益。

英文摘要

Multiplexed fluorescence microscopy improves tissue segmentation by providing complementary channels including nuclear (DAPI) and membrane (E-cadherin), that together encode richer spatial context than single-channel imaging alone. However, multiplexed models require all channels at inference, limiting deployment where only a subset is available. This work proposes a cross-modal knowledge distillation framework that transfers semantic information from a frozen foundation model teacher processing multiplexed input to a lightweight student operating on the nuclear channel only. The distillation objective combines MSE-based probability matching, boundary-aware supervision, and learnable uncertainty weighting. SAM ViT-H and CellSAM are evaluated as teachers across four U-Net students: Swin-Tiny (27M), ResNet18 (11M), EfficientNet-B0 (5.3M), and MobileNetV3 (1.5M), on TissueNet and BBBC038. On TissueNet, the SAM-distilled Swin-Tiny student achieves Dice 78.36 (plus or minus 1.44), a 13.05-point improvement over the no-KD baseline (65.31 plus or minus 1.35) and 87.9% recovery of teacher oracle performance (89.12 plus or minus 1.21) at a 23x parameter reduction. KD consistently improves all four students by approximately 12 Dice points, confirming architecture-agnostic distillation. SAM ViT-H outperforms CellSAM as teacher across all settings. Cross-dataset evaluation on BBBC038 shows consistent gains without teacher retraining.

2606.00927 2026-06-02 cs.CV 版本更新

Bridging Topology and Deep Representation Learning: A TDA-ViT Fusion Model for Four-Class Brain Tumor Classification

桥接拓扑与深度表示学习:用于四类脑肿瘤分类的TDA-ViT融合模型

Faisal Ahmed

发表机构 * Department of Data Science and Mathematics(数据科学与数学系)

AI总结 提出一种将拓扑数据分析(TDA)特征与预训练Vision Transformer(ViT)表示相融合的框架,用于四类脑肿瘤分类,在BRISC2025数据集上达到99.10%的准确率。

Comments 21 pages, 4 figures

详情
AI中文摘要

从磁共振成像(MRI)中准确分类脑肿瘤是早期诊断和临床决策的关键要求。Vision Transformers(ViTs)通过学习全局上下文表示在医学图像分析中表现出强大性能,但它们通常无法捕捉肿瘤区域中存在的内在结构和拓扑模式。为了解决这一局限性,我们提出了一种融合框架,将拓扑数据分析(TDA)特征与预训练的Vision Transformer表示相结合,用于四类脑肿瘤分类。在所提出的方法中,TDA用于提取补充的拓扑描述符,这些描述符从MRI图像中捕捉几何结构、连通性和形状信息。同时,预训练的ViT模型从相同图像中学习高级语义表示。然后将这两个特征空间融合,形成统一且更具判别性的表示用于分类。该模型在BRISC2025数据集上进行评估,该数据集包含四类脑肿瘤:胶质瘤、脑膜瘤、垂体瘤和非肿瘤病例。实验结果表明,与单独使用任一方法相比,结合拓扑和基于Transformer的特征显著提高了性能。所提出的TDA-ViT融合模型实现了99.10%的准确率、99.27%的精确率、99.15%的召回率、99.21%的F1分数和99.98%的AUC。它还优于几种最先进的模型,包括ResNet50、ResNet101、EfficientNetB2和独立的Vision Transformers。这些结果表明,拓扑特征提供了有价值的补充信息,增强了深度表示学习,从而为自动脑肿瘤分类提供了一个稳健且高精度的框架。

英文摘要

Accurate brain tumor classification from magnetic resonance imaging (MRI) is a key requirement for early diagnosis and clinical decision-making. Vision Transformers (ViTs) have shown strong performance in medical image analysis by learning global contextual representations, but they often fail to capture intrinsic structural and topological patterns present in tumor regions. To address this limitation, we propose a fusion framework that combines Topological Data Analysis (TDA) features with pretrained Vision Transformer representations for four-class brain tumor classification. In the proposed method, TDA is used to extract complementary topological descriptors that capture geometric structure, connectivity, and shape information from MRI images. In parallel, a pretrained ViT model learns high-level semantic representations from the same images. These two feature spaces are then fused to form a unified and more discriminative representation for classification. The model is evaluated on the BRISC2025 dataset, which contains four brain tumor classes: glioma, meningioma, pituitary tumor, and non-tumor cases. Experimental results show that combining topological and transformer-based features significantly improves performance compared to using either approach alone. The proposed TDA-ViT fusion model achieves an accuracy of 99.10%, precision of 99.27%, recall of 99.15%, F1-score of 99.21%, and an AUC of 99.98%. It also outperforms several state-of-the-art models, including ResNet50, ResNet101, EfficientNetB2, and standalone Vision Transformers. These results demonstrate that topological features provide valuable complementary information that enhances deep representation learning, leading to a robust and highly accurate framework for automated brain tumor classification.

2606.00910 2026-06-02 cs.CV cs.LG 版本更新

Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval

推理、检索、重排序:一种用于组合视频检索的零样本推理感知框架

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出R3-CoVR零样本管道,通过多模态大模型推理编辑后状态、对比编码检索和约束感知重排序,在CVPR 2026 VidLLMs挑战赛上达到91.9% R@1和98.2% R@10。

详情
AI中文摘要

组合视频检索(CoVR)旨在通过对参考视频应用自由形式的文本修改来寻找目标视频。我们应对CVPR 2026 VidLLMs研讨会上的推理感知CoVR(CoVR-R)挑战,其中检索严格为零样本。我们提出R3-CoVR(推理、检索、重排序),一个完全由冻结基础模型构建的无训练管道。多模态大语言模型(Qwen3-VL-8B)推理编辑所隐含的“后效”——状态转换、动作阶段、场景、镜头和节奏——并生成简洁的编辑后描述;对比视频-文本编码器(SigLIP-2)对该描述和图库进行嵌入以进行第一阶段检索;最后,一个约束感知重排序阶段使用相同的多模态模型作为评判者,对每个候选视频针对预期的编辑结果进行评分。在挑战测试集上,R3-CoVR达到了91.9%的R@1和98.2%的R@10。两个发现推动了这些结果:(i)将描述长度匹配到对比编码器的文本窗口使R@1从67.5提升到72.7;(ii)仅对候选列表进行重排序的约束感知重排序器将R@1从72.7提升到91.9——这是最大的单一增益。我们分析了重排序器的行为、检索/重排序混合以及候选列表深度,并发布了一个干净的三层实现。

英文摘要

Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emph{Reason-Aware} CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbf{R3-CoVR} (\emph{Reason, Retrieve, Re-rank}), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emph{after-effects} an edit implies -- state transitions, action phases, scene, camera and tempo -- and verbalises a concise post-edit description; a contrastive video--text encoder (SigLIP-2) embeds this description and the gallery for first-stage retrieval; finally a constraint-aware re-ranking stage uses the same multimodal model as a judge that scores each shortlisted candidate against the intended edited result. On the challenge test set, R3-CoVR attains \textbf{91.9\% R@1} and \textbf{98.2\% R@10}. Two findings drive these results: (i)~matching the description length to the contrastive encoder's text window lifts \Rk{1} from $67.5$ to $72.7$; and (ii)~the constraint-aware re-ranker, which reorders only the shortlist, lifts \Rk{1} from $72.7$ to $91.9$ -- the single largest gain. We analyse the re-ranker's behaviour, the retrieve/re-rank blend, and the shortlist depth, and we release a clean three-layer implementation.

2606.00906 2026-06-02 cs.CV 版本更新

hZACH-ViT: Curved Latent Geometry for Compact Vision Transformers in Low-Data Medical Imaging

hZACH-ViT:用于低数据医学成像中紧凑视觉Transformer的曲率潜在几何

Athanasios Angelakis

发表机构 * BioML Lab, Research Institute CODE, UniBw, Munich, Germany(BioML实验室,CODE研究机构,UniBw,慕尼黑,德国) Department of Epidemiology and Data Science, Amsterdam UMC, Amsterdam, Netherlands(流行病学与数据科学系,阿姆斯特丹大学医学中心,阿姆斯特丹,荷兰)

AI总结 提出hZACH-ViT,通过扩展ZACH-ViT的潜在空间为双曲或球形几何,在低数据医学成像中提升紧凑视觉Transformer的性能,并在MedMNIST数据集上平均提升+0.021。

Comments 17 pages, 2 figures, 4 tables. Code, execution notebooks, and aggregated result summaries will be released at https://github.com/Bluesman79/hZACH-ViT upon publication

详情
AI中文摘要

紧凑视觉Transformer在低数据和资源受限的医学成像场景中具有吸引力,但大多数现有变体假设欧几里得潜在几何足以组织图像表示。我们引入了hZACH-ViT,这是ZACH-ViT的曲率几何扩展家族,ZACH-ViT是一种紧凑的零令牌视觉Transformer,它去除了位置嵌入和类别令牌,并依赖于补丁表示的全局平均池化。为了隔离几何的作用,我们保留了经过验证的ZACH-ViT骨干网络,仅修改了最终表示空间和基于原型的分类器头部,从而实现了欧几里得、双曲和球形潜在几何之间的受控比较。我们在七个MedMNIST数据集上评估了庞加莱、克莱因和球形hZACH-ViT头部,采用相同的少样本协议,每个类别50个样本和五个随机种子。完整的基准测试包含770次训练运行,涵盖七个数据集、三种非欧几里得几何、七个曲率幅度以及一个欧几里得基线。在所有七个数据集中,最佳非欧几里得hZACH-ViT配置优于欧几里得ZACH-ViT,在数据集特定的主要指标上平均提升+0.021,在OCTMNIST上提升最大(+0.055 MacroF1)。固定的低曲率配置在大多数数据集上保持正向增益,低曲率值(c = 0.1或0.2)占据了七个数据集级别优胜者中的六个。我们的结果并未确定一个普遍最优的流形,而是将几何和曲率确立为数据集依赖的模型选择变量,固定的低曲率分析证实了增益在详尽的逐数据集调优之外仍然存在。

英文摘要

Compact Vision Transformers are attractive for medical imaging in low-data and resource-constrained settings, but most existing variants assume that Euclidean latent geometry is sufficient for organizing image representations. We introduce hZACH-ViT, a family of curved-geometry extensions of ZACH-ViT, a compact zero-token Vision Transformer that removes positional embeddings and the class token and relies on global average pooling over patch representations. To isolate the role of geometry, we preserve the verified ZACH-ViT backbone and modify only the final representation space and prototype-based classifier head, enabling a controlled comparison between Euclidean, hyperbolic, and spherical latent geometries. We evaluate Poincaré, Klein, and spherical hZACH-ViT heads on seven MedMNIST datasets under an identical few-shot protocol with 50 samples per class and five random seeds. The completed benchmark contains 770 training runs spanning seven datasets, three non-Euclidean geometries, seven curvature magnitudes, and a Euclidean baseline. Across all seven datasets, the best non-Euclidean hZACH-ViT configuration improves over Euclidean ZACH-ViT, with an average gain of +0.021 in the dataset-specific primary metric and the largest improvement on OCTMNIST (+0.055 MacroF1). Fixed low-curvature configurations retain positive gains on the majority of datasets, and low curvature values (c = 0.1 or 0.2) account for six of the seven dataset-level winners. Rather than identifying a universally optimal manifold, our results establish geometry and curvature as dataset-dependent model-selection variables, with fixed low-curvature analyses confirming that gains persist beyond exhaustive per-dataset tuning.

2606.00891 2026-06-02 cs.CV 版本更新

MMDG-Bench: A Benchmark for Multimodal Domain Generalization

MMDG-Bench:多模态领域泛化基准

Qianshan Zhan, Qian Wang, Da Li, Xiao-Jun Zeng, Xiatian Zhu

发表机构 * University of Manchester(曼彻斯特大学) Jiyue AI(极越AI) Samsung AI Centre Cambridge(三星AI中心剑桥) University of Surrey(萨里大学)

AI总结 提出MMDG-Bench基准,通过D2M和M2D两种框架统一多模态学习与领域泛化,在动作识别和活体检测等任务上验证了结构化组合优于现有方法,并给出关键设计指南。

详情
AI中文摘要

多模态领域泛化(MMDG)旨在利用互补模态增强模型在未见领域上的鲁棒性。尽管多模态学习(MML)和领域泛化(DG)作为独立领域取得了广泛进展,但它们的系统集成仍未被充分探索。当前的MMDG研究主要局限于动作识别,且缺乏标准化的评估协议。为此,我们引入了MMDG-Bench,一个全面的基准,包含两个基础框架:先DG后MML(D2M)和先MML后DG(M2D)。我们在多种任务上提供了统一的实验协议,包括视频-音频-光流动作识别和RGB-深度-红外人脸活体检测。通过将统一的MML配置与五种DG技术配对,在D2M和M2D两种顺序下实例化十个MMDG基线,我们证明这些结构化组合通常优于现有最先进方法,强调了统一基准工作的必要性。我们的分析得出三个关键见解:(1)集成DG技术在各种骨干网络上提供一致的泛化增益,而非DG方法对骨干网络变化高度敏感;(2)最优框架选择取决于模态间稳定性:当模态关系在领域间稳定时D2M表现更好,而M2D对跨领域关系变化更鲁棒;(3)更强的骨干网络在集成到我们的结构化框架中时会产生放大的性能收益。MMDG-Bench为未来多模态鲁棒性研究提供了原则性基础和可操作的设计指南。代码已发布在 https://github.com/qszhan/MMDG-Bench。

英文摘要

Multi-modal Domain Generalization (MMDG) seeks to leverage complementary modalities to enhance model robustness on unseen domains. Despite extensive progress in Multi-modal Learning (MML) and Domain Generalization (DG) as individual fields, their systematic integration remains under-explored. Current MMDG research is largely confined to action recognition and lacks standardized evaluation protocols. To address this, we introduce MMDG-Bench, a comprehensive benchmark featuring two foundational frameworks: DG then MML (D2M) and MML then DG (M2D). We provide unified experimental protocols across diverse tasks, including video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both D2M and M2D orderings, we demonstrate that these structured combinations frequently outperform existing state-of-the-art methods, underscoring the necessity of a unified benchmarking effort. Our analysis yields three key insights: (1) Integrating DG techniques provides consistent generalization gains across various backbones, whereas non-DG methods are highly sensitive to backbone shifts; (2) The optimal framework choice depends on inter-modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross-domain relational variance; (3) Stronger backbones yield amplified performance dividends when integrated into our structured frameworks. MMDG-Bench provides a principled foundation and actionable design guidelines for future research in multi-modal robustness. Code is released at https://github.com/qszhan/MMDG-Bench.

2606.00890 2026-06-02 cs.CV 版本更新

Cohort-Scale Neural Atlases of Ultrasound Video

超声视频的队列级神经图谱

Zhuorui Zhang, Roger Pallarès-López, Xuan Wu, Praneeth Namburi, Brian W. Anthony

发表机构 * Department of Mechanical Engineering, MIT(麻省理工学院机械工程系) Institute for Medical Engineering and Science, MIT(麻省理工学院医学工程与科学研究所) MIT.nano Immersion Lab, MIT(麻省理工学院MIT.nano沉浸实验室)

AI总结 提出一种基于DINOv3特征空间、联合训练数千帧的队列级神经图谱方法,通过每视频生成潜在优化嵌入实现准确注释迁移,在五个心脏和肌肉骨骼数据集上达到与强基线相当的精度。

详情
AI中文摘要

超声是临床实践中应用最广泛的实时成像模态,然而每帧视频注释仍然是一个主要瓶颈:专家标签稀缺且昂贵,图像外观随散斑、阴影、衰减和操作者依赖的探头姿态而变化。这尤其具有局限性,因为临床相关信息通常是动态的,从超声心动图中的左心室运动到肌肉骨骼成像中的肌肉和骨骼运动学。群体图谱可以通过将观测注册到共享的规范坐标系来分摊注释成本,但现有的神经图谱方法主要针对单个视频、小型测试时图像集或物体中心的图像集合。我们引入了一种用于超声视频的队列级神经图谱:一个单一的规范图表,带有每视频生成潜在优化嵌入,在DINOv3特征空间中联合训练数千帧。在五个带有地标点和分割掩膜的心脏和肌肉骨骼数据集上,我们的方法学习了连贯的规范模板,并实现了准确的图谱空间注释迁移。在EchoNet-Dynamic和MSK-Bone上,它支持单次和少样本迁移,其精度与强密集对应基线相当,同时在单个消费级GPU上训练只需几分钟。学习到的嵌入是可解释的:线性投影揭示了结构化的队列变异,图像解码器插值产生解剖学上合理的中间帧,测试时潜在反演通过图谱重建保留帧。这些结果表明,队列级神经图谱为减少超声视频分析中的专家注释负担提供了一种实用、可解释的表示。

英文摘要

Ultrasound is the most widely used real-time imaging modality in clinical practice, yet per-frame video annotation remains a major bottleneck: expert labels are scarce and costly, and image appearance varies with speckle, shadowing, attenuation, and operator-dependent probe pose. This is especially limiting because clinically relevant information is often dynamic, from left-ventricular motion in echocardiography to muscle and bone kinematics in musculoskeletal imaging. Population atlases can amortize annotation cost by registering observations to a shared canonical coordinate system, but existing neural atlas methods mainly target single videos, small test-time image sets, or object-centric image collections. We introduce a cohort-scale neural atlas for ultrasound video: a single canonical chart with per-video Generative Latent Optimization embeddings, trained jointly over thousands of frames in DINOv3 feature space. Across five cardiac and musculoskeletal datasets with point landmarks and segmentation masks, our method learns coherent canonical templates and enables accurate atlas-space annotation transfer. On EchoNet-Dynamic and MSK-Bone, it supports single- and few-shot transfer with accuracy competitive with strong dense-correspondence baselines, while training in minutes on a single consumer GPU. The learned embeddings are interpretable: linear projections reveal structured cohort variation, image-decoder interpolation produces anatomically plausible intermediate frames, and test-time latent inversion reconstructs held-out frames through the atlas. These results suggest that cohort-scale neural atlases offer a practical, interpretable representation for reducing expert annotation burden in ultrasound video analysis.

2606.00886 2026-06-02 cs.CV cs.RO 版本更新

GABI: Geometry-Aware Boundary Integration for Spacecraft Segmentation

GABI: 用于航天器分割的几何感知边界集成

Iason Georgios Velentzas, Dhruv Ahuja, Panagiotis Tsiotras

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出一种轻量级边界感知多任务分割架构GABI,通过辅助距离场预测头增强卷积骨干网络,在保持低模型复杂度的同时提升航天器分割精度,在SPARK基准上平均精度提升5%,跨域泛化提升50%。

Comments Accepted to AI4Space at CVPR 2026

详情
AI中文摘要

精确分割对于自主航天器至关重要,因为它直接影响与3D态势感知相关的下游任务。然而,太空恶劣的照明条件会产生外观高度变化的图像,阻碍分割方法在不同航天器和环境中的泛化。在这项工作中,我们提出了GABI,一种轻量级的边界感知多任务分割架构,它通过一个辅助的距离场预测头增强卷积骨干网络。距离场在物体边界周围提供密集的几何监督,鼓励网络学习航天器结构的空间一致表示,同时保持适合机载感知系统的低模型复杂度。我们在一个既定的卷积基线和更重的基于Transformer的架构上评估了GABI。在SPARK基准上,距离场监督使基线在平均精度上提高了5%,同时实现了与Transformer模型相当的性能。在泛化实验中,GABI的平均精度比基线提高了50%以上。在跨域评估中,轻量级GABI变体在IoU和F1分数上与更重的Transformer模型相差5%以内,而体积大约小十倍。同时,更重的GABI变体在保持近三倍轻量的同时超越了Transformer架构。

英文摘要

Accurate segmentation is crucial for autonomous spacecraft, as it directly affects downstream tasks related to 3D situational awareness. The harsh illumination conditions of space, however, produce images with high variability in appearance, hindering the generalization of segmentation approaches across different spacecraft and environments. In this work, we propose GABI, a lightweight boundary-aware multi-task segmentation architecture that augments a convolutional backbone with an auxiliary distance-field prediction head. The distance field provides dense geometric supervision around object boundaries, encouraging the network to learn spatially consistent representations of spacecraft structures while maintaining low model complexity suitable for onboard perception systems. We evaluated GABI against both an established convolutional baseline and a heavier transformer-based architecture. On the SPARK benchmark, distance-field supervision improves the baseline by up to $5\%$ in Average Precision while achieving performance comparable to the transformer models. In generalization experiments, GABI improves Average Precision by more than $50\%$ over the baseline. In cross-domain evaluation, the lightweight GABI variant performs within $5\%$ in IoU and F1-score of the heavier transformer model while being approximately ten times smaller. At the same time, the heavier GABI variant surpasses the transformer architectures while remaining nearly three times lighter.

2606.00872 2026-06-02 cs.CV 版本更新

Images as Tables: In-Context Learning with TabPFN for Low-Data Detection of AI-Generated Images

图像作为表格:利用TabPFN进行上下文学习以实现低数据量下AI生成图像的检测

Jan Philip Walter, Shashank Agnihotri, Margret Keuper

发表机构 * Jan Philip Walter Shashank Agnihotri Margret Keuper

AI总结 提出将图像转换为表格形式,使用冻结的DINOv3骨干网络提取特征,并通过TabPFN进行上下文学习,在低数据量下有效检测AI生成图像,优于现有方法。

Comments Accepted as a Spotlight Oral at the ICML 2026 Workshop Foundation Models for Structured Data. *Equal Contribution

详情
AI中文摘要

AI生成图像检测是一个移动目标问题:针对一个生成器训练的检测器在出现新生成器时常常失效,且只有少量标记样本可用。我们研究了一种简单的图像到表格的公式化方法,其中每个图像由冻结的DINOv3骨干网络编码,其CLS特征通过PCA降维为500维的结构化行,TabPFN通过上下文表格推理而非特定任务分类器训练进行真实/伪造分类。这将伪造图像检测转化为对学习到的视觉特征的低数据量结构化预测,使检测器适应依赖于标记的上下文集而非基于梯度的微调。在GenImage上,LATTE(最新的最先进检测器)在拥有来自所有生成器的大量标记样本时仍然更强,在最大合并设置中高出7.4%,但在实际重要的低数据量情况下,DINOv3-PCA-TabPFN更强,最高超出LATTE 8.2%,并且在检测器必须从一个生成器泛化到另一个生成器的迁移设置中也是如此。这些结果将表格基础模型定位为图像取证中一种强大的互补适应机制,将适应从检测器重新训练转变为使用少量标记样本的轻量级上下文更新。代码URL:https://github.com/jpwalter30/Towards-Generalizable-Detection-of-AI-Generated-Images

英文摘要

AI-generated image detection is a moving-target problem: detectors trained on one generator often fail when a new generator appears, and only a few labeled examples are available. We study a simple image-to-table formulation for this regime, where each image is encoded by a frozen DINOv3 backbone, its CLS feature is reduced to a 500-dimensional structured row with PCA, and TabPFN performs real/fake classification by in-context tabular inference rather than task-specific classifier training. This turns fake-image detection into low-data structured prediction over learned visual features, making detector adaptation depend on the labeled context set instead of gradient-based fine-tuning. On GenImage, LATTE, a recent state-of-the-art detector, remains stronger when many labeled samples from all generators are available, by 7.4% in the largest pooled setting, but DINOv3-PCA-TabPFN is stronger in the practically important low-data regime, outperforming LATTE by up to 8.2%, and in transfer settings where the detector must generalize from one generator to another. These results position tabular foundation models as a strong complementary adaptation mechanism for image forensics, shifting adaptation from detector retraining to lightweight in-context updates with a small labeled set of examples. Code URL: https://github.com/jpwalter30/Towards-Generalizable-Detection-of-AI-Generated-Images

2606.00871 2026-06-02 cs.CV cs.AI 版本更新

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

城市感知中的视觉语言模型基准应具备可靠性意识且可协商

Rashid Mushkani

发表机构 * Rashid Mushkani

AI总结 本文提出,用于城市感知的视觉语言模型基准应将分歧和弃权视为测量结果,报告标注者间信度,并将标签空间和评分策略视为可协商的产物。

Comments To appear in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉语言模型(VLM)越来越多地用于生成街景图像的结构化描述,用于街道景观审计、制图和公众咨询等任务。这些用途将可观察属性与评估类别相结合,而人类目标往往是带有分歧和明确不回答的判断分布。本文认为,为城市感知建立VLM基准应将分歧和弃权视为测量结果,报告标注者间信度以及模型对齐度,并在输出旨在为城市治理提供信息时,将标签空间和评分策略视为可协商的产物。我们基于一个由来自七个社区组织的12名参与者对100个蒙特利尔街景进行30个维度标注的基准,以及对七个VLM的确定性零样本评估来论证这一观点。在各个维度上,模型与人类共识的一致性随维度层面的人类信度共同变化,而对于评估维度“总体印象”,模型和标注者表现出分布不匹配,包括“不适用”的不同比率。最后,我们为基准创建者、模型开发者和机构提出了行动建议,以使不确定性和基准假设在评估报告中可见。

英文摘要

Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.

2606.00852 2026-06-02 cs.CV cs.AI cs.LG 版本更新

RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection

RefDiffNet: 在检测前学习暴露细微PCB缺陷

Vinay Edula, Nilesh Badwe, Priyanka Bagade

发表机构 * Department of Computer Science and Engineering Indian Institute of Technology Kanpur(计算机科学与工程系印度理工学院坎浦尔) Department of Materials Science and Engineering Indian Institute of Technology Kanpur(材料科学与工程系印度理工学院坎浦尔)

AI总结 提出RefDiffNet,一种轻量级即插即用的输入增强模块,通过引入无缺陷参考图像来突出缺陷区域,从而提升下游检测器在PCB缺陷检测中的性能。

详情
AI中文摘要

印刷电路板(PCB)缺陷检测具有挑战性,因为许多缺陷很小且难以与复杂的背景图案区分。大多数基于深度学习的PCB检测方法仅依赖被检测的PCB图像进行缺陷检测,忽略了编码走线、焊盘和其他PCB结构预期布局的无缺陷参考图像。在这项工作中,我们提出了RefDiffNet,一种轻量级即插即用的输入增强模块,放置在检测器主干之前,用于在缺陷检测前增强图像。RefDiffNet将经典检测中的一个成熟思想带入深度学习时代,利用无缺陷参考图像来揭示缺陷。RefDiffNet比较缺陷图像与对齐的参考图像,捕获相对于参考图像的结构变化,并使用轻量级编码器输出缺陷区域被突出的原始图像,从而简化下游检测器的任务。在HRIPCB和DeepPCB上的结果表明,RefDiffNet在各类检测器上一致地提升了性能,包括从YOLOv8到YOLOv26的单阶段检测器、基于Transformer的RT-DETR以及两阶段Faster R-CNN。它实现了高达18%的相对mAP50:95增益,且开销可忽略,仅引入0.004-0.005M额外参数和0.7-0.8 GFLOPs,最多占任何评估检测器参数量的0.25%。结果确立了RefDiffNet作为一种轻量级、即插即用、检测器无关的输入增强模块,以最小的计算成本显著提升PCB缺陷检测性能。

英文摘要

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex background patterns. Most deep learning-based PCB inspection methods rely only on the inspected PCB image for defect detection, ignoring the defect-free reference image that encodes the expected layout of traces, pads, and other PCB structures. In this work, we propose RefDiffNet, a lightweight plug-and-play input enhancement block placed before the detector backbone to enhance the image before defect detection. RefDiffNet brings one proven idea from classical inspection into the deep learning era, using a defect-free reference image to reveal defects. RefDiffNet compares the defective image with the aligned reference, captures structural changes relative to the reference, and uses a lightweight encoder to output the original image with defective regions highlighted, thereby making the downstream detector's task easier. Results on HRIPCB and DeepPCB show that RefDiffNet consistently improves performance across detector families, including one-stage detectors from YOLOv8 to YOLOv26, the transformer-based RT-DETR, and the two-stage Faster R-CNN. It achieves up to 18% relative mAP50:95 gain with negligible overhead, introducing only 0.004 - 0.005M additional parameters and 0.7 - 0.8 GFLOPs, amounting to at most 0.25% of the parameter count of any evaluated detector. Results establish RefDiffNet as a lightweight, plug-and-play, detector-agnostic input enhancement module that substantially improves PCB defect detection with minimal computational cost.

2606.00844 2026-06-02 cs.CV cs.AI cs.LG 版本更新

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

MoEIoU:将边界框回归重新思考为混合专家模型

Vinay Edula, Priyanka Bagade

发表机构 * Indian Institute of Technology Kanpur(印度理工学院坎普尔分校)

AI总结 提出MoEIoU损失函数,通过混合专家模型联合优化重叠、中心对齐和长宽比,并采用课程学习权重调度,在多个数据集和YOLO架构上超越现有IoU损失。

详情
AI中文摘要

边界框回归是目标检测的基本组成部分,在精确目标定位中起着关键作用。现有的基于交并比(IoU)的损失函数通过引入几何惩罚项(如中心距离和长宽比不匹配)来扩展IoU目标,以改进边界框回归。然而,这些惩罚项通常在训练过程中保持不变,没有考虑优化动态:预测框在初始阶段表现出较大的中心距离和形状误差,而后期阶段则侧重于提高与真实框的重叠。为了解决这一局限性,我们引入了MoEIoU,一种基于混合专家的回归损失,它联合建模了重叠、中心对齐和长宽比不匹配。MoEIoU使用log-sum-exp函数聚合这些组件,该函数强调主要的定位误差,同时保持其他项的平滑贡献。此外,采用基于课程的权重调度,在早期训练阶段优先纠正框的位置和形状,在后期阶段提高重叠。我们在PASCAL VOC、HRIPCB和MS COCO上使用多种YOLO架构以及大规模模拟实验评估了所提出的MoEIoU。它始终优于标准和最新的最先进损失,表现出更快的收敛速度和更高的定位精度。我们进一步表明,这种自适应聚合改进了现有的基于IoU的损失,带来了一致的增益,并为目标检测框架中的边界框回归提供了更有效的优化指导。

英文摘要

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing Intersection-over-Union (IoU)-based loss functions extend the IoU objective by incorporating geometric penalties, such as center-distance and aspect-ratio mismatch, to improve bounding-box regression. However, these penalties typically remain fixed throughout training and do not account for the optimization dynamics in which predicted boxes initially exhibit large center-distance and shape errors, with later stages focusing on improving overlap with the ground truth. To address this limitation, we introduce MoEIoU, a mixture-of-experts based regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. MoEIoU aggregates these components using a log-sum-exp function, which emphasizes the dominant localization error while maintaining smooth contributions from other terms. Additionally, a curriculum-based weighting schedule is employed to prioritize correcting box position and shape in early training stages and improving overlap in later stages. We evaluated proposed MoEIoU on PASCAL VOC, HRIPCB, and MS COCO using multiple YOLO architectures, along with large-scale simulation experiments. It consistently outperforms standard and recent state-of-the-art losses, demonstrating faster convergence and improved localization accuracy. We further show that this adaptive aggregation improves existing IoU-based losses, yielding consistent gains and providing more effective optimization guidance for bounding-box regression in object detection frameworks.

2606.00829 2026-06-02 cs.CV 版本更新

The Right Inference Strategy Is All You Need: Nearly Training-Free Domain-Wise Inference for EgoCross Challenge

正确的推理策略即是一切:面向EgoCross挑战的近乎无需训练的领域感知推理

Leyi Wu, Yifan Zhao, Jinjie Zhang, Yinchuan Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKUST(香港科技大学) Knowin

AI总结 针对EgoCross挑战中源受限场景下多模态大模型在领域偏移严重的自我中心视频问答任务上表现不佳的问题,提出一种领域感知推理策略,通过为四个目标领域分别设计不同的输入、提示和答案映射流程,在不进行额外训练的情况下显著提升基线模型性能。

详情
AI中文摘要

EgoCross评估多模态大语言模型在显著领域偏移下的自我中心视频问答,其中测试视频来自手术、工业装配、极限运动和动物佩戴相机,而非日常场景。在源受限赛道中,基础模型固定为Qwen3-VL-4B,而官方任务特定支持集仅包含20个训练样本。这一设置使得挑战更侧重于向受限模型暴露正确的视觉、时序和答案选择线索,而非模型规模。我们的关键观察是,冻结的基线模型并非完全无法处理这些罕见场景;相反,它往往缺乏合适的接口来将其现有的视觉-语言知识迁移到新任务格式。因此,我们采用领域感知推理策略,将四个目标领域分开处理,并根据每个领域的任务特点设计不同的输入、提示和答案映射流程。这些策略通过强调每个领域重要的线索,使罕见自我中心场景对VLM更具可解释性。最终系统几乎无需训练:手术和动物问题使用基础Qwen3-VL-4B模型回答,而极限运动和工业问题仅使用在提供的20个训练样本上训练两个epoch的官方SFT检查点。在最终评估中,这一简单策略达到了66.98%的整体准确率,表明精心设计的领域感知推理可以弥补基础模型能力的不足,并恢复基线模型中已存在的大部分能力。

英文摘要

EgoCross evaluates multimodal large language models on egocentric video question answering under substantial domain shift, where test videos come from surgery, industrial assembly, extreme sports, and animal-mounted cameras rather than ordinary daily-life scenes. In the source-limited track, the base model is fixed to Qwen3-VL-4B, while the official task-specific support set contains only 20 training samples. This setting makes the challenge less about model scaling and more about exposing the right visual, temporal, and answer-selection cues to a constrained model. Our key observation is that the frozen baseline model is not simply incapable of these rare scenarios; rather, it often fails to transfer its existing visual-language knowledge to the new task format without an appropriate interface. We therefore use a domain-wise inference strategy that treats the four target domains separately and designs different input, prompting, and answer-mapping procedures according to each domain's task characteristics. These strategies make the rare egocentric scenes more interpretable to the VLM by emphasizing the cues that matter for each domain. The resulting system is nearly training-free: surgery, and animal questions are answered with the base Qwen3-VL-4B model, while XSports and industry use only the official SFT checkpoint trained for two epochs on the provided 20 training samples. On the final evaluation, this simple strategy reaches 66.98\% overall accuracy, suggesting that careful domain-aware inference can compensate for limited base-model strength and recover much of the ability already present in the baseline model.

2606.00828 2026-06-02 cs.CV 版本更新

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

RoboStressBench: 在具身场景中基准测试VLM对物理视觉压力的鲁棒性

Leyi Wu, Yifan Zhao, Jinjie Zhang, Suzeyu Chen, Wosong Chen, Zhifei Chen, Tianshuo Xu, Qingchun He, Hongxin Hu, Haojian Huang, Yangkai Wei, Wenqian Li, Yinchuan Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州))

AI总结 本文提出RoboStressBench,从逆图形学角度将视觉压力分解为材质、视角、光照和几何四个物理维度,系统评估VLM在真实物理压力下的鲁棒性,并引入压力感知求解器提升高压力场景下的性能。

详情
AI中文摘要

视觉语言模型(VLM)展现出强大的视觉理解能力,并越来越多地部署在具身AI系统中,这些系统需要在真实条件下进行可靠的感知。然而,现有的基准测试使用干净图像或孤立扰动来评估VLM,而非由物理场景形成引起的压力。这种设计有两个局限性:它仅覆盖了日常视觉压力的一小部分子集,并且某些扰动在现实具身场景中很少出现。这一差距引发了一个基本问题:我们如何以一种原则性的方式定义视觉压力,以捕捉物理环境中遇到的各种因素?为了解决这个问题,我们从逆图形学角度构建视觉感知,并引入RoboStressBench,这是一个用于评估VLM在具身场景中对物理视觉压力鲁棒性的基准测试。受物理渲染方程的启发,RoboStressBench将视觉压力分解为四个物理基础维度:材质(M)、视角(V)、光照(L)和几何(G)。这种设计使RoboStressBench能够覆盖现实世界环境中的广泛视觉压力,同时允许对其在VLM能力(如视觉识别、推理和规划)上的影响进行受控分析。通过对最先进的VLM进行全面评估,我们识别出特定于压力的失败模式,并揭示了不同的物理因素会降低不同的具身能力,而这些往往被总体准确率所掩盖。我们进一步引入了一种压力感知的智能求解器,它在推理前检测视觉压力源并调用视觉编辑技能,从而提高了高压力场景下的鲁棒性。总体而言,RoboStressBench提供了一个原则性的评估框架,用于诊断和改进VLM在真实物理压力下的感知能力,支持开发更可靠的具身AI系统。

英文摘要

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.

2606.00825 2026-06-02 cs.CV cs.ET cs.HC cs.MA 版本更新

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

SuperMemory-VQA:面向长期记忆的自我中心视觉问答基准

Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard Newcombe, Hyo Jin Kim, Mi Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) Meta Project(Meta项目)

AI总结 提出SuperMemory-VQA数据集,包含52.9小时AI眼镜录制的日常活动及4853个多选问答对,用于评估AI助手在长期记忆任务上的表现,发现现有系统可靠性不足。

Comments 34 pages, 21 figures, 5 tables

详情
AI中文摘要

AI眼镜为AI代理作为个性化记忆助手提供了有吸引力的平台。要真正有用,此类系统必须超越短期视频理解,解决人类在纵向自我中心视频流中因实际、个人或社交目的而经历的记忆缺口。然而,现有的自我中心数据集主要关注动作识别或来自短片的通用问答,衡量的是感知能力而非现实的人类记忆需求。我们引入了SuperMemory-VQA,一个用于评估AI助手在实际长期记忆任务上的自我中心视觉问答(VQA)数据集。它包含52.9小时用AI眼镜记录的日常活动,包括同步的RGB视频、音频转录、眼动追踪、IMU和SLAM轨迹。通过人工验证的标注流程,我们构建了4,853个有依据的问答对,涵盖物体和位置记忆、意图回忆、视觉场景回忆、时间线重建、对话记忆和上下文检索。每个问题以多项选择形式提出,并包含明确的“不可回答”选项以测试幻觉鲁棒性。对领先的代理框架和LLM骨干的基准测试表明,现有系统在现实世界记忆任务上仍远不可靠,凸显了需要新的架构来实现有依据的AI记忆,使其仅在证据充分时才能回答。参与者调查进一步支持我们的问题具有现实性、实用性,并与日常记忆需求一致。

英文摘要

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.

2606.00817 2026-06-02 cs.GR cs.CV 版本更新

Directed Distance Fields for Constant-Time Ray Queries on Gaussian Splatting

定向距离场:用于高斯泼溅的恒定时间射线查询

Subhankar MIshra

发表机构 * School of Computer Sciences, National Institute of Science Education and Research (NISER)(计算机科学学院,国家科学教育与研究研究所(NISER))

AI总结 提出定向距离函数(DDF),将训练好的3D高斯泼溅场景转化为射线预言机,实现恒定时间的射线查询,用于全局光照等二次射线追踪。

详情
AI中文摘要

3D高斯泼溅(3DGS)实时渲染场景的新视图。与所有光栅化器一样,它只回答主射线,即从相机穿过图像的射线。它无法追踪阴影、环境遮挡和全局光照所需的二次射线。我们通过蒸馏定向距离函数(DDF)将训练好的3DGS场景转化为射线预言机。DDF是一个小型神经场。它接受由原点和方向给出的射线,并返回到第一个表面的距离以及射线是否击中任何物体。每次查询是一次前向传递。该场大小为52 MB,其大小不依赖于高斯数量,因此其成本和内存随场景增长而保持不变。我们提出三点。首先,我们研究DDF需要什么样的监督。从高斯渲染的深度太模糊,无法学习薄部分,而清晰的距离监督可以恢复它们。其次,我们测量速度。DDF比球体追踪等效的有符号距离场快26到72倍,并且与在高斯上构建的包围体积层次结构不同,即使在专用的RT核心硬件上,其查询时间和内存也不会随场景增长。第三,我们展示了一个不需要网格的流程:图像生成3DGS场景,神经表面提供清晰的距离,DDF从中学习。我们将DDF用作全局光照的二次射线预言机。它在142个对象上以30.3 dB的PSNR再现参考光线追踪阴影,以21.3 dB的PSNR再现环境遮挡,并在真实捕获场景上有效。我们的代码可在https://github.com/smlab-niser/ddf-gs获取。

英文摘要

3D Gaussian Splatting (3DGS) renders new views of a scene in real time. Like every rasterizer, it answers only primary rays, the rays from the camera through the image. It cannot trace the secondary rays that shadows, ambient occlusion, and global illumination need. We turn a trained 3DGS scene into a ray oracle by distilling a Directed Distance Function (DDF). The DDF is a small neural field. It takes a ray, given by an origin and a direction, and returns the distance to the first surface and whether the ray hits anything. Each query is one forward pass. The field is 52~MB, and its size does not depend on the number of Gaussians, so its cost and memory stay flat as the scene grows. We make three points. First, we study what supervision a DDF needs. Depth rendered from the Gaussians is too blurry to teach thin parts, while clean distance supervision recovers them. Second, we measure speed. The DDF is 26 to 72 times faster than sphere tracing an equivalent signed distance field, and unlike a bounding volume hierarchy built over the Gaussians, even on dedicated RT-core hardware, its query time and memory do not grow with the scene. Third, we show a pipeline that needs no mesh: images give a 3DGS scene, a neural surface gives clean distances, and the DDF learns from them. We use the DDF as a secondary-ray oracle for global illumination. It reproduces reference ray-traced shadows at 30.3~dB and ambient occlusion at 21.3~dB across 142 objects, and on real captured scenes. Our codes are available at https://github.com/smlab-niser/ddf-gs.

2606.00803 2026-06-02 astro-ph.CO cs.CV cs.LG 版本更新

Generative Diffusion Priors for 3D Mapping of the Dark Universe

用于暗宇宙三维映射的生成扩散先验

Brandon Zhao, Diana Scognamiglio, Olivier Doré, Katherine L. Bouman

发表机构 * Department of Computing and Mathematical Sciences, California Institute of Technology(加州理工学院计算与数学科学系) Jet Propulsion Laboratory, California Institute of Technology(加州理工学院喷气推进实验室) Department of Physics, Duke University(杜克大学物理系) Cahill Center for Astronomy and Astrophysics, California Institute of Technology(加州理工学院卡希尔天文与天体物理中心)

AI总结 利用扩散模型学习宇宙模拟中的先验分布,结合物理正向模型解决弱引力透镜三维暗物质反问题,显著提升重建精度并生成统计一致的后验样本。

Comments Accepted to CVPR 2026 (Highlight)

详情
AI中文摘要

从弱引力透镜观测重建暗物质的三维分布是宇宙学中一个核心但高度病态的反问题。与多视角标准三维重建不同,我们通过单一视线方向观测宇宙,通过星系不确定距离的噪声形状畸变,因此有意义的三维物质场恢复需要强先验假设。现有方法要么使用手工先验产生点估计,要么使用神经集成进行近似贝叶斯不确定性,难以捕捉宇宙网的非高斯、纤维状结构。随着新的高分辨率宇宙学模拟的出现,我们现在有了另一种先验知识来源,其捕捉结构形成的非线性统计的保真度远高于解析公式。我们利用这些模拟构建了一个新数据集$ exttt{Conicus3D}$,使我们能够学习一个数据驱动的扩散模型先验,捕捉暗物质结构在宇宙时间内的完整三维分布。基于最近的即插即用方法,我们将基于扩散的后验采样方案修改为三维弱引力透镜设置,将学习到的先验与可微分的物理正向模型相结合。在针对现代弱引力透镜巡天的逼真模拟上,我们的方法在二维和三维重建精度上显著优于基线方法。此外,它产生的后验样本的统计量紧密跟踪底层模拟,同时对宇宙学参数的适度偏移保持鲁棒性。

英文摘要

Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simulations to build a new dataset $\texttt{Conicus3D}$, which enables us to learn a data-driven diffusion-model prior capturing the full 3D distribution of dark matter structure across cosmic time. Building on recent plug-and-play approaches, we modify a diffusion-based posterior sampling scheme to the 3D weak-lensing setting, combining the learned prior with a differentiable physical forward model. On realistic simulations targeting a modern weak lensing survey, our approach yields substantially improved 2D and 3D reconstruction accuracy over baseline methods. Moreover, it produces posterior samples whose statistics closely track the underlying simulations, while remaining robust to moderate shifts in cosmology.

2606.00798 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

DASH: 用于引导校准紧凑扩散模型的双分支分数蒸馏

Abdullah Al Shafi, Kazi Saeed Alam, Sk Imran Hossain, Engelbert Mephu Nguifo

发表机构 * Khulna University of Engineering & Technology(Khulna 工程与技术大学) University Clermont Auvergne(克莱蒙特-奥弗涅大学)

AI总结 针对类条件扩散模型参数压缩中无监督无条件分数分支导致引导失效的问题,提出双分支蒸馏框架DASH,通过独立监督两个分支并引入锚点正则化和课程迁移,在5.9倍压缩下保持与教师模型相近的FID和引导保真度。

Comments 14 pages, 7 figures, 4 tables; appendix with additional ablations and qualitative results

详情
AI中文摘要

类条件扩散模型的参数压缩揭示了输出级蒸馏中一个未被充分探索的局限性:无条件分数分支保持无监督,导致学生模型中无分类器引导差距欠定。该差距在每个去噪步骤中被放大,允许两个分支都崩溃为相同预测的退化解,使得引导在低输出级训练损失下无效。本文介绍了DASH,一种双分支蒸馏框架,独立监督两个分数分支,通过独立分支约束为每个训练样本唯一指定目标分支输出,并引入锚点项将条件预测正则化到真实噪声。该框架进一步引入了TIRT迁移,将教师收敛的每时间步重要性课程复制到学生中作为冻结先验,消除了在有限蒸馏预算内重新学习它的需要。在CIFAR-10和CIFAR-100上的实验表明,5.9倍压缩在50步DDIM采样下将质量保持在教师模型4个FID点以内,显著优于从头训练,且引导保真度良好保持。消融研究证实无条件监督是主要贡献,占总蒸馏增益的60%以上。课程迁移和锚点正则化提供互补收益,共同验证了双分支约束对于引导保持压缩的经验必要性。

英文摘要

Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditional score branch remains unsupervised, leaving the classifier-free guidance gap underdetermined in the student. This gap, amplified at every denoising step, admits degenerate solutions where both branches collapse toward identical predictions, rendering guidance ineffective despite low output-level training loss. This paper introduces DASH, a dual-branch distillation framework that independently supervises both score branches, uniquely specifying target branch outputs for each training sample through independent branch constraints, with an anchor term regularising conditional predictions toward ground-truth noise. The framework further introduces TIRT Transfer, which copies the teacher's converged per-timestep importance curriculum into the student as a frozen prior, eliminating the need to relearn it within limited distillation budgets. Experiments on CIFAR-10 and CIFAR-100 demonstrate that 5.9x compression maintains quality within 4 FID points of the teacher at 50-step DDIM sampling, considerably outperforming training from scratch with guidance fidelity well preserved. Ablation studies confirm that unconditional supervision is the dominant contribution, accounting for over 60% of total distillation gain. Curriculum transfer and anchor regularisation provide complementary benefit, together validating dual-branch constraints as empirically essential for guidance-preserving compression.

2606.00784 2026-06-02 cs.CV 版本更新

DINO-GFSA: Geo-Localization via Semantic Gated Fusion and Mamba-based Sequential Aggregation

DINO-GFSA:基于语义门控融合和Mamba序列聚合的地理定位

Beier Hu, Yuanshen Guo, Jialu Cai, Chengwei Li, Yong Wang, Shunan Wu, Zhigang Wu

发表机构 * School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen, China(中山大学航空航天学院,深圳,中国)

AI总结 提出DINO-GFSA框架,通过LoRA适配的DINOv3骨干网络、语义门控残差融合模块和Mamba序列聚合头,在无人机跨视角地理定位中实现最先进性能。

详情
AI中文摘要

跨视角地理定位(CVGL)对于无人机在无GNSS环境下的自定位和目标定位至关重要。然而,在保留细粒度空间细节的同时获取鲁棒语义仍然具有挑战性。为此,我们提出DINO-GFSA框架,利用LoRA(低秩适配)适配的DINOv3(ViTL)骨干网络实现参数高效、高容量的表示。关键地,我们引入了语义门控残差融合模块,利用高层语义选择性校准和整合低层空间线索,有效弥合语义鸿沟。此外,设计了基于Mamba的序列聚合头,以线性复杂度捕获长距离空间依赖。实验表明,在University-1652和DenseUAV基准上取得了最先进性能,特别是在DenseUAV上Recall@1比之前最佳方法高出3.48%。这些结果验证了DINO-GFSA作为无人机CVGL通用鲁棒解决方案的有效性。

英文摘要

Cross-view geo-localization (CVGL) is critical for Unmanned Aerial Vehicle (UAV) self-positioning and target localization in GNSS-denied environments. However, acquiring robust semantics while preserving finegrained spatial details remains challenging. To address this, we propose DINO-GFSA, a framework leveraging a LoRA (Low-Rank Adaptation) adapted DINOv3 (ViTL) backbone for parameter-efficient, high-capacity representation. Crucially, we introduce a Semantic Gated Residual Fusion module, which utilizes high-level semantics to selectively calibrate and integrate low-level spatial cues, effectively bridging the semantic gap. Furthermore, a Mamba-based Sequential Aggregation Head is designed to capture long-range spatial dependencies with linear complexity. Experiments demonstrate state-of-the-art performance on University-1652 and DenseUAV benchmarks, notably surpassing the previous best on DenseUAV by 3.48% on Recall@1. These results validate DINO-GFSA as a generalized, robust solution for UAV CVGL.

2606.00782 2026-06-02 cs.CV 版本更新

FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection

FlowOVD: 学习生成式潜在流用于零样本开放词汇检测

Yao Wei, Andrea Cavallaro, Changjae Oh

发表机构 * Queen Mary University of London(伦敦女王学院) EPFL(瑞士联邦理工学院)

AI总结 提出FlowOVD,基于修正流的文本条件查询生成框架,通过连续潜在查询动态实现开放词汇检测,在COCO和LVIS上分别达到49.5 AP和31.5 AP,优于GroundingDINO。

详情
AI中文摘要

开放词汇目标检测(OVD)通过大规模视觉-语言预训练取得了显著进展。然而,现有方法通常将OVD表述为判别性预测问题,其中解码器查询要么是静态的,要么从编码器特征初始化,从而限制了其多样性和灵活性。在本文中,我们引入生成视角,将解码器查询生成建模为潜在空间中的连续传输过程。我们提出FlowOVD,一种基于修正流的文本条件查询生成框架,逐步将文本无关的查询转换为文本引导的查询。通过将连续潜在查询动态引入基于视觉-语言模型(VLM)的检测器,我们的方法避免了启发式离散查询构建,并为开放词汇检测实现了更具表现力的语义对齐。无需额外训练数据,FlowOVD在COCO上达到49.5 AP,在LVIS上达到31.5 AP,分别比GroundingDINO高出+1.2 AP(+2.5%)和+4.1 AP(+15.0%)。在具有挑战性的长尾LVIS基准上的更大增益进一步凸显了连续查询生成对开放词汇泛化的有效性。

英文摘要

Open-vocabulary object detection (OVD) has achieved remarkable progress through large-scale vision-language pre-training. Existing methods, however, typically formulate OVD as a discriminative prediction problem, where decoder queries are either static or initialized from encoder features, thus limiting their diversity and flexibility. In this paper, we introduce a generative perspective by modeling decoder query generation as a continuous transport process in latent space. We propose FlowOVD, a text-conditioned query generation framework based on rectified flow that progressively transforms text-agnostic queries into text-guided queries. By introducing continuous latent query dynamics into a vision-language model (VLM) based detector, our method avoids heuristic discrete query construction and enables more expressive semantic alignment for open-vocabulary detection. Without requiring additional training data, FlowOVD achieves 49.5 AP on COCO and 31.5 AP on LVIS, outperforming GroundingDINO by +1.2 AP (+2.5 %) and +4.1 AP (+15.0 %), respectively. The larger gain on the challenging long-tailed LVIS benchmark further highlights the effectiveness of continuous query generation for open-vocabulary generalization.

2606.00775 2026-06-02 cs.CV cs.AI 版本更新

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

GIRL-DETR: 梯度隔离强化学习用于视频时刻检索

Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang, Wei Ji

发表机构 * College of Electronics and Information Engineering, Sichuan University(四川大学电子信息工程学院) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 针对视频时刻检索中连续代理损失与非可微指标不匹配导致的优化停滞问题,提出梯度隔离强化学习框架GIRL-DETR,通过冻结骨干网络并采用三阶段渐进强化学习策略直接优化tIoU指标,在轻量级模型中实现定位精度提升。

Comments 13 pages, 6 figures. Submitted to IEEE Transactions on Image Processing (TIP). Code is available at: https://github.com/Z-Shihang/GIRL-DETR

详情
AI中文摘要

视频时刻检索(VMR)任务要求精确定位与自然语言查询对齐的时间边界,但许多模型存在连续代理损失与非可微指标之间的不匹配,导致训练后期优化停滞,边界预测陷入次优解。尽管强化学习(RL)后训练成功优化了大模型的定位结果,但直接应用于轻量级网络容易破坏监督阶段建立的脆弱特征表示。为克服这一优化瓶颈,我们提出梯度隔离强化学习用于DETR(GIRL-DETR),首次将RL后训练引入轻量级时间定位框架。输入视频和文本特征首先通过跨模态交互(CMI)在进入Transformer编码器之前建立早期对齐。随后,文本引导门控(TGG)机制在Transformer解码器生成候选提案之前动态地将语义先验注入查询,为时间预测提供高信噪比输入。在监督训练达到收敛后,冻结骨干网络以保护特征流形,而检测头通过三阶段渐进强化学习(TPRL)策略直接优化非可微评估指标tIoU以提升定位精度。该方法实现了状态表示与指标优化的正交解耦。在Charades-STA、QVHighlights和TACoS上的实验表明,GIRL-DETR有效解决了代理损失退化问题,以最少的参数更新实现了显著的精度提升,为轻量级VMR模型中的RL应用提供了稳健的新途径。

英文摘要

Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non-differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post-training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient-Isolated Reinforcement Learning for DETR (GIRL-DETR), introducing RL post-training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross-Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal-to-noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non-differentiable evaluation metric tIoU to enhance localization accuracy through a Three-stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades-STA, QVHighlights, and TACoS demonstrate that GIRL-DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.

2606.00751 2026-06-02 cs.CV 版本更新

Head-Pose-Aware Visual Speech Recognition with FiLM Modulation

基于FiLM调制的头部姿态感知视觉语音识别

Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh

发表机构 * Department of Artificial Intelligence, Kyushu Institute of Technology(人工智能系,九州工业大学)

AI总结 提出HP-VSR-ResFiLM框架,通过姿态条件残差FiLM模块显式融入头部姿态信息,在LRS2和LRS3上分别达到25.0%和33.2%的词错误率,有效提升非正面视角下视觉语音识别的鲁棒性。

Comments 27 pages, 4 figures

详情
AI中文摘要

视觉语音识别(VSR)旨在从唇部运动等视觉线索中识别语音,但其性能从根本上受到音素模糊性和姿态引起的变化(引入几何畸变和遮挡)的限制。现有方法主要依赖语言上下文或隐式不变性,导致非正面视角下的视觉表示不够鲁棒。本文提出一个姿态感知的音素级框架HP-VSR-ResFiLM,显式地将头部姿态信息融入视觉特征提取。该框架采用两阶段流水线:阶段1为姿态条件视觉编码器,阶段2使用预训练NLLB语言模型进行音素到文本重建。具体地,阶段1在2D CNN前端后引入姿态条件残差特征线性调制(FiLM)块,利用头部姿态信息自适应地优化视觉表示。在LRS2和LRS3上的实验表明,HP-VSR-ResFiLM在可比训练条件下取得了竞争性性能,无需额外训练数据即分别达到25.0%和33.2%的词错误率(WER)。消融研究进一步显示,单个残差FiLM块持续改善整体WER,而第3层和第4层的更深层调制为偏航角大于30°的样本带来更大增益,且不降低小姿态变化样本的性能。这些发现表明,显式的姿态感知特征调制为在无约束场景下提升VSR鲁棒性提供了一种有效且计算高效的解决方案。

英文摘要

Visual Speech Recognition (VSR) aims to recognize speech from visual cues such as lip movements, but its performance is fundamentally limited by viseme ambiguity and pose-induced variations that introduce geometric distortions and occlusions. Existing approaches mainly rely on linguistic context or implicit invariance, leaving visual representations insufficiently robust under non-frontal views. In this work, we propose a pose-aware phoneme-level framework, termed HP-VSR-ResFiLM, that explicitly incorporates head-pose information into visual feature extraction. The proposed framework adopts a two-stage pipeline consisting of a pose-conditioned visual encoder in Stage 1 and a pretrained NLLB language model in Stage 2 for phoneme-to-text reconstruction. Specifically, Stage 1 incorporates a pose-conditioned residual Feature-wise Linear Modulation (FiLM) block after the 2D CNN frontend to adaptively refine visual representations using head-pose information. Experiments on LRS2 and LRS3 demonstrate that HP-VSR-ResFiLM achieves competitive performance under comparable training conditions, attaining word error rates (WER) of 25.0% and 33.2%, respectively, without relying on additional training data. Ablation studies further show that a single residual FiLM block consistently improves overall WER, while deeper modulation at Layers 3 and 4 provides larger gains for samples with yaw angles greater than 30° without degrading performance for smaller pose variations. These findings demonstrate that explicit pose-aware feature modulation offers an effective and computationally efficient solution for improving VSR robustness in unconstrained settings.

2606.00746 2026-06-02 cs.CV cs.LG 版本更新

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

将并行序列模型扩展到基础规模的视觉编码器

Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

发表机构 * NVIDIA The Chinese University of Hong Kong(香港中文大学) The University of Hong Kong(香港大学) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出C-GSPN,一种基于2D空间传播的基础规模视觉编码器,通过快速CUDA内核、压缩潜在空间传播块和两阶段交叉算子蒸馏,在减少参数的同时提升性能并实现高效推理。

详情
AI中文摘要

视觉基础模型受限于自注意力的二次成本,这限制了可用分辨率并增加了大规模预训练的成本。次二次替代方案如线性注意力和状态空间模型降低了这一成本,但通常将图像序列化为1D令牌流,削弱了对视觉重要的2D空间结构。广义空间传播网络(GSPN)通过线扫描递归直接在2D网格上传播上下文,实现了接近线性的复杂度且无需位置嵌入,但很少用作基础规模的编码器。我们提出C-GSPN,一种基于2D空间传播的基础规模视觉编码器。C-GSPN通过三项改进使该算子实用化:(1)一个快速的GSPN CUDA内核,将每步启动融合为单个warp专用实现,采用共享内存分块、合并访问和紧凑的多通道传播,达到峰值内存带宽的90%以上,运行速度比原始GSPN实现快40-52倍;(2)一个带有融合归一化的压缩潜在空间传播块,将内核级速度转化为块级和模型级效率;(3)一个两阶段交叉算子蒸馏方案,从注意力教师训练新架构,无需从头开始进行基础规模训练的成本。使用6亿图像-文本对进行蒸馏,C-GSPN以少15%的参数匹配同构ViT基线,在ADE20K分割上提升+2.1%,以极少的数据迁移到高分辨率,并在2K分辨率下通过单次无分块推理实现4倍的端到端块加速。

英文摘要

Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-space models reduce this cost, but often serialize images into 1D token streams and weaken the 2D spatial structure important for vision. Generalized Spatial Propagation Networks (GSPN) instead propagate context directly on the 2D grid through line-scan recurrences, achieving near-linear complexity without positional embeddings, but have seen little use as foundation-scale encoders. We present C-GSPN, a foundation-scale vision encoder based on 2D spatial propagation. C-GSPN makes the operator practical through three improvements: (1) a fast GSPN CUDA kernel that fuses per-step launches into a single warp-specialized implementation with shared-memory tiling, coalesced access, and a compact multi-channel propagation, reaching over 90% of peak memory bandwidth and running up to 40--52x faster than the original GSPN implementation; (2) a compressed latent-space propagation block with fused normalization, which turns kernel-level speed into block- and model-level efficiency; and (3) a two-stage cross-operator distillation recipe that trains the new architecture from an attention teacher without the cost of from-scratch foundation-scale training. Distilled with 600M image-text pairs, C-GSPN matches an isomorphic ViT baseline with 15% fewer parameters, improves ADE20K segmentation by +2.1%, transfers to high resolution with a fraction of the data needed from scratch, and delivers a 4x end-to-end block speedup at 2K with single-pass, tiling-free inference.

2606.00738 2026-06-02 cs.LG cs.AI cs.CV 版本更新

SORA: Free Second-Order Attacks in Fast Adversarial Training

SORA:快速对抗训练中的自由二阶攻击

Mazdak Teymourian, Ramtin Moslemi, Farzan Rahmani, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering, Sharif University of Technology, Tehran, Iran(谢赫大学计算机工程系)

AI总结 针对快速对抗训练中的灾难性过拟合问题,提出通过扰动变异性和梯度对齐指标PertAlign来预测并防止过拟合,并设计自适应步长方法SORA,实现最优鲁棒性和干净准确率。

Comments Accepted at ICML 2026

详情
AI中文摘要

对抗训练是对抗性样本的主要防御手段,但在高效的单步变体中常常遭受灾难性过拟合,即尽管单步性能很高,但对多步攻击的鲁棒性却崩溃。我们通过两个贡献来解决这种失效模式。首先,我们形式化了epsilon过拟合(EO),这是一种固定扰动幅度和方向加剧CO的视角,并表明引入扰动变异性可以显著提高不同架构和数据集上的鲁棒泛化能力。其次,我们提出了PertAlign(扰动对齐),这是一种理论上合理、计算开销可忽略的指标,通过测量攻击阶段的梯度对齐来预测CO的发生。利用这些见解,我们引入了SORA,一种自适应步长的AT方法,它根据损失曲面几何动态调整扰动。SORA始终能防止CO,实现最先进的鲁棒性和干净准确率,并使用一组固定的超参数在数据集和架构上泛化,这对于快速AT的适用性至关重要。在不同数据集和架构上的大量实验表明,SORA在提供更高干净准确率和卓越效率的同时,匹配或超越了先前方法的鲁棒性。代码可在https://github.com/SecondOrderAT/SORA获取。

英文摘要

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step-size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.

2606.00712 2026-06-02 cs.CV 版本更新

CASTLE2026 Team WDL Technical Report

CASTLE2026 团队 WDL 技术报告

Zhengyang Li, Zhenglin Du, Yi Wen, Fang Liu, Shuo Li, Xu Liu

发表机构 * Key Laboratory of Intelligent Perception and Image Understanding(智能感知与图像理解重点实验室)

AI总结 提出基于 Qwen 的证据感知多模态推理流程,通过提示路由和置信度加权投票解决长视频问答,在 CASTLE 挑战赛中排名第一。

Comments 4 pages

详情
AI中文摘要

CASTLE 挑战赛 @ EgoVis 2026 评估基于 600 多小时多视角记录的长格式自我中心视频问答。每个四选一问题需要来自视频、转录、辅助照片、人物、天数、房间和时间上下文的证据。我们提出了一种基于 Qwen 的证据感知多模态推理流程。我们的系统解析问题提示、检索 ASR 片段、附加辅助图像、采样候选视频帧,并将问题路由到静态视觉、语音/文本、时间和混合类型,并附带专门提示。多次推理通过置信度加权投票进行聚合,并转换为官方 Codabench 格式。在消融实验中,LoRA 将得分从 0.21 提升至 0.50,更多采样帧进一步将其提升至 0.58。我们的最终系统在 CASTLE 挑战赛 @ EgoVis 2026 中排名第一。

英文摘要

The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, people, days, rooms, and temporal context. We propose an evidence-aware multimodal reasoning pipeline based on Qwen. Our system parses question hints, retrieves ASR chunks, attaches auxiliary images, samples candidate video frames, and routes questions into static visual, speech/text, temporal, and mixed types with specialized prompts. Multiple inference passes are aggregated by confidence-weighted voting and converted into the official Codabench format. In ablation, LoRA improves the score from 0.21 to 0.50, and more sampled frames further raise it to 0.58. Our final system ranks first in the CASTLE Challenge @ EgoVis 2026.

2606.00704 2026-06-02 cs.CV 版本更新

VICR: Visual In-Context Restoration for Real-World Image Super-Resolution

VICR: 面向真实图像超分辨率的视觉上下文恢复

Qichang Zhang, Hailong Wang, Baiang Li, Linhao Wang, Rong Fu, Erkang Cheng, Simon James Fong

发表机构 * Faculty of Science and Technology, University of Macau(澳门大学科技学院) Nullmax Hefei University of Technology(合肥工业大学) Shandong Normal University(山东师范大学)

AI总结 提出基于扩散变换器的视觉上下文恢复框架,通过解耦的视觉先验注入机制将真实图像超分辨率建模为图像补全,实现结构保真与细节合成的平衡。

Comments 28 pages, 11 figures, 9 tables

详情
AI中文摘要

真实世界图像超分辨率(Real-ISR)需要在结构保真度(对退化观测)与逼真细节合成之间取得平衡。然而,现有的生成式Real-ISR方法通常依赖于纠缠的条件机制,导致结构漂移或语义不一致的细节。为了解决这个问题,我们提出了视觉上下文恢复(VICR),一种基于扩散变换器(DiT)的框架,将Real-ISR表述为图像补全。具体来说,我们引入了一种解耦的视觉先验注入机制,从低质量(LQ)图像中提取局部和全局线索:局部线索有助于恢复图像结构并支持高频细节合成,而全局线索指导整体生成并促进语义一致性。对于严重退化下的模糊区域,VICR采用推理时代理,利用LQ输入的视觉证据优化语义提示,同时保持模型参数固定。实验表明,VICR仅用127M可训练参数就在多个Real-ISR基准上实现了最先进的性能。

英文摘要

Real-world image super-resolution (Real-ISR) requires balancing structural fidelity to degraded observations with realistic detail synthesis. However, existing generative Real-ISR methods often rely on entangled conditioning mechanisms, leading to structural drift or semantically inconsistent details. To address this issue, we propose Visual In-Context Restoration (VICR), a Diffusion Transformer (DiT)-based framework that formulates Real-ISR as image completion. Specifically, we introduce a decoupled visual prior injection mechanism that derives local and global cues from the low-quality (LQ) image: local cues help recover image structures and support high-frequency detail synthesis, while global cues guide overall generation and promote semantic consistency. For ambiguous regions under severe degradation, VICR employs an inference-time agent to refine semantic prompts using visual evidence from the LQ input while keeping model parameters fixed. Experiments show that VICR achieves state-of-the-art performance across multiple Real-ISR benchmarks with only 127M trainable parameters.

2606.00694 2026-06-02 cs.CV 版本更新

FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation

FROST-STA: 用于Ego4D短期物体交互预测的冻结密集特征

Chaoyang Wang, Lexuan Xu

发表机构 * Beihang University(北航大学)

AI总结 提出FROST-STA模型,利用冻结的密集图像-视频特征和对象中心解码,在Ego4D短期物体交互预测挑战中取得第二名。

详情
AI中文摘要

第一人称视频中的短期预测需要超越对当前场景的识别:系统必须推断摄像头佩戴者将接触哪个物体、将执行什么动作以及接触将在多久后发生。本报告描述了FROST-STA,我们提交至EgoVis 2026 Ego4D短期物体交互预测(STA)挑战的方案。对于每个查询时间,模型输出一组排序的结构化假设,包含主动物体框、名词标签、动词标签、接触时间(TTC)和置信度。FROST-STA基于V-JEPA 2.1 STA评估协议,但通过使用对象中心解码、多头预测以及面向提交的训练和集成方案,使其适应挑战。我们固定V-JEPA 2.1 ViT-G骨干网络,提取两个密集token流:来自查询前缩放至384像素的短视频片段的视频token,以及来自最后观察到的最高分辨率帧的图像token。一个紧凑的对齐模块,由注意力探针和帧引导的时间池化组成,将片段表示映射到最后一帧的空间参考上,然后与图像特征融合。融合后的特征图由Faster R-CNN风格的STA头解码,估计框偏移、名词、动词、TTC值和交互质量。对于最终排行榜提交,我们使用官方训练集加上额外允许的验证标注训练25个epoch,并组合来自8个头和epoch 15-25的检查点的预测。FROST-STA在官方测试服务器上获得5.13总体Top-5 mAP,在挑战中排名第二,表明冻结的密集图像-视频特征可以作为物体级交互预测的坚实基础。

英文摘要

Short-term anticipation in egocentric video requires more than recognizing the current scene: a system must infer which object the camera wearer will contact, which action will follow, and how soon the contact will happen. This report describes FROST-STA, our submission to the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. For each query time, the model produces a ranked set of structured hypotheses containing an active-object box, noun label, verb label, time-to-contact (TTC), and confidence. FROST-STA builds on the V-JEPA 2.1 STA evaluation protocol, but adapts it to the challenge by using object-centric decoding, multi-head prediction, and a submission-oriented training and ensembling recipe. We keep the V-JEPA 2.1 ViT-G backbone fixed and extract two dense token streams: video tokens from a short clip resized to 384 pixels before the query, and image tokens from the last observed high-resolution frame. A compact alignment module, consisting of an attentive probe and frame-guided temporal pooling, maps the clip representation onto the spatial reference of the final frame before fusing it with image features. The fused maps are decoded by Faster R-CNN-style STA heads that estimate box offsets, nouns, verbs, TTC values, and interaction quality. For the final leaderboard entry, we train for 25 epochs with the official training split plus additional permitted validation annotations, and combine predictions across eight heads and checkpoints from epochs 15-25. FROST-STA obtains 5.13 Overall Top-5 mAP on the official test server, ranking second in the challenge and showing that frozen dense image-video features can serve as a strong basis for object-level interaction forecasting.

2606.00689 2026-06-02 cs.CV 版本更新

Wavelet-Fusion Diffusion Model for Multimodal Brain MRI Synthesis with Modality and Metadata Conditioning

小波融合扩散模型用于多模态脑MRI合成,具有模态和元数据条件

Muhammad Nabi Yasinzai, Remika Mito, Mangor Pedersen

发表机构 * Department of Psychology & Neuroscience, Auckland University of Technology(心理学与神经科学系,奥克兰技术大学) Department of Psychiatry, University of Melbourne(精神病学系,墨尔本大学)

AI总结 提出一种小波融合扩散模型(WFDM),结合小波融合变分自编码器(WF-VAE)和条件3D U-Net扩散模型,通过显式模态和元数据条件实现多模态脑MRI合成,解决了数据集模态覆盖不均和异质性问题,在分布对齐上优于现有方法。

Comments 51 pages, 7 figures, including supplementary material. Submitted to Imaging Neuroscience

详情
AI中文摘要

多模态MRI为神经影像分析提供互补信息,不同成像模态捕获独特的解剖、组织和病理特征,支持下游AI应用的开发和评估。尽管大规模结构MRI资源日益可用,但公共和汇集神经影像数据集的模态覆盖往往不均匀。这种不均匀的模态覆盖因站点、扫描仪和采集协议之间的异质性,以及跨研究通常稀疏、不一致记录或不可用的人口统计学和临床变量而进一步复杂化。合成MRI生成可以通过合成目标模态体积用于数据集增强和受控合成队列创建,帮助解决这种不平衡。然而,许多现有的MRI合成方法在狭窄的模态集或相对同质的队列上训练,限制了它们对大型汇集神经影像资源的适用性,其中模态可用性、采集协议和元数据覆盖在不同数据集之间差异很大。扩散模型因其强大的样本保真度和多样性而成为MRI合成的一种有吸引力的方法,但直接在3D体素空间采样在推理时计算昂贵且缓慢。潜在扩散通过在学习的3D潜在空间中合成MRI提高了实用性,尽管生成质量取决于自编码器的重建保真度和由此产生的潜在分布。我们的方法将小波融合变分自编码器(WF-VAE)潜在压缩器与在学习的潜在空间中训练的、使用显式模态和元数据条件的条件3D U-Net扩散模型相结合。我们提出的Wavelet-Fusion Diffusion Model (WFDM) 在评估的合成MRI生成器中实现了最强的分布对齐。

英文摘要

Multimodal MRI provides complementary information for neuroimaging analysis, where different imaging modalities capture distinct anatomical, tissue, and pathological features that support the development and evaluation of downstream AI applications. Although large-scale structural MRI resources are increasingly available, their modality coverage is often uneven across public and pooled neuroimaging datasets. This uneven modality coverage is further complicated by heterogeneity across sites, scanners, and acquisition protocols, as well as demographic and clinical variables that are often sparse, inconsistently recorded, or unavailable across studies. Synthetic MRI generation can help address this imbalance by synthesizing target-modality volumes for dataset augmentation and controlled synthetic cohort creation. However, many existing MRI synthesis approaches are trained on narrow modality sets or relatively homogeneous cohorts, limiting their applicability to large pooled neuroimaging resources where modality availability, acquisition protocols, and metadata coverage vary substantially across datasets. Diffusion models have become an attractive approach for MRI synthesis because of their strong sample fidelity and diversity, but sampling directly in 3D voxel space is computationally expensive and slow at inference. Latent diffusion improves practicality by synthesizing MRI in a learned, 3D latent space, although generation quality depends on the autoencoder's reconstruction fidelity and the resulting latent distribution. Our approach combines a Wavelet-Fusion variational autoencoder (WF-VAE) latent compressor with a conditional 3D U-Net diffusion model trained in the learned latent space using explicit modality and metadata conditioning. Our proposed Wavelet-Fusion Diffusion Model (WFDM) achieved the strongest distributional alignment among the evaluated synthetic MRI generators.

2606.00688 2026-06-02 cs.CV 版本更新

Shape-Prior-Based Point Cloud Completion for Single-Stage Fully Sparse 3D Object Detection

基于形状先验的点云补全用于单阶段全稀疏3D目标检测

Kaizheng Wang, Mingqian Ji, Jian Yang, Shanshan Zhang

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院)

AI总结 针对单阶段全稀疏3D检测器中点云稀疏和不完整的问题,提出一种基于形状先验的点云补全方法,通过实例选择和对齐补全模块显著提升检测性能。

详情
AI中文摘要

单阶段全稀疏3D目标检测器依赖点云数据在自动驾驶场景中检测目标。然而,点云的稀疏性和不完整性严重限制了3D目标检测的性能。为解决此问题,本文提出一种专门针对单阶段全稀疏检测器的点云补全方法。整个基于形状先验的补全过程由两个连续步骤组成。第一步,我们设计了一个新颖的实例选择模块,即使在基线模型未生成提议的情况下,也能识别对应前景目标的点云,同时有效忽略背景区域的点云。第二步,我们引入了一种新颖的基于对齐的点补全模块,该模块将前景目标的点云在中心和朝向上与原型对齐。随后,从原型中选择点来填充前景目标的缺失部分。我们在KITTI数据集上使用两个单阶段全稀疏检测器评估了我们的方法。实验结果表明,所提方法显著提升了检测性能,证实了其有效性和泛化能力。

英文摘要

Single-stage fully sparse 3D object detectors rely on point clouds data to detect objects in autonomous driving scenarios. However, the sparsity and incompleteness of point clouds significantly limit the performance of 3D object detection. To address this issue, this paper proposes a point clouds completion method specifically designed for single-stage fully sparse detectors. The entire shape-prior-based completion process consists of two consecutive steps. In the first step, we design a novel Instance Selection module, which is capable of identifying point clouds corresponding to foreground objects even when the baseline model does not generate proposals, while effectively ignoring the point clouds of background regions. In the second step, we introduce a novel Alignment-Based Point Completion module, which aligns the point clouds of foreground objects with prototypes in terms of both their centers and orientations. Subsequently, points are selected from the prototype to fill in the missing parts of the foreground object. We evaluated our method on two single-stage fully sparse detectors using the KITTI dataset. The experimental results demonstrate that the proposed method significantly improves the detection performance, confirming its effectiveness and generalizability.

2606.00676 2026-06-02 cs.CV 版本更新

A Modelling and Evaluation Framework for EuroCrops-Driven Sentinel-2 Crop Segmentation

基于EuroCrops驱动的Sentinel-2作物分割的建模与评估框架

Alexandra Nicoleta Scarlat, Ioana Cristina Plajer, Alexandra Baicoianu

发表机构 * Transilvania University of Braşov(布拉索夫瓦拉米亚大学)

AI总结 提出一个可配置的流水线,利用EuroCrops标注和Sentinel-2影像生成语义分割数据集,并训练U-Net模型评估其在域内和域外数据集上的性能。

详情
AI中文摘要

本工作提出了一个可配置的流水线,用于从Sentinel-2影像和EuroCrops地块级标注生成适用于语义分割的农业数据集。该流程通过标签统一、Sentinel-2产品选择、空间对齐、栅格化、图块提取、质量过滤和类别感知样本选择,将异质的矢量作物标注转化为对齐的多光谱图像-掩码对。生成的数据集包含来自五个欧洲国家的67,337个图块,并使用简化的十种作物类别加上背景的分类法。 使用10个Sentinel-2光谱波段和组合损失(类别加权交叉熵和Dice损失)训练了一个带有组归一化的四层U-Net。在基于EuroCrops的内部测试集上,模型实现了平均交并比(mIoU)0.7665、像素准确率0.8693和平均类别准确率0.9072。与光谱和空间上下文随机森林基线相比,U-Net显示了学习多尺度空间表示对于作物分割的重要性。 在未见过的比利时EuroCrops子集、DACIA5和PASTIS上进行了外部评估。结果显示,在外部和跨数据集评估下存在明显的性能差距,尤其是对于具有不同分类法、标注协议、空间覆盖或时间组织的基准。模型更可靠地转移到分类法对齐的优势类别(如玉米和小麦),而对于几个少数类别以及适应后的单日期PASTIS设置,性能仍然有限。这些发现突出了在现实域偏移下使用EuroCrops衍生监督进行Sentinel-2作物分割的潜力和局限性。

英文摘要

This work presents a configurable pipeline for generating semantic-segmentation-ready agricultural datasets from Sentinel-2 imagery and EuroCrops parcel-level annotations. The workflow transforms heterogeneous vector crop annotations into aligned multispectral image--mask pairs through label harmonization, Sentinel-2 product selection, spatial alignment, rasterization, patch extraction, quality filtering, and class-aware sample selection. The generated dataset contains 67,337 patches from five European countries and uses a reduced taxonomy of ten crop classes plus background. A four-level U-Net with Group Normalization was trained using 10 Sentinel-2 spectral bands and a composite loss combining class-weighted cross-entropy and Dice loss. On the internal EuroCrops-based test split, the model achieved a mean Intersection over Union (mIoU) of 0.7665, a pixel accuracy of 0.8693, and a mean class accuracy of 0.9072. Compared with spectral and spatial-context Random Forest baselines, the U-Net showed the importance of learned multi-scale spatial representations for crop segmentation. External evaluation was performed on unseen Belgian EuroCrops subsets, DACIA5, and PASTIS. The results show a clear performance gap under external and cross-dataset evaluation, especially for benchmarks with different taxonomies, annotation protocols, spatial coverage, or temporal organization. The model transfers more reliably to dominant and taxonomically aligned classes such as maize and wheat, while performance remains limited for several minority classes and for the adapted single-date PASTIS setting. These findings highlight both the potential and the limitations of using EuroCrops-derived supervision for Sentinel-2 crop segmentation under realistic domain shifts.

2606.00673 2026-06-02 cs.CV 版本更新

T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining

T-CLIP:面向对比语言-图像预训练的热感知

Tayeba Qazi, Ayush Maheshwari, Prerana Mukherjee, Brejesh Lall

发表机构 * Indian Institute of Technology Delhi, India(印度理工学院德里分校) NVIDIA AI Technology Center, India(NVIDIA AI技术中心) Jawaharlal Nehru University, India(贾瓦哈拉尔·尼赫鲁大学)

AI总结 针对CLIP无法对齐热图像与文本描述的问题,提出物理感知的热描述数据集IR-Cap和解耦双LoRA框架T-CLIP,实现场景级和对象级热理解,在跨模态检索任务上超越所有基线。

Comments 34pages (including references and appendix), 13 figures

详情
AI中文摘要

热成像在低光照和恶劣天气等挑战性条件下提供了可见光谱视觉的强大替代方案,然而像CLIP这样的基础视觉-语言模型由于根本性的热感知差距,无法将热图像与文本描述对齐。我们识别出三个主要挑战:缺乏带标题的热数据集、标准LLM无法推理热现象,以及热成像中的一个关键表示挑战——全局场景上下文和对象级热信号在单个嵌入空间中同时学习时会产生冲突。为了解决这些问题,我们引入了IR-Cap,这是第一个物理感知的热标题生成管道和数据集,在三个公开基准上提供互补的全局和细粒度热描述;以及T-CLIP,一个解耦的双LoRA框架,独立地适配CLIP用于场景级和对象级热理解。T-CLIP在三个热基准的跨模态检索中相对于所有基线取得了一致的改进,并且我们初步展示了其在文本条件热图像生成中的适用性。

英文摘要

Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval, and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.

2606.00664 2026-06-02 cs.RO cs.CV 版本更新

SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models

SKIP: 用于高效具身世界模型的稀疏关键帧插值范式

Ziheng He, Yixiang Chen, Ning Yang, Zhanqian Wu, Qisen Ma, Yuan Xu, Jiabing Yang, Peiyan Li, Xiangnan Wu, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu, Yan Huang

发表机构 * UCAS(中国科学院自动化研究所) CASIA(中国科学院自动化研究所) NJU(南京大学) GigaAI THU(清华大学) FiveAges

AI总结 提出稀疏关键帧插值范式(SKIP),通过识别任务相关关键帧并仅生成这些帧,再基于机器人动作插值缺失帧,实现高效视频生成,在LIBERO上速度提升4.16倍,FVD降低89%,且生成视频作为训练数据时策略性能下降极小。

Comments 25 pages, 10 figures

详情
AI中文摘要

具身世界模型通过预测机器人动作如何影响周围场景,已成为机器人学中一种有前景的范式。然而,在像素空间中进行 rollout 推理在计算上仍然昂贵,因为长时程操作视频通常必须逐帧生成。这种成本不能通过不加区分地丢弃帧来轻易降低,因为下游策略依赖于对稀疏任务相关事件(如接近、接触、抓取和释放)的完整保留。为了解决这一挑战,我们提出了稀疏关键帧插值范式(SKIP),这是一种事件保留的稀疏到密集框架,避免了密集的逐帧生成。SKIP 首先通过利用机器人感知的多模态特征来识别任务相关的关键帧。然后,它仅用稀疏视频扩散模型合成这些关键帧。一个学习到的间隙预测器和一个动作条件插值器随后根据机器人动作重建缺失的间隔。在 LIBERO 上,SKIP 生成密集 rollouts 的速度比密集基线快 4.16 倍,同时提高了视觉保真度并将聚合 FVD 降低了 89.0%。重要的是,SKIP 生成的视频是有效的策略训练数据。即使它们完全替代真实演示,π_{0.5} 的成功率在 LIBERO 模拟中仅下降 1.3 个百分点,在真实机器人上下降 6.7 个百分点,而完全密集的逐帧生成则下降 48 到 58 个百分点。

英文摘要

Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long-horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task-relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event-preserving sparse-to-dense framework that avoids dense frame-by-frame generation. SKIP first identifies task-relevant keyframes by leveraging robot-aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action-conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts $4.16\times$ faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by $89.0\%$. Importantly, SKIP-generated videos are effective policy-training data. Even when they fully replace real demonstrations, $π_{0.5}$ success drops only $1.3$ pp in LIBERO simulation and $6.7$ pp on the real robot, whereas fully dense frame-by-frame generation collapses by $48$ to $58$ pp.

2606.00662 2026-06-02 cs.CV 版本更新

TAP-JEPA: Frozen Future-Latent Probing and Two-Stage Score Fusion for EPIC-KITCHENS-100 Action Anticipation

TAP-JEPA:冻结的未来潜在探测与两阶段分数融合用于EPIC-KITCHENS-100动作预测

Chaoyang Wang, Lexuan Xu

发表机构 * Beihang University(北航大学)

AI总结 提出TAP-JEPA方法,利用冻结的V-JEPA 2.1特征和两阶段分数融合,在EPIC-KITCHENS-100动作预测挑战中获得第二名。

Comments The runner-up solution for the Action Anticipation Challenge, EPIC-KITCHENS-100 at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

本报告介绍了TAP-JEPA,我们在EgoVis 2026的EPIC-KITCHENS-100(EK-100)动作预测挑战中获得亚军的提交方案。该任务是从目标动作开始前结束的自我中心视频片段中预测下一个动词、名词以及动词-名词动作。TAP-JEPA没有微调大型视频骨干网络,而是在冻结的V-JEPA 2.1特征上构建了一个紧凑的预测模型:ViT-G/384编码器提取可见的动作前令牌,预训练的潜在预测器从观察到的上下文估计近未来的令牌,两组令牌通过带有动词、名词和动作对特定查询的注意力探针进行融合。在最终提交中,我们使用官方训练集和大部分验证集扩展了监督训练,保留了一小部分用于合理性检查和定性观察,并采用了两阶段分数融合:首先在每个epoch内平均八个独立初始化的探针副本,然后合并epoch 12-20的候选结果,并应用依赖于类别的权重。在官方开放测试排行榜上,我们的sunshinesky条目达到了27.91%的整体动作平均Top-5召回率(MT5R),排名第二,仅比最高分低0.04个百分点。

英文摘要

This report presents TAP-JEPA, our runner-up submission to the EPIC-KITCHENS-100 (EK-100) Action Anticipation Challenge at EgoVis 2026. The task is to anticipate the next verb, noun, and verb-noun action from an egocentric clip that ends before the target action begins. Instead of fine-tuning a large video backbone, TAP-JEPA builds a compact anticipation model on frozen V-JEPA 2.1 features: a ViT-G/384 encoder extracts visible pre-action tokens, the pre-trained latent predictor estimates near-future tokens from the observed context, and both token groups are fused by attentive probes with task-specific queries for verbs, nouns, and action pairs. For the final submission, we expand supervised training with the official training split and most of the validation split, reserving a small subset for sanity checks and qualitative inspection, and adopt a two-stage score fusion that first averages eight independently initialized probe replicas within each epoch and then merges candidates from epochs 12-20 with field-dependent weights. On the official open-testing leaderboard, our sunshinesky entry achieves 27.91 percent overall action Mean Top-5 Recall (MT5R), ranking second and only 0.04 percentage points behind the top score.

2606.00658 2026-06-02 cs.CV cs.AI 版本更新

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Wan2.2双专家视频扩散模型的协同少步蒸馏与低位量化

Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong, Shiqiao Gu, Yang Yong, Jinyang Guo, Xianglong Liu

发表机构 * IEEE ICME 2026 GCC Low-Bit-width Large Model Quantization Challenge(GCC 低精度大模型量化挑战)

AI总结 针对Wan2.2-T2V-A14B视频扩散模型,提出结合少步分布匹配蒸馏与低位量化的部署压缩流程,通过双专家去噪分支校准、敏感层保护及HiF4低位表示,在保持质量的同时降低计算开销。

详情
AI中文摘要

大型视频扩散模型实现了强大的视觉质量,但由于每个样本需要大量去噪步骤和较大的驻留参数足迹,部署成本仍然很高。本文研究了一种面向部署的压缩流程,针对Wan2.2-T2V-A14B模型,结合少步分布匹配蒸馏与低位量化。该流程遵循模型的双专家去噪路线,分别校准高噪声和低噪声分支,保护敏感入口层,并使用HiF4风格的低位表示以改善动态范围覆盖。量化是在蒸馏后的少步学生模型上校准,而非原始的长步轨迹上,从而减少推理过程中的激活分布不匹配。所提出的协同设计使量化模型保持接近同步全精度模型,并在平均8步和20步时超越原始全精度基线。在测试配置中,20步设置提供了最佳的质量-效率权衡。

英文摘要

Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compression pipeline for Wan2.2-T2V-A14B by combining few-step distribution-matching distillation with low-bit quantization. The pipeline follows the model's dual-expert denoising route, calibrates the high-noise and low-noise branches separately, protects sensitive entrance layers, and uses HiF4-style low-bit representation to improve dynamic-range coverage. Quantization is calibrated on the distilled few-step student rather than on the original long-step trajectory, reducing activation-distribution mismatch during inference. The proposed co-design keeps the quantized model close to the same-step full-precision model and surpasses the original full-precision baseline at 8 and 20 steps on average. The 20-step setting gives the best quality-efficiency trade-off in the tested configurations.

2606.00640 2026-06-02 cs.CV 版本更新

An Attribute-Based Measure of Video Complexity

基于属性的视频复杂度度量

Aditya Sarkar, Yi Li, Zihao Wang, Jiacheng Cheng, Sai Vidyaranya Nuthalapati, Aashu Singh, Shlok Kumar Mishra, David Jacobs, Nuno Vasconcelos

发表机构 * UMIACS-University of Maryland College Park(马里兰大学College Park分校UMIACS) University of California San Diego(加州大学圣地亚哥分校) Yale University(耶鲁大学) Meta AI

AI总结 提出VideoABC框架,通过属性空间量化估计视频-问题对在视频大语言模型上的失败概率,实现非参数复杂度度量。

详情
AI中文摘要

提出了一种新的框架,用于估计视频-问题对给视频大语言模型带来的复杂度,即基于属性的视频复杂度(VideoABC)。视频复杂度定义为视频大语言模型在给定视频-问题对上的失败概率。VideoABC是一种非参数复杂度度量,使用参考视频数据集和预定义的视频属性词汇表(这些属性对复杂度有信息量,例如场景复杂度或与问题相关的视频事件速度)。在训练阶段,参考视频被投影到这些属性空间中,然后进行量化。计算每个量化单元的期望ABC。给定一个新视频及其在属性空间中的投影,通过关联量化单元的期望ABC来估计复杂度。为了能够使用小规模参考视频数据集,结合了两种量化器:k-means量化器(能对参考数据集分布内的样本进行准确复杂度估计)和通用格点量化器(保证对分布外样本的泛化)。受心理物理学研究中目标-干扰物操纵的启发,提出了一种合成视频生成程序,用于在训练期间填充格点量化器的单元,从而计算其期望ABC。实验结果表明,即使使用非常低维的属性表示,VideoABC也有效,其性能大大优于“视频大语言模型作为评判者”等方法,且复杂度更低。最后,VideoABC分数在定义良好的属性方面的可解释性,揭示了基准测试的属性组成如何影响其复杂度。

英文摘要

A new framework for the estimation of the complexity posed by video-question pairs to video-LLMs, Video Attribute-Based Complexity (VideoABC), is proposed. Video complexity is defined as the probability of failure of a video-LLM for a given video-question pair. VideoABC is a non-parametric complexity measure, using a reference video dataset and a pre-defined vocabulary of video attributes informative of complexity, \eg the scene complexity or the speed of the video event informative of the question. In a training phase, reference videos are projected into the space of these attributes, which is then quantized. The expected ABC of each quantization cell is then computed. Given a new video and its projection into the attribute space, complexity is estimated by the expected ABC of the associated quantization cell. To enable the use of VideoABC with small reference video datasets, two quantizers are combined: a k-means quantizer that enables accurate complexity estimates for samples in the distribution of the reference dataset and a universal lattice quantizer that guarantees generalization to out-of-distribution samples. A synthetic video generation procedure, inspired by target-distractor manipulations of psychophysics studies, is proposed to populate the cells of the lattice quantizer during training, enabling the computation of their expected ABCs. Experimental results show that VideoABCis effective even with very low-dimensional attribute representations, substantially outperforming approaches like `video-LLM as judge' with much less complexity. Finally, the explainable nature of the VideoABC score, in terms of well-defined attributes, is shown to provide insights on how the attribute composition of benchmarks affects their complexity.

2606.00630 2026-06-02 cs.CV stat.ML 版本更新

A Systematic Benchmark of Intraoperative Ultrasound-to-MR Synthesis for Brain Tumour Surgery

脑肿瘤手术中术中超声到MR合成的系统基准测试

Olga Esteban-Sinovas, Santiago Cepeda, Ignacio Arrese, Rosario Sarabia

发表机构 * Department of Neurosurgery, Neurovascular Unit Río Hortega University Hospital(里奥霍尔特ega大学医院神经外科部门,神经血管单元) Specialized Group in Biomedical Imaging and Computational Analysis (GEIBAC)(生物医学成像与计算分析专项组(GEIBAC)) Instituto de Investigación Biosanitaria de Valladolid (IBioVALL)(瓦尔拉多利德生物医学研究 institute(IBioVALL))

AI总结 针对脑肿瘤手术中术中超声(ioUS)到MR图像合成问题,本研究在公共ReMIND数据集上系统比较了6种生成器、4种推理模式和2种目标,结合图像保真度指标和下游分割评估,发现感知质量(LPIPS)与下游效用最相关,而SSIM与效用负相关,SynDiff-2.5D在下游分割中表现最佳。

详情
AI中文摘要

术中超声(ioUS)在脑肿瘤手术中是一种多功能、成本效益高的模态,但其解释困难:采集平面非标准,伪影具有模态特异性,且其外观与术前MRI(手术规划工具、分割模型和外科医生经验所依赖的)显著不同。从ioUS合成类似MRI的图像可以使基于MRI的基础设施在术中无需额外扫描即可重复使用。大多数先前的工作孤立地评估单一架构;据我们所知,没有基准测试在共同协议下涵盖架构范式、推理机制和下游任务端点。我们在公共ReMIND数据集(76名患者;153对ioUS/T2w和104对ioUS/FLAIR研究;60/16患者级训练/保留测试集划分)上填补了这一空白。六个生成器(四个GAN基线:Pix2Pix、SwinPix2Pix、CycleGAN、CUT;Transformer增强的ResViT;以及少步扩散模型SynDiff)分别在四种推理机制(2D、2.5D、2D+3D细化、全3D)和两种目标(仅T2w;T2w+FLAIR多任务)下训练,共产生48个实验。图像保真度指标(SSIM、PSNR、MAE、LPIPS)辅以nnU-Net v2下游分割评估(肿瘤和切除腔)以及按组织学分级和再次手术的亚组分析。没有一种架构在所有轴上占优,而且关键的是,感知质量与下游效用最密切相关(LPIPS,r=-0.66,p<0.001),而更高的SSIM与更差的效用相关(r=-0.64,p<0.001);SynDiff-2.5D最好地保留了下游分割(U_Dice=0.55)。因此,应报告或优先考虑感知和下游任务指标而非全局SSIM,并且架构选择应取决于手术阶段、患者病史和临床目标。

英文摘要

Intraoperative ultrasound (ioUS) is a versatile, cost-effective modality in brain tumour surgery, but its interpretation is difficult: acquisition planes are non-standard, artefacts are modality-specific, and its appearance differs markedly from the preoperative MRI on which surgical-planning tools, segmentation models and the surgeon's experience rely. Synthesising MRI-like images from ioUS could let this MRI-based infrastructure be reused intraoperatively without an extra scan. Most prior work evaluates a single architecture in isolation; to our knowledge, no benchmark has spanned architectural paradigms, inference regimes and downstream-task endpoints under a common protocol. We address this gap on the public ReMIND data set (76 patients; 153 paired ioUS/T2w and 104 paired ioUS/FLAIR studies; 60/16 patient-level train/held-out split). Six generators (four GAN baselines: Pix2Pix, SwinPix2Pix, CycleGAN, CUT; the transformer-augmented ResViT; and the few-step diffusion model SynDiff) were each trained under four inference regimes (2D, 2.5D, 2D + 3D-refinement, full-3D) and two targets (T2w only; T2w + FLAIR multi-task), yielding 48 experiments. Image-fidelity metrics (SSIM, PSNR, MAE, LPIPS) were complemented by an nnU-Net v2 downstream segmentation evaluation (tumour and resection cavity) and by subgroup analyses by histological grade and reoperation. No architecture dominated every axis, and, critically, perceptual quality tracked downstream utility most closely (LPIPS, r=-0.66, p<0.001), whereas higher SSIM was associated with worse utility (r=-0.64, p<0.001); SynDiff-2.5D best preserved downstream segmentation (U_Dice=0.55). Perceptual and downstream-task metrics should therefore be reported alongside or in preference to global SSIM, and architecture choice conditioned on surgical phase, patient history and clinical objective.

2606.00622 2026-06-02 cs.CV 版本更新

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

MM-Snowball:多模态多轮对话中的幻觉雪崩评估与缓解

Yue Jiang, Xue Jiang, Lihua Zhang, Zhiqiang Wang, Yuhang Lu, Peng Wang, Bo Han, Feng Zheng, Dingkang Yang

发表机构 * College of Intelligent Robotics and Advanced Manufacturing, Fudan University(复旦大学智能机器人与先进制造学院) Southern University of Science and Technology(南方科技大学) TMLR Group, Hong Kong Baptist University(香港 Baptist 大学 TMLR 团体) MM Lab, CUHK(CUHK 多模态实验室) RAMS Lab, Huawei Technologies Co., Ltd.(华为技术有限公司 RAMS 实验室)

AI总结 针对多模态大模型在多轮对话中因初始错误累积导致幻觉雪崩的问题,提出首个细粒度诊断基准MM-Snowball,并设计无训练的冲突感知视觉校正方法CAVR,通过表示级刷新视觉锚定和logit级修正输出分布来缓解雪崩效应。

Comments Accepted by The International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

多模态大语言模型(MLLMs)展现出显著的视觉理解能力,但在交互环境中的可靠性受到幻觉雪崩的严重破坏:一种初始错误在对话轮次间放大,导致连贯性崩溃的现象。这种失败揭示了一个根本性的脆弱性,即模型逐渐忽视视觉锚定,转而过度依赖受污染的文本历史。现有基准主要局限于单轮VQA,无法捕捉长程交互中错误传播的复杂动态。为解决这一问题,我们引入了MM-Snowball,这是首个用于细粒度诊断对话中幻觉雪崩的基准。广泛评估表明,我们的基准对即使是先进的MLLMs也构成了重大挑战,并揭示了现有为单轮VQA设计的缓解方法的无效性。为对抗这种退化,我们提出了冲突感知视觉校正(CAVR)。这种无训练方法通过协同双机制缓解雪崩:在表示级刷新视觉锚定,并在logit级修正输出分布,有效地将模型重新锚定到视觉事实。实验表明,CAVR达到了最先进的性能,为更可靠的交互式AI提供了一条有希望的路径。数据和代码可在 https://frenkie-chiang.github.io/MM-Snowball 获取。

英文摘要

Multimodal large language models (MLLMs) demonstrate remarkable visual understanding, yet their reliability in interactive settings is severely undermined by hallucination snowballing: a phenomenon where initial errors amplify across conversational turns, leading to a collapse in coherence. This failure reveals a fundamental vulnerability where models progressively neglect visual grounding in favor of over-relying on polluted textual history. Existing benchmarks are predominantly confined to single-turn VQA, which fail to capture the complex dynamics of error propagation in long-horizon interactions. To address this, we introduce MM-Snowball, the first benchmark for fine-grained diagnosis of hallucination snowballing within dialogues. Extensive evaluation shows that our benchmark poses a significant challenge even to advanced MLLMs and reveals the inefficacy of existing mitigation methods designed for single-turn VQA. To counteract this degradation, we propose Conflict-Aware Visual Rectification (CAVR). This training-free method mitigates snowballing through a synergistic dual-mechanism that refreshes visual grounding at the representation level and rectifies output distributions at the logit level, effectively re-anchoring the model to visual facts. Experiments demonstrate that CAVR achieves state-of-the-art performance, offering a promising path toward more reliable interactive AI. Data and code are available at: https://frenkie-chiang.github.io/MM-Snowball

2606.00620 2026-06-02 cs.CV 版本更新

FlowNar: Scalable Streaming Narration for Long-Form Videos

FlowNar: 面向长视频的可扩展流式叙述

Zeyun Zhong, Manuel Martin, Chengzhi Wu, David Schneider, Frederik Diederichs, Juergen Gall, Juergen Beyerer

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院) Lamarr Institute for Machine Learning(拉马尔机器学习研究所) University of Bonn(波恩大学)

AI总结 提出FlowNar框架,通过动态上下文管理和CLAM模块实现有界视觉记忆与计算复杂度,在流式视频叙述中兼顾高质量与高效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

近期的大型多模态模型(LMMs)主要针对离线场景设计,难以适应流式视频的动态需求。虽然最近的在线适配改进了实时处理,但仍面临关键的可扩展性挑战,资源需求通常随视频时长至少线性增长。为突破这一瓶颈,我们提出FlowNar,一种用于可扩展流式视频叙述的新型框架。FlowNar的核心是一种用于历史视觉上下文移除的动态上下文管理策略,结合我们的CLAM(跨线性注意力记忆)模块用于流式视觉历史保留,确保有界的视觉内存使用和计算复杂度,这对高效流式处理至关重要。我们还引入了一个现实的自条件评估协议和补充评估指标,以在类似部署的条件下评估流式叙述模型。在Ego4D、EgoExo4D和EpicKitchens100数据集上的实验表明,FlowNar在强基线上显著提高了叙述质量,同时保持高效,支持处理10倍长的视频,并实现3倍更高的吞吐量(FPS)。代码可在https://github.com/zeyun-zhong/FlowNar获取。

英文摘要

Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are ill-suited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic self-conditioned evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on the Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10$\times$ longer videos and achieving 3$\times$ higher throughput (FPS). The code is available at https://github.com/zeyun-zhong/FlowNar.

2606.00606 2026-06-02 cs.CV 版本更新

FiSeR: Fine-Grained Source Representations for Cross-Domain AI Image Detection

FiSeR:用于跨域AI图像检测的细粒度源表示

Shan Zhang, Yongxin He, Mingming Zhang, Huiwen Tian, Lei Ma

发表机构 * Shan Zhang, Yongxin He, Mingming Zhang, Huiwen Tian, Lei Ma(作者团队)

AI总结 针对合成图像检测器在域迁移下泛化能力差的问题,提出层次对比学习框架FiSeR,通过粗粒度和细粒度对比目标联合优化,在跨域评估中平均AUROC提升+10.22。

详情
AI中文摘要

现实世界的合成图像检测器在域内表现强劲,但在域迁移下通常泛化能力差。通过无监督UMAP投影,我们发现自然和合成特征在未见数据集上仍部分可分,但性能仍然下降,表明分类头过度拟合训练域伪影。因此,关键在于学习更具迁移性的表示,使决策标准对域迁移更稳定和鲁棒。基于合成图像由多种生成器生成的结构事实,我们提出一个层次对比学习框架,在保留生成器身份信息的同时提高自然和合成图像之间的可分离性。它联合优化(i)自然和合成图像之间的粗粒度对比目标和(ii)使用生成器身份的合成图像之间的细粒度对比目标。在WildFake上训练,我们的方法在跨域评估中,在与强基线DIRE相同的设置下,在Chameleon、AIGIBench、Community Forensics和GenImage上平均AUROC提升+10.22。对于少样本适应,我们冻结骨干网络,并在每类10个标记样本上拟合SVM头,在12个广泛使用的检测器上平均,AIGIBench的AUROC提升+10.64,Chameleon提升+17.41。我们的代码公开在:https://github.com/heyongxin233/FiSeR。

英文摘要

Real-world synthetic image detectors often generalize poorly under domain shift despite strong in-domain performance. Using unsupervised UMAP projections, we find that natural and synthetic features remain partially separable on unseen datasets, yet performance still drops, suggesting that the classification head overfits to training-domain artifacts. Therefore, the key is to learn more transferable representations so that the decision criterion is more stable and robust to domain shifts. Based on the structural fact that synthetic images are produced by diverse generators, we propose a hierarchical contrastive learning framework that improves the separability between natural and synthetic images while preserving generator identity information. It jointly optimizes (i) a coarse contrastive objective between natural and synthetic images and (ii) a fine contrastive objective among synthetic images using generator identities. Trained on WildFake, our method achieves an average AUROC gain of +10.22 on cross-domain evaluation over Chameleon, AIGIBench, Community Forensics, and GenImage under the same settings as the strong baseline DIRE. For few-shot adaptation, we freeze the backbone and fit an SVM head on 10 labeled samples per class, improving AUROC by +10.64 on AIGIBench and +17.41 on Chameleon, averaged over 12 widely used detectors. Our code is publicly available at: https://github.com/heyongxin233/FiSeR.

2606.00602 2026-06-02 cs.CV 版本更新

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

ASAP: 基于解剖感知语义自适应预训练的医学体素表示学习

Rongsheng Wang, Fenghe Tang, Zihang Jiang, Yingtai Li, Xu Zhang, Haoran Lai, Wenxin Ma, Wei Wei, Zhiyang He, Xiaodong Tao, Rui Yan, Qingsong Yao, Shaohua Kevin Zhou

发表机构 * School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China(生物医学工程学院,生命科学与医学系,中国科学技术大学) Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE) Lab, YRD-RIGHT, USTC Suzhou Institute for Advanced Research(医学影像、机器人、分析计算与学习(MIRACLE)实验室,YRD-RIGHT,中国科学技术大学苏州研究院) Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology(江苏省多模态数字孪生技术重点实验室) Biomedical Basic Research Center (BBRC) of Jiangsu Province(江苏省生物医学基础研究中心) Department of Radiology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, USTC(放射科,中国科学技术大学第一附属医院,生命科学与医学系,中国科学技术大学) Anhui IFLYTEK CO., Ltd(安徽科大讯飞股份有限公司) School of Medicine, Stanford University(医学院,斯坦福大学) State Key Laboratory of Precision and Intelligent Chemistry, Hefei, Anhui, China(安徽省精密与智能化学重点实验室,合肥,安徽,中国)

AI总结 提出ASAP框架,通过解剖感知知识注入、语义自适应对齐与融合,从胸部CT扫描和放射学报告中学习可迁移且可解释的体素表示,在15个数据集和22个下游任务上取得最先进性能。

Comments MICCAI2025 extention

详情
AI中文摘要

从医学体素扫描中学习可迁移和可解释的表示仍然具有挑战性,因为存在复杂的解剖结构和放射学报告提供的弱、异质监督。在本文中,我们提出了解剖感知语义自适应预训练(ASAP),一个用于从大规模胸部CT扫描及其对应放射学报告中进行细粒度医学体素表示学习的原理性视觉-语言预训练框架。ASAP集成了三个关键组件:(1)解剖感知知识注入模块,通过现成的分割工具融入器官级结构先验,以促进解剖上一致的表示;(2)语义自适应选择性对齐机制,动态地将句子级别的发现与局部体素区域关联;(3)语义自适应融合模块,在双模态掩码建模范式下,实现解剖信息视觉特征与基于文本线索之间的有效交互。除了方法论贡献外,我们还为胸部CT上的医学体素视觉-语言预训练建立了一个全面的基准,涵盖15个数据集和22个下游任务,包括异常分类、分割、疾病预后预测、报告生成、词汇分类、跨模态检索和视觉问答。该基准提供了标准化的评估协议,以系统评估在不同临床设置和数据制度下的表示质量。大量实验表明,ASAP在跨任务和数据集上一致地实现了最先进的性能,在有限监督和分布偏移下尤其显著,验证了其在学习可迁移和临床有意义的体素表示方面的有效性。

英文摘要

Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.

2606.00592 2026-06-02 cs.CV 版本更新

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

通过PRISM:原则感知、可解释和多尺度的视觉设计评估

Mona Gandhi, KJ Joseph, Srinivasan Parthasarathy, Sayan Nag

发表机构 * Ohio State University(俄亥俄州立大学) Adobe Research(Adobe研究院)

AI总结 提出PRISM基准和一种多尺度评估框架,通过原则扰动和分层分析实现可解释的设计质量评估。

详情
Journal ref
CVPR 2026 Findings
AI中文摘要

有效的视觉传达源于多个设计原则的和谐,如可读性、对比度、对齐、重叠和连贯性,这些原则共同支配着传达者的清晰度和意图。虽然人类设计师会整体性地考虑这些原则,但机器智能体通常将它们压缩成一个单一的启发式分数,提供有限的可解释性和诊断精度。为了解决这一差距,我们引入了PRISM(原则感知、可解释和结构引导的设计修改),这是一个基准,它沿着可测量的设计原则系统地扰动Crello数据集中的专业布局。该基准包含10万个扰动训练样本和1万个扰动验证设计,每个样本隔离特定的原则违规,以进行关于设计质量的多模态推理的受控分析。我们表明,像Qwen-2.5-VL和GPT-4o-mini这样的模型对有针对性的原则退化在很大程度上不敏感,而GPT-4o表现出全局意识但缺乏细粒度的解耦。基于这些见解,我们提出了一个多尺度评估框架,该框架集成了用于定量评估的轻量级评分器、用于局部反馈的指令调优视觉语言模型以及用于全局推理的基于提示的方法。我们的框架提供了设计失败的可解释解释。利用这些局部见解,我们展示了改善布局质量的有针对性的改进。PRISM和我们的框架共同为可解释的、具有设计素养的多模态推理系统奠定了基础。

英文摘要

Effective visual communication stems from the harmony of multiple design principles, such as readability, contrast, alignment, overlap, and coherence, which collectively govern clarity and intent of the communicator. While human designers reason holistically over these principles, machine agents typically condense them into a single heuristic score, offering limited interpretability and diagnostic precision. To address this gap, we introduce PRISM (PRinciple-aware, Interpretable, and Structure-guided Design Modifications), a benchmark that systematically perturbs professional layouts from the Crello dataset along measurable design principles. The benchmark comprises 100K perturbed training samples and 10K perturbed validation designs, each isolating a specific principle violation for controlled analysis of multimodal reasoning about design quality. We show that models like Qwen-2.5-VL and GPT-4o-mini are largely insensitive to targeted principle degradations, whereas GPT-4o exhibits global awareness without fine-grained disentanglement. Building on these insights, we propose a multi-scale evaluation framework that integrates lightweight scorers for quantitative assessment, instruction-tuned vision-language models for localised feedback, and prompt-based methods for global reasoning. Our framework provides interpretable explanations of design failures. Using these localised insights, we show targeted refinements that improve layout quality. Together, PRISM and our framework lay the foundation for interpretable design-literate multimodal reasoning systems.

2606.00583 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Improving Visual Representation Alignment Generation with GRPO

利用GRPO改进视觉表示对齐生成

Shentong Mo, Sukmin Yun

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Hanyang University(翰阳大学)

AI总结 提出VRPO方法,通过强化学习将静态对齐损失替换为生成式表示策略优化目标,动态平衡表示一致性与生成质量,在扩散Transformer中实现更快的收敛和更高的图像保真度。

详情
AI中文摘要

最近的扩散Transformer展示了强大的图像合成能力,但由于生成表示与判别表示之间的弱对齐,训练效率仍然较低。虽然表示对齐框架(如REPA)通过将噪声去噪特征与预训练视觉编码器对齐来改善收敛,但其外部监督的对齐损失是静态的,在训练和推理过程中缺乏自适应性。现有方法依赖于固定的余弦对齐或对比目标,无法动态平衡表示一致性和生成质量,导致判别收益有限,且无法以任务自适应方式优化对齐。为了解决这个问题,我们提出了VRPO,一种基于强化学习的优化策略,用生成式表示策略优化目标取代REPA的静态对齐损失。VRPO不强制执行固定的相似性约束,而是将表示对齐视为一个奖励引导的过程:模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式使生成器能够不断优化其内部表示,朝向有语义意义的方向,同时提高图像质量。我们的VRPO驱动训练无缝集成到扩散Transformer中,引入可忽略的计算成本,并保持与SiT和DiT架构的完全兼容性。在ImageNet-256x256上的大量实验表明,我们的VRPO-Alignment显著提高了收敛速度和保真度,在相同计算预算下,与REPA相比,FID提升高达1.8,训练速度加快2.3倍。

英文摘要

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.

2606.00579 2026-06-02 cs.CL cs.CV 版本更新

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

沙盒化编码智能体是竞争性的全模态任务求解器

Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li, Tianyi Zhou

发表机构 * University of Maryland(马里兰大学) MBZUAI

AI总结 本文提出沙盒化编码智能体,仅通过文本+图像访问和工具使用,即可在全模态任务中匹配甚至超越原生全模态模型,并通过技能注入和训练配方Code-X进一步提升性能。

Comments Paper under review

详情
AI中文摘要

随着多模态大语言模型越来越多地针对视频和音频,人们通常认为这类任务需要原生全模态模型。我们表明情况并非总是如此:仅具有文本+图像访问权限和沙盒化工具使用接口的编码智能体,可以在多个音频-视频基准测试中匹配,并在某些设置中超越最先进的原生全模态模型和预定义的多模态智能体框架。我们的轨迹分析表明,它们的优势来自于编写代码和编排工具,以从转录、帧和其他模态信号中提取相关证据,从而将全模态任务转化为检索和信息处理问题,而不是摄取整个媒体流。我们进一步通过失败分类和过程级轨迹分析来刻画它们的局限性,并表明简单的技能注入(包括人工编写和自蒸馏的技能)能显著提高性能。为了探索开源激发,我们引入了Code-X,一种包含OmniCoding轨迹数据集和可验证奖励的训练方案,并在Qwen-3.5-9B和Qwen-3.6-27B上提供了基线。最后,我们认为下一个前沿是多模态处理,并引入了TerminalBench-O,一个用于现实世界全模态处理任务的过程级基准。代码将在https://github.com/Dongping-Chen/OmniCoding提供。

英文摘要

As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.

2606.00571 2026-06-02 cs.LG cs.AI cs.CV 版本更新

On the Difficulty of Learning a Meta-network for Training Data Selection

学习用于训练数据选择的元网络的困难性

Zilin Du, Junqi Zhao, Boyang Albert Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对元学习训练数据选择(MTS)在实践中表现不佳的问题,本文通过数学分析揭示了梯度信噪比低和缺乏信息特征两大障碍,并提出增大批大小和利用信息特征作为解决方案。

详情
AI中文摘要

合成数据越来越多地被用于训练神经网络,但若不加区分地使用,其与真实数据的分布不匹配会限制其有效性。一种常见策略是通过双层优化学习数据权重,我们称之为元学习训练数据选择(MTS)。有趣的是,在实践中,MTS 往往低于预期。我们识别了正确训练 MTS 的两个障碍:梯度信噪比(GSNR)低导致优化困难,以及缺乏与数据质量相关的信息特征。我们对 MTS 进行了数学分析,揭示了归一化数据权重的动态以及不同数据质量与低 GSNR 之间的关系。分析表明,一个简单而有效的解决方案是增大批大小。此外,我们提出了一组信息特征,用于捕捉训练数据在其分布中的位置和训练动态。在四个基准上的实验显示了一致的改进,与无选择的训练相比平均提升 5.49%,与最强基线相比平均提升 2.89%。

英文摘要

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.

2606.00564 2026-06-02 cs.CV cs.CL 版本更新

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

面向视觉-语言推理的分解式在策略蒸馏:引导梯度实现视觉定位

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过将视觉-语言模型蒸馏损失分解为语言先验和视觉定位两个正交分量,提出视觉梯度引导(VGS)方法动态调整更新方向以优先优化视觉子空间,从而提升小模型在复杂多模态任务中的定位能力。

Comments ICML 2026 Spotlight

详情
AI中文摘要

虽然在策略蒸馏为训练小型推理模型提供了密集监督,但其在多模态领域的优化动态仍未得到充分探索。在这项工作中,我们通过数学上将损失分解为两个不同的组成部分:语言先验和视觉定位,挑战了视觉-语言模型(VLM)蒸馏的标准整体观点。我们的分析揭示,这些分量的梯度向量几乎正交,表明与教师语言分布对齐的目标在几何上独立于匹配其视觉感知的目标。因此,标准优化被动地遵循一条次优的折衷轨迹,隐式地平衡这两个目标。假设视觉定位是视觉-语言推理的主要瓶颈,我们引入了视觉梯度引导(VGS),一种动态重新定向更新向量以优先考虑视觉子空间的方法。在多个蒸馏设置和复杂多模态基准上的实验结果表明,VGS显著优于标准的在策略蒸馏整体公式,以最小的训练开销实现了卓越的定位能力。

英文摘要

While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher's language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead.

2606.00562 2026-06-02 cs.CV cs.LG 版本更新

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

DeepLatent: 通过并行潜在视觉推理用图像思考

Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao

发表机构 * Baidu Inc.(百度公司) Peking University(北京大学)

AI总结 提出DeepLatent框架,通过LatentFormer并行生成潜在视觉状态,并结合连续空间强化学习优化潜在表示,在多个基准上达到最先进性能。

详情
AI中文摘要

“用图像思考”的新兴范式将视觉状态嵌入中间推理步骤,定义了视觉语言模型的新前沿。现有方法沿两条路线分化。工具辅助方法应用显式视觉操作,但存在高延迟和操作类型受限的问题。潜在推理方法自回归生成隐式视觉状态,但性能不如工具辅助方法,且其潜在标记无法捕获有效的视觉信息。在这项工作中,我们提出DeepLatent,一个用于潜在视觉推理的并行框架。首先,我们引入LatentFormer。它使用可学习的2D标记并行生成上下文条件的潜在状态,将每次视觉更新直接锚定在原始图像特征中。其次,我们设计了一种连续空间强化学习算法。它直接在嵌入空间中优化潜在调制参数,显著提高潜在表示质量。该框架通过知识蒸馏和连续空间强化学习算法进行训练。此外,我们贡献了DeepLatent-180K,一个专为潜在视觉推理定制的大规模数据集。在多个基准上的广泛评估表明,DeepLatent达到了最先进的性能。

英文摘要

The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.

2606.00556 2026-06-02 cs.CV 版本更新

Improving Visual Grounding in Remote Sensing via Cluster-Guided Refinement and Model Ensemble Voting

通过聚类引导精炼和模型集成投票改进遥感中的视觉定位

Panav Shah, Geet Sethi, Ashutosh Gandhe

发表机构 * Indian Institute of Technology Bombay(印度理工学院班加罗尔)

AI总结 提出两种视觉定位流程(SGR和CGR),结合遥感专用模型RemoteSAM和通用分割模型SAM3,并通过多模型集成投票提升定位精度。

Comments Accepted at CVPR 2026 Workshop MORSE

详情
AI中文摘要

视觉定位旨在定位与自然语言描述对应的图像区域,是可解释视觉系统的关键组成部分。在遥感图像中,由于场景复杂、目标小且尺度变化大,定位尤为困难。依赖单一模型往往不足以应对这些多样化的挑战。在这项工作中,我们提出了两种定位流程,即序列定位精炼(SGR)和聚类感知定位精炼(CGR),它们结合了专门用于遥感的视觉定位模型RemoteSAM和强大的通用分割模型SAM3的互补优势。我们的方法首先使用RemoteSAM获得目标位置的初始估计,然后使用SAM3进行精炼,以产生更准确且空间一致的分割。此外,我们探索了一种基于六个不同能力的定位流程的多数投票集成策略。这种多模型框架提高了鲁棒性,并显著提升了定位精度。实验结果表明,所提出的流程和集成方法优于单个模型,从而产生更可靠和精确的视觉定位预测。

英文摘要

Visual grounding aims to locate image regions that correspond to natural language descriptions and is a key component of interpretable vision systems. In remote sensing imagery, grounding is particularly challenging due to complex scenes, small objects, and large variations in scale. Relying on a single model is often insufficient to address these diverse challenges. In this work, we propose two grounding pipelines, Sequential Grounding Refinement (SGR) and Cluster-Aware Grounding Refinement (CGR), that combine the complementary strengths of RemoteSAM, a visual grounding model specialized for remote sensing, and SAM3, a powerful general-purpose segmentation model. Our approach first uses RemoteSAM to obtain an initial estimate of object location, which is then refined using SAM3 to produce more accurate and spatially consistent segmentations. Additionally, we explore an ensemble strategy based on majority voting across six diverse grounding pipelines, each with distinct capabilities. This multi-model framework improves robustness and significantly enhances localization accuracy. Experimental results demonstrate that the proposed pipelines and ensemble approach outperform individual models, leading to more reliable and precise visual grounding predictions.

2606.00548 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery

CAFOSat:用于基于高分辨率影像的基础设施感知型CAFO制图的高质量标注数据集

Oishee Bintey Hoque, Nibir Chandra Mandal, Mandy L Wilson, Samarth Swarup, Madhav Marathe, Abhijin Adiga

发表机构 * University of Virginia(弗吉尼亚大学) Biocomplexity Institute, University of Virginia(弗吉尼亚大学生物复杂性研究所)

AI总结 针对集中式动物饲养操作(CAFO)大规模制图困难,提出CAFOSat数据集,集成高分辨率NAIP影像与多源CAFO清单,通过人机协同标注、GradCAM定位和几何聚类优化弱定位记录,并引入合成增强管道,实现基础设施级标注和鲁棒分类。

Comments Accepted at CVPR Workshop-2026. First two authors has equal contribution

详情
AI中文摘要

集中式动物饲养操作(CAFO)在农业生产中发挥重要作用,但也与环境、公共卫生和疾病监测问题相关。由于基础设施布局异质、位置记录噪声大、标注不一致以及清单不完整,从遥感影像大规模制图CAFO仍具挑战。我们引入CAFOSat,一个用于美国全境CAFO制图的高质量标注、基础设施感知数据集。CAFOSat集成高分辨率国家农业影像计划(NAIP)影像与跨州收集的多源CAFO清单,并通过结合AI辅助标注、基于GradCAM的定位和几何聚类的人机协同管道,将弱地理定位记录转化为精细标注。为提高数据集质量,我们利用土地覆盖引导采样和空间排除约束筛选具有挑战性的负样本,并通过人工验证提供基础设施级标注,包括畜棚、粪池和放牧相关特征。最终数据集包含超过45,000个图像块,覆盖20个州和四大CAFO类别。我们对多种卷积、基于Transformer和视觉-语言模型进行基准测试,证明了精细标注和精心筛选的负样本在CAFO分类和泛化中的价值。此外,我们引入一个合成增强管道,生成基础设施感知的变体以增加训练多样性并提升分布偏移下的鲁棒性。CAFOSat为推进基础设施感知的农业监测和基于高分辨率遥感影像的CAFO制图提供了大规模基准。

英文摘要

Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental, public health, and disease surveillance concerns. Large-scale mapping of CAFOs from remote sensing imagery remains challenging due to heterogeneous infrastructure layouts, noisy location records, inconsistent annotations, and incomplete inventories. We introduce CAFOSat, a strongly annotated, infrastructure-aware dataset for CAFO mapping across the United States. CAFOSat integrates high-resolution National Agriculture Imagery Program (NAIP) imagery with multi-source CAFO inventories collected across multiple states and transforms weak geolocation records into refined annotations through a human-in-the-loop pipeline combining AI-assisted annotation, GradCAM-based localization, and geometric clustering. To improve dataset quality, we curate challenging negative samples using land-cover-guided sampling with spatial exclusion constraints and provide infrastructure-level annotations, including barns, manure ponds, and grazing-related features, through manual verification. The resulting dataset contains more than 45,000 image patches spanning 20 states and four major CAFO categories. We benchmark a diverse set of convolutional, transformer-based, and vision-language models, demonstrating the value of refined annotations and curated negative samples for CAFO classification and generalization. In addition, we introduce a synthetic augmentation pipeline that generates infrastructure-aware variations to increase training diversity and improve robustness under distribution shifts. CAFOSat provides a large-scale benchmark for advancing infrastructure-aware agricultural monitoring and CAFO mapping from high-resolution remote sensing imagery.

2606.00543 2026-06-02 cs.CV 版本更新

ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

ETC: 通过任务感知的视觉信息蒸馏实现视觉语言模型中的极端令牌压缩

Yiling Gao, Hongchen Wei, Zhenzhong Chen

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院)

AI总结 提出ETC框架,基于变分信息蒸馏原理,在减少输入令牌数量时最小化任务损失,通过文本-图像交叉注意力加权视觉特征并引入变分信息蒸馏,实现单令牌压缩下仍保持强任务性能。

详情
AI中文摘要

在视觉语言模型(VLM)中,高分辨率图像会产生大量视觉令牌,导致推理时的高计算成本和KV缓存开销。为解决此问题,我们提出极端令牌压缩(ETC)框架,基于变分信息蒸馏原理,在减少输入令牌数量时最小化任务损失。具体而言,从信息论角度,我们表明最小化任务损失需要紧凑表示保留用于预测的指令感知充分统计量。在实践中,ETC利用文本-图像交叉注意力加权原始视觉特征以近似潜在的指令感知预测统计量。此外,ETC引入变分信息蒸馏,使紧凑表示保留必要信息以恢复该预测统计量。在LLaVA-1.5-7B和Qwen3-VL-2B上的实验表明,即使在单令牌压缩下,ETC仍保持有效性,大幅减少KV缓存开销同时保留强任务性能。

英文摘要

In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.

2606.00514 2026-06-02 cs.LG cs.CV 版本更新

Generate in Reconstruction Space, Match in Semantic Space: Transport Geometry for One-Step Generation

在重建空间中生成,在语义空间中匹配:一步生成的传输几何

Hugues Van Assel, Edward De Brouwer, Saeed Saremi, Gabriele Scalia, Aviv Regev

发表机构 * Genentech(基因泰克)

AI总结 本文研究自监督表示学习(SSL)特征在一步生成模型中的作用,提出在语义特征空间中使用Sinkhorn散度进行分布匹配,显著降低ImageNet FID,并揭示了评估指标与训练特征之间的潜在冲突。

Comments 26 pages, 4 figures

详情
AI中文摘要

生成建模和自监督表示学习(SSL)优化结构不同的目标:生成训练奖励分布保真度,而SSL奖励语义一致性。然而,最近的研究反复发现SSL特征改善了生成训练,尽管这种协同作用的机制仍不清楚。在这里,我们在一步生成的框架下研究SSL在生成建模中的优势,其中表示的作用是明确的:冻结的SSL特征用于将生成的样本与真实数据匹配。我们在该特征空间中使用Sinkhorn散度,为Wasserstein距离提供了一个可处理的代理,这是由Fréchet风格评估指标(如FID)近似的总体差异。我们发现,当在语义结构化的SSL特征空间中计算时,这个目标变得非常有效(ImageNet FID降低39倍)。我们将这种行为主要归因于匹配估计:抑制无关重建细节的语义SSL特征诱导出更紧凑的几何结构,使分布匹配更易处理。因此,最佳的训练SSL特征不一定与评估指标使用的特征匹配。特别是,我们表明使用Inception作为特征提取器可以改善FID,同时降低匹配稳定性和样本质量,揭示了一种形式的指标黑客攻击。通过在ImageNet上的大量实验,我们确定了哪些SSL特征族能带来最佳的生成性能,并表明匹配稳定性是选择它们的定量标准。代码可在https://github.com/Genentech/semantic-transport-generation获取。

英文摘要

Generative modeling and self-supervised representation learning (SSL) optimize structurally different objectives: generative training rewards distributional fidelity, while SSL rewards semantic coherence. Yet recent work repeatedly finds that SSL features improve generative training, though the mechanism of this synergy remains unclear. Here, we study the benefits of SSL in generative modeling in the framework of one-step generation where the role of representation is explicit: frozen SSL features are used to match generated samples to real data. We use the Sinkhorn divergence in that feature space, providing a tractable surrogate for the Wasserstein distance, the population-level discrepancy approximated by Fréchet-style evaluation metrics (such as FID). We find that this objective becomes highly effective when computed in a semantically structured SSL feature space (a 39$\times$ reduction in ImageNet FID). We trace this behavior primarily to matching estimation: semantic SSL features that suppress nuisance reconstruction details induce a more compact geometry, making distribution matching more tractable. As a consequence, the best training SSL features need not match the features used by the evaluation metric. In particular, we show that using Inception as the feature extractor can improve FID while degrading matching stability and sample quality, revealing a form of metric hacking. Using extensive experiments on ImageNet, we identify which SSL feature families lead to best generation performance and show that matching stability is a quantitative criterion for selecting them. Code is available at https://github.com/Genentech/semantic-transport-generation.

2606.00511 2026-06-02 cs.LG cs.CV 版本更新

Saliency-Aware Model Merging

显著性感知模型合并

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea(首尔大学) Ewha Womans University, Seoul, South Korea(成均馆女子大学)

AI总结 提出SA-Merging方法,利用结构剪枝中的连通性显著性(如SynFlow)进行数据无关模型合并,通过任务向量显著性评分和合并感知调制减少任务干扰,并在视觉和语言任务上验证有效性。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

模型合并旨在将多个在不同数据集上微调的任务特定模型整合到一个统一架构中,以实现跨领域能力。当前的数据无关模型合并方法通常难以扩展,因为它们依赖于忽略层间依赖性和非均匀专业知识分布的简单参数级启发式方法。本文提出SA-Merging,它基于结构剪枝(如SynFlow)中的连通性显著性公式,并将其扩展到数据无关模型合并设置。我们相对于共享基础模型定义任务向量上的显著性分数,并进一步引入合并感知调制,该调制结合专家间的一致性以减轻任务干扰。基于此公式,迭代的显著性感知合并过程逐步移除非信息性更新,同时保留端到端连通性。此外,我们将SA-Merging扩展到为LoRA引入秩级显著性分解,而不损害其结构完整性。在视觉和语言任务上的大量实验证明了我们基于显著性方法的有效性,进一步缩小了数据无关方法和测试时自适应方法之间的差距。

英文摘要

Model merging aims to consolidate multiple task-specific models fine-tuned on different datasets into a unified architecture that performs cross-domain proficiency. Current data-free model merging methods often struggle to scale as they rely on simple parameter-level heuristics that ignore inter-layer dependencies and non-uniform distribution of expertise. This work proposes SA-Merging, which is built upon connectivity-based saliency formulations from structural pruning (e.g., SynFlow) and extends them to the data-free model merging setting. We define a saliency score over task vectors relative to a shared base model, and further introduce merge-aware modulation that incorporates agreement across experts to mitigate task interference. Based on this formulation, an iterative saliency-aware merging procedure progressively removes non-informative updates while preserving end-to-end connectivity. Furthermore, we extend SA-Merging to introduce rank-wise saliency decomposition for LoRAs without compromising their structural integrity. Extensive experiments on vision and language tasks demonstrate the effectiveness of our saliency-based approach, further reducing the gap between data-free and test-time adaptation methods.

2606.00509 2026-06-02 cs.CV 版本更新

Structure-Aware Consistency Priors for Shape from Polarization in Complex Media

复杂介质中偏振形状恢复的结构感知一致性先验

Kaimin Yu, Puyun Wang, Huayang He, Xianyu Wu

发表机构 * The School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou, China(福州大学机械工程与自动化学院) Research Institute of Highway, Ministry of Transport, Beijing, China(交通部公路科学研究院)

AI总结 针对复杂介质(以冰为例)中偏振观测与表面法线间的非线性映射问题,提出基于自相关函数的结构感知偏振先验,并设计双分支网络IceSfP通过跨模态注意力和多尺度特征融合实现精确法线估计,在首个真实冰SfP数据集上达到16.01°的平均角度误差。

详情
Journal ref
2026ICML
AI中文摘要

在复杂介质中从单视角偏振图像恢复表面法线仍然具有挑战性。本文以冰作为代表性复杂介质,其中复杂的光与物质相互作用导致偏振观测与表面法线之间存在非线性映射。为了解决这一问题,提出了一种基于自相关函数的结构感知偏振先验,以捕获AoLP的局部空间一致性。在此基础上,设计了一个双分支网络(IceSfP),通过跨模态注意力和多尺度特征融合将原始偏振特征与先验集成,从而在复杂介质条件下实现准确的表面法线估计。为了评估该方法,构建了首个真实世界的冰SfP数据集。实验结果表明,该方法在所有指标上均优于现有方法,平均绝对误差(MAE)为16.01°,比第二好的方法低2.74°。该框架为复杂介质中的高精度几何感知提供了一种可推广的解决方案。

英文摘要

Recovering surface normals from single view polarization images in complex media remains challenging. This paper focuses on ice as a representative complex medium, where intricate light matter interactions lead to a nonlinear mapping between polarization observations and surface normals. To address this, a structure-aware polarization prior based on autocorrelation functions is proposed to capture the local spatial consistency of AoLP. Building on this, a dual-branch network (IceSfP) is designed to integrate raw polarization features with priors via cross modal attention and multi-scale feature fusion, enabling accurate surface normal estimation under complex media conditions. To evaluate the method, the first real-world ice SfP dataset is constructed. Experimental results show that the method outperforms existing approaches across all metrics, achieving a MAE of 16.01 deg, which is 2.74 deg lower than the second-best method. The framework provides a generalizable solution for high-precision geometric perception in complex media.

2606.00508 2026-06-02 cs.CV cs.AI 版本更新

V-LynX: Token Interface Alignment for Video+X LLMs

V-LynX: 视频+X 大语言模型的令牌接口对齐

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea(延世大学,首尔,韩国) Ewha Womans University, Seoul, South Korea(成均馆大学,首尔,韩国)

AI总结 本文发现视频大语言模型中存在令牌接口连续流形,并提出V-LynX框架,通过轻量辅助路径对齐注意力响应和统计分布,无需配对监督即可集成新模态,在音视频问答、3D推理等任务上达到最优效率。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

本研究揭示了视频大语言模型中的一个有趣现象:视频大语言模型不仅仅是简单地将帧转换为文本嵌入,而是建立了一个连续流形——令牌接口,使得视觉令牌能够在架构内作为独立实体运行。利用这一发现,我们提出了V-LynX,这是一个可扩展的框架,通过重新利用内部化接口,将新模态集成到视频大语言模型中。与需要大量模态特定编码器或配对监督的传统范式不同,V-LynX采用轻量辅助路径与冻结的视觉编码器并行运行。我们的方法通过使用非配对单模态数据集对齐注意力响应和统计分布,将新的感官输入与内在视频先验相结合。这确保了流形兼容性,同时保持了视频大语言模型的完整性。大量基准测试表明,V-LynX在音视频问答、3D推理、高帧率和多视角视频理解方面达到了最先进水平和高效性。代码可在https://github.com/park-jungin/lynx获取。

英文摘要

This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V-LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality-specific encoders or paired supervision, V-LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V-LynX achieves SOTA and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at https://github.com/park-jungin/lynx.

2606.00499 2026-06-02 cs.CV 版本更新

OptiWorld: Optimal Control for Video World Generation under Physical Constraints

OptiWorld: 物理约束下的视频世界生成最优控制

Yu Yuan, Jianhao Yuan, Xijun Wang, Daiqing Li, Liu He, Lu Ling, Stanley H. Chan

发表机构 * Purdue University(普渡大学) University of Oxford(牛津大学) SixteenMiles Labs(SixteenMiles 实验室)

AI总结 提出OptiWorld框架,在推理时结合经典最优控制与视频生成,通过提取紧凑世界状态、规划最优轨迹并生成条件视频,实现符合物理约束的动态优化。

Comments Porject Page: https://yuyuanspace.com/OptiWorld/

详情
AI中文摘要

视频生成模型正成为一种可扩展的世界模型形式,但它们主要生成合理的运动,而非主动控制或优化底层动态。因此,生成视频中的物体可能遵循不安全、不光滑、低效或物理不一致的轨迹。在这项工作中,我们提出了 extbf{OptiWorld},一个在推理时将经典最优控制引入视频生成的框架。OptiWorld首先提取紧凑的、与任务相关的世界状态,然后在物理约束下规划最优轨迹,最后基于该轨迹渲染视频。我们将规划表述为连续流形上的几何问题,将3D几何和任务相关的物理约束转化为统一的规划几何。通过添加这一最优控制层,OptiWorld生成具有更优动态的视频,在多个任务中展现出强大潜力,包括目标条件的图像到视频生成、视频动态编辑和反事实生成。

英文摘要

Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose \textbf{OptiWorld}, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task-relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task-dependent physical constraints into a unified planning geometry. By adding this optimal-control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.

2606.00477 2026-06-02 cs.CL cs.CV 版本更新

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

文本编辑能否泛化到视觉生成?统一多模态模型中的跨模态知识编辑基准

Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) University of Washington(华盛顿大学)

AI总结 提出跨模态知识编辑基准UniKE,发现文本编辑在图像生成中效果显著下降(VQA准确率仅18.5%),并提出推理增强参数编辑方法提升跨模态迁移效果。

Comments Published at ICML 2026; Code and data available at https://github.com/gxx27/UniKE

详情
AI中文摘要

统一多模态模型(UMMs)已成为通用多模态智能的有前途的范式。随着它们在现实世界应用中的部署,有效更新内部知识变得至关重要。虽然知识编辑在纯文本模型中已经成熟,但成功修改文本输出的编辑是否也能迁移到UMMs中的图像生成仍不清楚。为了研究这个问题,我们引入了UniKE,这是第一个用于UMMs中跨模态知识编辑的基准,包含2,971个编辑主题,涵盖属性和关系编辑。使用基于VQA的视觉验证,我们揭示了一个显著的模态差距:文本侧的有效性可以达到约92%,而直接图像生成下的最佳整体VQA准确率仅为18.5%。我们进一步提出了推理增强参数编辑,它在生成前显式激活编辑后的知识,并提高了所有评估模型-编辑器对的整体VQA准确率,提升高达18.6个百分点。机制分析表明,这种差距与编辑后的文本表示与视觉生成的条件路径之间的部分对齐有关,其中足以用于文本输出的编辑可能仍然太弱或未对齐,无法引导图像合成。这些发现表明,文本知识编辑不能保证可靠的跨模态迁移,并激励了模态感知的编辑方法。我们的代码和数据可在https://github.com/gxx27/UniKE获取。

英文摘要

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.

2606.00472 2026-06-02 cs.CV cs.AI cs.HC cs.LG 版本更新

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

CodeCytos: 通过代码增强的智能体动作空间实现AI辅助空间分子成像分析

Hung Q. Vo, Huy Q. Vo, Son T. Ly, Zhihao Wan, Anh-Vu Nguyen, Hong Zhao, Jianting Sheng, Stephen T. C. Wong, Hien V. Nguyen

发表机构 * University of Houston, Department of Electrical and Computer Engineering(德克萨斯大学休斯顿分校电子与计算机工程系) Houston Methodist Hospital, Department of Systems Medicine and Biomedical Engineering(休斯顿 Methodist 医院系统医学与生物医学工程系)

AI总结 提出CodeCytos框架,通过代码驱动的推理智能体实现空间分子成像数据的动态可编程分析,提升自动化与定制化能力,并在多种组织类型数据集上验证其优于基线方法。

详情
AI中文摘要

传统的组织图像分析软件为细胞分析提供了基础功能,包括分割、基本形态特征提取和空间组织分析。然而,这些工具通常需要手动干预,且与代码驱动的自动化集成不佳,限制了复杂空间组织研究的效率和可扩展性。此外,它们对自定义分析的灵活性有限,通常只支持一组固定的预实现空间细胞特征。为了解决这些限制,我们提出了CodeCytos,一个基于编码的推理智能体框架,能够实现与空间分子成像数据的动态、可编程交互,以提高自动化和定制化。CodeCytos旨在简化自定义空间细胞特征的探索,并适应多样化的研究需求。我们通过四个来自不同组织类型(额叶皮层、非小细胞肺癌、胰腺和扁桃体)的专家精选数据集案例研究展示了其实用性。我们在现实的最小提示设置下评估CodeCytos,其中生物科学家提出简单问题,没有任务特定指令或关于空间细胞分析的上下文信息,并基准测试了多个具有强大编码能力的LLM骨干。我们进一步表明,结合定制的、领域无关的少样本上下文编码推理示例(空间分析领域外随机采样的演示)可以显著提高性能,而无需昂贵的、专家制作的领域内演示。总体而言,CodeCytos优于基线方法,突显了代码动作智能体在空间分子成像中辅助自定义特征探索和加速生物标志物发现的潜力。

英文摘要

Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphological feature extraction, and spatial organization analysis. However, these tools often require manual intervention and are not well integrated with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies. In addition, they offer limited flexibility for custom analyses, as they typically support only a fixed set of pre-implemented spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data to improve automation and customization. CodeCytos is designed to streamline the exploration of custom spatial cellular features and adapt to diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets from distinct tissue types: frontal cortex, non-small-cell lung cancer, pancreas, and tonsil. We evaluate CodeCytos under a realistic minimal prompt setting, where bioscientists pose simple questions without task-specific instructions or contextual information about spatial cellular analysis, and benchmark multiple LLM backbones with strong coding capabilities. We further show that incorporating tailored, domain-agnostic few-shot in-context coding-reasoning examples (randomly sampled demonstrations outside the spatial analysis domain) can substantially improve performance without requiring costly, expert-crafted in-domain demonstrations. Overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-action agents to assist with custom feature exploration in spatial molecular imaging and to accelerate biomarker discovery.

2606.00471 2026-06-02 cs.CV 版本更新

MUSCLE-NET: Predicted-Multiscale-Aware Network for Pedestrian Trajectory Forecasting

MUSCLE-NET:面向行人轨迹预测的预测多尺度感知网络

Yu Liu, Ming Huang, Xiao Ren, Zhijie Liu, Youfu Li, He Kong

发表机构 * Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, School of Automation and Intelligent Manufacturing, Southern University of Science and Technology (SUSTech), Shenzhen(广东省全主动系统控制理论与技术重点实验室,自动化与智能制造学院,南方科技大学(SUSTech),深圳) Department of Mechanical Engineering, City University of Hong Kong, Hong Kong SAR, China(香港城市大学机械工程系,香港特别行政区,中国)

AI总结 提出MUSCLE-NET,通过多尺度多模态特征提取和尺度自适应预测机制,解决现有方法对观测信息利用不足及忽视未来运动尺度依赖的问题,在JAAD和PIE数据集上取得竞争性能。

Comments This manuscript has been accepted to the IEEE Transactions on Intelligent Transportation Systems as a regular paper

详情
AI中文摘要

准确的行人轨迹预测对于自动驾驶和智能交通系统中的安全导航至关重要。尽管近期方法取得了显著进展,但大多数现有方法在充分利用多样化观测方面存在局限,且往往忽视未来运动的尺度依赖性,无论底层运动动态如何,都统一处理多尺度特征。这限制了它们在多样化行人行为中的鲁棒性。为解决这些挑战,我们提出了一种用于行人轨迹预测的预测多尺度感知网络(MUSCLE-NET),该网络将互补的多模态线索与尺度自适应预测机制相结合。所提出的框架基于多尺度多模态特征提取(MMFE)模块,该模块结合了多尺度表示、模态感知重校准和方向性跨模态融合,从边界框、速度和姿态信息中构建语义对齐的表示。基于这些特征,多尺度增强层次预测(MEHP)模块通过概率粗预测器、尺度对齐融合和渐进细化,执行预测感知的未来运动细化,自适应地选择尺度相关线索以减轻空间漂移。在JAAD和PIE基准上的大量实验表明,所提出的MUSCLE-Net与最先进的轨迹预测方法相比,取得了竞争性能并持续改进。

英文摘要

Accurate pedestrian trajectory prediction is essential for safe navigation in autonomous driving and intelligent transportation systems. Despite substantial progress made by recent methods, most existing approaches are limited in fully exploiting diverse observations and often overlook the scale dependency of future motion, treating multiscale features uniformly regardless of underlying motion dynamics. This limits their robustness across diverse pedestrian behaviors. To address these challenges, we propose a Predicted-MUltiSCale-Aware Network (MUSCLE-NET) for Pedestrian Trajectory Forecasting that integrates complementary multimodal cues with scale-adaptive prediction mechanisms. The proposed framework is built upon a Multiscale Multimodal Feature Extraction (MMFE) module, which combines multiscale representation, modality-aware recalibration, and directional cross-modal fusion to construct semantically aligned representations from bounding boxes, velocities, and pose information. Building on these features, a Multiscale Enhanced Hierarchical Prediction (MEHP) module performs prediction-aware future-motion refinement via a probabilistic coarse predictor, scale-aligned fusion, and progressive refinement, adaptively selecting scale-relevant cues to mitigate spatial drift. Extensive experiments on the JAAD and PIE benchmarks demonstrate that the proposed MUSCLE-Net achieves competitive performance and consistent gains compared with state-of-the-art trajectory prediction methods.

2606.00461 2026-06-02 cs.CV eess.SP 版本更新

An explainable hierarchical self attention-based approach for tremor detection in the time domain

一种可解释的基于层次自注意力的时域震颤检测方法

Timothy Odonga, Jeanne M. Powell, Mark Saad, Richa Tripathi, Christine D. Esper, Stewart A. Factor, Hyeokhyen Kwon, J. Lucas Mckay

发表机构 * Department of Biomedical Informatics, School of Medicine, Emory University(埃默里大学生物医学信息学系) Jean and Paul Amos Parkinson’s Disease and Movement Disorders Program, Department of Neurology, School of Medicine, Emory University(埃默里大学帕金森病和运动障碍计划,神经学系) Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology(佐治亚理工学院沃尔什·H·库勒生物医学工程系)

AI总结 提出一种可解释的两阶段层次框架,直接从3D运动学时间序列数据学习震颤模式,实现时域震颤检测,并利用注意力权重和Grad-CAM提供后验可解释性。

Comments Submitted to PLOS Digital Health

详情
AI中文摘要

震颤是一种常见的运动障碍,与帕金森病和特发性震颤等疾病相关,传统上通过临床专家评估诊断。当前的自动检测方法依赖于基于临床知识的频域特征。在这项工作中,我们提出了一种可解释的两阶段层次框架,用于时域震颤检测,该框架直接从整个震颤诱发试验的3D运动学标记时间序列数据中学习震颤模式。我们的框架结合了深度卷积和长短期记忆网络,从试验中短时间、离散、非重叠的运动学时间序列数据段学习震颤表示,然后由视觉变换器处理,该变换器对时间段特征的长期时间动态进行建模,以实现试验(会话)级别的分类。在九个身体部位上评估,该框架的F1分数根据身体部位在0.594-0.947之间(平均0.765),虽然低于频域最先进性能(0.909),但所需预处理最少。注意力权重和基于梯度的类激活图(Grad-CAM)识别了不同身体部位的时域震颤特征。这一概念验证证明了数据驱动的时域建模在解剖学上不同身体部位震颤检测中的可行性,同时减少了对专家设计的频谱特征的依赖,并提供了震颤时间和解剖模式的后验可解释性。

英文摘要

Tremor is a common movement disorder associated with conditions like Parkinson's disease and Essential tremor, traditionally diagnosed through expert clinician assessment. Current automated detection methods rely on frequency-domain features informed by clinical expertise. In this work, we present an explainable, two-stage hierarchical framework for tremor detection in the time domain that learns tremor patterns directly from 3D kinematic marker time-series data across entire tremor-provoking trials. Our framework combined a deep convolutional and long short-term memory network to learn tremor representations from short, discrete, non-overlapping time segments of kinematic time series data from trials, which are then processed by a vision transformer that models their long-term temporal dynamics of time segment features for trial (session) level classification. Evaluated across nine body parts, the framework achieved F1-scores of 0.594 - 0.947 depending on body parts (average: 0.765), falling short of the frequency-domain state-of-the-art performance (0.909) while requiring minimal preprocessing. Attention weights and gradient-based class activation maps (Grad-CAM) identified time-domain features of tremor across body parts. This proof of concept demonstrated the feasibility of data-driven time-domain modeling for tremor detection across anatomically diverse body parts, while reducing reliance on expert-engineered spectral features and providing posthoc interpretability of temporal and anatomical patterns of tremor.

2606.00452 2026-06-02 cs.CV cs.GR 版本更新

Beyond Static Gaussians: An Empirical Investigation of Architectural Paradigms for Dynamic 3D Scene Reconstruction

超越静态高斯:动态3D场景重建架构范式的实证研究

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文通过实证比较结构引导与高斯中心两种动态3D高斯溅射范式,揭示重建质量/紧凑性与渲染速度之间的根本权衡。

Comments Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)

详情
Journal ref
Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, 2025, p. 99
AI中文摘要

通过3D高斯溅射(3DGS)进行动态场景重建已成为表示演化环境的一种引人注目的方法,但理解不同方法之间的权衡仍然至关重要。本文对动态3DGS方法进行了全面分析,将其分为两种范式:结构引导方法,利用辅助表示(变形场、规范空间、网格)来建模时间变化;以及高斯中心方法,通过连续函数或4D表示将动态直接编码到基元中。我们在D-NeRF基准上评估了两种范式的代表性方法。我们的发现表明,结构引导方法实现了优越的重建保真度和紧凑的模型大小,而高斯中心方法则表现出显著更高的渲染速度,能够实现实时性能,但质量变异性更大且可能产生大量存储开销。该分析突出了重建质量/紧凑性与渲染速度之间的根本权衡,为动态场景重建的未来研究和应用开发提供了见解。

英文摘要

Dynamic scene reconstruction via 3D Gaussian Splatting (3DGS) has emerged as a compelling approach for representing evolving environments, yet understanding trade-offs between methodologies remains crucial. This paper presents a comprehensive analysis of dynamic 3DGS methods, categorizing them into two paradigms: structure-guided methods employing auxiliary representations (deformation fields, canonical spaces, grids) to model temporal changes, and gaussian-centric methods encoding dynamics directly into primitives via continuous functions or 4D representations. We evaluate representative methods from both paradigms on the D-NeRF benchmark. Our findings reveal that structure-guided methods achieve superior reconstruction fidelity and compact model sizes, while gaussian-centric approaches demonstrate significantly higher rendering speeds enabling real-time performance, though with greater quality variability and potentially substantial storage overhead. This analysis highlights a fundamental trade-off between reconstruction quality/compactness versus rendering speed, providing insights to guide future research and application development in dynamic scene reconstruction.

2606.00450 2026-06-02 cs.CV cs.GR 版本更新

Optimizing 3D Gaussian Splatting via Point Cloud Upsampling

通过点云上采样优化3D高斯泼溅

Adrian Ramlal, Yan Song Hu, John S. Zelek

发表机构 * Vision and Image Processing Group, Systems Design Engineering, University of Waterloo(滑铁卢大学视觉与图像处理组,系统设计工程)

AI总结 提出多种点云上采样方法及深度引导点提升技术,改善3D高斯泼溅的初始化质量,实验表明不同场景适用不同策略。

Comments Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)

详情
Journal ref
Journal of Computational Vision and Imaging Systems, Vol. 10, No. 1, p. 47, 2024
AI中文摘要

3D高斯泼溅(3DGS)是一种用于创建和渲染3D场景的技术,但其性能严重依赖于初始种子点的质量。为了改进3DGS初始化,本研究提出并评估了几种点云上采样方法:线性插值、三角插值、基于样条的曲面重建、移动最小二乘曲面拟合和基于Voronoi的点生成。此外,本研究引入了一种深度引导的点提升方法,利用深度图保持与运动恢复结构(SfM)重建的几何一致性。通过在Mip-NeRF360和Replica数据集上的大量实验,所提出的方法在多种场景类型中展示了重建质量的提升。结果表明,不同的上采样策略在不同场景中表现优异:曲面重建方法在处理有机、细节丰富的场景时表现更好,而简单的插值方法更适合以分段平滑几何为主的场景。相比之下,深度引导方法在添加整个场景中的几何感知点方面显示出潜力,尤其是在纹理缺失区域。这些发现为根据场景特征和计算约束选择合适的上采样方法提供了初步实用指南,增进了对点云初始化如何影响3DGS质量的理解。

英文摘要

3D Gaussian Splatting (3DGS) is a technique for creating and rendering 3D scenes, however its performance depends heavily on the quality of initial seed points. To improve 3DGS initialization, this study presents and evaluates several point cloud upsampling approaches: linear interpolation, triangular interpolation, spline-based surface reconstruction, moving least squares surface fitting, and Voronoi-based point generation. Additionally, this research introduces a depth-guided point lifting method that leverages depth maps to maintain geometric consistency with Structure-from-Motion (SfM) reconstructions. Through extensive experiments on the Mip-NeRF360 and Replica datasets, the proposed methods demonstrate improvements in reconstruction quality across diverse scene types. Results indicate that different upsampling strategies excel in different scenarios: surface reconstruction methods perform better with organic, detailed scenes, while simpler interpolation approaches are more suited for scenes dominated by piecewise-smooth geometries. In comparison, the depth-guided approach shows promise for adding geometry-aware points across the entire scene, importantly in texture-less regions. These findings, which provide preliminary practical guidelines for selecting appropriate upsampling methods based on scene characteristics and computational constraints, advances the understanding of how point cloud initialization affects 3DGS quality.

2606.00447 2026-06-02 cs.CV cs.AI 版本更新

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

GeoSAM-3D: 用于从单目视频进行开放词汇3D场景分割的测地线提示传播

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出GeoSAM-3D方法,利用冻结的视觉基础模型和单目3D高斯泼溅重建,通过可微分的图-测地线传播核在场景图上传播用户提示,实现从单目视频的开放词汇3D场景分割。

详情
AI中文摘要

开放词汇的3D场景分割通常假设有RGB-D视频、校准的多视角图像或重建的网格。GeoSAM-3D研究了一种更轻的设置:用户上传一段短的单目视频,在一帧中点击或命名一个物体,并在高斯场景上接收传播的3D掩码。该实现结合了冻结的图像和视频基础模型、单目3D高斯泼溅重建以及在高斯质心上可微分的图-测地线传播核。核心设计选择是通过重建场景图上的热核距离传播提示,而不是通过3D中的欧几里得最近邻。这保持了曲面周围的连续性,并减少了附近但不相连物体之间的泄漏。本文描述了仓库状态、在geosam3d.propagate中实现的数学核、从Segment Anything掩码训练的特征头以及代码库中已有的验证。评估协议将实现验证、图传播质量、泄漏控制和交互延迟分开。

英文摘要

Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studies a lighter setting: a user uploads a short monocular video, clicks or names an object in one frame, and receives a propagated 3D mask over a Gaussian scene. The implementation combines frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel over Gaussian centroids. The central design choice is to propagate prompts by heat-kernel distance on the reconstructed scene graph, rather than by Euclidean nearest neighbors in 3D. This preserves continuity around curved surfaces and reduces leakage across nearby but disconnected objects. This paper describes the repository state, the mathematical kernel implemented in geosam3d.propagate, the feature head trained from Segment Anything masks, and the validation already present in the codebase. The evaluation protocol separates implementation validation, graph propagation quality, leakage control, and interactive latency.

2606.00445 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

DarkVesselNet: 用于暗船检测的多模态遥感和轨迹推理

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出DarkVesselNet,融合Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型、AIS轨迹推理、TGARD间隙检测和Pi-DPM异常头,实现多模态遥感暗船检测。

详情
AI中文摘要

暗船检测需要融合船只通过AIS报告的信息与卫星通过雷达和光学传感器观测到的信息。DarkVesselNet是一个多模态遥感堆栈,结合了Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型骨干、AIS轨迹推理、TGARD风格的间隙检测以及受Pi-DPM启发的异常头。该仓库将系统呈现为经过测试的Python包和公开的Hugging Face Space。本文介绍了传感器堆栈、骨干抽象、融合路径、异常头和当前的验证。目前可用的证据是基于软件的:针对SAR散斑滤波、光学波段比、Haversine距离、TGARD间隙发射、传感器配准、骨干token形状和可微分异常评分的测试。

英文摘要

Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkVesselNet is a multi-modal remote sensing stack that combines Sentinel-1 SAR, Sentinel-2 optical imagery, geospatial foundation model backbones, AIS trajectory reasoning, TGARD-style gap detection, and a Pi-DPM-inspired anomaly head. The repository exposes the system as a tested Python package and a public Hugging Face Space. The paper presents the sensor stack, backbone abstraction, fusion path, anomaly head, and current validation. The evidence currently available is software-grounded: tests for SAR speckle filtering, optical band ratios, Haversine distance, TGARD gap emission, sensor coregistration, backbone token shapes, and differentiable anomaly scoring.

2606.00444 2026-06-02 cs.CV cs.GR 版本更新

Real-Time Physics Simulation with Dynamic Mesh-Gaussian Reconstructions

基于动态网格-高斯重建的实时物理仿真

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对动态重建与物理仿真拓扑不兼容的问题,提出固定拓扑网格与高斯泼溅的双表示框架,实现实时物理仿真,并揭示高质量重建与物理兼容拓扑存在本质冲突。

详情
Journal ref
Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, 2025
AI中文摘要

将动态3D重建集成到物理仿真中需要固定的网格拓扑以实现高效的碰撞检测,但像DG-Mesh这样的先进方法会生成针对几何质量优化的可变拓扑。我们研究了拓扑转换是否能在保持重建保真度的同时实现物理集成。我们提出了一种双表示框架,将用于物理的固定拓扑网格与用于渲染的高斯泼溅相结合,通过运行时顶点缓冲区更新实现了比可变拓扑基线快4.65倍的加速。我们在DG-Mesh数据集上评估了两种转换策略(时间对应跟踪和基于模板的投影)与原生固定拓扑方法(MaGS)的性能。我们的评估表明,两种转换方法都会导致65-80%的几何退化,尽管DG-Mesh具有优越的初始质量,但产生的结果不如MaGS。这表明高质量重建和物理兼容拓扑代表了根本不同的目标,无法通过后处理来调和。我们的发现为未来物理感知重建方法的发展提供了信息,并且我们的框架能够与任何固定拓扑方法实现实时仿真。

英文摘要

Integrating dynamic 3D reconstructions into physics simulation requires fixed mesh topology for efficient collision detection, but state-of-the-art methods like DG-Mesh produce varying topology optimized for geometric quality. We investigate whether topology conversion can enable physics integration while preserving reconstruction fidelity. We propose a dual-representation framework combining fixed-topology meshes for physics with Gaussian splatting for rendering, achieving 4.65$\times$ speedup over varying-topology baselines through runtime vertex buffer updates. We evaluate two conversion strategies, temporal correspondence tracking and template-based projection, against native fixed-topology methods (MaGS) on the DG-Mesh dataset. Our evaluation reveals that both conversion approaches incur 65-80% geometric degradation, producing results inferior to MaGS despite DG-Mesh's superior initial quality. This demonstrates that high-quality reconstruction and physics-compatible topology represent fundamentally distinct objectives that cannot be reconciled through post-processing. Our findings inform future development of physics-aware reconstruction methods and our framework enables real-time simulation with any fixed-topology approach.

2606.00439 2026-06-02 cs.CV 版本更新

Physical Object Understanding with a Physically Controllable World Model

基于物理可控世界模型的物理对象理解

Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins

发表机构 * Stanford University(斯坦福大学) OpenAI(开放人工智能公司) Noetik Inc.(Noetik公司) Google(谷歌)

AI总结 提出一类概率世界模型,通过自回归序列建模高效训练,从视频中推断对象及其物理交互,实现对象发现、3D操控和物理关系计算。

Comments CVPR 2026 Highlight. Project page at: https://neuroailab.github.io/psi-website/blog.html

详情
AI中文摘要

视觉智能的一个核心挑战是从原始视频中学习场景的物理结构:区域如何形成对象以及支配它们交互的规律。解决这些任务需要能够从部分观测中推断世界分布状态的世界模型——当前架构无法提供这种能力。我们引入了一类新的概率世界模型,支持估计任何视觉变量(如外观和动态)在给定其他变量条件下的概率。在这里,我们发现这些模型可以通过自回归序列建模高效训练,从而产生能够涌现丰富对象理解的世界模型。首先,我们展示了我们的模型通过顺序推理生成多个合理的未来世界状态,捕捉了支配对象如何运动的物理规律。然后,通过分析这些未来状态中的运动相关性,我们提取出对象及其关节子部分。在发现这些对象后,我们展示了我们的世界模型可以在3D中操控它们。最后,我们演示了如何从世界模型计算对象之间的物理关系,从而实现了诸如视觉叠叠乐等应用。

英文摘要

A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations - capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.

2606.00416 2026-06-02 cs.CV 版本更新

4D Radar Meets LiDAR and Camera: Cooperative Perception under Adverse Weather

4D雷达与激光雷达和相机的结合:恶劣天气下的协同感知

Melih Yazgan, Iramm Hamdard, Qiyuan Wu, J. Marius Zoellner

发表机构 * FZI Research Center for Information Technology(FZI信息技术研究所以) Karlsruhe Institute of Technology(卡尔斯鲁厄大学)

AI总结 针对恶劣天气下相机和激光雷达性能下降的问题,提出集成4D成像雷达作为鲁棒模态,并引入多普勒引导的空间注意力机制进行多智能体融合,显著提升雾雨环境下的协同感知鲁棒性。

Comments Accepted by CVPR - DriveX Workshop

详情
AI中文摘要

协同感知对于自动驾驶至关重要,但在恶劣天气下,当相机和激光雷达性能下降时,其可靠性会受到影响。我们通过将4D成像雷达作为一种对天气鲁棒的模态集成到协同感知中,并引入多普勒引导的空间注意力机制用于多智能体融合,来解决这一挑战。我们的方法扩展了两种代表性骨干网络:一种是雷达-相机流水线,其中雷达替代激光雷达;另一种是激光雷达-雷达流水线,其中雷达补充激光雷达。为了支持评估,我们发布了雷达增强的基准数据集OPV2V-R和Adver-City-R,并加入了基于物理的激光雷达退化模拟。实验表明,在雾和雨条件下,该方法获得了显著的鲁棒性提升,特别是在雷达替代退化激光雷达时改进明显。在MAN TruckScenes上的额外验证证明了该方法在仿真之外的迁移能力。总体而言,我们的结果突出了4D成像雷达作为一种适用于全天候协同感知的鲁棒模态。数据集和代码可在以下网址获取:https://url.fzi.de/SlimComm。

英文摘要

Cooperative perception is important for autonomous driving but remains fragile when cameras and LiDAR degrade in adverse weather. We address this challenge by integrating 4D imaging radar as a weather-robust modality into collaborative perception and introducing a Doppler-guided spatial attention mechanism for multi-agent fusion. Our approach extends two representative backbones: a radar-camera pipeline where radar substitutes LiDAR, and a LiDAR-radar pipeline where radar complements LiDAR. To support evaluation, we release radar-augmented benchmarks, OPV2V-R and Adver-City-R, with physics-based LiDAR degradation. Experiments show strong robustness gains in fog and rain, including substantial improvements when radar replaces degraded LiDAR. Additional validation on MAN TruckScenes demonstrates transfer beyond simulation. Overall, our results highlight 4D imaging radar as a robust modality for all-weather collaborative perception. Dataset and code are available at: https://url.fzi.de/SlimComm.

2606.00404 2026-06-02 cs.CV cs.LG 版本更新

Rethinking Amortized Neural Representations for High-Resolution Terrain Elevation Data

重新思考高分辨率地形高程数据的摊销神经表示

Haoan Feng, Xin Xu, Leila De Floriani

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对地形高程数据,提出HUVR+SIREN超网络方法,通过替换坐标解码器为平滑可微版本,在统一基准上实现最佳高度和导数保真度,且支持后训练量化压缩。

Comments 12 pages, 7 figures, 10 tables

详情
AI中文摘要

隐式神经表示(INR)将信号建模为连续的坐标到值函数。对于地形高程数据,这支持解析导数、任意分辨率解码以及底层高度场的平滑表面模型。然而,为每个瓦片拟合和存储单独的INR无法扩展到大型地形数据集。摊销神经表示通过共享网络降低了这一成本:新瓦片被映射到紧凑的每瓦片载荷,共享解码器从中重建高度场。大多数此类方法是超网络,通过单次前向传递预测载荷,而其他方法则通过短时的每瓦片优化恢复载荷。这些方法主要针对自然图像开发,其在地形高度场上的适用性尚不清楚。我们在1米/像素的地形数据集上引入了受控基准,并在统一协议下评估了三种代表性方法。观察到明显的跨领域差距后,我们提出了HUVR+SIREN,这是一种超网络,它通过将坐标解码器替换为平滑、解析可微的解码器来适应最强的基准方法(HUVR)。它在基准上实现了最佳的高度和导数保真度,无需额外的每瓦片存储且解码成本更低,并且能够容忍激进的后训练量化而质量损失可忽略,从而形成了紧凑的地形神经格式。消融和诊断进一步确定了哪些设计选择可迁移到地形,并表明每瓦片瓶颈已接近其有用极限,剩下的差距在于共享超网络的架构设计。

英文摘要

Implicit neural representations (INRs) model a signal as a continuous coordinate-to-value function. For terrain elevation data, this supports analytic derivatives, arbitrary-resolution decoding, and a smooth surface model of the underlying heightfield. However, fitting and storing a separate INR for every tile does not scale to large terrain datasets. Amortized neural representations reduce this cost with a shared network: a new tile is mapped to a compact per-tile payload, and a shared decoder reconstructs the heightfield from it. Most such methods are hypernetworks that predict the payload in a single forward pass, while others recover it through a short per-tile optimization. These methods were developed primarily for natural images, and their suitability for terrain heightfields remains unclear. We introduce a controlled benchmark on a 1 m/pixel terrain dataset and evaluate three representative methods under a unified protocol. Observing a clear cross-domain gap, we propose HUVR+SIREN, a hypernetwork that adapts the strongest benchmarked method (HUVR) by replacing its coordinate decoder with a smooth, analytically differentiable one. It attains the best height and derivative fidelity on the benchmark with no additional per-tile storage and lower decode cost, and tolerates aggressive post-training quantization with negligible quality loss, giving a compact terrain neural format. Ablations and diagnostics further identify which design choices transfer to terrain and show that the per-tile bottleneck is already near its useful limit, leaving the remaining gap in the shared hypernetwork's architectural design.

2606.00393 2026-06-02 eess.IV cs.CV 版本更新

AutoIQ: An Ensemble Framework for Automatic Assessment of Geometric Distortion in Prostate Diffusion-Weighted Imaging

AutoIQ:前列腺扩散加权成像中几何畸变自动评估的集成框架

Haoran Sun, Lixia Wang, Yin-Chen Hsu, Hsu-Lei Lee, Chang Gao, Fei Han, Robert Grimm, Vibhas Deshpande, Ziyang Long, Hsin-Jung Yang, Rola Saouaf, Alessandro D'Agnolo, Timothy Daskivich, Hyung Kim, Debiao Li, Yibin Xie

发表机构 * Biomedical Imaging Research Institute, Cedars-Sinai Medical Center(生物医学成像研究 institute, Cedars-Sinai 医疗中心) Department of Bioengineering, University of California(生物工程系,加州大学) Siemens Medical Solutions USA Inc.(西门子医疗解决方案美国公司) Siemens Healthineers AG(西门子健康影像股份有限公司) Department of Imaging, Cedars-Sinai Medical Center(成像部,Cedars-Sinai 医疗中心) Department of Nuclear Medicine, Cedars-Sinai Medical Center(核医学部,Cedars-Sinai 医疗中心) Department of Urology, Cedars-Sinai Medical Center(泌尿科,Cedars-Sinai 医疗中心)

AI总结 提出AutoIQ集成机器学习框架,结合分割和配准方法量化DWI几何畸变,用于自动分类畸变严重程度,在独立测试集上达到0.95准确率。

Comments Original research; 11 pages, 7 figures, 1 table

详情
AI中文摘要

前列腺扩散加权成像(DWI)中的几何畸变会损害病灶定位并降低基于MRI的临床评估的可靠性。我们提出了AutoIQ,一个用于自动量化和分类DWI几何畸变严重程度的集成机器学习框架。共分析了140例回顾性前列腺双参数MRI检查,包括33次严重畸变需要重复采集的扫描和107次基于放射科专家评估可接受的畸变扫描。AutoIQ结合了两种互补的畸变量化策略:一种基于分割的方法,测量T2加权成像(T2WI)和DWI之间的前列腺边界不匹配;另一种基于配准的方法,估计DWI到T2WI对齐后的变形幅度。由此产生的畸变分数用于训练单个分类器和逻辑回归集成模型。两种计算方法均显著区分了严重和可接受的畸变病例(p < 0.001)。在独立测试集上,集成模型达到了0.95的准确率、0.93的F1分数和0.98的AUC,优于单个模型。这些结果表明,AutoIQ可以为前列腺DWI提供自动化的定量质量评估,并可能有助于识别需要重复采集的扫描。

英文摘要

Geometric distortion in prostate diffusion-weighted imaging (DWI) can impair lesion localization and reduce the reliability of MRI-based clinical assessment. We propose AutoIQ, an ensemble machine learning framework for automatic quantification and classification of DWI geometric distortion severity. A total of 140 retrospective prostate biparametric MRI examinations were analyzed, including 33 scans with severe distortion requiring repeat acquisition and 107 scans with acceptable distortion based on expert radiologist assessment. AutoIQ combines two complementary distortion quantification strategies: a segmentation-based method measuring prostate boundary mismatch between T2-weighted imaging (T2WI) and DWI, and a registration-based method estimating deformation magnitude after DWI-to-T2WI alignment. The resulting distortion scores were used to train individual classifiers and a logistic-regression ensemble model. Both computational methods significantly differentiated severe from acceptable distortion cases (p < 0.001). On an independent test set, the ensemble model achieved an accuracy of 0.95, F1-score of 0.93, and AUC of 0.98, outperforming individual models. These results suggest that AutoIQ can provide automated, quantitative quality assessment for prostate DWI and may help identify scans that require repeat acquisition.

2606.00390 2026-06-02 cs.CV cs.AI 版本更新

Zamba2-VL Technical Report

Zamba2-VL 技术报告

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 提出基于混合架构Zamba2的视觉语言模型Zamba2-VL,在图像理解等基准上媲美Transformer模型,且首次令牌延迟降低约一个数量级。

Comments 16 pages, 2 figures

详情
AI中文摘要

我们提出Zamba2-VL,这是一套基于Zamba2构建的视觉语言模型,Zamba2是一种混合语言模型架构,结合了Mamba2状态空间层和少量共享的Transformer块。在广泛的图像理解、推理、OCR、定位和计数基准测试中,Zamba2-VL与同等规模的主流基于Transformer的开源VLM(包括Molmo2、Qwen3-VL和InternVL3.5系列)具有竞争力,并且显著优于之前的基于SSM和混合的VLM,如VL-Mamba、Cobra和mmMamba。继承了其Zamba2骨干网络的近线性预填充计算和小的、近乎恒定的循环状态,Zamba2-VL在匹配参数规模下,首次令牌延迟(TTFT)比这些Transformer基线低大约一个数量级,在最适合设备和边缘部署的较小1.2B和2.7B规模上效率差距最为明显。我们发布了三个模型——1.2B、2.7B和7B——以及推理代码,网址为https://huggingface.co/collections/Zyphra/zamba2-vl。

英文摘要

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.

2606.00386 2026-06-02 cs.CV 版本更新

αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion

αDepth: 学习用于立体转换的单次软边界分解

Xiang Zhang, Yang Zhang, Lukas Mehl, Karlis Martins Briedis, Markus Gross, Christopher Schroers

发表机构 * ETH Zürich(苏黎世联邦理工学院) DisneyResearch|Studios(迪士尼研究|工作室)

AI总结 提出αDepth表示,通过圆形Alpha表示(CAR)将软边界分解为局部层次,实现高保真立体转换,无需用户干预。

详情
AI中文摘要

精确建模软边界(例如头发和散焦模糊)是立体转换中的一个基本挑战,因为前景和背景的模糊混合。现有的深度模型主要预测单层深度,导致软边界处的深度对应关系模糊。虽然抠图技术可以捕获用于分层建模的不透明度,但它们在具有多个目标的复杂场景中通常表现不佳,并且通常需要用户干预。本文介绍了αDepth,一种分层表示,用于分解软边界以实现高保真立体转换。具体来说,我们首先通过估计软边界处的分层颜色和深度值来解决混合颜色和深度模糊问题。考虑到复杂的多目标场景,我们设计了一种圆形Alpha表示(CAR),将范式从全局目标提取转变为局部边界分解。与先前仅限于单个前景/背景的抠图方法不同,CAR无需手动指导即可实现高效的场景级推理。大量评估表明,αDepth在立体转换中实现了最先进的性能,消除了软边界处的背景渗色和结构失真。

英文摘要

Accurately modeling soft boundaries, e.g., hair and defocus blur, is a fundamental challenge in stereo conversion due to the ambiguous blending of foreground and background. Existing depth models primarily predict single-layer depth, leading to ambiguity in depth correspondence at soft boundaries. While matting techniques can capture opacity for layered modeling, they often struggle in complex scenes with multiple targets and usually require user intervention. This paper introduces αDepth, a layered representation that decomposes soft boundaries for high-fidelity stereo conversion. Specifically, we first resolve mixed color and depth ambiguity by estimating layered color and depth values at soft boundaries. Considering complex multi-target scenes, we design a Circular Alpha Representation (CAR) that shifts the paradigm from global target extraction to local boundary decomposition. Unlike prior matting methods restricted to a single foreground/background, CAR enables efficient scene-level inference without manual guidance. Extensive evaluations demonstrate that αDepth achieves state-of-the-art performance in stereo conversion, eliminating background bleeding and structural distortions at soft boundaries.

2606.00380 2026-06-02 cs.CV cs.AI 版本更新

SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

SUPREME: 一个用于可复现图像遗忘方法评估的多GPU框架

Petros Andreou, Jamie Lanyon, Axel Finke, Georgina Cosma

发表机构 * Department of Computer Science, School of Science, Loughborough University(计算机科学系,科学学院,洛斯伯勒大学) School of Mathematics, Statistics and Physics, Newcastle University(数学、统计与物理学院,新卡克大学)

AI总结 提出SUPREME框架,通过多GPU分布式架构加速图像分类遗忘方法的评估,支持新方法注册和多精度模式。

Comments 17 pages. Code available at https://github.com/pedroandreou/supreme-unlearning

详情
AI中文摘要

机器遗忘旨在从已训练模型中移除特定训练数据的影响,而无需从头重新训练。评估遗忘方法需要在多个种子下重复训练、遗忘和评估,计算成本高昂。据我们所知,现有的图像分类遗忘框架在单个GPU上运行,限制了在合理时间内可评估的种子数量。我们提出SUPREME,一个开源框架,将这些阶段分布到多个GPU上。SUPREME做出三项贡献:基于注册表的设计,用于添加新方法、指标、模型和场景;支持多种加速器和精度模式的多GPU架构;以及在Pins Face Recognition上使用ResNet18和ViT在十种种子下进行全类和随机样本遗忘的演示。该框架可在https://github.com/pedroandreou/supreme-unlearning获取。

英文摘要

Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an unlearning method requires repeating training, unlearning, and evaluation across multiple seeds, which is computationally expensive. To our knowledge, existing image classification unlearning frameworks run on a single GPU, which limits how many seeds can be evaluated in reasonable time. We introduce SUPREME, an open-source framework that distributes these stages across multiple GPUs. SUPREME makes three contributions: a registry-based design for adding new methods, metrics, models, and scenarios; a multi-GPU architecture supporting multiple accelerators and precision modes; and a demonstration on Pins Face Recognition using ResNet18 and ViT under full-class and random-sample unlearning across ten seeds. The framework is available at https://github.com/pedroandreou/supreme-unlearning.

2606.00379 2026-06-02 cs.CV 版本更新

Non-Learning Low-Light Stereo Vision

非学习低光立体视觉

Jason Wang, Lucas Nguyen, Hyunseung Eom, Wei Xu, Qi Guo

发表机构 * Department of Computer Sciences, Purdue University(普渡大学计算机科学系) Elmore Family School of Electrical and Computer Engineering, Purdue University(普渡大学埃尔莫夫家庭电气与计算机工程学院)

AI总结 提出一种非学习立体框架,利用Field of Junctions (FoJ)提取粗视觉特征,结合边界感知半全局匹配(SGM)从严重噪声图像中估计视差,在基准数据集上获得比近期立体算法更准确的稀疏视差图。

Comments Accepted to ICIP 2026. Code and data available at https://github.com/guo-research-group/nonlearning-lowlight-stereo

详情
AI中文摘要

我们提出了一种非学习立体框架,用于从严重噪声图像中估计视差。利用Field of Junctions (FoJ),它保留了在严重噪声下稳定的粗视觉特征用于构建代价体,同时丢弃与光子噪声不可分的精细纹理。由此产生的结构信息指导边界感知的半全局匹配(SGM),动态调整平滑惩罚以保留真实的视差不连续性。输出是稀疏视差图,在广泛使用的基准数据集上,在未掩蔽像素上比最近的立体算法更准确。

英文摘要

We present a non-learning stereo framework for disparity estimation from severely noisy images. Using the Field of Junctions (FoJ), it retains coarse visual features stable under severe noise for cost volume construction while discarding fine textures inseparable from photon noise. The resulting structural information guides boundary-aware Semi-Global Matching (SGM) that dynamically adapts smoothness penalties to preserve true disparity discontinuities. The output is a sparse disparity map more accurate than those of recent stereo algorithms over unmasked pixels on widely-used benchmark datasets.

2606.00377 2026-06-02 cs.CV 版本更新

Score-Control for Hallucination Reduction in Diffusion Models

扩散模型中减少幻觉的分数控制

Mahesh Bhosale, Naresh Kumar Devulapally, Abdul Wasi, Chau Pham, Vishnu Suresh Lokhande, David Doermann

发表机构 * University at Buffalo(布法罗大学)

AI总结 针对扩散模型中的幻觉问题,提出基于方差引导的分数调制策略,通过控制分数雅可比矩阵减少幻觉,在保持高保真度和多样性的同时将幻觉降低约25%。

详情
AI中文摘要

扩散模型已成为现代生成式AI的支柱,推动了视觉、语言、音频及其他模态的进步。尽管取得了成功,但它们存在幻觉问题,即生成真实数据分布支撑集之外的不可信样本,这降低了可靠性和信任度。在这项工作中,我们首先通过实验证实了先前提出的假设,即分数平滑性导致图像生成扩散模型中的幻觉,并提供了基于密度的视角。我们进一步通过将幻觉概率质量与学习到的分数函数的利普希茨常数联系起来,形式化了这一概念。受此启发,我们引入了一种方差引导的分数调制(VSM)策略,该策略控制分数雅可比矩阵,从而降低分数平滑性并更好地逼近真实分数,进而减少幻觉。在合成和真实世界数据集上的实验结果表明,我们的方法在保持高保真度和多样性的同时,将幻觉降低了约25%,为更可靠的基于扩散的图像生成提供了原则性步骤。我们还提出了两个具有极端语义变化的基准数据集,用于系统性幻觉评估。代码和数据集公开于https://github.com/bhosalems/VSM。

英文摘要

Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language, audio and other modalities. Despite their success, they suffer from hallucinations, implausible samples that lie outside the support of true data distribution, which degrade reliability and trust. In this work, we first empirically confirm previously proposed hypothesis that score smoothness causes hallucinations in Image Generation diffusion models and provide a density-based perspective. We further formalize this notion by linking the hallucinations probability mass to lipschitz constant of the learned score function. Motivated by this, we introduce a Variance-Guided Score Modulation (VSM) strategy that controls the score Jacobian, in turn reducing score smoothness and better approximating the ground truth score that decreases hallucinations. Empirical results on synthetic and real-world datasets demonstrate that our approach reduces hallucinations (up to ~25%) while maintaining high fidelity and diversity, providing a principled step toward more reliable diffusion-based image generation. We also propose two benchmark datasets with extreme semantic variation for systematic hallucination evaluation. Code and Datasets are publicly available at https://github.com/bhosalems/VSM.

2606.00372 2026-06-02 cs.CV 版本更新

LFA: Layer Feature Attention for Run-Time Introspection of 2D Object Detectors in Automated Driving

LFA:用于自动驾驶中2D目标检测器运行时自省的分层特征注意力

Mert Keser, Alois Knoll

发表机构 * Automated Driving Report GitHub Issue(自动驾驶报告GitHub问题)

AI总结 提出LFA方法,通过注意力机制聚合骨干网络多层特征,以提升自动驾驶中2D目标检测器的错误预测性能和可解释性。

详情
AI中文摘要

可靠的目标检测对于自动驾驶至关重要,然而即使是最先进的检测器也不可避免地会犯错误,从而危及安全。预测检测器失败的自省方法通过触发后备机制或提醒人类操作员,能够实现更安全的部署。然而,现有方法仅依赖最后一层特征或手工设计的统计量,丢弃了来自早期层的宝贵信息,这些信息捕捉了不同层次的视觉抽象。我们提出了分层特征注意力(LFA),一种轻量级的自省方法,通过注意力机制学习从多个骨干层聚合特征。我们的关键洞察是,检测错误在特征层次上表现不同:低层捕捉对检测小目标或被遮挡目标至关重要的细粒度细节,而高层编码用于场景理解的语义信息。LFA端到端地学习层重要性权重,从而既改进了错误预测,又实现了对哪些特征级别最能指示检测器失败的可解释分析。在KITTI和BDD100K上的大量实验表明,LFA实现了最先进的自省性能,在多种检测器架构上优于单层基线方法。

英文摘要

Reliable object detection is critical for automated driving, yet even state-of-the-art detectors inevitably make errors that can compromise safety. Introspection methods that predict detector failures enable safer deployment by triggering fallback mechanisms or alerting human operators. However, existing approaches rely solely on last-layer features or hand-crafted statistics, discarding valuable information from earlier layers that capture different levels of visual abstraction. We propose Layer Feature Attention (LFA), a lightweight introspection method that learns to aggregate features from multiple backbone layers through an attention mechanism. Our key insight is that detection errors manifest differently across feature hierarchies: low-level layers capture fine-grained details essential for detecting small or occluded objects, while high-level layers encode semantic information for scene understanding. LFA learns layer importance weights end-to-end, enabling both improved error prediction and interpretable analysis of which feature levels are most indicative of detector failures. Extensive experiments on KITTI and BDD100K demonstrate that LFA achieves state-of-the-art introspection performance, outperforming single-layer baselines across multiple detector architectures.

2606.00352 2026-06-02 cs.CV cs.GR 版本更新

HiGS: A Hierarchical Rendering Architecture for Real-Time 3D Gaussian Splatting

HiGS:一种用于实时三维高斯泼溅的分层渲染架构

Dawid Pająk, Martin Bisson, Rodolfo Lima

发表机构 * NVIDIA

AI总结 针对3D高斯泼溅中空间分区与光栅化对瓦片尺寸需求矛盾的问题,提出分层瓦片高斯泼溅(HiGS),通过粗粒度宏瓦片分区和细粒度渲染瓦片光栅化实现加速,在保持精确alpha合成的同时实现最高15.8倍加速。

Comments Project Page: https://research.nvidia.com/labs/sil/projects/higs/

详情
AI中文摘要

3D高斯泼溅(3DGS)已成为在商用GPU上实现实时新视角合成的标准。其流程将空间分区和光栅化绑定到同一瓦片尺寸,但两者需求相反:分区(对高斯进行分箱和深度排序)随瓦片增大而成本降低,而光栅化随瓦片减小而成本降低。先前的加速工作降低了单个阶段的成本,但将两者锁定在单一尺度上,其中少数密集瓦片主导帧时间。我们提出分层瓦片高斯泼溅(HiGS),为每个阶段赋予独立尺度:分区在粗粒度宏瓦片上运行,而光栅化在宏瓦片内的细粒度渲染瓦片上运行。光栅化工作根据每个宏瓦片中的高斯数量分配,而非按瓦片分配,因此密集区域分布在多个并行单元上,而非串行通过一个单元。在测试场景中,HiGS比原始3DGS渲染速度快15.8倍,并且优于我们评估的所有其他光栅化器,同时保持精确的前后alpha合成。

英文摘要

3D Gaussian Splatting (3DGS) has become the standard for real-time novel view synthesis on commodity GPUs. Its pipeline ties spatial partitioning and rasterization to one tile size, yet the two pull in opposite directions: partitioning, which bins and depth-sorts gaussians, grows cheaper with larger tiles, while rasterization gets cheaper with smaller ones. Prior acceleration work reduces the cost of individual stages but keeps both locked to that single scale, where a few dense tiles dominate frame time. We present Hierarchically Tiled Gaussian Splatting (HiGS), which gives each its own scale: partitioning runs over coarse macro-tiles, while rasterization runs over the fine render tiles within them. Rasterization work is then issued in proportion to the gaussians in each macro-tile rather than per tile, so dense regions spread across many parallel units instead of serializing through one. Across tested scenes, HiGS renders up to 15.8x faster than the original 3DGS and outperforms every other rasterizer we evaluate, while preserving exact front-to-back alpha compositing.

2606.00318 2026-06-02 cs.RO cs.CV 版本更新

Belief Consistency Between Foundation-Model Evidence and Geometric Perception in Persistent Robotic Maps

持久机器人地图中基础模型证据与几何感知之间的信念一致性

Christoffer Heckman, Harel Biggie, Brendan Crowe, Nicholas Roy

发表机构 * Department of Computer Science, University of Colorado, Boulder(科罗拉多大学博尔德分校计算机科学系) Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出一种更新算子,通过每类校准提交门和每事件冲突丢弃窗口,解决基础模型语义通道与几何感知通道在持久地图中的矛盾,显著提升地图精度。

详情
AI中文摘要

自主机器人使用的持久地图越来越多地将一个断言特征良好的几何感知栈与一个产生语义声明但未校准可靠性的基础模型通道融合到同一场景中。当代建图系统通过将基础模型通道视为每个元素后验的额外投票者来集成这两个通道,但未针对其自身的每类可靠性进行校准,也没有机制在给定时刻标记两个通道相互矛盾的情况。我们提出了一种具有两个协作机制的更新算子:一个每类校准的提交门,以及一个每事件冲突丢弃窗口,该窗口拒绝提交在声明时刻与几何通道矛盾的基础模型声明。我们在KITTI-360和ScanNet上进行了评估,使用oracle几何通道(全景真值)和现成的在线语义分割器(Mask2Former)来展示真实世界性能。该算子生成的提交地图精度显著更高(KITTI中汽车提交精度99.7%对比仅校准算子的43.9%;平均每类IoU 0.522对比0.180),并且在更高精度下保留了比整体式组合VLM提示更多的组合真阳性。该框架在oracle和现成分割器几何通道上均达到部署质量,并且对基础模型替换具有不变性。

英文摘要

Persistent maps used by autonomous robots increasingly fuse a geometric perception stack whose assertions are well-characterized with a foundation-model channel that produces semantic claims without calibrated reliability about the same scene. Contemporary mapping systems integrate the two channels by treating the foundation-model channel as an additional voter into a per-element posterior, uncalibrated for its own per-class reliability and without machinery to flag when the two channels contradict each other at a given moment. We propose an update operator with two cooperating mechanisms: a per-class calibrated commit gate, and a per-event conflict-drop window that refuses to commit foundation-model claims contradicted by the geometric channel at the moment of the claim. We evaluate on KITTI-360 and ScanNet, with an oracle geometric channel (panoptic ground truth) and an off-the-shelf online semantic segmenter (Mask2Former) to demonstrate real-world performance. The operator produces substantially more accurate committed maps (KITTI is car commit precision 99.7% vs. 43.9% for the calibration-only operator; mean per-class IoU 0.522 vs. 0.180), retains more compositional true positives at higher precision than a monolithic compositional VLM prompt. The framework operates at deployment quality across both oracle and off-the-shelf-segmenter geometric channels, and is invariant under foundation-model substitution.

2606.00310 2026-06-02 cs.CV 版本更新

Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation

何处精炼,何时停止:通过潜在差异重新思考高效视觉自回归生成中的冗余

Changwang Mei, Peisong Wang, Zekun Li, Changsheng Li, Shuang Qiu, Qinghao Hu, Gang Li, Yifan Zhang, Zhihui Wei, Jian Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出基于潜在差异(Latent Discrepancy)的无训练剪枝框架LD-Pruning,通过解码无关区域选择和自适应无条件分支跳过,在视觉自回归模型中实现高达2.35倍加速并保持生成质量。

详情
AI中文摘要

视觉自回归(VAR)模型能够生成高质量图像,但在高分辨率下存在显著的推理延迟。最近的加速方法大多依赖基于层特征的启发式度量来剪枝令牌。这些启发式方法对复杂上下文语义敏感,导致冗余计算识别不准确且跨提示的适应性差。我们从冗余对像素空间生成影响的角度重新思考VAR中的冗余,并引入潜在差异(Latent Discrepancy)。该统一度量通过测量生成过程中模型状态的变化来量化令牌的贡献。我们的分析表明,当以图像潜在或像素空间信号为指导时,冗余识别更准确。我们进一步观察到,在无分类器引导(CFG)中,条件分支与无条件分支之间差异的收敛趋势随不同提示呈现高度动态性。基于这些发现,我们提出LD-Pruning(潜在差异剪枝),一种无训练框架,通过集成解码无关区域选择和自适应无条件分支跳过,利用潜在差异消除冗余。大量实验表明,LD-Pruning在保持高生成质量的同时显著降低推理延迟,在Infinity-8B上实现高达2.35倍加速。

英文摘要

Visual Autoregressive (VAR) models deliver high-quality image generation but suffer from significant inference latency at high resolutions. Recent acceleration approaches most rely on heuristic measures with layer features to prune tokens. Such heuristics are sensitive to complex contextual semantics, leading to inaccurate identification of redundant computation and poor adaptability across prompts. We rethink redundancy in VAR from the perspective of its impact on pixel-space generation and introduce Latent Discrepancy. This unified metric quantifies a token's contribution by measuring the change in model states during generation. Our analysis shows that redundancy is more accurately identified when guided by image latent or pixel-space signals. We further observed that in classifier-free guidance (CFG), the convergence trend of the discrepancy between conditional and unconditional branches exhibits high dynamics with different prompts. Based on these findings, we propose LD-Pruning (Latent Discrepancy Pruning), a training-free framework that removes redundancy via latent discrepancy by integrating decoding-free region selection and adaptive unconditional-branch skipping. Extensive experiments show that LD-Pruning substantially reduces inference latency while maintaining high generation quality, achieving up to 2.35x speedup on Infinity-8B.

2606.00299 2026-06-02 cs.CV cs.AI 版本更新

Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion

Real2SAM2Real: 生成式3D缓存作为视频扩散的互补上下文

Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler, Yiannis Aloimonos

发表机构 * University of Maryland(马里兰大学)

AI总结 提出Real2SAM2Real框架,通过3D提升模型提取可编辑的3D缓存作为几何支架,结合软空间对齐注入和微调策略,实现视频扩散模型对相机轨迹和多实体运动的精确解耦控制。

详情
AI中文摘要

虽然视频扩散模型(VDM)在合成高保真视频方面表现出色,但实现精确的相机和场景控制仍然具有挑战性。现有方法主要依赖隐式扩散先验来生成未观察区域,在高动态运动或复杂遮挡期间不可避免地导致结构崩溃。为了解决这一挑战,我们提出了Real2SAM2Real框架,该框架利用3D提升模型(例如SAM3D)提取显式可编辑的3D缓存,作为VDM的稳健几何支架。通过捕获前景实体的整个3D体积而不仅仅是其可见外壳,该缓存将整体空间先验注入VDM,为复杂场景动态提供可靠的3D感知指导。为了有效利用这种3D指导同时保留预训练先验,我们设计了一种软空间对齐注入机制以及一种针对VDM量身定制的微创微调策略。此外,我们采用掩码法线图作为跨模态桥梁,构建了无3D数据的数据整理和扰动流程。大量实验表明,Real2SAM2Real能够对相机轨迹和多实体运动实现精确、解耦的控制。通过利用生成式3D缓存的互补上下文,我们的框架克服了因过度依赖扩散先验而导致的典型崩溃,在大的相机位移和严重遮挡下保持了卓越的时空一致性。关键的是,通过将几何与外观解耦,我们为VDM定制的3D缓存消除了由结构空洞和错误立面引起的视角歧义,以及反射和折射引起的误导性线索。项目网站见https://jiayi-wu-leo.github.io/real2sam2real。

英文摘要

While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved regions, inevitably leading to structural collapse during high-dynamic movements or complex occlusions. To address this challenge, we propose Real2SAM2Real, a framework that leverages 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache, serving as a robust geometric scaffold for the VDM. By capturing the entire 3D volume of foreground entities rather than just their visible shells, this cache injects holistic spatial priors into the VDM, providing dependable 3D-aware guidance for complex scene dynamics. To effectively leverage this 3D guidance while preserving pre-trained priors, we design a Soft Spatial-Aligned Injection mechanism alongside a minimally invasive fine-tuning strategy tailored for VDMs. Furthermore, we employ masked normal maps as a cross-modal bridge to construct a 3D-free data curation and perturbation pipeline. Extensive experiments demonstrate that Real2SAM2Real enables precise, decoupled control over both camera trajectories and multi-entity motions. By utilizing the complementary context from generative 3D caches, our framework overcomes typical breakdowns caused by over-reliance on diffusion priors, maintaining exceptional spatiotemporal consistency under large camera shifts and severe occlusions. Crucially, by decoupling geometry from appearance, our VDM-tailored 3D cache eradicates perspective ambiguities caused by structural holes and erroneous facades, as well as misleading cues from reflections and refractions. Project website is available at https://jiayi-wu-leo.github.io/real2sam2real

2606.00275 2026-06-02 cs.CV cs.AI 版本更新

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

超几何与证据优先专家用于大型视觉-语言模型

Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao

发表机构 * China University of Petroleum (Beijing)(中国石油大学(北京)) Hainan Institute of China University of Petroleum (Beijing)(中国石油大学(北京)海南学院) South China Normal University(华南师范大学)

AI总结 针对大型视觉-语言模型中视觉与语言模态的不对称性,提出AsyMoE架构,通过超几何跨模态专家和证据优先语言专家分别建模层级关系与保持上下文基础,在减少参数的同时提升性能。

详情
AI中文摘要

大型视觉-语言模型(LVLMs)通过扩展架构和大量训练在多模态任务上展现了令人印象深刻的性能。近期研究将混合专家(MoE)引入LVLMs以提高计算效率。然而,现有的MoE方法以对称架构处理视觉和语言模态,忽视了这两种模态处理中的固有不平衡性。这种不平衡性导致两个关键问题。首先,文本和视觉形成层级而非并行关系,因为文本查询通常描述完整视觉场景的部分方面。欧几里得专家空间难以编码这种包含结构。其次,深层语言专家逐渐从基于证据的处理转向参数记忆依赖,失去对提供的视觉和语言信息的立足点。为解决这些问题,我们提出AsyMoE,一种通过三个专门专家组显式建模这种不平衡性的新型架构。模态内专家处理模态特定处理。超几何跨模态专家通过负曲率几何捕获层级跨模态关系。证据优先语言专家抑制参数记忆激活并在整个网络深度中保持上下文基础。大量实验表明,AsyMoE相比基线方法取得一致改进,平均比MoE变体提升1.5%,在幻觉敏感任务上提升高达3.8%。与密集模型相比,AsyMoE激活参数减少25.45%。

英文摘要

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.

2606.00267 2026-06-02 cs.CV cs.AI cs.LG cs.RO 版本更新

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

StressDream: 引导视频世界模型实现鲁棒的策略评估与改进

Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy

发表机构 * Carnegie Mellon University(卡内基梅隆大学) NVIDIA Research(NVIDIA研究) University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 提出StressDream方法,通过优化扩散视频世界模型的初始噪声,在推理时引导生成高影响且合理的未来场景,以支持鲁棒的策略评估与改进。

Comments Project page: https://junwon.me/StressDream/

详情
AI中文摘要

视频世界模型通过想象以自我机器人动作为条件的真实未来观察,在策略评估与改进方面展现出潜力。虽然世界模型可以对未来的分布进行建模,但策略评估与改进通常依赖于名义上的想象,这可能会遗漏机器人动作的高影响结果,除非抽取大量样本。为了实现对世界模型想象的鲁棒策略评估与改进,我们提出StressDream,该方法通过在推理时优化扩散世界模型的初始噪声,将想象引导至高影响且合理的结果。然而,优化高维噪声具有挑战性:优化必须推理生成视频中细微的、场景相关的目标事件,同时避免产生不合理想象的分布外噪声。我们通过两个互补目标来解决这一问题:一个语义目标,利用视觉语言模型通过推理生成视频提供信息丰富的梯度;一个合理性目标,防止优化后的噪声漂移到分布外。利用用于自动驾驶和机器人操作的最先进的视频世界模型,我们展示了StressDream能够有效地将想象引导至推理时由文本指定的高影响且合理的结果,例如任务失败,从而通过识别那些合理未来包含不良结果的动作,实现鲁棒的策略评估与改进。视频结果见https://junwon.me/StressDream/。

英文摘要

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

2606.00261 2026-06-02 cs.CV physics.soc-ph 版本更新

The Harsh Truth: Segment-Level Analysis of Harsh Driving Events in Milan Using Large-Scale Telematics, Street Networks, and Google Street View

残酷真相:基于大规模远程信息处理、街道网络和谷歌街景的米兰激烈驾驶事件路段级分析

Andrea La Grotteria, Paolo Santi, Titus Venverloo, Umberto Fugiglando, Carlo Ratti

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本研究结合大规模远程信息处理、交通指标、街道网络属性和谷歌街景视觉特征,通过非参数检验和机器学习回归分析米兰城市道路网络中激烈驾驶事件的路段级特征,发现更宽的车道、交叉口和公交站以及更开阔的视野与更高激烈事件强度相关,而密集建筑正面与较低强度相关,并针对自行车基础设施案例揭示了不同设施类型间的强度梯度。

详情
AI中文摘要

警方报告的碰撞统计数据仍然是城市道路安全评估的标准输入,但其不完整性和报告滞后限制了其在及时、细粒度干预设计中的实用性。激烈加速和制动事件被广泛用作替代安全指标,但迄今为止仅在相对较小的城市样本中进行了研究。本研究分析了米兰城市道路网络中的激烈事件,结合了来自超过420万辆配备车载单元的车辆的高分辨率远程信息处理数据、TomTom的路段级交通指标、OpenStreetMap的街道网络和基础设施属性,以及通过使用OneFormer模型进行语义分割从谷歌街景中提取的视觉街景特征。我们采用了一个分析框架,结合了高、低激烈组之间路段特征分布的非参数Mann-Whitney U检验和监督机器学习回归器。我们发现,在控制暴露量后,更宽的车道、交叉口和公交站以及更开阔的视野(更高的天空和道路像素比例)与更高的激烈事件强度相关,而更密集的建筑正面与较低的强度相关。最后,自行车基础设施案例研究确定了不同设施类型之间激烈事件强度的梯度:相对于物理隔离的自行车道,仅标线的自行车道与19.5%更高的激烈评分相关,混合交通配置与11.5%更高的评分相关,条件取决于包含的控制变量。这些结果支持针对具体情境而非统一的城市安全干预措施,并说明了大规模远程信息处理结合开放地理空间和视觉数据如何为大都市尺度的零死亡愿景决策提供信息。

英文摘要

Police-reported crash statistics remain the standard input for urban road-safety assessment, but their incompleteness and reporting lag limit their usefulness for timely, fine-grained intervention design. Harsh acceleration and braking events are widely used as surrogate safety indicators, but have so far been studied only in comparatively small urban samples. This study analyses harsh events across the urban road network of Milan, combining high-resolution telematics from more than 4.2 million vehicles equipped with On-Board Units, segment-level traffic metrics from TomTom, street-network and infrastructure attributes from OpenStreetMap, and visual streetscape features extracted from Google Street View via semantic segmentation using a OneFormer model. We employ an analytical framework combining non-parametric Mann--Whitney U tests of segment-feature distributions between high- and low-harshness groups with supervised machine-learning regressors. We find that, once exposure is controlled for, wider carriageways, crossings and transit stops, and more open visual fields (higher sky- and road-pixel proportions) are associated with higher harsh-event intensity, while denser built frontage is associated with lower intensity. Finally, the cycling-infrastructure case study identifies a gradient in harsh-event intensity across facility types: markings-only cycle lanes are associated with a 19.5% higher harshness score, and mixed-traffic configurations with an 11.5% higher score, relative to physically separated cycle paths, conditional on the included controls. These results support context-specific rather than uniform urban-safety interventions and illustrate how large-scale telematics combined with open geospatial and visual data can inform Vision Zero decision-making at the metropolitan scale.

2606.00204 2026-06-02 cs.CV 版本更新

APE: Agentic Prompt Enhancer for Image Generation and Editing

APE: 用于图像生成与编辑的智能提示增强器

Zijian Huang, Jay Zhangjie Wu, Zian Wang, Tianshi Cao, Jiasi Chen, Sanja Fidler, Huan Ling, Xuanchi Ren

发表机构 * NVIDIA University of Michigan(密歇根大学)

AI总结 提出APE框架,通过后训练小型语言模型作为提示增强代理,以单代理或多代理方式改进文本到图像生成与编辑中的提示质量,无需修改下游视觉模型。

Comments Project Page: https://research.nvidia.com/labs/sil/projects/ape/

详情
AI中文摘要

自然语言已成为图像生成和编辑的强大接口,但文本引导的视觉系统对提示表述高度敏感。语义相似的请求可能因措辞、具体性以及视觉约束的明确程度而产生不同输出,这促使将提示增强作为可训练组件而非外围用户选择。现有的强增强器通常依赖大型专有LLM(如ChatGPT或Gemini),增加了视觉生成流水线的成本、延迟和部署依赖性。我们提出智能提示增强器(APE),一种轻量级框架,将小型语言模型(SLM)后训练为提示增强代理。APE支持单代理重写和角色专用多代理增强。其单代理实例SAPE一次性重写提示,而多代理实例MAPE将增强分解为路由器-重写器-组合器过程,以处理对象、属性、空间关系和编辑的组合约束。通过任务感知奖励和后训练协议,APE在不修改下游视觉模型的情况下改善了视觉对齐和提示遵循。在具有挑战性的图像生成和编辑基准上的实验表明,后训练的小型提示增强器可靠地优于其基础对应物,缩小了与闭源提示增强器的差距;此外,MAPE在这些基准中的复杂组合任务上表现尤为强劲。

英文摘要

Natural language has become a powerful interface for image generation and editing, yet text-guided visual systems remain highly sensitive to prompt formulation. Semantically similar requests can produce different outputs depending on wording, specificity, and how explicitly visual constraints are stated, motivating prompt enhancement as a trainable component rather than a peripheral user choice. Existing strong enhancers often rely on large, proprietary LLMs such as ChatGPT or Gemini, adding cost, latency, and deployment dependence to the visual generation pipeline. We propose Agentic Prompt Enhancer (APE), a lightweight framework that post-trains small language models (SLMs) as prompt-enhancement agents. APE supports both single-agent rewriting and role-specialized multi-agent enhancement. Its single-agent instantiation, SAPE, rewrites the prompt in one pass, while its multi-agent instantiation, MAPE, decomposes enhancement into a router--rewriter--composer process for handling compositional constraints over objects, attributes, spatial relations, and edits. With task-aware rewards and post-training protocols, APE improves visual alignment and prompt following without modifying the downstream visual model. Experiments on challenging image generation and editing benchmarks demonstrate that post-trained small prompt enhancers reliably outperform their base counterparts, narrowing the gap to closed-source prompt enhancers; in addition, MAPE proves particularly strong on complex compositional tasks within these benchmarks.

2606.00191 2026-06-02 cs.RO cs.CV 版本更新

Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Safe2Drive: 评估端到端自动驾驶模型的安全驾驶行为

Nishad Sahu, Kalpana Panda, Congyuan Yu, Changzhong Qian, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Birla Institute of Technology and Science Pilani(比拉理工学院和科学帕利尼)

AI总结 针对端到端自动驾驶模型在常见安全关键场景中表现脆弱的问题,提出Safe2Drive测试集和安全驾驶评分(SDS),评估发现领先模型在安全场景中驾驶得分大幅下降且SDS较低。

详情
Journal ref
CVPR Workshops 2026
AI中文摘要

最近的端到端(E2E)自动驾驶策略在闭环模拟中取得了高驾驶得分。然而,这些策略是否能够处理常见的安全关键场景仍不清楚。我们提出了Safe2Drive(S2D),一组与Bench2Drive对齐的场景扩展,重点关注三类常见的道路危险:施工区、行人乱穿马路和被遮挡的弱势道路使用者(VRU)。Safe2Drive增加了100个常见但具有挑战性的场景,并引入了安全驾驶评分(SDS),这是一种以安全为中心的度量,在先前评估器的基础上增加了碰撞前制动、施工区物体接触、车道居中和平滑性检查。在S2D上评估两种最先进的策略(LEAD和SimLingo),我们发现它们的驾驶得分相对于报告的Bench2Drive基线急剧下降(LEAD:从Bench2Drive上的94.70 DS下降到S2D上的39.95 DS;SimLingo:从Bench2Drive上的85.07 DS下降到S2D上的41.00 DS),并且S2D上的SDS较低(LEAD为11.85,SimLingo为15.27)。这些结果与脆弱的安全驾驶行为一致,例如对施工区理解差、闯红灯以及行人制动延迟或缺失。这项研究突显了E2E模型即使在训练集包含的CARLA城镇上进行测试时也缺乏安全行为推理。我们计划发布所有100个S2D场景的代码和视频。

英文摘要

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

2606.00174 2026-06-02 cs.CV cs.AI 版本更新

MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding

MyoSem: 将肌电图与自然语言动作语义对齐以实现手部动作理解

Chiyue Wang, Dong She, Yang Gao, Zhanpeng Jin

发表机构 * South China University of Technology(华南理工大学)

AI总结 提出MyoSem框架,通过多视角动作语义构建、激活感知EMG编码和语义查询对齐,实现EMG信号与文本描述的双向检索,在多个数据集上优于基线方法并展现良好泛化性。

Comments 16 pages, 9 figures. Preprint

详情
AI中文摘要

肌电图(EMG)直接反映肌肉激活,是手势识别、假肢控制和可穿戴交互的关键传感模态。然而,现有的EMG方法通常将手部动作理解视为固定标签的分类问题,难以支持基于动作描述的查询、检索和泛化。我们提出MyoSem,一个EMG-动作语义对齐框架,将低层EMG信号映射到由多视角动作描述构建的共享语义空间。MyoSem结合多视角动作语义构建、激活感知EMG编码和语义查询对齐,实现了EMG信号与文本描述之间的双向检索。我们在EMG2Pose和NinaPro系列数据集上系统评估了MyoSem。结果表明,MyoSem在EMG-文本双向检索上表现良好,普遍优于大多数基线,并在未见用户、保留动作类别和截肢用户迁移场景中展现出良好的泛化性。消融实验和可视化进一步验证了每个模块的有效性。总体而言,MyoSem将基于EMG的手部动作理解从固定标签识别推进到可查询的双向语义检索,为语言介导的EMG动作理解提供了新的建模范式。

英文摘要

Electromyography (EMG) directly reflects muscle activation and is a key sensing modality for gesture recognition, prosthetic control, and wearable interaction. Existing EMG methods, however, commonly formulate hand action understanding as classification over fixed labels, making it difficult to support querying, retrieval, and generalization based on action descriptions. We present MyoSem, an EMG--action semantic alignment framework that maps low-level EMG signals into a shared semantic space constructed from multi-view action descriptions. MyoSem combines multi-view action-semantic construction, activation-aware EMG encoding, and semantic query alignment, enabling bidirectional retrieval between EMG signals and text descriptions. We systematically evaluate MyoSem on EMG2Pose and NinaPro-series datasets. Results show that MyoSem performs well on EMG--text bidirectional retrieval, generally outperforms most baselines, and shows favorable generalization to unseen users, held-out action classes, and amputee-user transfer scenarios. Ablations and visualizations further validate the effectiveness of each module. Overall, MyoSem advances EMG-based hand action understanding from fixed-label recognition toward queryable bidirectional semantic retrieval, providing a new modeling paradigm for language-mediated EMG action understanding.

2606.00170 2026-06-02 cs.HC cs.AI cs.CV 版本更新

UF-AMA: A unified framework for cross-domain emotion recognition via adaptive multimodal alignment

UF-AMA: 通过自适应多模态对齐的跨域情感识别统一框架

Zheng Wang, Shuo Wang, Junhong Wang

发表机构 * Institute of Advanced Technology, University of Science and Technology of China(中国科学技术大学先进技术研究院) Department of Electronic Engineering and Information Science, University of Science and Technology of China(中国科学技术大学电子工程与信息科学系) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

AI总结 提出一种统一框架UF-AMA,利用自适应多模态对齐和置信度感知筛选机制,解决跨主体和跨会话的生理信号情感识别中的分布偏移问题,在SEED和SEED-IV数据集上达到最优性能。

详情
AI中文摘要

近年来,基于脑电图(EEG)等生理信号的情感识别受到了广泛关注,因为与面部表情等外部行为数据相比,内部生理数据提供了更高的客观性和可靠性。然而,由于个体和情境差异导致的分布偏移,以及各模态样本质量的差异,构建具有高泛化性和鲁棒性的跨域多模态情感识别模型仍然是一个关键挑战。在本研究中,我们提出了一种具有自适应多模态对齐的统一框架(UF-AMA),以使用多模态生理信号解决跨主体和跨会话的情感识别问题。首先,我们构建了一个由Transformer编码器和多头交叉注意力模块组成的跨模态特征融合网络,实现了EEG信号和眼动追踪数据的深度融合。随后,我们引入了一种置信度感知筛选机制,动态评估每个模态分支在目标域样本上的预测可靠性,将样本划分为不同的质量子集,并相应地应用全局一致性对齐和跨模态蒸馏。最后,我们提出了一个多级域自适应框架,联合优化局部模态特定特征和全局融合特征的边际分布和条件分布,从而在多个粒度上减少跨域分布偏移。在SEED和SEED-IV数据集上的大量实验表明,UF-AMA在跨主体和跨会话任务中均达到了最先进的性能。源代码可在 https://github.com/BetterCoderLab/UF-AMA 获取。

英文摘要

In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, as internal physiological data offer greater objectivity and reliability compared to external behavioral data like facial expressions. However, due to distribution shifts caused by individual and contextual differences, along with variations in sample quality across modalities, constructing a cross-domain multimodal emotion recognition model with high generalization and robustness remains a key challenge. In this study, we propose a Unified Framework with Adaptive Multimodal Alignment (UF-AMA) to address cross-subject and cross-session emotion recognition using multimodal physiological signals. First, we construct a cross-modal feature fusion network comprising Transformer encoders and multi-head cross-attention modules, enabling the deep integration of EEG signals and eye-tracking data. Subsequently, we introduce a confidence-aware screening mechanism that dynamically assesses the predictive reliability of each modality branch on target domain samples, partitions samples into different quality subsets, and accordingly applies global consistency alignment and cross-modal distillation. Finally, we propose a multi-level domain adaptation framework that jointly optimizes the marginal and conditional distributions of both local modality-specific and global fusion features, thereby reducing cross-domain distribution shifts at multiple granularities. Extensive experiments on the SEED and SEED-IV datasets demonstrate that UF-AMA achieves state-of-the-art (SOTA) performance in both cross-subject and cross-session tasks. The source code is available at: https://github.com/BetterCoderLab/UF-AMA.

2606.00162 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Modeling Robotics Dataset Construction as an Artifact-Based Build Process

将机器人数据集构建建模为基于工件的构建过程

Leon Pohl, Lukas Beer, George Sebastian, Mirko Maehlisch

发表机构 * Institute for Autonomous Driving, University of the Bundeswehr Munich(自主驾驶研究所,联邦国防军 Munich 大学)

AI总结 本文提出将机器人数据集构建建模为基于工件的构建过程,并实现开源工具Bagzel,通过依赖图管理和增量构建显著降低数据集更新延迟,实验表明在迭代工作流中速度提升高达386倍。

Comments Accepted 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), 6 pages, 6 figures, 2 tables

详情
AI中文摘要

机器人系统生成大量多模态传感器数据,但将ROS bag记录转换为机器学习数据集通常由临时的顺序脚本处理,导致工程开销和迭代周期缓慢。我们将数据集构建建模为基于依赖图的工件构建过程,并在Bagzel中实现该方法,这是一个开源的Bazel扩展,用于可重现、增量式的数据集生成(包括nuScenes格式导出)。我们将Bagzel和Bagzel-xattr(服务端摘要管理)与顺序的rosbag2nuscenes基线进行比较。Bagzel在所有评估执行模式下减少了运行时间,在迭代工作流中提升最大(在20.4 GB数据集上,热构建加速高达386.26倍,增量构建加速高达7.21倍)。在5.1至20.4 GB的数据集大小范围内,Bagzel变体显示出比基线明显更好的扩展行为,尤其是在热构建和增量构建模式下。Bagzel-xattr提供了额外增益,在输入粒度研究中相比Bagzel平均运行时间减少5.9%。总体而言,将机器人数据集构建建模为基于工件的构建过程大幅降低了数据集更新延迟,同时保持了支持可重现性的确定性构建设计。Bagzel公开获取地址:https://github.com/UniBwTAS/bagzel。

英文摘要

Robotic systems generate large volumes of multimodal sensor data, but converting ROS bag recordings into machine learning datasets is often handled by ad hoc sequential scripts, creating engineering overhead and slow iteration cycles. We model dataset construction as an artifact-based build process over a dependency graph and implement this approach in Bagzel, an open-source Bazel extension for reproducible, incremental dataset generation (including nuScenes-format export). We compare Bagzel and Bagzel-xattr (server-side digest management) against a sequential rosbag2nuscenes baseline. Bagzel reduces runtime in all evaluated execution modes, with the largest gains in iterative workflows (up to 386.26x in warm builds and 7.21x in incremental builds on a 20.4 GB dataset). Across dataset sizes from 5.1 to 20.4 GB, Bagzel variants show markedly better scaling behavior than the baseline, especially in warm and incremental modes. Bagzel-xattr provides additional gains, with a mean runtime reduction of 5.9% compared to Bagzel in the input granularity study. Overall, modeling robotics dataset construction as an artifact-based build process substantially reduces dataset update latency while maintaining a deterministic build design that supports reproducibility. Bagzel is publicly available at https://github.com/UniBwTAS/bagzel.

2606.00159 2026-06-02 cs.CV cs.AI 版本更新

Digital-to-Physical Transfer of Adversarial Patches for Aerial Vehicle Detection

针对空中飞行器检测的对抗性补丁的数字到物理迁移

Jung Heum Woo, Eun-Kyu Lee

发表机构 * School of Information Technology, Incheon National University(信息科技学院,Incheon国立大学)

AI总结 本文通过数字优化和物理部署,评估了针对YOLOv3空中飞行器检测器的物理对抗性补丁攻击,发现ON补丁在物理环境中鲁棒性更强。

Comments 18 pages, 5 figures, 3 tables, preprint

详情
AI中文摘要

基于深度神经网络(DNN)的目标检测器广泛应用于环境监测和城市分析等领域的航拍和卫星图像分析。尽管性能强劲,但这些模型已知易受对抗性示例攻击,而使用可打印图案的物理对抗性攻击构成了现实的安全威胁。在本文中,我们通过桥接数字优化和实际部署,评估了针对空中飞行器检测器的物理对抗性补丁攻击。对抗性补丁在数字域中使用损失函数进行优化,该函数最小化最大目标性分数,同时结合不可打印性分数(NPS)和总变差(TV)约束以确保可打印性和空间平滑性。优化后的补丁被打印并以三种配置部署:ON、OFF和OFF-Side。使用YOLOv3检测器的实验表明,虽然OFF补丁在数字域中实现了最高有效性(85.51%的平均目标性降低率(AORR)),但ON补丁由于其一贯的可见性,在物理环境中表现出更强的鲁棒性(0.197-0.343的目标性分数比(OSR))。此外,我们的结果表明,基于天气的增强并不一定能改善该领域的补丁优化。这些发现为空中目标检测系统的实际脆弱性提供了关键见解。

英文摘要

Deep neural network (DNN)-based object detectors are widely used for analyzing aerial and satellite imagery in applications such as environmental monitoring and urban analytics. Despite their strong performance, these models are known to be vulnerable to adversarial examples, and physical adversarial attacks using printable patterns pose realistic security threats. In this paper, we evaluate physical adversarial patch attacks against an aerial vehicle detector by bridging digital optimization and real-world deployment. Adversarial patches are optimized in the digital domain using a loss function that minimizes the maximum objectness score while incorporating non-printability score (NPS) and total variation (TV) constraints to ensure both printability and spatial smoothness. The optimized patches are printed and deployed in three configurations: ON, OFF, and OFF-Side. Experiments using a YOLOv3 detector show that while the OFF patch achieves the highest effectiveness in the digital domain (85.51% Average Objectness Reduction Rate (AORR)), the ON patch demonstrates superior robustness in physical environments (0.197-0.343 Objectness Score Ratio (OSR)) due to its consistent visibility. Furthermore, our results indicate that weather-based augmentation does not necessarily improve patch optimization in this domain. These findings provide critical insights into the practical vulnerabilities of aerial object detection systems.

2606.00158 2026-06-02 eess.IV cs.CV 版本更新

Training-Free Continuous Bitrate Control for Scalable Image Coding for Humans and Machines

面向人类与机器的可扩展图像编码的无训练连续码率控制

Yui Tatsumi, Hiroshi Watanabe

发表机构 * University of Tokyo(东京大学)

AI总结 提出一种无训练的变码率可扩展图像编码框架,通过基于预测尺度值调整量化步长实现连续码率控制,同时保留机器层和增强层的高尺度信息。

详情
AI中文摘要

连续变码率压缩在实际应用中需求很高,但在面向人类和机器的可扩展图像编码中仍未得到充分探索。在本文中,我们提出了一种无训练的变码率可扩展图像编码框架。通过基于预测尺度值调整量化步长,所提出的方法实现了连续码率控制,同时保留了机器层和增强层中的高尺度信息。实验结果证明了所提出方法的有效性,并强调了两个层之间码率分配的重要性。

英文摘要

Continuous variable-rate compression is highly demanded in real-world applications, but remains underexplored in scalable image coding for humans and machines. In this paper, we propose a training-free variable-rate scalable image coding framework. By adjusting quantization steps based on predicted scale values, the proposed method achieves continuous bitrate control while preserving high-scale information in the machine and enhancement layers. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of bitrate allocation between the two layers.

2606.00153 2026-06-02 cs.CV cs.AI 版本更新

DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion

DiffCrossGait:基于潜在扩散的2D-3D跨模态步态识别轨迹级对齐

Zhiyang Lu, Ming Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对2D-3D跨模态步态识别中的域差异问题,提出DiffCrossGait,通过潜在扩散空间中的轨迹级对齐实现连续模态对齐,并引入三阶段对齐策略确保身份锚定、动态一致性和跨模态结构可恢复性,在SUSTech1K和FreeGait基准上达到最优性能。

Comments Accepted by ICML2026

详情
AI中文摘要

跨模态2D-3D步态识别受到2D轮廓和3D LiDAR距离视图表示之间固有域差异的阻碍。虽然先前的方法仅对齐最终嵌入,我们提出DiffCrossGait,将跨模态匹配重新表述为身份相关潜在扩散空间中的轨迹级对齐,而不是假设2D和3D观测完全等价。通过在潜在空间中使用共享高斯噪声驱动两种模态,我们实现了生成演化过程中的连续对齐。我们引入了一种三阶段对齐策略,利用不同的噪声强度来强制身份锚定、动态一致性和跨模态结构可恢复性,从而约束两种模态共享去噪动态和瓶颈结构,促进模态不变的步态特征。关键的是,我们的框架将生成对齐与判别骨干解耦;扩散机制仅作为训练目标,通过消除迭代去噪的计算开销确保高推理效率。在SUSTech1K和FreeGait基准上的大量实验表明,DiffCrossGait达到了最先进的性能。

英文摘要

Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representations. While prior methods align only final embeddings, we propose DiffCrossGait, which reformulates cross-modal matching as trajectory-level alignment in an identity-relevant latent diffusion space, rather than assuming full equivalence between 2D and 3D observations. By driving both modalities with shared Gaussian noise within a latent space, we enable continuous alignment throughout the generative evolution. We introduce a Tri-Phase Alignment Strategy that exploits varying noise intensities to enforce identity anchoring, dynamics consistency, and cross-modal structural recoverability, thereby constraining both modalities to share denoising dynamics and bottleneck structure, which promotes modality-invariant gait features. Crucially, our framework decouples generative alignment from the discriminative backbone; the diffusion mechanism serves exclusively as a training objective, ensuring high inference efficiency by eliminating the computational overhead of iterative denoising. Extensive experiments on the SUSTech1K and FreeGait benchmarks demonstrate that DiffCrossGait achieves state-of-the-art performance.

2606.00148 2026-06-02 cs.CV cs.AI 版本更新

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

StemBind: 当多模态大语言模型在抽象视觉推理中迷失于规则与实例之间

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng, Qiyao Sun, Xuanyu Ji, Qingyong Hu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出 StemBind 诊断基准,通过共享主干的三对齐问题(感知、规则、完整)定位 MLLM 在抽象视觉推理中的失败环节,发现规则到实例的绑定是主要瓶颈。

Comments Project page: https://hexixiang.github.io/StemBind

详情
AI中文摘要

多模态大语言模型(MLLM)常常知道规则但选错答案:在抽象视觉推理(AVR)任务中,模型可以描述所见内容并命名底层模式,但仍然无法选择匹配的候选。现有的 AVR 基准无法检测到这一点,因为它们将感知、规则归纳和答案选择合并为一个单一的对错信号。我们引入了 StemBind,一个共享主干的诊断基准,它用三个对齐的问题探测同一视觉主干:感知(图像中有什么)、规则(支配它的模式是什么)和完整(哪个选项完成它),因此最终答案的错误可以归因于同一证据上的特定子步骤。StemBind 包含 2,298 个经过精心策划的知识精简主干,涵盖九种可审计的视觉操作,总计 19,533 个 P/R/F 任务,每个完整项目都通过 Sternberg 的四个推理阶段(S1 编码、S2 推断、S3 映射、S4 应用)进行标注。评估 24 个前沿 MLLM 配置得出四个发现。(i)R-F 鸿沟:在 24 个模型中的 22 个上,规则准确率超过完整项目准确率,因此大多数失败发生在规则被识别之后。(ii)持续的绑定差距:即使在同一主干上 P 和 R 都正确,模型仍有 51.2% 的时间错误回答 F。(iii)瓶颈是 S3:过程诊断和阶段式刺激增强将主要失败定位到规则到实例的映射。(iv)扩展和思考无济于事:更大的模型和显式思考模式都无法可靠地缩小差距,思考甚至降低了规则和完整项目的准确率。StemBind 将 AVR 评估从最终答案排名重新定义为定位抽象视觉推理失败的位置,将规则到实例的绑定确定为视觉基础推理的具体下一个目标。

英文摘要

Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final-answer error can be attributed to a specific sub-step on the same evidence. StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg's four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R-F chasm: rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified. (ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51.2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to rule-to-instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full-item accuracy. StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down, identifying rule-to-instance binding as a concrete next target for vision-grounded reasoning.

2606.00146 2026-06-02 eess.IV cs.AI cs.CV 版本更新

Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts

多对比度MRI运动校正:基于参数信息解缠与自适应专家网络

Honglin Xiong, Yuxian Tang, Feng Li, Yulin Wang, Lei Xiang, Dinggang Shen, Qian Wang

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 提出一种结合参数信息对比度解缠与严重度感知自适应校正的统一框架,通过ScanCLIP提取对比度嵌入以分离解剖内容,利用视觉Transformer估计运动严重度并路由至专家混合网络,实现跨对比度与严重度的运动伪影校正,在IXI和HCP基准上优于现有方法。

详情
AI中文摘要

磁共振成像中的运动伪影降低了诊断可靠性。现有的深度学习方法通常针对特定对比度,无法泛化到不同模态和伪影严重度。我们提出一个统一框架,结合参数信息对比度解缠与严重度感知自适应校正。ScanCLIP在超过30,000个MRI文本-图像对上预训练,从采集参数中导出对比度嵌入,将对比度风格与解剖内容分离,得到无对比度特征。然后,视觉Transformer估计运动严重度,并通过专家混合网络路由特征,实现针对性伪影校正。双路径解码器重建干净图像和残差伪影图,强制执行图像空间一致性。在IXI和HCP基准上,我们的方法在PSNR上提升0.75 dB,SSIM最高提升0.0279,优于现有方法,且在更高伪影严重度下增益更大。该方法在真实临床数据上展现出鲁棒的零样本泛化能力,这些数据使用未见过的扫描参数采集,而现有方法要么无法去除伪影,要么引入额外失真。

英文摘要

Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-specific and fail to generalize across diverse modalities and artifact severities. We propose a unified framework combining parameter-informed contrast disentanglement with severity-aware adaptive correction. ScanCLIP, pretrained on over 30,000 MRI text-image pairs, derives contrast embeddings from acquisition parameters to disentangle contrast style from anatomical content, yielding contrast-free features. A Vision Transformer then estimates motion severity and routes features through a Mixture-of-Experts network, enabling targeted artifact correction. A dual-pathway decoder reconstructs both the clean image and residual artifact map, enforcing image-space consistency. On IXI and HCP benchmarks, our method improves PSNR by 0.75 dB and SSIM by up to 0.0279 over state-of-the-art approaches, with larger gains at higher artifact severities. It further demonstrates robust zero-shot generalization on real-world clinical data acquired with unseen scanning parameters, where existing methods either fail to remove artifacts or introduce additional distortions.

2606.00139 2026-06-02 cs.CV cs.AI 版本更新

Geodesics with Unified Tangent-constrained Priors and Curvature Regularization

具有统一切线约束先验和曲率正则化的测地线

Chong Di, Li Liu, Jinglin Zhang, Zhenjiang Li, Da Chen, Laurent D. Cohen

发表机构 * Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)(山东省人工智能研究院,齐鲁工业大学(山东省科学院)) Yuanshen Rehabilitation Institute, Shanghai Jiao Tong University School of Medicine(元身康复研究院,上海交通大学医学院) School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences(放疗科,山东省肿瘤医院及研究院,山东第一医科大学,山东省医学科学院) CEREMADE, Université Paris Dauphine, Université-PSL, CNRS, UMR 7534(CEREMADE,巴黎大学Dauphine,Université-PSL,CNRS,UMR 7534)

AI总结 提出一种在方向提升空间中融合切线约束先验与曲率惩罚的测地线框架,通过快速行进法高效求解HJB PDE,增强复杂形状图像分割的鲁棒性。

详情
AI中文摘要

曲率惩罚的测地线模型通过计算全局最优曲线在图像分割中证明了其有效性。不幸的是,当描绘具有复杂形状和图像强度分布的对象时,这些模型仍然容易受到捷径的影响,因为它们缺乏强制执行形状感知切线约束的机制。为了解决这一局限性,我们提出了一种统一的测地线框架,该框架将切线约束先验与曲率惩罚相结合。关键思想是直接在方向提升空间内制定切线可接受性,其中路径切线被限制在由内在形状代表(ISR)(如骨架或内部地标)导出的空间变化角度扇区内。这一公式产生了一系列切线约束的芬斯勒度量,扩展了经典的曲率惩罚测地线模型,同时强制执行强制切线约束。由此产生的Hamilton-Jacobi-Bellman(HJB)偏微分方程(PDE)可以通过快速行进法的变体进行高效数值求解,保持了单次通过的计算复杂度。在合成、自然和医学图像上的实验表明,所提出的测地线框架确实提高了对弱边界和拓扑捷径的鲁棒性,与现有测地线模型相比,产生了具有增强形状保真度的分割结果。

英文摘要

Curvature-penalized geodesic models have proven their effectiveness in image segmentation by computing globally optimal curves. Unfortunately, these models remain susceptible to shortcuts when delineating objects with complex shapes and image intensity distributions, as they lack mechanisms to enforce shape-aware tangent constraints. To address this limitation, we propose a unified geodesic framework that integrates tangent-constrained priors with curvature penalization. The key idea is to formulate tangent admissibility directly within the orientation-lifted space, where path tangents are restricted to spatially varying angular sectors derived from intrinsic shape representatives (ISR) such as skeletons or interior landmarks. This formulation gives rise to a family of tangent-constrained Finslerian metrics, extending the classical curvature-penalized geodesic models while enforcing mandatory tangent constraints. The resulting Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) admit efficient numerical solutions via variants of the fast marching method, preserving the single-pass computational complexity. Experiments on synthetic, natural, and medical images demonstrate that the proposed geodesic framework indeed improves robustness against weak boundaries and topological shortcuts, yielding segmentation results with enhanced shape fidelity compared to existing geodesic models.

2606.00137 2026-06-02 cs.CV cs.GR 版本更新

Advances in Neural 3D Mesh Texturing: A Survey

神经3D网格纹理化进展:综述

Sai Raj Kishore Perla, Hao Zhang, Ali Mahdavi-Amiri

发表机构 * Simon Fraser University(西蒙弗雷泽大学)

AI总结 本文综述了神经3D网格纹理化的最新进展,涵盖纹理合成、迁移和补全方法,并提出了统一的分类体系。

Comments Eurographics STAR (Computer Graphics Forum), 2026. Project Page: https://sairajk.github.io/neural-mesh-texturing/

详情
Journal ref
Eurographics STAR (State of The Art Report), Computer Graphics Forum, Volume 45, Number 2, 2026
AI中文摘要

3D网格纹理化在决定数字对象和场景的视觉真实感中起着至关重要的作用。尽管最近基于神经辐射场和高斯泼溅的生成式3D方法可以直接生成带纹理的资产,但多边形网格仍然是建模、动画、视觉效果和游戏管线中的核心表示。因此,神经3D网格纹理化仍然是一个重要且活跃的研究领域。在本综述中,我们对神经3D网格纹理化的最新进展进行了全面回顾,涵盖了纹理合成、迁移和补全的方法。我们首先总结了网格几何、纹理映射、可微渲染和神经生成模型的关键基础,然后将文献组织成一个统一的分类体系,涵盖从早期基于GAN的方法到现代基于扩散的管线。我们进一步分析了常见的架构和监督策略,回顾了数据集和评估协议,并讨论了新兴应用、实际/商业系统以及开放挑战。这些见解共同为当前格局提供了结构化的视角,并有助于指导基于学习的3D网格纹理化的未来发展。

英文摘要

Texturing 3D meshes plays a vital role in determining the visual realism of digital objects and scenes. Although recent generative 3D approaches based on Neural Radiance Fields and Gaussian Splatting can produce textured assets directly, polygonal meshes remain the core representation across modeling, animation, visual effects, and gaming pipelines. Neural 3D mesh texturing therefore continues to be an essential and active area of research. In this survey, we present a comprehensive review of recent advances in neural 3D mesh texturing, covering methods for texture synthesis, transfer, and completion. We first summarize key foundations in mesh geometry, texture mapping, differentiable rendering, and neural generative models, and then organize the literature into a unified taxonomy spanning early GAN-based methods to modern diffusion-based pipelines. We further analyze common architectures and supervision strategies, review datasets and evaluation protocols, and discuss emerging applications, practical/commercial systems, and open challenges. Together, these insights provide a structured perspective on the current landscape and help guide future developments in learning-based 3D mesh texturing.

2606.00124 2026-06-02 cs.CV cs.LG 版本更新

Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness

位置编码锚定视觉Transformer中的空间结构:基于几何视角的鲁棒性研究

Mahmoud Mannes

发表机构 * ESSTHS

AI总结 本文通过引入空间相似性距离相关性(SSDC)度量,研究不同位置编码对视觉Transformer内部空间表示几何结构的影响,发现位置编码通过建立索引锚定的空间组织来提升模型在内容破坏性分布偏移下的鲁棒性。

Comments 16 pages (9 main text, 7 appendix). 5 figures (3 main text, 2 appendix) with 8 graphics total. 5 tables (1 main text, 4 appendix). Submitted to NeurIPS 2026 main conference and the ICML 2026 mechanistic interpretability workshop

详情
AI中文摘要

视觉Transformer中的位置嵌入(PEs)已知会影响性能和鲁棒性,但它们在塑造内部空间表示中的作用尚不明确。本文研究了不同形式的PEs如何影响ViT的表示几何结构,以及这些变化如何与内容破坏性分布偏移下的鲁棒性相关。我们引入了一个度量——空间相似性距离相关性(SSDC),用于量化token表示中的空间结构。利用该度量,我们发现未使用PEs训练的ViT仍会发展出非平凡的空间结构,但这种结构由视觉内容驱动,并在token置换下崩溃。相反,所有考虑的PEs(可学习绝对位置编码、正弦位置编码和旋转位置编码)都与向索引锚定空间组织的一致转变相关。这些模型中的表示在破坏内容的扰动下保持稳定,并对这类分布偏移表现出显著增强的鲁棒性。我们进一步表明,尽管不同的PEs产生不同的空间结构深度轨迹,但其鲁棒性属性大致相似(编码方案间存在次要差异),这表明鲁棒性似乎更依赖于稳定的位置参考框架的存在,而非特定的编码机制。这些结果为位置编码如何塑造内部表示提供了几何解释,并对未来编码方案的原则性设计具有启示意义。

英文摘要

Positional embeddings (PEs) in Vision Transformers (ViTs) are known to impact performance and robustness, but their role in shaping internal spatial representations is not well understood. In this work, we study how different forms of PEs influence the representational geometry of ViTs and how these changes relate to robustness under content-disrupting distribution shifts. We introduce a metric, the Spatial Similarity Distance Correlation (SSDC), to quantify spatial structure in token representations. Using this metric, we show that ViTs trained without PEs still develop non-trivial spatial structure, but this structure is driven by visual content and collapses under token permutation. In contrast, we find that all PEs considered (learned absolute, sinusoidal, and rotary) are associated with a consistent shift toward an index-anchored spatial organization. Representations in these models remain stable under perturbations that disrupt content, and exhibit substantially improved robustness to such distributional shifts. We further show that while different PEs produce distinct depth-wise trajectories of spatial structure, their robustness properties are largely similar (with secondary variation across encoding schemes), suggesting that robustness appears to depend on the presence of a stable positional reference frame more than it depends on the specific encoding mechanism. These results offer a geometric account of how positional encodings shape internal representations, with implications for the principled design of future encoding schemes.

2606.00123 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

CardioLens: 通过多序列心脏MRI评估揭示MLLMs的临床现实差距

Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su, Taiping Qu, Jingwei Guo, Nan Zhang, Hui Wang, Zhen Zhou, Kairui Bo, Yan Chen, Yue Ren, Shuai Li, Lei Xu, Henggui Zhang

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Beijing Anzhen Hospital(北京安贞医院) Beihang University(北航) King Abdullah University of Science and Technology(国王 Abdullah 科学与技术大学)

AI总结 提出CardioLens测试平台,通过多序列心脏磁共振成像评估24个多模态大语言模型,发现其在临床工作流中表现不佳,存在类别崩溃失败模式,且输入选择和推理提示改进效果有限。

详情
AI中文摘要

多模态大语言模型在公共医学基准上表现出色,但现有评估通常依赖于孤立输入和简化识别任务,难以作为临床使用的有效代理。我们提出了CardioLens,一个针对多序列心血管磁共振的无泄漏评估测试平台,通过严格的报告到QA构建和验证流程,从私有医院档案中构建。CardioLens包含473,896张切片和13,494个经过验证的QA对,涵盖4D Cine、LGE、灌注和T2加权成像,并评估CMR解读的三个阶段:图像理解、报告生成和疾病诊断。在24个最先进的MLLM上,CardioLens揭示了显著的临床现实差距:模型整体表现不佳,性能沿真实CMR工作流下降。混淆分析进一步显示一种类别崩溃失败模式,模型倾向于默认频繁出现的异常类别,而不是区分临床不同的发现。为了排除MLLM兼容输入构造是主要原因,我们在不同切片预算下比较了随机、临床动机和数据驱动的切片选择协议;性能变化很小,通常约为1%。显式推理提示也无法挽救性能,往往使模型更加保守,而不是改善视觉证据的使用。这些结果表明,当前MLLM远未达到可靠的CMR解读,临床决策需要跨序列、视图和时间相位整合分布式证据。CardioLens为开发面向真实临床部署的下一代MLLM提供了一个临床基础的测试平台。

英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.

2606.00121 2026-06-02 cs.CV cs.AI 版本更新

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

基于语义和结构引导的大脑活动图像重建通用框架

Yizhuo Lu, Changde Du, Qiongyi Zhou, Liuyun Jiang, Huiguang He

发表机构 * State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology(脑认知与脑启发智能技术国家重点实验室) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Future Technology, University of Chinese Academy of Sciences(中国科学院大学未来技术学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出MindDiffuser两阶段框架,结合CLIP文本嵌入和视觉特征,通过Stable Diffusion生成语义图像并迭代优化结构信息,在fMRI、EEG、MEG三种模态上显著提升图像重建性能。

详情
AI中文摘要

从大脑记录中重建视觉刺激一直是脑解码中一项有意义且具有挑战性的任务。特别是,实现精确且可控的图像重建对于推动脑机接口的进步和应用具有重要意义。最近的方法利用文本到图像生成模型的能力,在语义(如概念和对象)方面重建了接近复杂自然刺激的图像。然而,它们在保持与原始刺激在细粒度结构信息(如位置、方向和大小)上的一致性方面存在困难,这削弱了模型的可控性和可解释性。为了解决上述问题,我们提出了一个两阶段图像重建框架,称为MindDiffuser。在第一阶段,从大脑反应解码的对比语言-图像预训练(CLIP)文本嵌入被输入到Stable Diffusion中,生成包含语义信息的初步图像。在第二阶段,我们使用解码的浅层CLIP视觉特征作为监督信号,通过反向传播迭代优化来自第一阶段的特征向量,以对齐结构信息。我们在由视觉刺激引发的三种模态(fMRI、EEG、MEG)的大脑反应数据集上进行了大量实验,结果表明我们的框架显著提升了先前最先进模型的性能,凸显了我们方法的有效性和通用性。空间和时间可视化结果进一步支持了我们框架的神经生物学合理性,为未来跨不同大脑信号模态的神经解码工作提供了指导。

英文摘要

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Recent methods, leveraging advances in the power of text-to-image generation models, have reconstructed images that closely approximate complex natural stimuli in terms of semantics (e.g., concepts and objects). However, they struggle to maintain consistency with the original stimuli in fine-grained structural information (e.g., position, orientation and size), which undermines both the controllability and interpretability of the models. To address the aforementioned issues, we propose a two-stage image reconstruction framework, termed MindDiffuser. In Stage 1, Contrastive Language-Image Pretraining (CLIP) text embeddings decoded from brain responses are input into Stable Diffusion, generating a preliminary image containing semantic information. In Stage 2, we use decoded shallow CLIP visual features as supervisory signals, iteratively refining the feature vectors from Stage 1 via backpropagation to align structural information. We conducted extensive experiments on brain response datasets across three modalities (fMRI, EEG, MEG) elicited by visual stimuli, demonstrating that our framework significantly enhances the performance of previous state-of-the-art models, highlighting the effectiveness and versatility of our approach. Spatial and temporal visualization results further support the neurobiological plausibility of our framework, providing guidance for future neural decoding efforts across different brain signal modalities.

2606.00115 2026-06-02 cs.CV cs.LG stat.ML 版本更新

Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions

来自视频的物理:最小轨迹条件下时不变二阶ODE的可辨识性

Yuanyuan Wang, Wenjie Wang, Kun Zhang, Mingming Gong

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 研究从原始像素中辨识连续时间物理定律的结构可辨识性,证明在最小轨迹条件下,编码器-仅管道可唯一恢复二阶线性ODE参数,并引入方差底正则化器稳定无解码器目标。

Comments Accepted at ICML 2026

详情
AI中文摘要

弥合视觉真实感与物理理解之间的差距是基于视频的世界模型的核心挑战。我们研究从原始像素中辨识连续时间物理定律的结构可辨识性,重点关注编码器-仅管道能否唯一恢复二阶线性ODE的参数。我们证明,一个水平集斜率覆盖条件确保学习到的潜在空间与真实物理状态局部仿射,从而实现精确的参数恢复。我们的理论首次给出了不同阻尼机制下最小数据需求的刻画,建立了欠阻尼系统可从单个视频片段辨识,而其他机制需要三个不同轨迹。我们进一步引入方差底正则化器以稳定无解码器目标并防止潜在坍缩。在合成和真实数据上验证,我们的方法表明,无需计算密集的像素重建,即可从视频中可靠估计可解释的物理常数,确保物理正确性和透明性。代码可在 https://github.com/wenjiewang3/PhysicsFromVideo 获取。

英文摘要

Bridging the gap between visual realism and physical understanding is a core challenge for video-based world models. We study the structural identifiability of continuous-time physical laws from raw pixels, focusing on whether an encoder-only pipeline can uniquely recover the parameters of second-order linear ODEs. We prove that a level-set slope-coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance-floor regularizer to stabilize the decoder-free objective and prevent latent collapse. Validated on synthetic and real-world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute-intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at https://github.com/wenjiewang3/PhysicsFromVideo.

2606.00114 2026-06-02 cs.CV cs.IT math.IT 版本更新

Recursive Vision Transformer with Dynamic Depth and Width Adjustment for Resource-Efficient Image Semantic Communication

递归视觉Transformer与动态深度和宽度调整用于资源高效图像语义通信

Zhilong Zhang, Xinhui Zhang, Gongyu Jin, Sihua Wang, Danpu Liu, Changchuan Yin

发表机构 * Beijing Laboratory of Advanced Information Network(北京先进信息网络实验室) Beijing Key Laboratory of Network System Architecture and Convergence(北京网络系统架构与融合重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出一种递归视觉Transformer图像语义通信系统,通过动态深度和宽度调整策略降低参数和计算复杂度,在资源受限设备上实现高效通信。

详情
AI中文摘要

图像语义通信是下一代无线通信系统中的关键组成部分。然而,此类系统通常具有较大的内存占用和较高的计算复杂度,使其难以部署在资源受限的设备上。为了解决这些挑战,我们提出了一种基于视觉Transformer(ViT)的图像语义通信系统。在该系统中,引入递归结构以迭代细化语义特征并减少参数数量。此外,设计了三种动态调整策略以自适应降低计算复杂度:动态深度调整、动态宽度调整以及宽度-深度联合优化。动态深度调整根据图像内容和信道条件自适应确定递归模块的数量,而动态宽度调整则选择性保留重要的神经元和注意力头。宽度-深度联合优化进一步实现了灵活的计算配置。仿真结果验证了所提出的基于递归ViT的系统,结合三种动态调整策略,在相当的计算复杂度下,参数数量减少了48.7%,并且实现了比现有基线更高的重建质量。

英文摘要

Image semantic communication is a critical component in next-generation wireless communication systems. However, such systems typically suffer from large memory footprints and high computational complexity, making them difficult to deploy on resource-constrained devices. To address these challenges, we propose a vision transformer (ViT)-enabled image semantic communication system. In this system, a recursive structure is introduced to iteratively refine semantic features and reduce the parameter count. In addition, three dynamic adjustment strategies are designed to adaptively reduce computational complexity: dynamic depth adjustment, dynamic width adjustment, and joint width-depth optimization. Dynamic depth adjustment adaptively determines the number of recursive modules according to image content and channel conditions, while dynamic width adjustment selectively preserves important neurons and attention heads. The joint width-depth optimization further enables flexible computation configurations. Simulation results verify that the proposed recursive ViT-based system, combined with the three dynamic adjustment strategies, reduces the parameter count by 48.7% and achieves higher reconstruction quality than existing baselines under comparable computational complexity.

2606.00112 2026-06-02 cs.NE cs.CV 版本更新

Evolving to the Aesthetics of a Vision-Language Model

进化到视觉语言模型的美学

Stephen James Krol, Jon McCormack

发表机构 * SensiLab, Monash University Melbourne, Australia(传感实验室,墨尔本莫纳什大学,澳大利亚)

AI总结 本研究探索使用视觉语言模型(VLM)通过CLIP-IQA评分或成对比较结合Glicko评级系统来评估进化设计的美学,并与艺术家排名对比分析两种方法的优劣。

Comments Paper presented at ICCC26, June 29 - July 3, 2026, Coimbra, Portugal

详情
AI中文摘要

进化系统在创意领域已展现出显著成果,最近的应用包括生成式排版、设计和音乐。然而,设计能有效捕捉抽象输出所需美学的适应度函数仍是一个开放问题。在这项工作中,我们探索了两种使用视觉语言模型(VLM)评估种群美学的方法。第一种方法使用CLIP-IQA预测每个设计的美学分数。第二种方法则让候选设计相互对抗,由VLM根据用户指定的自定义提示确定胜者。然后,这些成对比较的结果通过Glicko评级系统用于估计种群排名。我们在一个使用自定义生成系统的案例研究中展示了这些方法,并将所得排名与艺术家的美学排名以及其他美学评估技术产生的排名进行比较。此外,我们记录了艺术家使用这些方法进化设计的体验,批判性地分析了两种方法的优缺点。

英文摘要

Evolutionary systems have demonstrated remarkable results in creative domains, with recent applications in generative typography, design, and music. However, an open problem remains in designing fitness functions that effectively capture the desired aesthetics of abstract outputs. In this work, we explore two methods for evaluating the aesthetics of a population using Vision-Language Models (VLMs). The first method uses CLIP-IQA to predict an aesthetic score for each design. The second method instead pits candidates against each other, with winners determined by a VLM using a custom prompt specified by the user. The outcomes of these pairwise comparisons are then used to estimate a population ranking via the Glicko rating system. We present these methods in the context of a case study using a custom generative system and compare the resulting rankings with an artist's aesthetic ranking and those produced by other aesthetic evaluation techniques. Additionally, we document the artist's experience using these approaches to evolve designs, critically analysing the strengths and weaknesses of both methods.

2606.00111 2026-06-02 eess.IV cs.CV cs.LG 版本更新

ChWDTA: Channel-wise Wavelet-Domain Transformer Attention and Entropy Modeling for Learned Image Compression

ChWDTA:用于学习图像压缩的通道级小波域变换器注意力和熵建模

Haisheng Fu, Runyu Yang, Feng Ding, Siyu Zhu, Jie Liang, Xiaoxiao Li, Zhenman Fang, Jingning Han

发表机构 * Electrical and Computer Engineering Department, The University of British Columbia(英属哥伦比亚大学电气与计算机工程系) School of Engineering Science, Simon Fraser University(西蒙弗雷泽大学工程科学学院) School of Electronic Science and Technology, Eastern Institute of Technology(电子科学与技术学院,东部技术学院) Google LLC(谷歌公司)

AI总结 提出通道级小波域变换器注意力(ChWDTA)和通道级小波包分解,在混合CNN-Transformer图像压缩框架中提升率失真性能,在多个测试集上实现显著BD-rate降低。

Comments 13 pages, 8 figures, 6 tables

详情
AI中文摘要

最先进的学习图像压缩(LIC)方案越来越多地基于混合CNN-Transformer架构。为了进一步提高率失真性能,我们将通道级小波变换引入变换器和熵编码组件。首先,我们提出了一种通道级小波域变换器注意力(ChWDTA)机制。ChWDTA保留了现代LIC骨干中使用的有效窗口化空间自注意力,但在将注意力输出通过逆变换映射回来之前,在通道级小波变换特征上计算Q/K/V投影。因此,得到的通道级小波域变换器块(ChWDTB)保留了窗口化注意力的空间标记化模式,同时稀疏化了注意力投影所见的通道协方差。其次,在熵编码阶段,我们引入了一种通道级小波包(ChWP)分解,产生四个大小相等的子带,这更适合基于通道级切片的自回归熵建模。当每个通道级子带被分成两个切片时,我们使用八个切片进行熵编码。通过这种配置,所提出的方案在Kodak、CLIC Professional Validation和Tecnick测试集上分别获得了-17.82%、-19.15%和-22.56%的BD-rate降低。即使每个通道级子带被编码为单个切片,该方案仍以较低的复杂度保留了大部分编码增益。结果证实了在基于CNN-Transformer的LIC方案中引入小波变换的优势。

英文摘要

State-of-the-art learned image compression (LIC) schemes are increasingly based on hybrid CNN-transformer architectures. To further improve rate-distortion performance, we introduce channel-wise wavelet transforms into both the transformer and entropy-coding components. First, we propose a channel-wise wavelet-domain transformer attention (ChWDTA) mechanism. ChWDTA keeps the efficient windowed spatial self-attention used in modern LIC backbones, but computes the Q/K/V projections on channel-wise wavelet-transformed features before mapping the attention output back with the inverse transform. The resulting Channel-wise Wavelet-Domain Transformer Block (ChWDTB) therefore preserves the spatial tokenization pattern of windowed attention while sparsifying the channel covariance seen by the attention projections. Second, in the entropy-coding stage, we introduce a channel-wise wavelet packet (ChWP) decomposition that produces four equal-sized subbands, which better fit channel-wise slice-based autoregressive entropy modeling. When each channel-wise subband is divided into two slices, we use eight slices for entropy coding. With this configuration, the proposed scheme obtains BD-rate reductions of -17.82%, -19.15%, and -22.56% on the Kodak, CLIC Professional Validation, and Tecnick test sets, respectively. Even when each channel-wise subband is coded as a single slice, the scheme still retains most of the coding gains with lower complexity. The results confirm the advantage of introducing wavelet transform in CNN-transformer-based LIC schemes.

2606.00110 2026-06-02 cs.CV cs.RO 版本更新

General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling

广义协变动作建模:通过时空解耦构建广义流形

Huaihai Lyu, Chaofan Chen, Mingyu Cao, Yuheng Ji, Changsheng Xu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出广义动作流形框架,通过时间不变性和几何不变性解耦实现广义协变,提升从稀疏演示中泛化的鲁棒性。

详情
AI中文摘要

从有限数据中实现鲁棒泛化是具身智能的核心挑战。现有方法通过回归绝对坐标失败,这违反了广义协变原理。根本上,这混淆了内在任务几何与刚性执行模式,将策略绑定到特定运动风格和固定速度。为解决此问题,我们提出广义动作流形(GAM)框架,通过结构解耦强制执行广义协变。具体地,GAM通过强制两个正交维度的不变性来实现流形:(1)时间不变性,利用弧长参数化将空间路径几何与时间动力学正交化,确保对速度变化的鲁棒性;(2)几何不变性,其中模式-仿射-分解机制将轨迹映射到姿态归一化坐标框架中的规范“世界线”。这区分了不变几何模式与仿射调制,确保空间泛化性。通过将GAM集成到结构化视觉-语言-动作(VLA)架构中,我们使稀疏演示能够密集填充连续有效的动作流形。实验结果表明,GAM实现了优越的迁移和鲁棒性,优于几何无关基线。

英文摘要

Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Fundamentally, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM realizes the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an Arc-Length Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical ``world lines'' in a pose-normalized coordinate frame. This distinguishes invariant geometric schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we enable sparse demonstrations to densely populate a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, outperforming geometry-agnostic baselines.

2606.00109 2026-06-02 cs.CV cs.AI cs.LG 版本更新

VDSB-GWSyn: Diffusion Schrödinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

VDSB-GWSyn: 用于冠状动脉造影中可控且解剖学可行的导丝合成的扩散薛定谔桥

Haoyuan Tang, Zhuo Zhang, Jialin Li, Shuai Xiao, Jiachen Yang

发表机构 * Tianjin University(天津大学)

AI总结 提出基于扩散薛定谔桥的VDSB-GWSyn框架,通过形状先验和血管分割约束生成可控、高保真导丝样本,显著提升下游导丝端点定位精度。

Comments Early accept to MICCAI 2026

详情
AI中文摘要

冠状动脉导丝端点定位是计算机辅助PCI的基本能力,随着机器人辅助PCI逐渐普及以减少操作者辐射暴露,其重要性日益增加。然而,带有导丝的标注CAG图像稀缺以及现有导丝合成模型的适应性有限,仍是导丝端点定位的关键瓶颈。为解决此问题,我们提出VDSB-GWSyn,一个基于扩散薛定谔桥(DSB)模型的框架,能够在复杂解剖背景下合成可控、高保真的导丝样本。VDSB-GWSyn首先使用我们的形状先验算法学习基本导丝几何形状,然后在血管分割掩码的约束下生成导丝掩码并输出对应的端点坐标,最后通过SPADE条件化的DSB在真实CAG图像上合成逼真的导丝样本。实验结果表明,VDSB-GWSyn合成的导丝样本取得了良好的ROI-FID和ROI-KID,以及高IPR分数。此外,将我们的合成数据用于合成预训练后接真实微调,显著改进了下游导丝端点定位,将MPE从16.01像素降低到7.71像素,PCK@3像素从52.63%提高到86.27%,从而实现了更临床可靠的机器人辅助导丝输送系统部署。此外,具有严格背景保留和解剖可行性约束的可控设备合成的核心设计理念,有可能迁移到其他标注数据稀缺的介入设备感知任务中。

英文摘要

Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assisted PCI is progressively adopted to reduce operator radiation exposure. However, the scarcity of annotated CAG images with guidewires and the limited adaptability of existing guidewire synthesis models remain key bottlenecks for guidewire endpoint localization. To address this issue, we propose VDSB-GWSyn, a Diffusion Schrödinger Bridge (DSB) model-based framework, enabling synthesis of controllable, high-fidelity guidewire samples under complex anatomical backgrounds. VDSB-GWSyn first uses our shape prior algorithm to learn the basic guidewire geometry. It then generates guidewire masks under constraints imposed by the vessel segmentation masks and outputs the corresponding endpoint coordinates. Finally, it synthesizes realistic guidewire samples on real CAG images using DSB conditioned with SPADE. Experimental results show that the guidewire samples synthesized by VDSB-GWSyn achieve favorable ROI-FID and ROI-KID, as well as high IPR scores. In addition, incorporating our synthesized data for synthetic pre-training followed by real fine-tuning substantially improves downstream guidewire endpoint localization, reducing MPE from 16.01~px to 7.71~px and increasing PCK at 3~px from 52.63\% to 86.27\%, leading to more clinically reliable deployment of robot-assisted guidewire delivery systems. Moreover, the core design philosophy of controllable device synthesis with strict background preservation and anatomical feasibility constraints has the potential to transfer to other interventional device perception tasks where annotated data are scarce.

2606.00105 2026-06-02 cs.CV cs.AI 版本更新

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

视觉噪声引导的上下文蒸馏用于多模态大语言模型遗忘

Junkai Chen, Yuhao He, Junxiang You, Ruiqi Liu, Chenyu Wang, Shu Wu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, UCAS(北京大学交叉学科研究院)

AI总结 提出视觉噪声引导的上下文蒸馏(VGID)框架,通过双模态干预构建教师分布进行蒸馏,实现多模态大语言模型参数级遗忘,平衡遗忘效果与模型效用。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务上取得了显著进展,但它们也可能记忆和暴露敏感或受限知识,引发隐私和更广泛的安全风险。机器遗忘(MU)提供了一种有前景的方法,可以从训练好的模型中移除目标不良知识,而无需从头重新训练,同时保持通用模型效用。然而,在MLLMs中实现有效遗忘仍然特别具有挑战性。现有的基于训练的方法通常难以平衡遗忘效果和模型效用。相比之下,无训练方法如上下文遗忘通过避免参数更新来保持模型效用,但它们不会在参数级别移除记忆的知识,可能仍然容易受到逆向工程攻击。更重要的是,上下文遗忘在多模态设置中不足,其中视觉输入可以提供强条件信号并诱导不良输出。为了解决这些挑战,我们提出了视觉噪声引导的上下文蒸馏(VGID),一种基于蒸馏的MLLM遗忘框架。VGID通过结合视觉扰动与文本上下文遗忘的双模态干预,从冻结的基础模型动态构建面向遗忘的教师分布。由此产生的干预诱导分布作为蒸馏的教师信号,引导学生模型实现参数级遗忘,而无需外部教师模型或显式的不良响应注释。实验结果表明,VGID在保持竞争性模型效用的同时实现了强遗忘效果,在代表性设置中,遗忘集ROUGE-L降低了0.371,而保留集ROUGE-L仅下降0.055。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose sensitive or restricted knowledge, raising concerns about privacy and broader safety risks. Machine Unlearning (MU) provides a promising way to remove targeted undesirable knowledge from trained models without retraining from scratch while preserving general model utility. Nevertheless, effective unlearning in MLLMs remains particularly challenging. Existing training-based methods often struggle to balance unlearning effectiveness and model utility. In contrast, training-free methods such as in-context unlearning preserve model utility by avoiding parameter updates, but they do not remove memorized knowledge at the parameter level and may remain vulnerable to reverse-engineering attacks. More importantly, in-context unlearning is insufficient in multimodal settings, where visual inputs can provide strong conditioning signals and induce undesirable outputs. To address these challenges, we propose Visual-Noise Guided In-Context Distillation (VGID), a distillation-based framework for MLLM unlearning. VGID dynamically constructs an unlearning-oriented teacher distribution from the frozen base model through dual-modal intervention that combines visual perturbation with textual in-context unlearning. The resulting intervention-induced distribution serves as a teacher signal for distillation, guiding the student model toward parameter-level unlearning without requiring external teacher models or explicit undesirable response annotations. Experimental results show that VGID achieves strong unlearning effectiveness while preserving competitive model utility, reducing forget set ROUGE-L by 0.371 with only a 0.055 drop in retain set ROUGE-L in a representative setting.

2606.00101 2026-06-02 cs.CV cs.AI 版本更新

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

CoCoVideo: 基于商业模型的高质量对比基准用于AI生成视频检测

Huidong Feng, Wentao Chen, Jie Chen, Xinqi Cai, Ruolong Ma, Yinglin Zheng, Yuxin Lin, Ming Zeng

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) China Academy of Information and Communications Technology(中国信息通信技术研究院) AI Transcend Pte. Ltd.(AI Transcend有限公司)

AI总结 针对现有数据集依赖低质量开源模型且商业样本带水印的问题,提出包含13个商业生成器的CoCoVideo-26K对比数据集,并设计结合对比学习与置信门控多模态大语言模型的CoCoDetect检测框架,实现高保真AI生成视频的鲁棒检测。

Comments Accepected by CVPR 2026

详情
AI中文摘要

随着人工智能生成内容(AIGC)技术的快速发展,视频伪造日益普遍,给公共讨论和社会安全带来新挑战。尽管现有深度伪造检测方法取得了显著进展,但AIGC伪造检测仍然具有挑战性,因为现有数据集主要依赖开源视频生成模型,其质量远低于商业AIGC系统。即使包含少量商业样本的数据集也常常保留可见水印,损害真实性并阻碍模型泛化到高保真AIGC视频。为解决这些问题,我们引入了CoCoVideo-26K,一个基于对比学习的商业模型AIGC视频数据集,涵盖13个主流商业生成器,并提供语义对齐的真实-伪造视频对。该数据集能够深入探索真实视频与高质量合成视频之间的差异,并为高逼真视频伪造检测建立新基准。基于该数据集,我们提出了CoCoDetect,一个集成对比学习与置信门控多模态大语言模型(MLLM)推理的检测框架。R3D-18骨干网络提取时空表示,而置信门将不确定案例路由到MLLM进行物理合理性和场景一致性的推理。在CoCoVideo-26K和公共基准上的大量实验证明了最先进的性能,验证了该框架的鲁棒性和泛化能力。我们的代码和数据集可在https://github.com/DonoToT/CoCoVideo获取。

英文摘要

With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework's robustness and generalizability. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.

2606.00100 2026-06-02 cs.CV cs.AI 版本更新

CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout

CoilDrop-MRI:基于线圈丢弃的自监督物理引导MRI重建

Tongxi Song, Ziyu Li, Zihan Li, Wen Zhong, Congyu Liao, Yang Yang, Hua Guo, Wenchuan Wu, Qiyuan Tian

发表机构 * School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University(清华大学生物医学工程系) Oxford Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford(牛津大学整合神经影像中心) Department of Radiology & Biomedical Imaging, University of California San Francisco(加州大学旧金山分校放射科与生物医学成像系)

AI总结 提出CoilDrop-MRI方法,通过在线圈维度进行丢弃并作为自监督训练目标,结合图像域和k空间域展开架构,实现无需全采样数据的并行MRI重建,在多站点、多场强、多模态数据集上性能优于现有自监督方法。

详情
AI中文摘要

基于自监督深度学习的方法在加速磁共振成像(MRI)重建中展现出巨大潜力,无需全采样数据即可实现高图像质量。这些方法通常将采集的数据划分为两个不相交的子集,构建输入-目标对以优化重建网络。然而,现有方法仅在空间频率(k空间)域进行划分,未探索线圈维度。为充分利用接收线圈间的信号相关性,我们提出CoilDrop-MRI,该方法对输入应用线圈级丢弃,并将丢弃的数据作为自监督框架中的训练目标。该方法被集成到图像域(SENSE)和k空间(SPIRiT)公式的展开架构中。我们进一步将CoilDrop-MRI扩展到多激发、相位校正的扩散MRI(dMRI)重建,展示了其多功能性。CoilDrop-MRI在多站点、多场强(0.3T、0.55T和3T)和多模态(T1加权、T2加权、T2-FLAIR和dMRI)数据集上进行了广泛验证,始终优于最先进的自监督方法,达到了与监督重建方法相当的质量,且无需全采样参考训练数据。此外,CoilDrop-MRI表现出强大的数据效率和跨成像条件的鲁棒泛化能力,使其成为自监督并行MRI重建的实用且通用的框架。

英文摘要

Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achieving high image quality without requiring fully sampled data for training. These methods typically partition the acquired data into two disjoint subsets to construct input-target pairs for optimizing the reconstruction network. However, existing approaches perform this partition exclusively within the spatial frequency (k-space) domain, leaving the coil dimension unexplored. To enforce full exploitation of signal correlation across receiver coils, we propose CoilDrop-MRI, which applies coil-wise dropout to the input and uses the dropped data as training targets in a self-supervised framework. This method is integrated into unrolled architectures in both image-domain (SENSE) and k-space (SPIRiT) formulations. We further demonstrate its versatility by extending CoilDrop-MRI to multi-shot, phase-corrected diffusion MRI (dMRI) reconstruction. CoilDrop-MRI is extensively validated on multi-site, multi-field-strength (0.3T, 0.55T, and 3T), and multi-modality (T1-weighted, T2-weighted, T2-FLAIR, and dMRI) datasets and consistently outperforms state-of-the-art self-supervised methods, achieving quality comparable to supervised reconstruction methods without requiring fully sampled reference training data. Moreover, CoilDrop-MRI exhibits strong data efficiency and robust generalization across imaging conditions, establishing it as a practical and versatile framework for self-supervised parallel MRI reconstruction.

2606.00098 2026-06-02 cs.CV eess.IV 版本更新

Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

分割引导的空间索引用于可泛化和可解释的深度伪造检测

Izaldein Al-Zyoud, Abdulmotaleb El Saddik

发表机构 * University of Central Florida(佛罗里达大学)

AI总结 提出分割引导的空间索引方法,通过冻结的FaRL解析器为DINOv3 ViT-L/16的patch token分配语义标签,仅选择语义相关的区域进行分类,实现可泛化且可解释的深度伪造检测。

详情
AI中文摘要

我们引入了分割引导的空间索引,用于可泛化和可解释的深度伪造检测。关键思想颠倒了标准设计顺序:不是先汇集所有人脸token再分类,而是先选择语义上有意义的patch token,然后仅汇集这些token。一个冻结的FaRL解析器为每个DINOv3 ViT-L/16 patch token分配一个语义标签;丢弃非目标token;一个线性探针对保留的区域进行分类。这种空间索引利用了DINOv3的patch级空间一致性(即产生涌现分割的相同属性),向探针呈现一个更纯净的区域子空间,其中与操作相关的证据较少被全脸线索稀释。区域归因是结构性的:当嘴部模型预测为假时,决策仅使用了嘴部token,而不是叠加的显著性图。在Celeb-DF v2上,嘴部索引探针的AUC达到0.905,优于LipForensics(+8.1个百分点)和Xception(+16.9个百分点),且无需对DINOv3或FaRL进行微调,也无需目标域数据。消融实验隔离了机制:用DINOv3的CLS token替换区域选择,Celeb-DF v2 AUC下降26.4个百分点;用FaRL特征替换DINOv3,AUC下降20.9个百分点。DINOv3表示和空间索引都是独立必要的;单独任何一个都无法达到完整系统的性能。

英文摘要

We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO 版本更新

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

弥合2D-3D鸿沟:面向视觉语言导航的分层语义几何地图

Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Bosch Corporate Research(博世企业研究) King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出分层语义几何地图(HSGM),将3D几何信息转化为VLM可理解的结构化表示,结合VLM高层语义规划与经典路径规划,实现零样本视觉语言导航,在R2R-CE和RxR-CE基准上达到最先进性能。

详情
AI中文摘要

视觉语言导航(VLN)使具身智能体能够通过遵循语言指令在未知环境中到达目标位置。尽管近期视觉语言模型(VLM)取得了进展,但仍存在关键的语义-几何鸿沟:VLM擅长语言和2D视觉理解,但在3D空间推理方面表现不佳,且无法捕捉动作与空间转换之间的因果动态,导致导航不可靠,尤其在零样本设置中。为弥合这一鸿沟,我们提出分层语义几何地图(HSGM),将3D几何信息转化为与VLM兼容的结构化表示,有效将其与物理世界连接。具体而言,HSGM表示为多通道俯视图,组织为三个层次:(1)几何层,记录可导航区域和障碍物;(2)语义层,表示物体及其关系;(3)决策层,支持高层任务推理和目标选择。导航过程中,VLM作为高层语义规划器,解释HSGM编码的空间布局以选择几何有效航点,而航点间的低层无碰撞运动由经典路径规划算法执行,从而将语义推理与动作执行完全解耦。此外,复杂指令被分解为子任务,以缓解长程导航中的进度遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明,我们的零样本框架达到了最先进性能,甚至优于若干监督方法。代码见 https://github.com/Teacher-Tom/HSGM_public。

英文摘要

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

2606.00092 2026-06-02 cs.CV cs.AI 版本更新

Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization

对齐细胞层与分类器注意力以实现可解释的弱监督病理定位

Devansh Lalwani, Swapnil Bhat, Maulik Shah

发表机构 * Turocrates AI Private Limited(Turocrates AI私有有限公司)

AI总结 针对弱监督全切片图像分类中注意力图定位不准确的问题,提出结合细胞层与注意力机制的一致性训练方法,在Camelyon16上实现补丁级AUC 0.940,并提升注意力AUC从0.717至0.953。

详情
AI中文摘要

基于基础特征的注意力多实例学习(ABMIL)在Camelyon16切片级别性能上接近饱和,但相应的注意力图作为定位信号并不完美:在临床解释中,一个正确分类但未激活实际病灶的模型难以被信任。我们通过细胞层(cellular sheaves)来解决这一差距,细胞层为图的每个顶点和边赋予有限维向量空间及它们之间一致的线性映射,提供了一种在图结构数据上检测局部不一致性的原则性方法。我们将细胞层应用于全切片图像的弱监督肿瘤定位,结合了细胞层不一致场与ABMIL。自然的训练目标——鼓励相似特征之间的一致性——产生的不一致场追踪的是组织级纹理而非诊断内容。我们提出注意力条件一致性,利用分类器的注意力来定义哪些相邻补丁应该一致。在此目标下联合训练分类器和细胞层,在Camelyon16上产生的不一致场达到补丁级AUC 0.940,并将注意力头从单独ABMIL的0.717提升至0.953。两阶段消融实验(分类器冻结在ABMIL值)仅在不一致场上达到0.727,注意力保持0.717,证实增益来自投影器在两个目标下的共同适应,而非单独的损失变化。训练后的模型无需重新训练即可迁移至Camelyon17的标注切片,保持Delta AUC 0.932 +/- 0.083和注意力AUC 0.955 +/- 0.099。结果是注意力图和细胞层不一致图同时激活相同的诊断区域,为每个切片级预测提供两种互补的解释。

英文摘要

Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation features now reaches near-saturation on Camelyon16 slide-level performance, but the corresponding attention maps are an imperfect localization signal: in clinical interpretation, a model that classifies correctly without firing on the actual lesion is hard to trust. We address this gap with cellular sheaves, which equip each vertex and edge of a graph with a finite-dimensional vector space and consistent linear maps between them, providing a principled way to detect local disagreement on graph-structured data. We apply cellular sheaves to weakly-supervised tumour localization on whole-slide images, combining a sheaf disagreement field with ABMIL. The natural training objective, encouraging consistency between similar features, produces a disagreement field that tracks tissue-level texture rather than diagnostic content. We propose attention-conditional consistency, which uses the classifier's attention to define which neighbouring patches should agree. Joint training of the classifier and the sheaf under this objective produces a disagreement field with patch-level AUC 0.940 on Camelyon16 and raises the attention head from its ABMIL-alone level of 0.717 to 0.953. Two-stage ablation with the classifier frozen at its ABMIL values reaches only 0.727 on the disagreement field and leaves attention at 0.717, confirming that the gain comes from the projector co-adapting under both objectives, not from the loss change in isolation. The trained model transfers without retraining to annotated slides from Camelyon17, maintaining Delta AUC 0.932 +/- 0.083 and attention AUC 0.955 +/- 0.099. The result is an attention map and a sheaf-disagreement map that fire on the same diagnostic regions, giving clinicians two complementary explanations for each slide-level prediction.

2606.00087 2026-06-02 cs.CV cs.AI 版本更新

Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome

结构化视觉证据分解用于阻塞性睡眠呼吸暂停低通气综合征的证据驱动多模态筛查

Chen Zhan, Yingchen Wei, Xiaoyu Tan, Jingjing Huang, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子与电气工程学院) Tencent Youtu Lab(腾讯云视频实验室) ENT Institute and Department of Otorhinolaryngology, Eye & ENT Hospital of Fudan University(复旦大学耳鼻喉科医院耳鼻喉科研究所) National University of Singapore(新加坡国立大学)

AI总结 提出EviOSAHS框架,通过将面部图像分解为七个解剖查询并生成结构化证据卡,结合临床信息进行高灵敏度OSAHS筛查。

详情
AI中文摘要

有效的阻塞性睡眠呼吸暂停低通气综合征(OSAHS)多导睡眠图前筛查需要结合临床风险因素与可见的颅面和颈部线索。直接提示通用多模态基础模型进行医学是/否决策可能产生不稳定、校准不良的输出。我们提出EviOSAHS,一个证据驱动的多模态推理框架,将仅基于图像的解剖证据获取与最终临床判定分离。每张正面面部图像被分解为七个固定的解剖查询,涵盖颈部、下巴、嘴巴、面/颈脂肪、下颌、中面部和鼻子。视觉响应被转换为结构化证据卡,记录目标解剖结构、可见性、风险方向、证据强度、置信度和简洁摘要。这些卡片仅在最后阶段与清理后的临床档案结合,由大型语言模型进行平衡的二元筛查判定。我们在642名受试者队列上评估了EviOSAHS,将正常受试者映射为筛查阴性,轻度、中度或重度OSAHS受试者映射为筛查阳性。EviOSAHS实现了88.47%的准确率、94.86%的灵敏度、93.74%的F1分数和5.14%的假阴性率,在统一协议下优于仅临床提示、直接多模态提示和朴素两阶段流水线。消融实验表明,七问题视觉分解和平衡最终判定对高灵敏度工作点至关重要。对4,494个视觉输出的问题级审计显示100%的结构化解析率和93.88%的高可见率。EviOSAHS为二元多导睡眠图前OSAHS筛查提供了一个可审计、高灵敏度的工作流程,但应被视为分诊助手而非诊断系统。在临床部署前需要进行前瞻性验证、外部测试和校准的工作点控制。

英文摘要

Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.

2606.00080 2026-06-02 cs.CV cs.AI cs.LG cs.NE 版本更新

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Planktonzilla: 用于理解浮游生态系统的多模态数据集与模型

Alan Gerson Contreras Montanares, Luis Valenzuela, Luis Martí, Nayat Sanchez-Pi

发表机构 * Inria Chile Research Center(Inria智利研究中心)

AI总结 为解决浮游生物分类模型泛化性差的问题,提出统一数据集Planktonzilla-17M(含1740万张图像,涵盖602个分类类群),并对比监督学习与CLIP风格训练,发现基于分类谱系的监督学习优于CLIP,且现有生物基础模型在海洋成像领域表现不佳。

详情
AI中文摘要

海洋浮游生物支撑着水生食物网,并在全球二氧化碳封存中发挥关键作用,因此可靠的物种识别对于理解海洋健康和气候反馈至关重要。现有的分类模型在单个数据集上表现良好,但由于训练数据集孤立且标签不一致,无法跨仪器和环境泛化。为解决这一问题,我们引入了Planktonzilla-17M,这是一个统一的数据集,整合了来自13个成像系统的公开浮游生物图像集合。它包含1740万张图像,具有标准化的分类学和地理环境元数据,其中包括374万张浮游生物图像,涵盖602个分类类群,其中201个在物种级别被识别,使其成为迄今为止最大、最全面的浮游生物图像数据集。利用这一大规模数据集,我们在共享ViT骨干网络上进行了监督学习与CLIP风格图像-文本训练的对比实验。我们发现,当使用分类谱系作为文本时,监督分类器的表现与CLIP风格训练相当或更优。我们进一步观察到,BioCLIP和BioCLIP2在零样本和少样本设置下对浮游生物表现不佳。利用Planktonzilla-17M提高了浮游生物分类性能,凸显了当前生物基础模型在海洋成像领域的局限性。

英文摘要

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image--text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.

2606.00078 2026-06-02 cs.CV cs.AI 版本更新

Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications

基于流的生成建模优化压缩感知应用中的采样策略

Roman Pavelkin, Luis A. Zavala-Mondragon, Christiaan G. A. Viviers, Fons van der Sommen

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出一种任务感知的基于流的生成框架,通过训练流模型优化压缩感知中的子采样掩码,显著提升图像分类、重建和MRI加速的性能。

详情
AI中文摘要

信号处理和医学成像中的许多现代应用需要在严格的资源约束下获取高维信号。传统采样理论表明,准确重建信号所需的测量次数与信号的维数成正比,这一要求往往过于昂贵或不切实际。压缩感知通过证明稀疏信号可以在较少的测量下恢复(前提是测量算子满足某些条件)挑战了这一观念。这项概念验证研究提出了一种任务感知的基于流的生成框架——对传统流匹配训练范式的重新表述,其中流模型被训练用于优化压缩感知应用中的子采样。我们建立了所提出的学习子采样掩码框架的基本可行性,该框架显著提升了压缩感知在图像分类、图像重建和MRI加速中的性能。在图像重建任务中,我们的方法展示了最先进的性能,在CelebA数据集上以5%的子采样率实现了25.17 dB的峰值信噪比,在重建8倍加速的MRI测量(fastMRI数据集)时以最小的计算开销达到了29.24 dB。这些结果突显了生成流模型中任务条件化的有效性,并揭示了表示学习策略的一个有前景的方向。总体而言,所提出的框架提供了一种统一、灵活的方法来设计数据和任务驱动的感知方案,有望适用于广泛的逆问题。

英文摘要

Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource constraints. Traditional sampling theory suggests that accurate signal reconstruction requires a number of measurements proportional to the signal's ambient dimension, a requirement often too expensive or impractical. Compressed sensing challenges this notion by demonstrating that sparse signals can be recovered with fewer measurements, provided the measurement operator meets certain conditions. This proof-of-concept study presents a task-aware flow-based generative framework -- a reformulation of the conventional Flow Matching training paradigm with a flow model trained to optimize subsampling in compressed sensing applications. We establish the fundamental feasibility of the proposed framework of learning subsampling masks that substantially enhance the performance of compressed sensing for image classification, image reconstruction, and MRI acceleration. For the image reconstruction task, our method demonstrated state-of-the-art performance, achieving Peak Signal-to-Noise Ratio of 25.17 dB at the subsampling rate of 5\% on the CelebA dataset and 29.24 dB when reconstructing $8\times$ accelerated MRI measurements (fastMRI dataset) with the minimal computational overhead. These results highlight the effectiveness of task-conditioning within generative flow models and reveal a promising direction for representation learning strategies. Overall, the proposed framework offers a unified, flexible approach to designing data- and task-driven sensing schemes that can be potentially adapted to a broad range of inverse problems.

2606.00077 2026-06-02 cs.CV cs.AI 版本更新

Improved Belief-Attention in Vision Task

视觉任务中的改进信念注意力

Guoqiang Zhang

发表机构 * University of Exeter(埃克塞特大学)

AI总结 提出Belief2-Attention,通过同时利用垂直分量和投影分量扩展信念注意力,并引入额外内积矩阵增强标记相关性,提升视觉任务性能。

详情
AI中文摘要

最近,Belief-Attention \cite{Guoqiang25BeliefAttention} 被提出,它首先对基于 softmax 的 $V$ 向量加权求和进行关于原始 $V$ 向量的正交投影,然后将垂直分量作为 Transformer 中的残差信号以提升性能。在本文中,我们首先进行消融研究,表明投影分量也携带关于标记相关性的信息,不应被忽略。然后,我们提出通过同时利用垂直分量和投影分量来扩展 Belief-Attention。具体地,投影分量经过某种激活函数,然后进行线性映射,再与所考虑的标记合并。概念上讲,投影分量的神经块可以视为新注意力块内的两层前馈网络(FFN)。此外,注意到标准注意力通过内积矩阵 $QK^T$ 捕获标记相关性。我们提出向 $QK^T$ 引入额外的内积矩阵 $ZZ^T$ 以捕获更丰富的标记相关性。我们将新模块称为 Belief2-Attention。可以很容易地证明 Belief2-Attention 比标准注意力更具表达能力。然后,我们验证了 Belief2-Attention 在图像分类和分割等视觉任务中的有效性。

英文摘要

Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of $V$ vectors with respect to the original $V$ vectors and then taking the perpendicular component as the residual signal in Transformer for performance improvement. In this paper, we first conduct an ablation study showing the projected component also carries information about the token correlation, which should not be ignored. We then propose to extend Belief-Attention by making use of both the perpendicular and projected components. In particular, the projected component goes through certain activation function and then a linear mapping before merging with the considered token. Conceptually speaking, the neural block for the projected component can be viewed as a two-layer feedforward network (FFN) within the new attention block. It is also noted that standard attention captures the token correlation via the inner-product matrix $QK^T$. We propose to introduce an additional inner-product matrix $ZZ^T$ to $QK^T$ to capture richer token correlation. We refer to the new module as Belief2-Attention. It can be easily shown that Belief2-Attention is more expressive than standard Attention. We then verify the effectiveness of Belief2-Attention for vision tasks of image classification and segmentation.

2606.00076 2026-06-02 cs.CV 版本更新

DefocusTrackerAI -- A Generalized Framework for the Automatic Detection of Defocused Particle Images

DefocusTrackerAI -- 一种用于自动检测离焦粒子图像的通用框架

Gonçalo Coutinho, Ana S. Moita, António L. N. Moreira, Massimiliano Rossi

发表机构 * IN+ Center for Innovation, Technology and Policy Research, Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal(IN+创新、科技与政策研究中心,理工学院,里斯本大学,里斯本,葡萄牙) CINAMIL - Military Academy Research Center, Militart Academy, Portugal(CINAMIL - 军事学院研究中心,军事学院,葡萄牙) Department of Industrial Engineering, Alma Mater Studiorum University of Bologna, Bologna, Italy(工业工程系,博洛尼亚大学,博洛尼亚,意大利)

AI总结 提出DefocusTrackerAI,一种基于YOLOv9的通用深度学习框架,用于自动检测和位置估计离焦粒子图像,在多种光学配置下实现高召回率和低不确定性。

Comments 24 pages, 10 figures

详情
AI中文摘要

本工作介绍了DefocusTrackerAI,一个通用的深度学习框架,用于自动检测和位置估计来自任何光学配置的离焦粒子图像,同时不损害不确定性和召回率,旨在作为开源项目DefocusTracker的后续。我们从两个知名的目标检测模型Faster R-CNN和YOLOv9的直接比较中选择了深度神经网络架构,这些模型在包含不同直径的像散和非像散离焦粒子图像的多样化且特征丰富的合成图像集上进行了训练。对合成数据的模型评估表明,首先,YOLOv9优于Faster R-CNN,实现了更高的召回率和更低的不确定性,特别是在高粒子图像密度下;其次,YOLOv9提供了增强的空间分辨率,对于粒子图像密度N_s高达0.5,不确定性值在0.1到0.4像素之间,优于最先进的算法。我们证明了我们的模型能够在多种光学设置和不同光照条件下检测像散和非像散离焦粒子图像。此外,我们成功地将模型应用于真实的DPT实验,包括荧光和阴影图数据,表明它们可以用于传统DPT应用之外,包括喷雾和液滴的跟踪。基于YOLOv9的预训练、即用型DefocusTrackerAI版本可在https://gitlab.com/goncalo.coutinho/defocustrackerAI-main/-/tree/7e0f11f649ebad50e20dca5b9545f26ca303ebe0获取,并可用于高精度自动检测任何类型的离焦粒子图像。结合合适的深度位置校准方法,它可作为三维离焦粒子跟踪的有效第一步。

英文摘要

The present work introduces DefocusTrackerAI, a generalized deep-learning framework for the automatic detection and position estimation of defocused particle images from any kind of optical configuration without compromising uncertainty and recall, intended as a follow-up of the open-source project DefocusTracker. We selected the deep neural network architecture from the direct comparison of two well-known object detection models, Faster R-CNN and YOLOv9, trained on a diverse and feature-rich synthetic image set containing astigmatic and non-astigmatic defocused particle images of varying diameters. The model evaluation on synthetic data showed that, first, YOLOv9 outperforms Faster R-CNN, achieving higher recall and lower uncertainty, particularly at high particle image densities; and second, that YOLOv9 provides enhanced spatial resolution, with uncertainty values between 0.1 and 0.4 pixels for particle image densities N_s up to 0.5, outperforming state-of-the-art algorithms. We demonstrated that our models are able to detect astigmatic and non-astigmatic defocused particle images in multiple optical setups with varying lighting conditions. In addition, we successfully applied our models on real DPT experiments, including fluorescence and shadowgraph data, showing that they can be used beyond conventional DPT applications, including the tracking of sprays and droplets. A pre-trained, ready-to-use version of DefocusTrackerAI based on YOLOv9 is available at https://gitlab.com/goncalo.coutinho/defocustrackerAI-main/-/tree/7e0f11f649ebad50e20dca5b9545f26ca303ebe0 and can be used for automatic detection of defocused particle images of any kind with high accuracy. In combination with a suitable calibration approach for the depth position, it can be used as an effective first step for three-dimensional defocusing particle tracking.

2606.00054 2026-06-02 cs.RO cs.AI cs.CV 版本更新

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

从人类视频到机器人操作:基于人类中心数据的可扩展视觉-语言-动作学习综述

Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University(清华大学) HKUST(香港科技大学) Xi’an Jiaotong University(西安交通大学) Fudan University(复旦大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Microsoft Zurich Project(微软苏黎世实验室)

AI总结 本文综述了如何将丰富的人类视频转化为视觉-语言-动作(VLA)模型的有效知识,分类了四种方法(潜在动作表示、预测世界模型、显式2D监督、显式3D重建),并指出了结构化非结构化视频、跨具身和视角的动作映射、以及评估协议设计三大挑战。

Comments Accepted to IJCAI 2026 Survey Track. Project page: https://aaronfengzy.github.io/HumanCentricToVLA-Survey/

详情
AI中文摘要

近期在可泛化具身控制方面的进展由大规模预训练的视觉-语言-动作(VLA)模型驱动。然而,大多数现有方法依赖于大量机器人演示数据,这些数据获取成本高昂且与特定具身紧密耦合。相比之下,人类视频丰富且捕捉了丰富的交互,为真实世界操作提供了多样的语义和物理线索。然而,具身差异以及任务对齐标注的频繁缺失使得它们直接用于VLA模型具有挑战性。本综述提供了一个统一的视角,探讨如何将人类视频转化为VLA模型的有效知识。我们根据所提取的动作相关信息将现有方法分为四类:(i) 编码帧间变化的潜在动作表示;(ii) 预测未来帧的预测世界模型;(iii) 提取图像平面线索的显式2D监督;(iv) 恢复几何或运动的显式3D重建。除分类外,我们强调了该领域的三个关键开放挑战:将非结构化视频结构化为可训练的片段、在具身和视角异质性下将视频导出的监督接地到机器人可执行动作中,以及设计能更好预测真实世界部署性能和迁移效率的评估协议,从而为未来研究方向提供参考。论文和资源的精选列表见 https://github.com/AaronFengZY/HumanCentricToVLA-Survey。

英文摘要

Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real-world manipulation. Yet, embodiment differences and the frequent absence of task-aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action-related information they derive: (i) latent action representations that encode inter-frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image-plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training-ready episodes, grounding video-derived supervision into robot-executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real-world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA-Survey.

2606.00046 2026-06-02 cs.MM cs.AI cs.CV cs.CY 版本更新

When Jokes Cross the Line: Analyzing Regular Humor and Dark Humor in YouTube Shorts

当玩笑越界:分析YouTube Shorts中的常规幽默与黑色幽默

Sydney Johns, Sanjeev Parthasarathy, Shantnu Bhalla, Vaibhav Garg

发表机构 * Virginia Polytechnic Institute and State University(弗吉尼亚理工大学)

AI总结 通过构建TwistedHumor数据集(1211个YouTube Shorts及33041条评论的手工标注),结合多视角分析(LLooM概念归纳、评论情感分析、大模型评估),揭示了短格式视频中常规幽默与黑色幽默在主题、观众反应和模型检测上的差异,强调了上下文感知审核的必要性。

详情
AI中文摘要

YouTube等视频平台重塑了用户参与娱乐和信息的方式,强调简短、高参与度的内容,如Shorts。在这个生态系统中,某些内容处于灰色地带:虽然允许存在,但仍可能对部分观众产生意想不到的负面影响。为了研究这一问题,我们引入了TwistedHumor数据集,包含1,211个YouTube Shorts及其配对的33,041条相关评论,并手工标注了幽默存在性、幽默类型、伤害性、主题、修辞手法和单口喜剧背景。除了数据集构建,我们还提出了对短格式社交媒体中幽默与伤害表现的多视角分析。通过使用基于LLooM的概念归纳对视频描述进行分析,我们发现黑色幽默经常围绕批评、应对、尴尬和身份表达等主题聚集,而不是作为一个单一的类别出现。我们进一步通过关联评论分析观众反应,表明常规幽默与更积极的情感相关,而黑色幽默则收到更多混合、中性甚至有时更有毒的反馈。最后,我们评估了大语言模型与人类标注的一致性,发现它们在单口喜剧上的表现优于短笑话。综合来看,这些结果将TwistedHumor不仅定位为一个新的基准,而且是对短格式视频中幽默与伤害灰色地带的实证研究,强调了需要上下文感知的审核和更稳健的多模态评估。

英文摘要

Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging content such as Shorts. Within this ecosystem, certain content occupies a gray area where it remains allowed but may still have unintended negative effects on some audiences. To study this problem, we introduce TwistedHumor, a dataset of 1,211 YouTube Shorts paired with 33,041 related comments, with hand annotations for humor presence, humor type, harm, topic, rhetorical devices, and stand up context. Beyond dataset creation, we present a multi view analysis of how humor and harm appear in short form social media. Using LLooM based concept induction over video descriptions, we find that dark humor frequently clusters around themes of critique, coping, awkwardness, and identity expression rather than appearing as a single uniform category. We further analyze audience response through linked comments and show that regular humor is associated with more positive sentiment, while dark humor receives more mixed, neutral, and sometimes more toxic reactions. Finally, we evaluate large language models against human annotations and find that they perform better on stand up comedy compared to shorter jokes. Together, these results position TwistedHumor not only as a new benchmark, but as an empirical study of the gray area between humor and harm in short form video, highlighting the need for context aware moderation and more robust multimodal evaluation.

2606.00001 2026-06-02 cs.HC cs.CV cs.MM 版本更新

Shu Dao: A Calligraphy Score Framework Linking Calligraphy, Music, and Performance

书道:连接书法、音乐与表演的评分框架

Lican Huang

发表机构 * Hangzhou Domain Zones Technology Co., Ltd.(杭州域区技术有限公司)

AI总结 提出CWSR表示法和书道框架,将东亚书法建模为类似乐谱的结构化表演,支持人机共创。

Comments 47 pages

详情
Journal ref
Journal of Advances in Information Science and Technology, 2026 4(2), 1-47. https://yvsou.com/journal/index.php/jaist/article/view/43
AI中文摘要

本文介绍了书法书写评分表示法(CWSR),并提出了书道框架,将东亚书法解读为一种表演艺术而非静态视觉产物。受日本书道和茶道等体现文化实践的启发,该框架将书法建模为类似于音乐符号的结构化表演。该方法不将字符表示为固定图像,而是将每个笔画编码为有序且可执行的动作,形成书法评分。字符在结构化空间网格中组织,笔画标注有类型、执行顺序、空间坐标、轨迹、构图角色以及动态属性(如笔压和节奏)。这种表示捕捉了书法书写中通常图像表示所缺失的时间和表达方面。本文做出三项主要贡献:首先,引入CWSR作为结构化符号系统,在笔画、字符结构和构图组织(如布局和章法)等多个层面表示书法,及其节奏和表演动态;其次,将书道概念化为基于评分的框架,将书法建模为结构化表演;第三,为基于AI的书法智能体分析、可视化和可执行生成书法作品建立计算基础。这些贡献共同连接了书法、音乐符号和表演文化实践,支持计算书法和数字人文研究中的人机共创。

英文摘要

This paper introduces Calligraphy Writing Score Representation (CWSR) and proposes Shu Dao as a framework that interprets East Asian calligraphy as a performative art rather than a static visual artifact. Inspired by traditions such as Japanese Shodō and embodied cultural practices such as Chadao , the framework models calligraphy as a structured performance analogous to musical notation. Instead of representing characters as fixed images, the proposed approach encodes each brush stroke as an ordered and executable action, forming a calligraphy score. Characters are organized within a structured spatial grid, and strokes are annotated with attributes including stroke type, execution order, spatial coordinates, trajectory, compositional role, and dynamic properties such as brush pressure and pacing. This representation captures temporal and expressive aspects of calligraphic writing that are typically absent from image-based representations. The paper makes three main contributions. First, it introduces CWSR as a structured notation system for representing calligraphy across multiple levels, including strokes, character structures, and compositional organization (e.g., layout and zhangfa), together with their rhythmic and performative dynamics. Second, it conceptualizes Shu Dao as a score-mediated framework that models calligraphy as structured performance. Third, it establishes a computational foundation for the analysis, visualization, and executable generation of calligraphic works by AI-based calligraphic agents. Together, these contributions bridge calligraphy, musical notation, and performative cultural practices, supporting human--AI co-creation in computational calligraphy and digital humanities research.

2605.31597 2026-06-02 cs.CV 版本更新

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

SOCO: 视觉基础模型中语义对象对应关系的基准测试

Olaf Dünkel, Basavaraj Sunagad, Haoran Wang, David T. Hoffmann, Christian Theobalt, Adam Kortylewski

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所信息学研究所) Saarland Informatics Campus(萨尔州信息学校园) CISPA Helmholtz Center for Information Security(信息安全霍夫曼中心) University of Freiburg(弗赖堡大学)

AI总结 提出SOCO基准,通过引入对应类型分类法和100个类别上超过100万对功能上有意义的关键点注释,系统评估视觉基础模型中的语义对应能力,并揭示模型在跨类别迁移、语言引导定位与视觉对应之间的差距。

Comments Project page: https://genintel.github.io/SOCO/

详情
AI中文摘要

由于评估协议不一致和部分级监督有限,测量视觉基础模型中的结构化对象理解仍然具有挑战性。语义对应(SC)通过测试对象部分是否能在外观、视角和几何形状的巨大变化下跨实例和类别匹配来评估这种能力。为了实现系统的SC评估,我们引入了SOCO,一个新的语义对象对应基准,它引入了对应类型的分类法,并在100个类别和超过100万对应对上提供了一致、功能上有意义的关键点注释。此外,SOCO包括关键点语言描述,使得能够评估大型视觉语言模型(LVLMs)及其细粒度部分级理解。综合实验揭示:(i) 视觉基础骨干编码了强大的语义结构,但在相关类别之间转移对应关系较差,且仅部分捕捉对象部分位置;(ii) LVLMs在文本提示的部分定位方面比视觉参考的跨图像匹配更强,暴露了语言引导定位与细粒度视觉对应之间的差距;(iii) 对应性能比ImageNet分类更能预测密集下游任务(包括分割、跟踪、3D姿态估计和3D检测)的性能。总之,这些发现将SOCO定位为视觉和多模态基础模型中结构化、部分级表示质量的基准。

英文摘要

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.

2605.31557 2026-06-02 cs.CV 版本更新

EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision

EGOSTREAM: 面向第一人称视角的流式情景记忆诊断基准

Rosario Forte, Giuseppe Lando, Antonino Furnari

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出EGOSTREAM基准,通过七种认知维度和答案有效期窗口,诊断流式视频中模型的情景记忆能力,并评估多种记忆管理机制。

详情
AI中文摘要

连续情景记忆是自主代理在动态真实环境中运行的核心能力,然而当前的流式视频基准为诊断模型记住什么以及记忆多久提供的工具有限。我们引入Egostream,一个面向第一人称视角的流式情景记忆评估诊断基准。Egostream沿七个认知维度组织了2250个精心设计的问题:细节、空间、时间、事件、社会、因果和前瞻记忆。我们引入了答案有效期窗口(AVW),它指定了随着观察场景演变答案保持有效的时间跨度。这使得我们将问题扩展为8528个回忆条件评估,从而能够控制从即时到超长期的回忆测试,同时将模型真正的遗忘与自然世界状态变化区分开来。我们通过一个统一的流式多模态大语言模型框架严格建立了基线性能,该框架比较了几种最先进的记忆管理机制,包括滑动窗口、注意力汇聚、KV缓存剪枝、合并和卸载。在统一的Qwen3-VL骨干网络上的实验表明,可比较的总体准确率掩盖了截然不同的记忆特征。例如,token剪枝在保留细粒度细节和时间结构方面显著优于token合并,而量化卸载则挽救了超长期回忆。最终,所有机制都远低于实时运行(>1秒每帧),且表现最好的方法准确率上限约为45%,揭示了当前架构中的关键差距。Egostream提供了弥合这些差距所需的诊断测试平台。项目网站、新闻和更新请访问:https://saroo25.github.io/Egostream/

英文摘要

Continuous episodic memory is a core capability for autonomous agents operating in dynamic, real-world environments, yet current streaming video benchmarks provide limited tools for diagnosing what models remember and for how long. We introduce Egostream, a diagnostic benchmark for streaming episodic memory evaluation in egocentric vision. \egostream organizes 2,250 curated questions along seven cognitive dimensions: detail, spatial, temporal, event, social, causal, and prospective memory. We introduce the Answer Validity Window (AVW), which specifies the temporal span an answer remains valid as the observed scene evolves. This allows us to expand the questions into 8,528 recall-conditioned evaluations, enabling controlled testing from instant to ultra-long-term recall while separating genuine model forgetting from natural world-state changes. We rigorously establish baseline performance through a unified streaming MLLM framework that compares several state-of-the-art memory-management mechanisms, covering sliding windows, attention sinks, KV-cache pruning, merging, and offloading. Experiments within a unified Qwen3-VL backbone reveal that comparable aggregate accuracies mask starkly different memory profiles. For instance, token pruning preserves fine-grained details and temporal structure significantly better than token merging, while quantized offloading rescues ultra-long-term recall. Ultimately, all mechanisms operate well below real-time (>1s per frame), and top performing methods ceil at about 45% accuracy, exposing critical gaps in current architectures. Egostream provides the diagnostic testbed needed to close these gaps. Project website, news and updates at: https://saroo25.github.io/Egostream/

2605.31487 2026-06-02 cs.CV 版本更新

Enhancing Computer Vision Model Generalization in Warehouse Facilities: A Case Study on Anomaly Detection in Vertical Material Handling Systems

提升仓库设施中计算机视觉模型泛化能力:垂直物料搬运系统异常检测案例研究

Ruiliang Liu, Tina Dongxu Li, Joshua Migdal, Ken Meszaros, Trevor Dardik

发表机构 * Amazon, USA(亚马逊公司)

AI总结 本研究通过实验室环境下的最优相机布置、图像触发策略、模型选择与集成,实现了垂直物料搬运系统异常检测模型从实验室到多种仓库环境的有效泛化,简化了部署流程并节省了标注和重训练资源。

Comments 6 pages, 10 figures. Accepted at IEEE International Conference on Mechatronics and Automation (ICMA) 2026

详情
AI中文摘要

在仓库设施中部署计算机视觉模型传统上需要大量资源用于相机安装、图像采集、标注、训练和部署——由于相机安装限制和环境变化,这一过程通常需要在每个新环境中重复。本文探索了一种创新方法,通过仅在实验室环境中执行标准流程来简化这一过程,重点关注垂直物料搬运系统及其叉的异常检测。通过大量实验,我们发现结合最优相机布置、策略性图像触发、谨慎的模型选择和模型集成,能够实现从实验室条件到多种仓库设施环境的有效泛化,可能通过将仓库设施部署简化为仅需相机安装、图像采集和模型部署,从而节省通常用于图像标注和模型重训练的大量资源和时间,改变仓库自动化实施方式。这是一项实验研究,并非生产部署。

英文摘要

Deploying computer vision models in Warehouse Facilities traditionally requires extensive resources for camera mounting, image collection, annotation, training, and deployment - a process often needing repetition in each new environment due to camera mounting constraints and environmental variability. This paper explores an innovative approach to streamline this process by conducting the standard procedure solely in a laboratory setting, focusing on vertical material handling systems and anomaly detection in forks of the systems. Through extensive experimentation, we have found that combining optimal camera placement, strategic image triggering, careful model selection and model ensemble enables effective generalization from laboratory conditions to diverse warehouse facilities environments, potentially transforming warehouse automation implementation by simplifying warehouse facilities deployment to just camera mounting, image collection, and model deployment, thereby saving significant resources and time typically spent on image annotation and model retraining. This is an experimental research study and not a production deployment.

2605.31437 2026-06-02 cs.CV 版本更新

Astra: a generalizable report generation foundation model for 3D computed tomography

Astra:一种用于三维计算机断层扫描的通用报告生成基础模型

Zhuhao Wang, Fang Chen, Chaohui Yu, Zihan Li, Yuchao Zheng, Jing Wang, Xuan Yang, Jia Guo, Zhenlu Yang, Xingju Zheng, Yihua Sun, Haojie Han, Xiaoxiao Qin, Zhan Feng, Wenbo Xiao, Chao Zhu, Yuehua Li, Shipeng Zhang, Hao Luo, Yunsong Peng, Fan Wang, Hongen Liao

发表机构 * School of Biomedical Engineering, Tsinghua University(清华大学生物医学工程学院) School of Biomedical Engineering, Shanghai Jiao Tong University(上海交通大学生物医学工程学院) DAMO Academy, Alibaba Group(阿里云达摩院) Hupan Laboratory(壶辰实验室) Department of Biomedical Engineering, National University of Singapore(新加坡国立大学生物医学工程系) Department of Radiology, Guizhou Provincial People’s Hospital(贵州省级人民医院放射科) Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine(浙江大学医学院附属第一医院放射科) Department of Radiology, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine(上海交通大学医学院附属第六人民医院放射科) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院)

AI总结 提出Astra模型,通过风格统一和强化学习,在8个器官系统的CT报告生成中实现高精度,平均细粒度诊断指标提升44.1%,并加速临床工作流。

详情
AI中文摘要

CT解读需要放射科医生每次检查审查数百个容积切片,使得报告耗时且高度依赖专业知识。自动CT报告生成为提高临床效率提供了一条有前景的途径,但该领域仍缺乏一个支持多区域报告并在外部真实世界队列中保持鲁棒性的通用CT报告生成基础模型。不同队列间报告风格和诊断术语的内在不一致性使得朴素联合训练容易受到噪声文本监督的影响,从而限制了模型的泛化能力。本文提出Astra,一个通用的CT报告生成基础模型,在包含90,678个胸腹部CT-报告对(CTRgDB)的数据集上训练,涵盖8个器官系统的353,671个异常。通过统一报告风格并进一步通过强化学习细化诊断一致性,Astra实现了跨不同解剖区域和机构的风格一致且诊断准确的报告生成。在CTRgDB和六个外部队列上的评估显示,Astra在细粒度诊断指标上平均提升44.1%(P<0.001),达到最先进性能。在实际临床工作流中,Astra辅助将胸部报告起草速度提高29.6%,并将腹部报告完整性提高11.3%(P<0.001)。此外,Astra作为CT AI开发的基础也展现出广泛实用性,通过高质量报告合成改善下游诊断性能并扩展视觉-语言预训练。总体而言,Astra作为一个广泛可用的临床助手和下一代AI医疗的关键基础设施。

英文摘要

CT interpretation requires radiologists to review hundreds of volumetric slices per examination, making reporting time-consuming and highly expertise-dependent. Automated CT report generation offers a promising route to improving clinical efficiency, yet the field still lacks a generalizable CT report generation foundation model that supports multi-region reporting and remains robust across external real-world cohorts. Intrinsic inconsistencies in reporting style and diagnostic terminology across cohorts make naive joint training prone to noisy textual supervision, thereby limiting model generalizability. Here we present Astra, a generalizable CT report generation foundation model trained on 90,678 thoracoabdominal CT-report pairs (CTRgDB) with 353,671 abnormalities spanning eight organ systems. By harmonizing report style and further refining diagnostic consistency via reinforcement learning, Astra achieves style-consistent and diagnostically accurate report generation across diverse anatomical regions and institutions. Evaluating on CTRgDB and six external cohorts, Astra achieves state-of-the-art performance with a 44.1% average improvement in fine-grained diagnostic metrics (P<0.001). In real-world clinical workflows, Astra assistance accelerates chest report drafting by 29.6% and improves abdominal report completeness by 11.3% (P<0.001). Furthermore, Astra also demonstrates broad utility as a foundation for CT AI development, improving downstream diagnostic performance and scaling vision-language pretrain through high-quality report synthesis. Overall, Astra serves as a broadly accessible clinical assistant and a pivotal infrastructure for the next generation of AI-powered healthcare.

2605.31162 2026-06-02 cs.CV cs.LG 版本更新

Guidance for Low-Level Perceptual Editing in Unconditional Diffusion Models

无条件扩散模型中低级感知编辑的引导

Shreyansh Modi, Akshat Tomar, Aarush Aggarwal

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基)

AI总结 针对无条件扩散模型在美学和感知增强中难以进行全局低级变换的问题,提出一种无需训练的推理时机制,通过提取退化概念向量并结合瓶颈修补与无分类器引导,实现图像编辑与质量提升。

Comments 11 pages, 12 figures, Generative Models for Computer Vision Workshop CVPR 2026

详情
AI中文摘要

无条件扩散模型提供了强大的生成先验,但将其引导至美学增强的输出仍未被充分探索。我们表明,h-空间修补(用于无训练扩散编辑的主导范式)在美学和感知细化所需的全局低级变换中系统性失败。我们引入了一种新颖的、通用的框架,用于在无条件扩散模型中进行图像编辑,无需显式训练。这种推理时机制通过提取退化概念向量并组合瓶颈修补与无分类器引导来操作低级特征,从而引导采样远离退化流形,无需任何模型重训练即可持续生成改进的图像。

英文摘要

Unconditional diffusion models offer powerful generative priors, yet steering them toward aesthetically enhanced outputs remains largely unexplored. We show that h-space patching, the dominant paradigm for training-free diffusion editing, systematically fails for global, low-level transformations required for aesthetic and perceptual refinement. We introduce a novel, generalized framework for image-editing in unconditional diffusion models without explicit training. This inference-time mechanism operates on low-level features by extracting degradation concept vectors and combining bottleneck patching with classifier-free guidance to guide sampling away from the degraded manifold, producing consistently improved images without any model retraining.

2605.30855 2026-06-02 cs.CV 版本更新

Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

Robust Dreamer: 用于动作控制AR视频生成的偏差感知潜在高斯记忆

Hanlin Chen, Jiaxin Wei, Xibin Song, Yifu Wang, Steve Wang, Hongdong Li, Pan Ji, Gim Hee Lee

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) Technische Universität München(慕尼黑技术大学) Vertex Lab(Vertex实验室) Australian National University(澳大利亚国立大学)

AI总结 提出Robust Dreamer框架,通过潜在高斯记忆和动态偏差存档解决自回归视频生成中的漂移问题,实现长程3D一致性。

详情
AI中文摘要

逐帧动作控制的图像到视频生成是交互式世界模拟的一种有前景的范式,其中每个控制信号应引发即时的视觉响应。然而,在长自回归展开中保持视觉保真度和3D一致性仍然具有挑战性。现有的3D感知方法常常因两个障碍而遭受灾难性漂移:来自 extit{潜在--RGB循环}的信息丢失,其中生成的潜在被反复解码为RGB并重新编码用于未来条件;以及由 extit{无误差假设}引起的训练--推理差距,其中干净的训练记忆无法匹配预测损坏的推理记忆。为了解决这些挑战,我们提出了 extbf{Robust Dreamer},这是一个围绕如何设计3D记忆以及如何稳健使用它而构建的记忆增强框架。首先,我们引入了 extbf{潜在高斯记忆},它将从生成过程中继承的扩散潜在锚定到高斯基元,并通过潜在空间高斯泼溅召回它们。这提供了密集、几何感知、视图对齐的条件,同时避免了重复VAE转换导致的累积退化。其次,我们提出了 extbf{带有动态偏差存档的偏差学习},它通过一步近似合成展开引起的潜在偏差,按自回归阶段和去噪时间戳存储,并在训练期间将其注入历史记忆。这使得生成器暴露于现实的损坏记忆状态,并在推理前学习内部修正。在ScanNet、DL3DV和OmniWorldGame上的实验证明了最先进的长程性能。

英文摘要

Frame-wise action-controlled image-to-video generation is a promising paradigm for interactive world simulation, where each control signal should elicit an immediate visual response. However, maintaining visual fidelity and 3D consistency over long autoregressive rollouts remains challenging. Existing 3D-aware methods often suffer from catastrophic drift due to two impediments: information loss from \textit{Latent--RGB Cycling}, where generated latents are repeatedly decoded to RGB and re-encoded for future conditioning, and the training--inference gap induced by the \textit{error-free hypothesis}, where clean training memory fails to match prediction-corrupted inference memory. To address these challenges, we present \textbf{Robust Dreamer}, a memory-augmented framework built around how to design 3D memory and how to use it robustly. First, we introduce \textbf{Latent Gaussian Memory}, which anchors diffusion latents inherited from the generation process to Gaussian primitives and recalls them via latent-space Gaussian splatting. This provides dense, geometry-aware, view-aligned conditioning while avoiding accumulated degradation from repeated VAE conversion. Second, we propose \textbf{Deviation Learning with Dynamic Deviation Archive}, which synthesizes rollout-induced latent deviations through a one-step approximation, stores them by autoregressive stage and denoising timestamp, and injects them into historical memory during training. This exposes the generator to realistic corrupted memory states and teaches internal correction before inference. Experiments on ScanNet, DL3DV, and OmniWorldGame demonstrate state-of-the-art long-horizon performance.

2605.30581 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

工业视觉模拟到现实中的先验可用性:CAD引导与CAD不可用机制的综述

Chenxi Tao, Seung-Kyum Choi

发表机构 * George W. Woodruff School of Mechanical Engineering(乔治·W·伍德鲁夫机械工程学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文通过先验可用性视角重新组织工业视觉模拟到现实问题,区分CAD可用、CAD不可用和边界先验三种机制,并基于T-LESS/BOP、MVTec AD和VisA数据集进行实证分析,揭示了源分布设计、检测器容量和真实校准的重要性,以及CAD在测试时提供的独特验证通道。

Comments Review article; 103 references; 9 main figures; empirical anchors on T-LESS/BOP, MVTec AD, and VisA

详情
AI中文摘要

工业视觉模拟到现实通常被描述为从合成图像到真实图像的迁移,但工业部署通常涉及可用证据与所需决策之间更广泛的错配。系统可能基于CAD渲染、模拟RGB-D观测、正常参考图像、合成缺陷、预训练特征空间或语言提示构建,却在不同的传感器、光照、材料、夹具、校准、生产变化和罕见缺陷模式下部署。本综述将工业视觉模拟到现实重新定义为由先验可用性组织的域差距问题。我们区分了CAD可用设置(其中显式物体几何可支持渲染、校准、姿态估计、分割和测试时几何验证)、CAD不可用设置(其中几何被正常参考外观、特征分布、师生残差、合成异常假设、基础特征或视觉语言先验取代)以及边界先验设置(其中近似模型、模板、参考视图或语义对应仅保留CAD的部分作用)。这一框架将基于CAD的检测和6D姿态估计文献与通常单独综述的工业异常和表面检测文献联系起来。为使分类具体化,我们使用T-LESS/BOP、MVTec AD和VisA上的实证锚点。这些锚点表明,仅靠CAD渲染数量并不能弥合迁移;源分布设计、检测器容量和小规模真实校准可能更为重要。它们还表明,测试时的CAD通过掩码、姿态和深度一致性创建了独特的验证通道,而CAD不可用的检测则依赖于校准的正常性和特征偏差。因此,本综述反对单一跨任务排行榜,而是询问什么先验支撑了部署决策。

英文摘要

Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.

2605.30380 2026-06-02 cs.CV 版本更新

Lightweight SAR Ship Detection via Contrastive Distillation

基于对比蒸馏的轻量级SAR舰船检测

Surendar Devasundaram, Banafsheh Saber Latibari, Abhijit Mahalanobis

发表机构 * University of Arizona Department of Electrical and Computer Engineering(亚利桑那大学电气与计算机工程系)

AI总结 提出结构化统一关系知识蒸馏框架SURGE,通过对比InfoNCE目标在共享嵌入空间中转移关系几何,实现轻量级SAR舰船检测,在SSDD和HRSID上提升6.2 mAP和8.0 AP75。

Comments Accepted in GLSVLSI'26 special session 74: Efficiency In Computer Vision: From Image Generation to Decision"

详情
AI中文摘要

深度卷积和基于Transformer的检测器在SAR舰船检测中表现出色,但通常计算成本高昂,难以用于实时或机载部署。轻量级模型提高了效率,但难以捕捉SAR后向散射中固有的复杂结构关系。大多数现有的SAR知识蒸馏方法依赖于特征或logit匹配,强制局部激活相似性,而忽略了对象表示之间的几何关系。我们提出了一种用于SAR舰船检测的结构化统一关系知识蒸馏框架(SURGE),该框架通过对比InfoNCE目标在共享投影嵌入空间中从强大的教师检测器向紧凑的学生检测器转移关系几何。据我们所知,这项工作提出了SAR领域中首个基于Transformer的SAR舰船检测器知识蒸馏框架。该框架与架构无关,为两阶段、一阶段和基于Transformer的检测器提供了通用的区域级蒸馏接口,无需修改其部署架构。在SSDD和HRSID基准上的实验表明,所提出的方法为两阶段检测器带来了显著改进,相比基线学生模型实现了高达6.2 mAP和8.0 AP75的提升,甚至超越了教师性能。

英文摘要

Deep convolutional and transformer-based detectors achieve strong performance for SAR ship detection but are often computationally prohibitive for real-time or onboard deployment. Lightweight models offer improved efficiency yet struggle to capture the complex structural relationships inherent in SAR backscatter. Most existing SAR knowledge-distillation approaches rely on feature or logit matching, which enforces localized activation similarity while neglecting the geometric relationships among object representations. We propose a Structured Unified Relational knowledGE distillation framework for SAR Ship detection (SURGE) that transfers relational geometry from a powerful teacher detector to a compact student detector using a contrastive InfoNCE objective in a shared projection embedding space. To the best of our knowledge, this work presents the first transformer-based SAR ship detector knowledge distillation framework in SAR domain. The framework is architecture-agnostic in the sense that it provides a common region-level distillation interface for two-stage, one-stage and transformer-based detectors without modifying their deployed architectures. Experiments on the SSDD and HRSID benchmarks demonstrate that the proposed method yields substantial improvements for two-stage detectors, achieving up to 6.2 mAP and 8.0 AP75 gains over baseline student and even surpassing teacher performance

2605.27590 2026-06-02 cs.CV cs.MM 版本更新

ForestHG-Trace: Traceable Long-Horizon Ecological Reasoning over Large-Scale Forest Scenes

ForestHG-Trace: 大规模森林场景下的可追踪长程生态推理

Zihang Cheng, Duanchu Wang, Cheng Li, Jing Huang, Huanzhao Fu, Di Wang

AI总结 提出ForestHG-Trace框架,通过生态超图表示和LLM引导的确定性工具链,实现森林场景中可追踪的多步生态推理,并构建ForestTraceQA基准,显著提升长程生态问答的准确性和执行忠实度。

Comments It has theoretical flaws and experimental errors

详情
AI中文摘要

遥感问答(RS-QA)通常需要超越直接语义预测的能力,尤其是在大规模森林场景中,生态分析涉及多步过滤、数值聚合、邻域推理和可验证证据。我们提出ForestHG-Trace,一个用于森林环境中可追踪长程生态推理的框架。它将多模态NEON森林场景表示为生态超图,其中树木实例、空间单元、语义组和邻域关系支持超越成对场景图的高阶推理。然后,一个LLM引导的智能体调用确定性工具进行读取、过滤、扩展、聚合、比较和审计,生成可重放的执行轨迹和紧凑的证据记录,而不仅仅是自由形式的答案。我们进一步构建了ForestTraceQA,一个可执行的基准,用于评估跨不同任务类型和推理深度的生态问答。实验表明,ForestHG-Trace在答案准确性和执行忠实度上显著优于单步基线和场景图智能体,同时指出执行深度是长程生态问答的主要瓶颈。

英文摘要

Remote sensing question answering (RS-QA) often requires more than direct semantic prediction, especially in large-scale forest scenes where ecological analysis involves multi-step filtering, numerical aggregation, neighborhood reasoning, and verifiable evidence. We introduce ForestHG-Trace, a framework for traceable long-horizon ecological reasoning over forest environments. It represents multimodal NEON forest scenes as ecological hypergraphs, where tree instances, spatial units, semantic groups, and neighborhood relations support higher-order reasoning beyond pairwise scene graphs. An LLM-guided agent then invokes deterministic tools for reading, filtering, expansion, aggregation, comparison, and auditing, producing replayable execution traces and compact evidence records rather than only free-form answers. We further construct ForestTraceQA, an executable benchmark for evaluating ecological QA across diverse task types and reasoning depths. Experiments show that ForestHG-Trace substantially improves answer accuracy and execution faithfulness over single-step baselines and scene-graph agents, while highlighting execution depth as the main bottleneck for long-horizon ecological QA.

2605.27458 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

融合异质注意力结构的Transformer模型通用解释方法

Yongjin Cui, Xiaohui Fan, Huajun Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 针对Transformer中异质注意力结构(如共注意力)带来的多源信息融合挑战,提出一种通用解释方法,并通过实验分析范式对代表性模型进行语义和逻辑解释。

详情
AI中文摘要

Transformer极大地推动了人工智能的发展,也推动了智能体(agent)的发展。我们将Transformer的注意力结构根据输入信息的来源分为两类:同质注意力结构和异质注意力结构。异质注意力结构以共注意力(co-attention)为典型例子,处理来自不同来源的信息。异质注意力结构是Transformer模型实现更复杂功能、融合更多模态信息的基础。无论是出于研究目的还是政策要求,对具有异质注意力结构的Transformer模型进行解释都是一项重要任务。来自不同来源的信息融合带来了新的挑战。我们的工作主要包括方法和实验两部分。在方法方面,我们提出了一种针对具有异质注意力结构的Transformer模型的解释方法。在实验方面,基于我们的实验分析范式,我们解释代表性模型的操作机制,进行语义解释和逻辑解释。

英文摘要

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

2605.26292 2026-06-02 cs.CV cs.CL 版本更新

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Evi-Steer:通过高效且可泛化的证据调优学习引导生物医学视觉-语言模型

Taha Koleilat, Hassan Rivaz, Yiming Xiao

发表机构 * Concordia University(康科迪亚大学)

AI总结 提出Evi-Steer框架,通过证据跨模态低维引导实现BiomedCLIP的不确定性感知参数高效微调,仅更新0.11%参数,在15个生物医学数据集上少样本学习和域泛化设置中优于现有方法。

Comments MICCAI 2026 Early Accept; Project Page: https://tahakoleilat.github.io/Evi-Steer. This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October

详情
AI中文摘要

视觉-语言基础模型的参数高效适配对于生物医学图像的精确多模态理解至关重要,但现有方法仍具有确定性,且在域偏移或模糊的图像-文本对齐下常常表现不佳。这一限制在临床中尤为关键,因为模型应在低数据 regime 和域偏移下保持鲁棒性。我们提出了Evi-Steer,一个用于BiomedCLIP的证据跨模态低维引导框架,能够在仅更新总模型参数0.11%的情况下实现不确定性感知的参数高效微调。我们的方法在视觉和文本编码器中执行轻量级低维令牌更新,同时估计认知不确定性。这些不确定性估计更新门控残差,使模型在证据较弱时能够保守地适应。此外,我们引入了基于Dempster-Shafer理论的跨模态置信度融合,使视觉适应能够以文本置信度为条件,并抑制冲突或不确定的跨模态更新。我们在涵盖8个器官和8种成像模态的15个生物医学成像数据集上,在少样本学习和域泛化设置下进行了全面评估。Evi-Steer在少样本学习和域偏移设置下始终优于最先进的方法,展示了在真实临床环境中部署视觉-语言模型的实用且鲁棒的途径。代码可在https://github.com/HealthX-Lab/Evi-Steer获取。

英文摘要

Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at https://github.com/HealthX-Lab/Evi-Steer.

2605.24634 2026-06-02 cs.CV 版本更新

Resolving Ambiguity in Composed Image Retrieval via Calibrated Interaction

通过校准交互解决组合图像检索中的歧义

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran Baogh Le Tuan Kiet Pham Sui Yang Guang

AI总结 本文提出将组合图像检索重新定义为不确定性下的校准意图解析,通过共形预测层提供覆盖保证的候选集,并利用期望信息增益策略提出最有效的澄清问题,从而解决查询歧义和假阴性问题。

详情
AI中文摘要

组合图像检索(CIR)使用参考图像和描述如何修改它的文本搜索语料库。尽管从三元组训练的合成器到零样本和生成方法取得了快速进展,但所有系统本质上都共享一个假设:查询映射到单个目标,通过Recall@K针对一个标注进行评分。我们认为这与任务根本不一致。诸如“使其更正式”之类的查询并不命名一个图像,而是命名语料库的一个区域,用户意图中的哪个成员真正是不确定的。这种欠指定是众所周知的假阴性问题的根源,并使得当前模型无法区分精确查询和模糊查询。我们将CIR重新定义为不确定性下的校准意图解析:检索器被包裹在一个共形预测层中,该层返回一个具有覆盖保证的候选集,其大小是歧义的原则性度量;当集合很大时,期望信息增益策略从可解释的歧义轴中提出一个最有用的澄清问题,然后集合收缩。我们引入了AmbiCIR,一个基准和经过人工验证的用户模拟器,它复活了CIRR中休眠的辅助和对话标注,并扩展了CIRCO的多正例设置。在开放域和时尚基准上,我们的方法匹配了单轮最先进水平,确认了校准解析在精确查询上是无成本的,同时以朴素对话基线所需交互预算的一小部分达到预期目标,并且它是第一个为任务报告有效覆盖和校准的方法。

英文摘要

Composed image retrieval (CIR) searches a corpus with a reference image and a text describing how to modify it. Despite rapid progress from triplet-trained compositors to zero-shot and generative methods, essentially all systems share one assumption: that a query maps to a single target, scored by Recall@K against one annotation. We argue this is fundamentally at odds with the task. A query such as make it more formal does not name an image but a region of the corpus, and which member the user intends is genuinely underdetermined. This underspecification is the root of the well-known false-negative problem and leaves current models unable to tell a precise query from an ambiguous one. We reframe CIR as calibrated intent resolution under uncertainty: a retriever is wrapped in a conformal prediction layer that returns a candidate set with a coverage guarantee and whose size is a principled measure of ambiguity; when the set is large, an expected-information-gain policy asks the single most useful clarifying question, drawn from interpretable ambiguity axes, and the set contracts. We introduce AmbiCIR, a benchmark and human-validated user simulator that revive the dormant auxiliary and dialogue annotations of CIRR and extend the multiple-positive setting of CIRCO. Across open-domain and fashion benchmarks our method matches single-turn state of the art, confirming calibrated resolution is cost-free on precise queries, while reaching the intended target in a fraction of the interaction budget required by naive conversational baselines, and it is the first to report valid coverage and calibration for the task.

2605.26102 2026-06-02 cs.CV 版本更新

InstructSAM: Segment Any Instance with Any Instructions

InstructSAM: 根据任意指令分割任意实例

Yuqian Yuan, Wentong Li, Zhaocheng Li, Yutong Lin, Juncheng Li, Siliang Tang, Jun Xiao, Yueting Zhuang, Wenqiao Zhang

发表机构 * Zhejiang University(浙江大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学)

AI总结 提出InstructSAM框架,通过将指令驱动实例分割建模为集合结构查询预测问题,并设计显式推理到实例查询接口,结合视觉语言模型和SAM3实现单次前向传播中的多实例分割。

Comments 19 pages, 8 figures, code: https://github.com/DCDmllm/InstructSAM

详情
AI中文摘要

在本文中,我们介绍了InstructSAM,一个统一且精简的框架,旨在任意指令下进行多实例分割。我们将指令驱动的实例分割形式化为一个集合结构查询预测问题,并提出了一个显式的推理到实例查询接口,优雅地桥接了视觉语言模型(VLM)和SAM3。具体来说,一组可学习的实例查询被注入到VLM中,并与指令和视觉信息进行上下文关联,使每个查询成为一个实例感知槽。混合注意力机制进一步促进了这些查询、视觉令牌和指令令牌之间的交互,改进了实例枚举并减少了重复预测。得到的LLM条件查询被投影到SAM3的检测器查询空间中,以在单次前向传播中驱动准确的多实例分割。这种设计赋予了SAM3高级指令理解、组合推理和实例级集合预测的能力,而无需修改其核心架构。为了支持训练和评估,我们进一步构建了Inst2Seg,一个高质量、大规模的基于指令的实例分割数据集和基准,将自由形式的指令与实例级掩码配对。大量实验表明,仅2B规模的InstructSAM在复杂的指令驱动和短语级指代分割基准上取得了强劲的结果,超越了之前的端到端方法和SAM3的代理流水线,同时实现了高效的单次多实例预测。

英文摘要

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

2605.26089 2026-06-02 cs.CV cs.AI 版本更新

Channel-wise Vector Quantization

通道级向量量化

Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Min Li, Jiaqi Wang, Kaicheng Yu

发表机构 * Shanghai Innovation Institute(上海创新研究院) Westlake University(西湖大学) Zhejiang University(浙江大学) Fudan University(复旦大学) JD.COM(京东公司) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出通道级向量量化(CVQ)代替补丁级量化,并基于此设计通道级自回归(CAR)模型,通过逐通道预测实现渐进式细节生成,在图像重建和文本到图像生成中取得优异性能。

详情
AI中文摘要

我们提出了通道级向量量化(CVQ),一种新颖的图像标记化范式,用通道级标记取代补丁级标记。与传统的向量量化(为每个补丁特征向量分配一个离散标记)不同,CVQ 对特征图的每个通道进行量化。这种表示将图像表示为视觉细节的离散层级,而不是空间补丁的网格。基于 CVQ,我们引入了一种新的视觉自回归框架,采用“下一通道预测”。我们的通道级自回归(CAR)模型不是按光栅顺序逐补丁渲染图像,而是顺序预测图像通道,逐步生成更丰富的视觉细节。具体来说,它首先勾勒全局结构,然后细化细粒度属性,类似于人类艺术家的创作流程。实验表明:(1)CVQ 在 16K+ 的码本大小下实现了 100% 的码本利用率,无需任何额外技巧,并且显著提高了传统 VQ 的重建质量;(2)CAR 在 DPG 评分中达到 86.7,在 GenEval 评分中达到 0.79,展示了其在文本到图像生成中的强大有效性。

英文摘要

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.

2605.29977 2026-06-02 cs.CV cs.LG 版本更新

EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation

EVL-ECG:面向多视角异构知识蒸馏的高效心电图解读

Dang Nguyen Hong, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 提出EVL-ECG框架,通过多头交叉注意力对齐、最优传输视觉特征匹配和几何结构关系匹配三种创新方法,实现跨架构知识蒸馏,在资源受限环境下高效解读心电图。

Comments 7Accepted at the SD4H Workshop at ICML 2026. 7 pages, 3 figures

详情
AI中文摘要

高保真心电图解读越来越依赖于大规模基础模型,但其在临床边缘护理中的部署仍受到极端计算需求的阻碍。虽然知识蒸馏(KD)是一种有前景的解决方案,但传统方法在跨异构架构传递知识时,无法捕捉心电图信号的复杂时空依赖关系。本文提出EVL-ECG,一个专门用于心脏诊断逻辑跨架构蒸馏的框架。EVL-ECG引入了三种心电图感知创新:(1)多头交叉注意力对齐,协调架构差异以保留细粒度形态特征;(2)基于最优传输的视觉特征匹配,利用最优传输在标记表示不匹配的情况下保持跨心电图导联的全局结构关系;(3)几何结构内关系匹配,蒸馏教师模型的潜在诊断推理。在心电图基准测试上的评估表明,EVL-ECG相比现有基线,AUC提升高达2.4%,临床准确率提升1.1%。值得注意的是,EVL-ECG建立了一个高效的20亿参数心电图基础模型,适用于资源受限的临床环境。

英文摘要

High-fidelity ECG interpretation is increasingly reliant on massive foundation models, yet their deployment in clinical edge-care remains hindered by extreme computational demands. While knowledge distillation (KD) is a promising solution, traditional methods fail to capture the complex spatio-temporal dependencies of ECG signals when transferring knowledge across heterogeneous architectures. In this paper, we propose EVL-ECG, a framework specifically designed for cross-architecture distillation of cardiac diagnostic logic. EVL-ECG introduces three ECG-aware innovations: (1) Multi-Head Cross-Attention Alignment, which harmonizes architectural discrepancies to preserve fine-grained morphological features; (2) Optimal Transport-based Visual Feature Matching, utilizing optimal transport to maintain global structural relationships across ECG leads despite mismatched token representations; and (3) Geometric Intra-Architecture Relation Matching, which distills the latent diagnostic reasoning of the teacher model. Evaluations across ECG benchmarks demonstrate that EVL-ECG yields improvements of up to 2.4% AUC and 1.1% clinical accuracy over existing baselines. Notably, EVL-ECG establishes an efficient 2B-parameter ECG foundation model, suitable for resource-constrained clinical environments.

2605.16415 2026-06-02 cs.CV cs.LG 版本更新

Diffusion Models, Denoiser Architecture and Creativity

扩散模型、去噪器架构与创造力

Itamar Levine, Yair Weiss

发表机构 * The Hebrew University of Jerusalem(海法大学)

AI总结 本文通过理论和实验表明,扩散模型的创造力源于去噪器架构与目标分布之间的相互作用,并指出去噪器架构的归纳偏差必须与真实目标分布高度一致才能成功。

详情
AI中文摘要

扩散模型的创造力是指它们生成与训练数据不同但高度逼真图像的能力。创造力有些令人惊讶,因为已知如果扩散模型中使用的去噪器是给定训练集的贝叶斯最优去噪器,那么模型将简单地复制训练样本。在本文中,我们提出经验和理论结果,表明扩散模型的创造力源于去噪器架构与目标分布之间的相互作用。理论上,我们针对三种不同的去噪器架构(线性、多项式、瓶颈)给出了生成样本分布作为目标分布和去噪器函数的显式形式。经验上,我们表明流行的UNET去噪器架构的微小变化会导致非常不同的创造力形式,并且这些微小变化通常会产生高度不真实的样本。综合来看,我们的结果表明,只有当去噪器架构的归纳偏差与真实目标分布高度一致时,扩散模型才能成功。

英文摘要

The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.

2605.29539 2026-06-02 cs.CV cs.AI 版本更新

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

GiPL: 用于跨域小样本目标检测的生成增强迭代伪标签方法

Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao, Yongwei Jiang, Yixiong Zou

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 提出GiPL双分支训练框架,通过迭代伪标签自训练和生成数据增强,解决跨域小样本目标检测中支持集利用不足和过拟合问题。

Comments CVPR 2026 Workshop

详情
AI中文摘要

视觉语言基础模型在跨域小样本目标检测(CD-FSOD)中展现出有前景的零样本泛化能力。然而,它们在微调过程中面临两个关键挑战:由于稀疏的单实例标注导致支持集利用不足,以及在极有限的域目标样本下严重过拟合。为解决这些问题,本文提出GiPL,一个高效的双分支训练框架。在第一个分支中,我们设计了一种迭代伪标签自训练范式,该范式对支持集进行零样本推理以生成可靠的伪标注,将其与真实标签融合,并迭代优化模型以充分利用支持集数据。在第二个分支中,我们引入了使用大型视觉语言模型的生成数据增强流程,该流程合成域对齐、多目标标注的图像以丰富训练样本并抑制过拟合。在三个具有挑战性的CD-FSOD数据集(RUOD、CARPK、CarDD)上,在1/5/10样本设置下的大量实验表明,GiPL始终以显著的性能提升优于最先进的方法。代码可在\href{https://github.com/z-yaz/CDiscover}{CDiscover}获取。

英文摘要

Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). However, they face two critical challenges in fine-tuning: insufficient support set utilization due to sparse single-instance annotations, and severe overfitting under extremely limited target-domain samples. To address these issues, this paper proposes GiPL, an efficient two-branch training framework. In the first branch, we design an iterative pseudo-label self-training paradigm, which performs zero-shot inference on the support set to generate reliable pseudo-annotations, fuses them with ground-truth labels, and iteratively optimizes the model to fully exploit support set data. In the second branch, we introduce generative data augmentation pipeline using large vision-language models, which synthesizes domain-aligned, multi-object annotated images to enrich training samples and suppress overfitting. Extensive experiments on three challenging CD-FSOD datasets (RUOD, CARPK, CarDD) under 1/5/10-shot settings demonstrate that GiPL consistently outperforms state-of-the-art methods with significant performance gains. Code is available at \href{https://github.com/z-yaz/CDiscover}{CDiscover}.

2605.29488 2026-06-02 cs.CV cs.AI 版本更新

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

AnyMo: 基于掩码建模的任意模态条件运动生成

Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen, Hong Chang, Hao Liu, Shiguang Shan

发表机构 * Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China(中国科学院智能信息处理重点实验室(中国科学院计算技术研究所,中国)) University of Chinese Academy of Sciences, China(中国科学院大学)

AI总结 提出AnyMo框架,结合残差FSQ运动分词器和可扩展掩码建模Transformer,利用大规模多模态对齐数据集OmniHuMo实现任意模态组合下的高质量人体运动生成。

详情
AI中文摘要

条件人体运动生成仍然是计算机视觉和机器人学中的一个基本挑战。尽管取得了显著进展,当前方法通常受限于固定的模态配置和特定任务架构,跨模态交互和多模态条件合成的扩展规律在很大程度上仍未得到充分探索。一个关键瓶颈是缺乏大规模模态对齐的运动数据,限制了跨不同控制信号的泛化能力。在这项工作中,我们引入了OmniHuMo,一个大规模、高质量的数据集,包含超过5000小时的运动和320万条序列,并带有精确对齐的多模态注释(例如,文本、语音、音乐和轨迹)。利用OmniHuMo,我们提出了AnyMo,一个统一的多模态框架,结合了基于残差FSQ的运动分词器与可扩展的掩码建模Transformer,能够在任意模态组合下实现高质量的运动合成。大量实验表明,AnyMo在提供对空间和风格属性的灵活控制的同时,实现了高保真合成。

英文摘要

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

2605.29341 2026-06-02 cs.CV cs.CL 版本更新

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

WorldMemArena: 通过动作-世界交互评估多模态智能体记忆

Chengzhi Liu, Yuzhe Yang, Sophia Xiao Pu, Yepeng Liu, Lin Long, Yichen Guo, Nuo Chen, Zhaotian Weng, Elena Kochkina, Simerjot Kaur, Charese Smiley, Xiaomo Liu, James Zou, Sheng Liu, Yuheng Bu, Songyou Peng, Xin Eric Wang

发表机构 * University of California, Santa Barbara(加州大学圣芭芭拉分校) J.P. Morgan Chase(摩根大通) ETH Zurich(苏黎世联邦理工学院) Stanford University(斯坦福大学) Johns Hopkins University(约翰霍普金斯大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出WorldMemArena基准,通过动作-世界交互循环的四阶段生命周期评估多模态智能体记忆,揭示现有方法在写入、维护、检索和使用中的失败点。

Comments 25 pages, 8 figures

详情
AI中文摘要

多模态大语言模型越来越多地被部署为长周期智能体,其中记忆必须做的不仅仅是回忆:它必须跟踪不断变化的世界,修正过时的信息,并在决策时提供正确的证据。现有基准衡量静态对话中的回忆,将记忆压缩为单一的任务结束准确率,并将视觉观察简化为字幕,使我们无法将失败定位到写入、维护、检索或使用。能够自主管理记忆的智能体框架的兴起加剧了这一差距,因为我们没有原则性的方法来比较手工设计的流水线与自我管理的替代方案。为了弥补这些差距,我们将多模态智能体记忆形式化为一个具有可观察四阶段生命周期的动作-世界交互循环,并在WorldMemArena中实例化:400个多会话多模态任务,涵盖终身演化(演化的个人和任务状态)和智能体执行(来自真实观察、动作和反馈的记忆),并标注了黄金记忆点、更新、干扰项和用于阶段级诊断的证据链。这使得长上下文、手工设计(RAG和外部记忆系统)和基于框架的记忆智能体之间首次进行直接比较。结果表明:(1)更好的记忆写入和存储并不能保证更好的性能;(2)多模态记忆仍然难以充分利用视觉证据;(3)系统在不同领域不稳定,并在真实的智能体轨迹上性能下降;(4)框架记忆更灵活,但成本更高且可靠性较低。

英文摘要

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.

2605.29287 2026-06-02 cs.IR cs.CV 版本更新

UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

UniNote: 一种用于多模态表示和排序的统一嵌入模型

Jinghan Zhao, Wenwei Jin, Anqi Li, Jintao Tong, Luya Mo, Jiawei Li, Bin Li, Yao Hu

发表机构 * Xiaohongshu Beijing China(小红书北京中国) Shanghai Jiao Tong University(上海交通大学) Huazhong University of Science and Technology(华中科技大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出UniNote统一嵌入模型,通过两阶段训练(对比SFT和强化学习)解决工业级Item-to-Item检索中全局表示与局部检索的平衡、解耦流水线效率及精度-延迟权衡问题,在小红书部署后显著提升检索质量和成本效率。

Comments Accepted by KDD Ads Track 2026

详情
AI中文摘要

Item-to-Item (I2I) 检索是现代内容平台的基础部分,支持从推荐引擎到内容审核的关键工业工作流。虽然多模态嵌入方法在通用检索中取得了进展,但由于全局内容表示与细粒度局部检索之间的平衡挑战、解耦的嵌入-排序流水线的系统性低效,以及模型精度与服务延迟之间的固有权衡,它们通常在 I2I 场景中表现不佳。为了解决这些问题,我们提出了 extbf{UniNote},一种专为工业 I2I 检索设计的统一嵌入模型。引入了定制的检索策略,以支持在不同粒度上对复杂多模态内容进行表示学习。为了实现这些策略,UniNote 采用了两阶段训练范式:第一阶段利用对比 SFT 建立稳健的基础嵌入,第二阶段通过强化学习 (RL) 过程优化排序质量,使模型与内容相关性对齐。我们的结果表明,UniNote 在多种 I2I 任务上达到了最先进的性能。在小红书部署并与 Matryoshka 表示学习 (MRL) 集成后,UniNote 在大规模应用中显著提升了检索质量和成本效率。

英文摘要

Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to the challenges of balancing global content representation with fine-grained local retrieval, the systemic inefficiency of decoupled embedding-and-ranking pipelines, and the inherent trade-offs between model precision and serving latency. To solve these issues, we propose \textbf{UniNote}, a unified embedding model designed for industrial I2I retrieval. Tailored retrieval strategies are introduced to support representation learning over complex, multimodal content at varying granularities. To operationalize these strategies, UniNote employs a two-stage training paradigm: the first stage leverages contrastive SFT to establish robust base embeddings, while the second stage refines ranking quality through a reinforcement learning (RL) process that aligns the model with content relevance. Our results show that UniNote achieves SOTA performance across diverse I2I tasks. Deployed at Xiaohongshu and integrated with Matryoshka Representation Learning (MRL), UniNote achieved significant improvements in retrieval quality and cost efficiency in large-scale applications.

2605.29260 2026-06-02 cs.CV 版本更新

Deep Psychovisual Image Representations

深度心理视觉图像表示

Wendi Ma, Aryaman Sharma, Wei Dai, Shekhar S. Chandra

发表机构 * School of EECS The University of Queensland(电子工程与计算机科学学院昆士兰大学)

AI总结 受心理视觉模型启发,提出深度视觉编码方法,利用频域表示和复值图像表示实现心理视觉风格的抽象,构建首个基于心理视觉的深度学习框架,通过数据驱动频谱滤波器学习任务相关语义结构,实验表明该模型提取可解释性强的物体部分,且对深度依赖较小。

详情
AI中文摘要

心理视觉模型表明,人类视觉通过首先形成中间抽象来将低级特征提取与高级认知解耦。相比之下,基于深度学习的视觉模型通常使用同质空间层堆叠来提取和聚合特征,导致其决策过程不透明。在本文中,我们提出了深度视觉编码,这是一种受20世纪90年代图像编码启发的学习频域表示,该编码量化了感知显著的频率,与复值图像表示一起产生心理视觉风格的抽象。该方法实现了首个基于心理视觉的深度学习框架,利用数据驱动的频谱滤波器学习在不同频率子带内编码任务相关的语义结构。显著性分析表明,与常规卷积神经网络产生的无定形区域相比,我们的心理视觉模型提取了高度可解释的物体部分。此外,我们发现对于模型缩放,我们的模型对深度的依赖小于CNN,因为我们的复值表示和学习抽象取代了深层空间层的作用。这些发现共同表明,心理视觉编码为更高效和透明的视觉模型提供了一条有前景的路径。

英文摘要

Psychovisual models suggest human vision decouples low-level feature extraction from higher cognition by first forming intermediate abstractions. In contrast, deep learning-based vision models routinely extract and aggregate features using homogeneous stacks of spatial layers, rendering their decision-making processes opaque. In this paper, we propose Deep Visual Coding, a learned frequency-domain representation inspired by 1990s image codes that quantised perceptually salient frequencies, which together with complex-valued image representations produces psychovisual-style abstractions. This approach enables the first psychovisual-based deep learning framework, utilizing data-driven spectral filters that learn to encode task-relevant semantic structures within distinct frequency sub-bands. Salience analyses reveal that our psychovisual models extract highly interpretable object parts compared to the amorphous regions produced by regular Convolutional Neural Networks (CNNs). Furthermore, we find that our models are less depth dependent than CNNs for model scaling, since our complex-valued representations and learned abstractions subsume the role of the deep spatial layers. Together, these findings demonstrate that psychovisual coding provides a promising path toward more efficient and transparent vision models.

2605.28995 2026-06-02 cs.CV 版本更新

GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation

GAP3D: 将VLM潜在表示与补丁级嵌入进行生成式对齐以实现3D生成

Polytimi Anna Gkotsi, Andrii Zadaianchuk, Mohammad Mahdi Derakhshani

发表机构 * Polytimi Anna Gkotsi Andrii Zadaianchuk Mohammad Mahdi Derakhshani

AI总结 提出GAP3D,一种基于扩散的模块化方法,将VLM生成的潜在表示直接对齐到预训练图像编码器的完整补丁级特征空间,使冻结的下游生成模型能够利用VLM作为提示编码器,同时保持空间结构化的条件信号,在3D资产生成中无需大规模3D数据训练,并展现出多模态提示的零样本能力。

详情
AI中文摘要

最近将视觉语言模型(VLM)作为生成模型条件提示编码器的方法通常依赖于昂贵的端到端训练或将特征映射到压缩表示,丢弃了像3D资产生成这类几何感知任务所需的密集空间结构。为了解决这个问题,我们提出了GAP3D,一种基于扩散的模块化方法,它将VLM生成的潜在表示直接对齐到预训练图像编码器的完整补丁级特征空间,使得冻结的下游生成模型能够利用VLM作为提示编码器,同时保持空间结构化的条件信号。在3D资产生成上的评估表明,我们的方法主要通过训练通用领域的图像-文本对来绕过对大规模3D数据的需求。尽管仅使用文本输入进行训练,但它还展现出对多模态提示的涌现零样本能力。最后,虽然目前优先考虑高级语义而非细粒度细节,但GAP3D表明,通过基于扩散的对齐,VLM和图像编码器特征空间之间的表示差距可以部分弥合,这为通过生成式对齐到密集嵌入空间实现基础模型的模块化集成迈出了第一步。

英文摘要

Recent approaches integrating vision-language models (VLMs) as prompt encoders for generative model conditioning typically rely on expensive end-to-end training or map features to compressed representations, discarding the dense spatial structure required for geometry-aware tasks like 3D asset generation. To address this, we propose GAP3D, a modular, diffusion-based approach that aligns VLM-generated latents directly to the complete, patch-level feature space of a pre-trained image encoder, enabling a frozen downstream generative model to utilize a VLM as prompt encoder while maintaining a spatially structured conditioning signal. Evaluated on 3D asset generation, our method bypasses the need for large-scale 3D data by training mainly on general-domain image-text pairs. It also exhibits emergent zero-shot capabilities for multimodal prompts, despite being trained exclusively on text input. Finally, while currently prioritizing high-level semantics over fine-grained detail, GAP3D demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment, taking the first steps towards a modular integration of foundation models through generative alignment to dense embedding spaces.

2505.11158 2026-06-02 eess.IV cs.CV 版本更新

Diffusion Models for Hyperspectral Image Analysis: A Comprehensive Review

扩散模型在高光谱图像分析中的应用:综述

Xing Hu, Xiangcheng Liu, Qianqian Duan, Lian Zhang, Huiliang Shang, Linhua Jiang, Haima Yang, Dawei Zhang

发表机构 * School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology(上海理工大学光学电子与计算机工程学院) School of Electronics and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子与电气工程学院) Medical Artificial Intelligence Lab, The First Hospital of Hebei Medical University, Hebei Medical University(河北医科大学第一医院医学人工智能实验室) Hangzhou Institute of Technology, xidian University(杭州职业技术学院)

AI总结 本文系统综述了扩散模型(包括去噪扩散概率模型和基于随机微分方程的生成框架)在高光谱图像处理中的最新进展,分类现有方法,强调其处理高维数据的优势,并与传统方法比较性能,特别关注变化检测和灾后异常识别等关键应用,同时讨论计算成本和训练稳定性等局限,并展望未来研究方向。

Comments Published in Neural Networks

详情
Journal ref
Neural Networks (2026) 109109
AI中文摘要

高光谱图像(HSI)分析在遥感、农业和环境监测中起着关键作用。然而,传统方法通常难以处理HSI数据中固有的高维度、光谱冗余和噪声,限制了其准确性和可扩展性。最近,扩散模型(包括去噪扩散概率模型和其他基于随机微分方程的生成框架)在捕捉复杂光谱空间结构和生成高保真HSI数据方面显示出强大潜力。这些模型为噪声抑制、数据增强、分类和异常检测等任务提供了有效解决方案。本文系统总结了扩散模型在HSI处理中的最新进展。我们对现有方法进行分类,强调其处理高维数据的优势,并与传统方法进行性能比较。特别关注变化检测和灾后异常识别等关键应用。本文还讨论了当前局限性,如计算成本和训练稳定性,并概述了潜在的研究方向。我们的主要贡献可总结如下:提供了基于扩散的HSI方法的系统分类,考察了它们在主要遥感任务中的应用,并提供了对未来研究潜在方向的见解。通过这些努力,本综述旨在支持社区利用深度学习模型实现更有效和高效的高光谱图像分析。

英文摘要

Hyperspectral image (HSI) analysis plays a critical role in remote sensing, agriculture, and environmental monitoring. However, traditional methods often struggle to handle the high dimensionality, spectral redundancy, and noise inherent in HSI data, limiting their accuracy and scalability. Recently, diffusion models including denoising diffusion probabilistic models and other generative frameworks based on stochastic differential equations have shown strong potential in capturing complex spectral spatial structures and generating high fidelity HSI data. These models offer effective solutions for tasks such as noise supression, data augmentation, classification, and anomaly detection. This review presents a systematic summary of recent advances in diffusion models for HSI processing. We categorize existing methods, highlight their strengths in handling high dimensional data, and compare their performance with conventional approaches. Special attention is given to critical applications such as change detection and post disaster anomaly identification. The review also discusses current limitations, such as computational cost and training stability, and outlines potential research directions. Our main contributions can be summarized as follows: we provide a systematic taxonomy of diffusion based HSI methods, examine their applications across major remote sensing tasks, and offer perspectives on potential directions for future research. With these efforts, this review seeks to support the community in harnessing deep learning models to achieve more effective and efficient hyperspectral image analysis.

2605.25195 2026-06-02 cs.CV 版本更新

Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

Baton: 用于联合视频-音频生成的显式语义蓝图

Shuyuan Tu, Qi Tian, Zihan Yang, Yue Wu, Xintong Han, Weijie Kong, Jiangfeng Xiong, Jian-Wei Zhang, Zhao Zhong, Liefeng Bo, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Tencent Hunyuan(腾讯幻元)

AI总结 提出Baton框架,通过VA-Planner生成语义对齐的模态感知规划令牌作为蓝图,注入扩散骨干以协调视频和音频去噪,解决现有方法因缺乏共享长期规划导致的跨模态对齐脆弱问题。

详情
AI中文摘要

当前的开源扩散模型难以生成稳定且同步的视听内容,尤其是在需要复杂语义推理的场景中。根本原因在于现有方法依赖现成编码器生成的粗糙文本嵌入来引导音频-视频去噪,这丢弃了细粒度语义,并且关键的是缺乏共享的长期规划,导致去噪轨迹不协调和跨模态对齐脆弱。我们提出Baton,这是第一个将显式语义规划引入联合视频-音频生成的框架。我们的关键洞察是,用语义丰富、模态感知的规划令牌(在去噪前经过联合推理和相互对齐)补充粗糙文本引导,可以同时恢复细粒度语义细节并建立协调音频和视频去噪轨迹的共享蓝图。具体来说,Baton首先引入VA-Planner,这是一个配备双语义对齐塔的多模态语言模型,其中可学习查询与视频和音频特征进行交叉注意力,生成一对语义对齐的视频和音频规划令牌作为关键帧级别的蓝图。这些规划令牌通过交叉注意力层注入扩散骨干,提供与粗糙文本嵌入互补的时域引导。由于规划令牌与扩散潜变量不具有一一对应的时空对应关系,我们进一步提出相对语义RoPE,一种相对位置编码,将规划令牌和潜变量映射到共享的时空坐标框架中,使每个潜变量能够准确关注其位置对应的语义线索。基准实验在定性和定量上均证明了Baton的有效性。

英文摘要

Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories. Concretely, Baton first introduces the VA-Planner, a multimodal language model equipped with dual semantic alignment towers, where learnable queries cross-attend to both video and audio features to produce a pair of semantically aligned video and audio planned tokens as keyframe-level blueprints. These planned tokens are injected into the diffusion backbone via cross-attention layers, providing temporally grounded guidance complementary to coarse text embeddings. Since planned tokens do not share one-to-one spatial-temporal correspondence with diffusion latents, we further propose Relative Semantic RoPE, a relative positional encoding that maps planned tokens and latents into a shared spatial-temporal coordinate frame, enabling each latent to accurately attend to its positionally corresponding semantic cues. Experiments on benchmarks show the effectiveness of Baton both qualitatively and quantitatively.

2605.25144 2026-06-02 cs.CV 版本更新

SpikeReg: Energy-Efficient 3D Deformable Medical Image Registration with Spiking Neural Networks

SpikeReg: 基于脉冲神经网络的高能效3D可变形医学图像配准

Ali Mikaeili Barzili, Behzad Moshiri, Hamid Azadegan, Mohammad-Reza A. Dehaqani

发表机构 * School of Electrical and Computer Engineering, College of Engineering, University of Tehran(德黑兰大学电气与计算机工程学院) Max Planck Institute for Brain Research(马克斯·普朗克脑科学研究所) School of Computer Engineering, Iran University of Science and Technology (IUST)(伊朗科学技术大学计算机工程学院) Department of Electrical and Computer Engineering, University of Waterloo(滑铁卢大学电气与计算机工程系)

AI总结 提出SpikeReg,一种脉冲U-Net,通过层间权重迁移和激活百分位阈值校准从模拟ANN教师初始化,结合局部互相关、扩散正则化和脉冲率稀疏性的代理梯度微调,在OASIS Learn2Reg验证集上达到Dice 0.7474,与ANN教师无显著差异,同时实现12.8%平均脉冲率和55.5倍算术能量降低。

详情
AI中文摘要

可变形医学图像配准对齐图像中的解剖结构,但在3D分辨率下计算密集。脉冲神经网络(SNN)提供稀疏事件驱动计算,但尚未系统研究用于可变形医学图像配准。我们提出SpikeReg,一种用于3D脑MRI配准的脉冲U-Net。SpikeReg从模拟ANN配准教师初始化,通过层间权重迁移和激活百分位阈值校准进行转换,并使用结合局部互相关、扩散正则化和脉冲率稀疏性的代理梯度目标进行微调。在OASIS Learn2Reg验证集(19对图像)上,SpikeReg达到Dice 0.7474 ± 0.032,与ANN教师(0.7480 ± 0.037,p = 0.67)无显著配对Dice差异,平均脉冲率为12.8%,相对于密集ANN基线,在事件稀疏SynOps/MAC代理下投影算术能量降低55.5倍。我们还报告了两个负面发现:来自ANN教师的位移蒸馏损害性能,以及使用标签Dice损失训练的ANN教师无法通过速率编码转换。这些结果共同表明,密集几何预测可以在稀疏事件驱动计算下进行,为神经形态医学图像配准开辟了道路。

英文摘要

Deformable medical image registration aligns anatomical structures across images but remains computationally dense at 3D resolution. Spiking neural networks (SNNs) offer sparse event-driven computation, yet have not been systematically studied for deformable medical image registration. We introduce SpikeReg, a spiking U-Net for 3D brain MRI registration. SpikeReg is initialized from an analog ANN registration teacher, converted by layer-wise weight transfer and activation-percentile threshold calibration, and fine-tuned with a surrogate-gradient objective combining local cross-correlation, diffusion regularization, and spike-rate sparsity. On the OASIS Learn2Reg validation split ($19$ image pairs), SpikeReg reaches Dice $0.7474 \pm 0.032$, with no significant paired Dice difference from the ANN teacher ($0.7480 \pm 0.037$, $p = 0.67$), at a $12.8\%$ mean spike rate and a $55.5\times$ projected arithmetic-energy reduction under an event-sparse SynOps/MAC proxy relative to the dense-ANN baseline. We additionally report two negative findings: displacement distillation from the ANN teacher hurts performance, and ANN teachers trained with a label-Dice loss fail to transfer through rate-code conversion. Together these results show that dense geometric prediction can be performed under sparse event-driven computation, opening a path toward neuromorphic medical image registration.

2605.24716 2026-06-02 cs.CV eess.SP 版本更新

Physics-Guided Self-Supervised Statistical Residual Learning for Sonar Despeckling with Improved Generalization

物理引导的自监督统计残差学习用于声纳图像去斑及泛化改进

Swapna Pillai, Siddharth Singh Savner, Sujit Kumar Sahoo

发表机构 * School of Electrical Sciences, Indian Institute of Technology Goa(印度理工学院Goa电子科学学院) Inria, Sophia Antipolis, France(法国Sophia Antipolis Inria)

AI总结 提出一种物理引导的自监督框架,通过同态对数域残差一致性约束,结合方差统计损失、边缘感知正则化和中值引导课程学习,实现无需干净监督的声纳图像去斑,并在多个真实数据集上达到最优性能且具有跨数据集鲁棒性。

详情
Journal ref
IEEE Signal Processing Letters, Early Access, pp. 1-5, 2026
AI中文摘要

本文介绍了一种物理引导的自监督框架用于声纳图像去斑,该框架将去斑重新表述为同态对数域中的残差一致性。通过约束对数比残差服从乘性散斑统计,所提方法无需干净监督即可防止恒等解退化。结合方差目标统计损失、边缘感知结构正则化以及中值引导的课程学习,该方法在保持结构保真度的同时实现了有效的散斑抑制。该公式与轻量级神经网络相结合,在多个真实声纳数据集上实现了最先进的性能,并展现出优异的跨数据集鲁棒性,同时适用于实时部署。

英文摘要

This letter introduces a physics-informed self-supervised framework for sonar image despeckling that reformulates despeckling as residual consistency in the homomorphic log domain. By constraining the log-ratio residual to obey multiplicative speckle statistics, the proposed method eliminates the need for clean supervision while preventing degenerate identity solutions. A variance-targeted statistical loss combined with edge-aware structural regularization and median-guided curriculum stabilization enables effective speckle suppression with preserved structural fidelity. This formulation along with a lightweight neural network achieves state-of-the-art performance across multiple real sonar datasets and demonstrates excellent cross-dataset robustness, while remaining suitable for real-time deployment.

2603.09095 2026-06-02 cs.CL cs.CV 版本更新

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

阅读,而非思考:理解并弥合多模态大语言模型中文本变为像素时的模态差距

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Amazon(亚马逊) New York University(纽约大学) Texas A&M University(德克萨斯大学)

AI总结 本文系统诊断多模态大语言模型在处理图像文本时的模态差距,发现其源于模型推理意愿不足而非感知失败,并提出一种轻量级自蒸馏方法有效弥合该差距。

详情
AI中文摘要

多模态大语言模型(MLLMs)能够处理以图像形式呈现的文本,但它们的表现往往不如相同内容以文本令牌形式提供时。我们通过在五种输入模式下跨七个基准评估七个MLLM,系统性地诊断了这种“模态差距”,涵盖了从合成渲染文本到来自arXiv PDF和Wikipedia页面的真实文档图像。我们发现,该差距对字体和分辨率等渲染选择高度敏感,并且自然文档图像通常表现出更小的差距,这表明性能差异部分反映了评估伪影而非根本性限制。通过对超过4000个示例进行基于扎根理论的错误分析,我们确定了主要原因:仅图像输入抑制了推理努力,模型产生的输出短5-19倍,跳过了逐步计算或推理。不愿推理,而非感知或知识检索失败,驱动了性能差距,尤其是在需要多步推理的任务上。我们展示了一种简单的、轻量级的在线自蒸馏方法,通过让模型在其自身的文本模式推理轨迹与图像输入配对上进行微调,弥合了这一差距,将图像模式准确率提升至匹配或超过文本模式性能,提升超过50%,并且增益可迁移到未见过的基准而不会灾难性遗忘。总体而言,我们的结果和分析提供了对模态差距的系统理解,并指出了在多模态语言模型中改进视觉文本理解的实际路径。

英文摘要

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the gap is highly sensitive to rendering choices such as font and resolution, and that natural document images often exhibit much smaller gaps, suggesting the performance difference partly reflects evaluation artifacts rather than fundamental limitations. Through a grounded-theory error analysis of over 4,000 examples, we identify the primary cause: image input alone suppresses reasoning effort, with models producing 5--19x shorter outputs that skip step-by-step computation or reasoning. The reluctance to reason, not a failure of perception or knowledge retrieval, drives the performance gap, particularly on tasks requiring multi-step reasoning. We show that a simple, lightweight on-policy self-distillation method by fine-tuning models on their own text-mode reasoning traces paired with image inputs closes this gap, raising image-mode accuracy to match or exceed text-mode performance with over 50\% improvement, and the gains transfer to unseen benchmarks without catastrophic forgetting. Overall, our results and analyses provide a systematic understanding of the modality gap and suggest a practical path toward improving visual text understanding in multimodal language models.

2605.23500 2026-06-02 cs.CV cs.LG 版本更新

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

B-GRTO: 引导式分组相对工具优化用于指代分割

Mario Markov, Stefan Maria Ailuro, Mohammad Mahdi, Luc Van Gool, Danda Pani Paudel

发表机构 * INSAIT Sofia University "St. Kliment Ohridski"(索菲亚大学"圣克莱门特·欧赫里迪斯基")

AI总结 提出B-GRTO框架,通过引导式预训练和分组相对工具优化,联合优化策略与可微分割解码器,显著提升复杂指代分割性能。

详情
AI中文摘要

分割是计算机视觉中的基本任务,支撑像素级场景理解,并作为从自主感知到医学图像分析等应用的基石。对于复杂的指代分割,近期方法将大型视觉-语言模型与分割解码器配对:前者分析图像和提示,后者预测目标掩码。尽管强化学习改进了推理密集型视觉-语言系统,但可训练工具(如分割解码器)通常使用可微目标单独优化,而将这些目标原则性地整合到强化学习中仍未被充分探索。因此,我们引入了分组相对工具优化(GRTO),这是一个数学上严谨的框架,用于联合优化具有可微工具使用的策略。GRTO重用分组相对策略优化(GRPO)的采样结果来优化辅助工具目标,使解码器梯度补充策略奖励。此外,我们推导出引导式GRTO(B-GRTO),一种廉价引导工具的预训练方法,从而实现更快的收敛和更优的性能。在三个具有挑战性的指代分割设置中,B-GRTO相比普通GRPO取得了显著改进,匹配或超越了领域特定的最新方法。这证明了将强化学习与可微辅助目标统一用于推理密集型分割的价值。

英文摘要

Segmentation is a fundamental task in computer vision, underpinning pixel-level scene understanding and serving as a cornerstone for applications ranging from autonomous perception to medical image analysis. For complex referring segmentation, recent methods pair large vision-language models with segmentation decoders: the former analyzes the image and prompt, while the latter predicts the target mask. Although reinforcement learning improves reasoning-intensive vision-language systems, trainable tools such as segmentation decoders are typically optimized separately with differentiable objectives, and the principled integration of such objectives into reinforcement learning remains underexplored. Thus, we introduce group relative tool optimization (GRTO), a mathematically grounded framework for jointly optimizing a policy with differentiable tool use. GRTO reuses group relative policy optimization (GRPO) rollouts to optimize the auxiliary tool objective, letting decoder gradients complement policy rewards. Further, we derive Bootstrapped-GRTO (B-GRTO), a pre-training method that cheaply bootstraps the tool, leading to faster convergence and superior performance. Across three challenging referring segmentation settings, B-GRTO results in substantial improvements over plain GRPO, matching or surpassing domain-specific state-of-the-art methods. This demonstrates the value of unifying reinforcement learning with differentiable auxiliary objectives for reasoning-intensive segmentation.

2605.23231 2026-06-02 cs.CV 版本更新

Beyond Normal References: Discriminative Few-Shot Anomaly Detection

超越正常参考:判别式少样本异常检测

Huan Wang, Jun Shen, Jun Yan, Guansong Pang

发表机构 * Singapore Management University, Singapore(新加坡国立管理学院) University of Wollongong, Australia(沃林戈大学)

AI总结 提出IDEAL框架,通过内在偏差学习同时利用正常和异常参考,抑制正常变化并提取判别性偏差向量,实现少样本异常检测的泛化。

Comments 31 pages, 7 figures

详情
AI中文摘要

本文考虑一种实用的少样本异常检测(FSAD)设置,称为判别式FSAD,其中在推理时仅有有限数量的正常和异常样本作为参考可用。现有的FSAD方法依赖于仅正常参考进行正常性匹配,忽略了异常参考中的判别性线索,而直接拟合两种参考可能导致对已知异常的过拟合。我们引入了IDEAL,一种内在偏差学习框架,它利用两种参考类型来学习表征可泛化异常(即偏离正常性)的内在偏差模式。IDEAL将学习过程分解为两个新颖的组件:1)正常变化擦除器,用于抑制可能导致偏离正常性的噪声正常变化,从而突出异常相关的偏差表示;2)内在偏差编码器,用于将这些去噪后的偏差表示分解为内在偏差向量,捕捉最具判别性的正交偏差方向。在推理时,IDEAL对投影到学习到的内在偏差向量上的查询-正常偏差进行评分,从而实现对已知和未知异常的泛化。在八个真实世界数据集上的大量实验表明,IDEAL有效泛化到未知异常,并持续优于现有最先进的FSAD方法。代码和数据将在\href{https://github.com/mala-lab/IDEAL}{https://github.com/mala-lab/IDEAL}提供。

英文摘要

This paper considers a practical few-shot anomaly detection (FSAD) setting, termed discriminative FSAD, where a limited number of both normal and anomalous examples are available as references during inference. Existing FSAD methods rely on normal-only references through normality matching, ignoring the discriminative clues in anomalous references, while directly fitting both references can overfit to the seen anomalies. We introduce IDEAL, an intrinsic deviation learning framework that leverages both reference types to learn intrinsic deviation patterns characterizing generalizable abnormality as deviations from normality. IDEAL decomposes the learning process into two novel components: 1) a Normal Variation Eraser to suppress nuisance normal variations that may lead to noisy deviations from normality, thereby highlighting anomaly-relevant deviation representations; 2) an Intrinsic Deviation Encoder to decompose these denoised deviation representations into intrinsic deviation vectors capturing the most discriminative orthogonal deviation directions. At inference, IDEAL scores query-to-normal deviations preserved after projection onto the learned intrinsic deviation vectors, enabling generalization for both seen and unseen anomalies. Extensive experiments on eight real-world datasets show that IDEAL generalizes effectively to unseen anomalies and consistently outperforms existing state-of-the-art FSAD methods. Code and data will be available at \href{https://github.com/mala-lab/IDEAL}{https://github.com/mala-lab/IDEAL}.

2605.22671 2026-06-02 cs.CV 版本更新

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

从抽象到实例化:学习视觉-语言-动作模型的行为表示

Bing Hu, Zaijing Li, Rui Shao, Junda Chen, April Hua Liu, Wei-Shi Zheng, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen Loop Area Institute(深圳环城研究所) PengCheng Laboratory(鹏城实验室) Sun Yat-sen University(中山大学) Shanghai University of Finance(上海财经大学)

AI总结 提出BehaviorVLA框架,通过因果Mamba架构的视觉运动行为编码器和相位条件行为解码器学习时间一致的行为表示,在分布偏移下实现鲁棒操作,在多个基准上达到最优成功率并展现数据效率。

Comments ICML 2026 Oral

详情
AI中文摘要

视觉-语言-动作(VLA)模型在分布偏移下常出现性能下降,因为它们在跨不同环境学习泛化行为表示方面存在困难。现有方法尝试通过以动作为中心的潜变量构建行为表示,但常受限于短时间跨度的时间碎片化和静态执行对齐,导致复杂场景中的行为不一致。为解决这些限制,我们提出 extbf{BehaviorVLA},一个通过学习时间一致的行为表示来促进鲁棒操作的框架。我们的方法包含两个对称组件:(1) extbf{视觉运动行为编码器(VBE)},利用基于因果Mamba的架构将长时间跨度的轨迹信息聚合为统一的行为表示;(2) extbf{相位条件行为解码器(PBD)},通过动态对齐任务级先验与实时执行进度,将该表示解码为精确动作。在RoboTwin 2.0、LIBERO和CALVIN上的实验分别达到了58%、98%和4.36(平均长度)的最优成功率。值得注意的是,在真实世界的仿真到现实迁移中,BehaviorVLA仅使用50%的演示数据就匹配了OpenVLA-OFT的性能,展示了其优越的数据效率和泛化能力。

英文摘要

Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios. To address these limitations, we propose \textbf{BehaviorVLA}, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations. Our approach features two symmetric components: (1) the \textbf{Visuomotor Behavior Encoder (VBE)}, which utilizes a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation; and (2) the \textbf{Phase-conditioned Behavior Decoder (PBD)}, which decodes this representation into precise actions by dynamically aligning task-level priors with real-time execution progress. Experiments on RoboTwin 2.0, LIBERO, and CALVIN demonstrate state-of-the-art success rates of 58\%, 98\%, and 4.36 (Avg.Len), respectively. Notably, in real-world sim-to-real transfer, BehaviorVLA matches the performance of OpenVLA-OFT using only 50\% of the demonstration data, showcasing its superior data efficiency and generalization.

2605.00941 2026-06-02 cs.LG cs.CV 版本更新

Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching

散度即不确定性:流匹配的闭式后验协方差

Jiarui Xing, Song Wang, Jian Wang

发表机构 * Yale University(耶鲁大学) Shanxi University(山西大学) Harvard Medical School(哈佛医学院)

AI总结 本文通过扩展Tweedie公式到流匹配插值,推导出生成轨迹上每一点后验协方差的精确闭式表达式,该表达式仅依赖于学习速度场的散度,可在预训练模型上事后计算,无需重新训练或修改架构。

Comments 9 Pages, 5 figures

详情
AI中文摘要

流匹配已成为生成建模的领先框架,但量化其样本的不确定性仍是一个开放问题。现有方法使用辅助方差头重新训练模型、维护昂贵的集成或通过多个积分步骤传播近似协方差,在训练成本、推理成本或准确性之间进行权衡。我们表明这些权衡都不是必需的。通过将Tweedie公式从去噪设置扩展到流匹配插值,我们推导出生成轨迹上每一点后验协方差的精确闭式表达式。结果仅依赖于一个量,即学习速度场的散度,该散度可以在任何预训练的流匹配模型上事后计算,无需重新训练和架构修改。对于像MeanFlow这样的单步生成器,相同的公式在单次前向传递中产生端到端的生成不确定性,消除了所有先前方法所需的多步方差传播。在MNIST上的实验证实,得到的逐像素不确定性图在语义上有意义,集中在样本间变化最大的数字边界上,并且标量不确定性分数跟踪实际预测误差,所有计算量大约比集成或蒙特卡洛丢弃法少$10^4$倍。

英文摘要

Flow matching has become a leading framework for generative modeling, but quantifying the uncertainty of its samples remains an open problem. Existing approaches retrain the model with auxiliary variance heads, maintain costly ensembles, or propagate approximate covariance through many integration steps, trading off training cost, inference cost, or accuracy. We show that none of these trade-offs is necessary. By extending Tweedie's formula from the denoising setting to the flow matching interpolant, we derive an exact, closed-form expression for the posterior covariance at every point along the generative trajectory. The result depends on a single quantity, namely the divergence of the learned velocity field, which can be computed post-hoc on any pre-trained flow matching model, requiring no retraining and no architectural modification. For one-step generators such as MeanFlow, the same formula yields the end-to-end generation uncertainty in a single forward pass, eliminating the multi-step variance propagation required by all prior methods. Experiments on MNIST confirm that the resulting per-pixel uncertainty maps are semantically meaningful, concentrating on digit boundaries where inter-sample variation is highest, and that the scalar uncertainty score tracks actual prediction error, all at roughly $10^4 \times$ less total compute than ensembling or Monte Carlo dropout.

2604.17473 2026-06-02 cs.CV cs.AI 版本更新

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

双锚定:解决视觉语言导航中的状态漂移问题

Kangyi Wu, Pengna Li, Kailin Lyu, Xi Lin, Lin Zhao, Qingrong He, Jinjun Wang, Jianyi Liu

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) National Engineering Research Center for Visual Information and Applications(视觉信息与应用国家工程研究中心) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院) Xi’an Jiaotong University(西安交通大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Johns Hopkins University(约翰霍普金斯大学) Joy Future Academy, JD(京东未来学院)

AI总结 提出双锚定框架,通过指令进度锚定和记忆地标锚定分别解决进度漂移和记忆漂移,显著提升长场景导航成功率。

详情
AI中文摘要

视觉语言导航(VLN)要求智能体通过遵循自然语言指令在3D环境中导航。尽管最近的视频大语言模型(Video-LLMs)极大地推进了VLN,但在长场景中它们仍然非常容易受到状态漂移的影响。在这些情况下,智能体的内部状态偏离真实的任务执行状态,导致无目的漫游和无法执行指令中的关键操作。我们将这种失败归因于两种不同的认知缺陷:进度漂移,即智能体无法区分已完成的子目标和剩余的子目标;以及记忆漂移,即智能体的历史表示退化,使其无法跟踪已访问的地标。在本文中,我们提出了一个双锚定框架,明确锚定指令进度和历史表示。首先,为了解决进度漂移,我们引入了指令进度锚定,监督智能体生成结构化的文本标记,以描述已完成与剩余的子目标。其次,为了缓解记忆漂移,我们提出了记忆地标锚定,利用以地标为中心的世界模型回顾性地预测由Segment Anything模型提取的以对象为中心的嵌入,迫使智能体显式验证过去的观察并保留已访问地标的独特表示。为促进该框架,我们整理了两个大规模数据集:360万个带有显式进度描述的样本,以及93.7万个用于回顾性验证的接地地标数据。在模拟和真实环境中的大量实验证明了我们方法的优越性,在成功率上提高了15.2%,在长时程轨迹上获得了24.7%的显著提升。为促进进一步研究,我们将发布我们的代码、数据生成流程以及收集的数据集。

英文摘要

Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.

2602.02214 2026-06-02 cs.CV 版本更新

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

因果强迫:自回归扩散蒸馏的正确方法,用于高质量实时交互式视频生成

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu

发表机构 * Hongzhou Zhu(朱洪洲) Min Zhao(赵敏) Guande He(何冠德) Hang Su(苏hang) Chongxuan Li(李崇轩) Jun Zhu(朱军)

AI总结 针对双向扩散模型蒸馏为自回归模型时的架构差距问题,提出因果强迫方法,通过自回归教师进行ODE初始化并应用DMD过程,显著提升视频生成质量。

Comments Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; https://github.com/thu-ml/Causal-Forcing. ICML 2026

详情
AI中文摘要

为了实现实时交互式视频生成,当前方法将预训练的双向视频扩散模型蒸馏为少步自回归(AR)模型,当全注意力被因果注意力替代时面临架构差距。然而,现有方法并未从理论上弥合这一差距。它们通过ODE蒸馏初始化AR学生模型,这需要帧级单射性,即在AR教师的PF-ODE下,每个噪声帧必须映射到唯一的干净帧。从双向教师蒸馏AR学生违反了这一条件,阻止了教师流映射的恢复,反而诱导出条件期望解,导致性能下降。为解决此问题,我们提出因果强迫(Causal Forcing),它使用自回归教师进行ODE初始化以弥合架构差距,然后应用与Self Forcing相同的DMD过程。实验结果表明,我们的方法在所有指标上优于所有基线,在动态程度、VisionReward和指令跟随上分别超过SOTA Self Forcing 19.3%、8.7%和16.7%。项目页面:https://thu-ml.github.io/CausalForcing.github.io/;代码:https://github.com/thu-ml/Causal-Forcing。

英文摘要

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.

2605.21964 2026-06-02 cs.CV physics.optics 版本更新

Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection

用于目标检测的双集成低延迟单透镜红外计算成像

Xuquan Wang, Guishuo Yang, Dapeng Yan, Yujie Xing, Xuanyu Qian, Kai Zhang, Xiong Dun, Jiande Sun

发表机构 * MOE Key Laboratory of Advanced Micro-Structured Materials(教育部先进微结构材料重点实验室) Institute of Precision Optical Engineering(精密光学工程研究院) School of Physics Science and Engineering(物理科学与工程学院) Shanghai Frontiers Science Center of Digital Optics(上海前沿科学中心数字光学中心) School of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Shandong Normal University(山东师范大学) Shandong Engineering Research Center for Multimodal Computing and Intelligent Decision Making(山东省多模态计算与智能决策中心)

AI总结 提出物理感知双集成网络(PDI-Net),通过嵌入光学先验并共享编码器特征,在单透镜红外相机上实现低延迟高精度目标检测。

Comments 15 pages, 11 figures; supplementary material: 3 pages, 2 figures

详情
AI中文摘要

计算成像能够实现紧凑的红外系统,但结合图像重建和目标检测的深度学习流程通常会引入显著的推理延迟。大多数现有的加速策略压缩重建网络,而忽略了来自光路的物理先验,从而在准确性和速度之间留下权衡。我们提出了物理感知双集成网络(PDI-Net),这是一个低延迟框架,它将红外重建与目标检测集成在一起,并进一步将光学先验嵌入到学习过程中。PDI-Net在训练期间使用监督U-Net,而在推理期间,半U-Net编码器直接与基于YOLO的检测器共享特征,避免了完整的图像重建。为了弥合面向保真度的重建特征与面向检测的语义之间的差距,我们引入了物理感知大小桥接(PALS-Bridge),它使用与视场相关的点扩散函数先验自适应地调制多尺度卷积分支。还开发了物理信息的光学退化模拟流程用于训练和验证。该方法部署在单透镜红外相机上,与传统多透镜设计相比,系统重量减轻约50%。在低信噪比条件下的M3FD基准上,与采用剪枝策略的Rec+Det相比,PDI-Net将推理时间减少了84.06%,同时将mAP@0.5:0.95提高了5.07%。这些结果展示了在资源受限平台上用于实时目标检测的紧凑、低延迟计算红外成像。

英文摘要

Computational imaging enables compact infrared systems, but deep-learning pipelines that combine image reconstruction and object detection often introduce substantial inference latency. Most existing acceleration strategies compress the reconstruction network while overlooking physical priors from the optical path, leaving a trade-off between accuracy and speed. We present Physics-aware Dual-Integrated Network (PDI-Net), a low-latency framework that integrates infrared reconstruction with object detection and further embeds optical priors into the learning process. PDI-Net uses a supervised U-Net during training, while a semi-U-Net encoder shares features directly with a YOLO-based detector during inference, avoiding full image reconstruction. To bridge the gap between fidelity-oriented reconstruction features and detection-oriented semantics, we introduce a physics-aware large-small bridge (PALS-Bridge), which uses field-dependent point spread function priors to adaptively modulate multiscale convolutional branches. A physics-informed optical degradation simulation pipeline is also developed for training and validation. The method is deployed on a single-lens infrared camera, reducing system weight by about 50% compared with traditional multi-lens designs. On the M3FD benchmark under low-SNR conditions, PDI-Net reduces inference time by 84.06% compared with the Rec+Det with pruning strategy while improving mAP@0.5:0.95 by 5.07%. These results demonstrate compact, low-latency computational infrared imaging for real-time object detection on resource-constrained platforms.

2605.20823 2026-06-02 cs.CV 版本更新

RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses

RelWitness: 基于视觉-几何关系见证者的开放词汇3D场景图生成

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Phenikaa University(费恩基亚大学)

AI总结 提出RelWitness框架,通过视觉-几何关系见证者从不完整关系监督中生成开放词汇3D场景图,解决关系标注稀疏和词汇扩展问题。

详情
AI中文摘要

开放词汇3D场景图生成旨在用灵活的自然语言谓词描述对象实例及其关系。核心难点不仅在于词汇扩展,还在于监督可靠性:3D场景图数据集中的关系标注具有选择性,许多有效的对象对关系未被标注。我们提出RelWitness,一个从带有位姿的RGB-D序列中生成开放词汇3D场景图的框架,可在不完整关系监督下工作。关键概念是关系见证者:一种具体的视觉-几何线索,使关系在捕获场景中可观察。支持关系需要接触和垂直排序;包含关系需要包围;邻近关系需要度量接近;朝向关系需要面对方向;稳定关系应在两个对象可见的视角间持续存在。RelWitness从RGB视图、深度图、重建的3D几何、角色敏感文本、对象先验空视图和多视角一致性构建关系见证记录。视觉-几何见证验证器将未标注的关系候选分配给验证的缺失正例、可靠负例或不确定未标注案例。然后,见证引导的正-无标记目标从不完整标注中学习,而不将每个缺失标签视为负例。我们进一步引入见证一致解码和RGB-D缺失关系审计协议。在3DSSG/3RScan和ScanNet派生的开放词汇分割上的模拟手稿规划实验显示了预期行为:改进的未见关系识别、更高的见证精度、更低的幻觉和减少的关系短语冗余。所有数值结果均为规划值,在提交前必须替换为复现的测量值。

英文摘要

Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission

2605.21421 2026-06-02 cs.CV 版本更新

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

AIGaitor: 面向所有人的隐私保护与无云端运动分析——基于边缘计算

Lauhitya Reddy, Trisha M. Kesar, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University(埃默里大学生物医学信息学系) Department of Rehabilitation Medicine, Emory University(埃默里大学康复医学系) The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology(埃默里大学和佐治亚理工学院的Wallace H. Coulter生物医学工程系)

AI总结 提出AIGaitor系统,在智能手机上利用边缘计算实现无标记单目运动捕捉与深度学习分析,解决成本、隐私和易用性问题。

Comments 18 pages 3 figures, 2 tables

详情
AI中文摘要

运动捕捉是测量人体运动的金标准,但临床使用仍受成本、技术复杂性和隐私问题限制。AIGaitor是一个隐私保护、无云端的运动分析系统,完全在消费级智能手机上使用设备上的神经加速器运行无标记单目运动捕捉流程和下游深度学习分析。为激励其设计,我们调查了74位康复临床医生:92%表示会采用准确、经济、易用的AI步态分析工具,而79.7%认为运营成本、68.9%认为培训不足、64.9%认为隐私问题是主要障碍。然后,我们优化并基准测试了当前单目流程组件的移动iOS实现,包括2D和3D姿态估计、姿态优化、基于骨架的深度学习和视觉语言模型。一个时间优先的端到端设备上流程在iPhone 14上处理10秒4K 60fps视频片段耗时77秒,与高端NVIDIA H200云服务器(含网络传输)相比,在全局移动平均上行链路下为94秒,在发达地区Wi-Fi下为66秒,匹配或优于后者。轻量级模型如ViTPose-s实现实时关键点提取,基于骨架的动作识别模型在同一片段上提供亚毫秒级步态分类。据我们所知,AIGaitor是首个展示端到端设备上运动捕捉和下游深度学习分析的单目系统,支持低成本、私密且对智能手机用户可及的临床适用运动分析。

英文摘要

Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.

2605.20301 2026-06-02 cs.CV cs.AI 版本更新

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

Co-Fusion4D:面向鲁棒3D目标检测的时空协同融合

Wenxuan Li, Qin Zou, Shoubing Chen, Chi Chen, Yingyi Yang, Qingxiang Meng

发表机构 * Tsinghua University(清华大学)

AI总结 提出Co-Fusion4D框架,通过当前帧主导-历史帧互补机制和双注意力融合模块,解决BEV检测器中跨帧时空不一致问题,在nuScenes上达到74.9% mAP和75.6% NDS。

详情
AI中文摘要

在自动驾驶中,3D目标检测对于准确感知和可靠决策至关重要。然而,目标运动和自车运动常常在基于BEV的检测器中引起跨帧时空不一致,导致时序BEV特征错位和时空一致性退化。为了解决这些挑战,我们提出了Co-Fusion4D,一个统一框架,显式地保持跨帧时空一致性并抑制时序特征漂移。Co-Fusion4D采用当前帧中心策略,将当前帧作为主要信息源,同时在时空滤波和对齐后选择性地融入历史帧。这种主从互补机制有效减轻了累积对齐误差,抑制了噪声特征传播,并利用可靠的时序线索获得更一致的BEV表示。此外,Co-Fusion4D集成了双注意力融合(DAF)模块,以进一步增强时空特征交互。DAF联合利用帧内空间注意力和帧间时序注意力,自适应地对齐和融合多帧特征,强调运动一致区域同时抑制虚假相关性。通过偏离传统的均匀融合范式,该设计显著提高了BEV表示的时序稳定性和判别能力。在nuScenes基准上的大量实验表明,Co-Fusion4D实现了最先进的性能,mAP为74.9%,NDS为75.6%,且不依赖测试时增强或外部数据。

英文摘要

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.

2605.20282 2026-06-02 cs.CV cs.AI 版本更新

Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

视觉模型真的能遗忘吗?Mirage:表示层面的视觉遗忘认证

Zhenyu Yu, Yangchen Zeng, Chunlei Meng, Guangzhen Yao, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Southeast University(东南大学) Northeast Normal University(东北师范大学)

AI总结 提出Mirage框架,通过表示层面诊断揭示现有垂直联邦学习遗忘方法在输出层面通过认证后仍保留类别结构信息,并发现遗忘三元组困境和类别-样本不对称性。

详情
AI中文摘要

垂直联邦学习中的机器遗忘引起了越来越多的关注,但现有方法仅使用输出层面指标来认证遗忘。我们通过引入Mirage(一个表示层面审计框架,包含四种互补诊断方法:线性探针恢复、中心核对齐、特征可分性评分和逐层恢复分析)来挑战这些说法。通过在七个数据集和七种基线方法上遵循最近的VFL遗忘协议进行实验,Mirage揭示了三个关键发现:(i)遗忘差距:通过输出层面认证的方法在其表示中仍然保留了大量的类别结构,线性探针恢复比重新训练的基线高出最多15.4个百分点;中心核对齐显示这些模型在结构上更接近原始模型而非重新训练的参考模型,而可分性评分表明存在持续的几何区分。(ii)遗忘三元组困境:没有现有方法能同时实现高效用、输出层面遗忘和表示层面遗忘。(iii)类别-样本不对称性:类别级遗忘留下强烈的表示痕迹(线性探针恢复高达97%),而样本级遗忘与随机无异(线性探针恢复约50%);逐层分析进一步表明残差类别信息在网络深度中持续存在。这些发现呼吁在联邦遗忘研究中采用表示层面感知的评估标准。

英文摘要

Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.

2605.05945 2026-06-02 cs.CV cs.CL 版本更新

MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

MobileEgo Anywhere:基于商用硬件的长时域自我中心数据开放基础设施

Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathore, Pratyush Patnaik, Shubhanshu Khatana, Ekaksh Janweja

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of California, Los Angeles(加州大学洛杉矶分校) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出MobileEgo Anywhere框架,利用智能手机传感器实现超过一小时的自我中心轨迹采集,并发布开源处理流水线STERA、移动应用及200小时数据集,验证其在视觉-语言-动作模型训练中的有效性。

详情
AI中文摘要

视觉-语言-动作(VLA)模型推动了对大规模自我中心数据集的需求,但用于收集长时域数据的硬件和基础设施仍然难以获取。当前数据集通常只有几分钟长的片段,无法捕捉复杂机器人任务执行所需的长时域时间依赖。我们提出MobileEgo Anywhere,一个在商用移动硬件上收集超过一小时自我中心轨迹的框架,利用现代智能手机传感器进行长期姿态跟踪,避免了传统机器人数据收集的硬件障碍。我们发布三个组件:(1)STERA,一个开源视频处理流水线,将原始移动捕获转换为标准化、训练就绪的格式,用于VLA和基础模型研究;(2)一个免费的移动应用,让任何用户记录自我中心活动;(3)一个200小时的数据集,包含多样化的长格式自我中心数据,跨584个会话具有持久状态跟踪。我们进一步展示该数据是可用的训练信号:在其上对VLA进行中期训练可降低保留动作预测误差。

英文摘要

Vision-language-action (VLA) models have driven demand for large-scale egocentric datasets, yet the hardware and infrastructure to collect long-horizon data remain inaccessible. Datasets today typically have episodes only a few minutes long, which fails to capture the long-horizon temporal dependencies that complex robotic task execution requires. We present MobileEgo Anywhere, a framework for collecting hour-plus egocentric trajectories on commodity mobile hardware that uses modern smartphone sensors for long-term pose tracking without the hardware barriers of traditional robotics data collection. We release three components: (1) STERA, an open-source video-processing pipeline that converts raw mobile captures into standardized, training-ready formats for VLA and foundation-model research; (2) a free mobile app that lets any user record egocentric activity; and (3) a 200-hour dataset of diverse, long-form egocentric data with persistent state tracking across 584 sessions. We further show this data is a usable training signal:mid-training a VLA on it lowers held-out action-prediction error.

2411.19093 2026-06-02 cs.CV cs.CY cs.LG 版本更新

Seeing SDG 6 from space: local-scale monitoring of piped water and sewage system access across Africa using satellite imagery and self-supervised learning

从太空看SDG 6:利用卫星图像和自监督学习对非洲管道水和污水系统接入进行局部尺度监测

Othmane Echchabi, Aya Lahlou, Nizar Talty, Josh Malcolm Manto, Tongshu Zheng, Ka Leung Lam

发表机构 * Mila – Quebec AI Institute(魁北克人工智能研究所) School of Computer Science, McGill University(麦吉尔大学计算机科学学院) Department of Earth and Environmental Engineering, Columbia University(哥伦比亚大学地球与环境工程系) Center for Learning the Earth with Artificial Intelligence and Physics (LEAP)(人工智能与物理学习地球中心(LEAP)) Division of Natural and Applied Sciences, Duke Kunshan University(杜克-昆山大学自然科学与应用科学系)

AI总结 本研究利用Sentinel-2图像、Afrobarometer调查数据、30米人口数据和DINO自监督视觉Transformer特征,开发了一个可扩展的遥感框架,以约2.56公里分辨率估计管道水和污水系统接入情况,最佳模型AUROC分别达到91.54%和93.24%,与WHO/UNICEF JMP统计数据高度一致,并在尼日利亚案例中揭示了细粒度环境不平等。

Comments Under Review

详情
AI中文摘要

获得饮用水和卫生设施对健康和福祉至关重要,但主要差距仍然存在,尤其是在非洲等数据稀缺地区。SDG 6旨在实现普遍接入,但目前的监测依赖于成本高昂、频率低且空间不均匀的调查和普查,且报告延迟较长。 本研究开发了一个可扩展的遥感框架,利用Sentinel-2图像、Afrobarometer调查响应、30米人口数据和DINO自监督视觉Transformer特征,以约2.56公里分辨率估计管道水和污水系统接入情况。最佳模型在管道水和污水接入方面分别达到91.54%和93.24%的AUROC值。在50个非洲国家中,人口加权估计与WHO/UNICEF JMP统计数据在管道水方面高度一致($R^2 = 0.92$),在污水接入方面也有显著一致性($R^2 = 0.72$)。在无Afrobarometer覆盖的国家,平均绝对误差分别为9.5%和10.7%,估计值分别与1.214亿和1.597亿人口的JMP值相差在15%以内。 一项覆盖尼日利亚767个地方政府区域的案例研究表明,该框架揭示了细尺度的环境不平等。管道水和污水无接入的最大负担分别达到115.5万和145.2万人,是地方政府区域中位数负担的7.9倍和8.3倍,而最高十分位无接入阈值分别为0.805和0.952,表明匮乏普遍存在。这些发现表明,基于DINO的卫星模型可以以低成本、空间详细的方式补充家庭调查,为SDG 6监测、基础设施定位和环境公平评估提供证据。

英文摘要

Access to drinking water and sanitation is essential for health and well-being, yet major disparities remain, especially in data-scarce regions such as Africa. SDG 6 aims for universal access, but current monitoring relies on costly, infrequent, and spatially uneven surveys and censuses with long reporting delays. This study develops a scalable remote-sensing framework to estimate piped water and sewage system access at approximately 2.56 km resolution using Sentinel-2 imagery, Afrobarometer survey responses, 30 m population data, and DINO self-supervised Vision Transformer features. The best model achieves AUROC values of 91.54% for piped water and 93.24% for sewage access. Across 50 African countries, population-weighted estimates strongly align with WHO/UNICEF JMP statistics for piped water ($R^2 = 0.92$) and show meaningful agreement for sewage access ($R^2 = 0.72$). In countries without Afrobarometer coverage, MAEs are 9.5% and 10.7%, with estimates within 15% of JMP values for 121.4 million and 159.7 million people, respectively. A Nigeria case study across 767 Local Government Areas (LGAs) shows that the framework reveals fine-scale environmental inequality. The largest no-access burdens reach 1.155 million people for piped water and 1.452 million for sewage, 7.9 and 8.3 times the median LGA burden, while top-decile no-access thresholds of 0.805 and 0.952 indicate that deprivation is widespread. These findings show that DINO-based satellite models can complement household surveys with low-cost, spatially detailed evidence for SDG 6 monitoring, infrastructure targeting, and environmental equity assessment.

2605.17921 2026-06-02 cs.CV 版本更新

An Efficient Streaming Video Understanding Framework with Agentic Control

一种具有代理控制的高效流式视频理解框架

Jinming Liu, Jianguo Huang, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Zongyu Guo, Bin Li, Wenjun Zeng, Yan Lu, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology, Ningbo, China(宁波工程技术学院) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出R3-Streaming框架,通过级联控制(记忆压缩、响应判断、计算路由)和年龄感知遗忘策略及目标平衡强化学习(TB-GRPO),在严格延迟预算下实现流式视频理解,性能达到SOTA并减少95-96%视觉令牌使用。

详情
AI中文摘要

流式视频需要在严格的延迟预算下处理动态信息密度。然而,现有方法通常采用静态策略,例如固定记忆压缩或依赖单一模型,这迫使做出权衡:快速模型无法处理复杂查询,而始终开启的重模型违反实时约束并使简单查询过于复杂。我们不预先固定这些决策,而是提出R3-Streaming(记忆、响应、推理),它将流式视频理解表述为级联控制问题:对于每个查询,系统压缩记忆、判断响应就绪状态,并顺序路由计算,使得每个下游决策建立在逐步精化的信息状态上。为了优化这一流水线,我们引入了一种年龄感知的遗忘策略用于记忆压缩,因为激进地压缩历史帧可以带来显著的性能提升。对于计算路由,我们提出了TB-GRPO,一种目标平衡的强化学习目标,它将困难查询路由到更强的模型,同时防止模式崩溃。大量评估表明,R3-Streaming在流式多模态大模型中取得了最先进的结果,在OVO-Bench上达到57.92,在StreamingBench上达到76.36,同时将视觉令牌使用量减少了95%到96%。

英文摘要

Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.

2605.16740 2026-06-02 cs.CV 版本更新

TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation

TRACE:基于证据定位的多视频事件理解与声明生成

Pengyu Yan, Akhil Gorugantu, Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, David Doermann

发表机构 * University at Buffalo, SUNY(布法罗大学) New York University(纽约大学)

AI总结 提出TRACE框架,通过先构建文本可搜索时间线进行证据定位,再引导视觉语言模型生成声明和跨视频引用,显著提升多视频事件理解的事实完整性和归因准确性。

Comments Accepted at ACL 2026 Workshop

详情
AI中文摘要

多视频事件理解要求模型能够定位并归因于分布在长且异构的视频语料库中的查询相关证据。现有大型视觉语言模型(LVLMs)在此场景下表现不佳,因为它们很快耗尽上下文预算,难以精确定位证据重要的片段,经常错过密集的信息线索,如广播图形、字幕和记分牌。我们引入TRACE,一个基于证据定位的框架,采用先定位后推理的策略进行多视频事件推理。我们的方法首先使用OCR和物体检测为每个视频构建结构化的、可文本搜索的时间线。然后,一个纯文本LLM进行查询感知的证据定位,在后续视觉推理之前选择相关时刻。检索到的帧及其定位摘要随后用于引导基于LVLM的声明生成和跨视频引用整合。在MAGMaR 2026和WikiVideo上的实验表明,结构化定位显著提升了事实完整性和归因保真度。在MAGMaR验证集上,与未引导的Qwen3-VL-30B基线相比,TRACE将宏平均MiRAGE F1从0.705提升至0.811,引用召回率从0.440大幅提升至0.628。该方法还在官方MAGMaR 2026排行榜上取得了最先进的结果。代码已发布在https://github.com/pengyu965/TRACE。

英文摘要

Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision-language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We introduce TRACE, an evidence grounding-guided framework that follows a ground-before-reasoning strategy for multi-video event reasoning. Our approach first builds a structured, text-searchable timeline for each video using OCR and object detection. A text-only LLM then conducts query-aware evidence localization, selecting relevant moments prior to any downstream visual reasoning. The retrieved frames and their grounding summaries are subsequently used to steer LVLM-based claim generation and cross-video citation consolidation. Experiments on MAGMaR 2026 and WikiVideo demonstrate that structured grounding markedly boosts factual completeness and attribution fidelity. On the MAGMaR validation split, TRACE raises macro-average MiRAGE F1 from 0.705 to 0.811 compared to an unguided Qwen3-VL-30B baseline, with especially strong improvements in citation recall from 0.440 to 0.628. The method also attains state-of-the-art results on the official MAGMaR 2026 leaderboard. Code is released at https://github.com/pengyu965/TRACE.

2604.26283 2026-06-02 cs.CV cs.AI 版本更新

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

MedSynapse-V:通过潜在记忆演化桥接视觉感知与临床直觉

Chunzheng Zhu, Jiaqi Zeng, Junyu Jiang, Jianxin Lin, Yijun Wang

发表机构 * Hunan University(湖南大学)

AI总结 提出MedSynapse-V框架,通过潜在诊断记忆演化模拟临床专家经验调用,解决医学视觉语言模型因离散分词导致的量化损失、长程信息消散和案例适应性问题,在诊断准确性上显著超越现有方法。

Comments Medical latent reasoning; Memory evolution

详情
AI中文摘要

高精度医学诊断不仅依赖于静态成像特征,还依赖于专家在图像解读过程中即时调用的隐式诊断记忆。我们指出了医学视觉语言模型中由于离散分词导致的基本认知错位,表现为量化损失、长程信息消散以及缺乏案例自适应专业知识。为弥合这一差距,我们提出了MedSynapse-V,一个用于潜在诊断记忆演化的框架,通过在模型隐藏流中动态合成隐式诊断记忆来模拟临床医生的经验调用。具体而言,它从元查询先验记忆机制开始,其中可学习的探针从解剖先验编码器中检索结构化先验,以生成压缩的隐式记忆。为确保临床保真度,我们引入了因果反事实细化(CCR),利用强化学习和基于区域级特征掩蔽的反事实奖励来量化每个记忆的因果贡献,从而修剪冗余并将潜在表示与诊断逻辑对齐。这一演化过程最终达到内在记忆转换(IMT),一种特权-自主双分支范式,通过全词汇散度对齐将教师分支的诊断模式内化到学生分支中。跨多个数据集的全面实证评估表明,通过将外部专业知识转化为内源参数,我们的方法在诊断准确性上显著优于现有最先进方法,特别是思维链范式。代码可在https://github.com/zhcz328/MedSynapse-V获取。

英文摘要

High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model's hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy. The code is available at https://github.com/zhcz328/MedSynapse-V.

2605.15141 2026-06-02 cs.CV 版本更新

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Causal Forcing++:用于实时交互式视频生成的可扩展少步自回归扩散蒸馏

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, Jun Zhu

发表机构 * Tsinghua University(清华大学) ShengShu(盛数) Renmin University of China(中国人民大学)

AI总结 提出Causal Forcing++框架,通过因果一致性蒸馏(causal CD)实现帧级1-2步自回归扩散蒸馏,在降低延迟和训练成本的同时提升视频生成质量。

详情
AI中文摘要

实时交互式视频生成需要低延迟、流式处理和可控展开。现有的自回归(AR)扩散蒸馏方法通过将双向基础模型蒸馏为少步AR学生模型,在分块4步机制中取得了强劲结果,但仍受限于粗粒度响应和不可忽略的采样延迟。本文研究了一种更激进的设置:仅用1-2采样步的帧级自回归。在此机制下,我们识别出少步AR学生模型的初始化是关键瓶颈:现有策略要么目标不对齐,要么无法进行少步生成,要么成本过高难以扩展。我们提出 extbf{Causal Forcing++},一个原则性且可扩展的流水线,使用\emph{因果一致性蒸馏}(causal CD)进行少步AR初始化。核心思想是:因果CD学习与因果ODE蒸馏相同的AR条件流映射,但通过相邻时间步之间的单个在线教师ODE步获得监督,避免了预计算和存储完整PF-ODE轨迹的需要。这使得初始化既更高效又更易优化。由此产生的流水线\ours在 extit{ extbf{帧级2步设置}}下,VBench总分、VBench质量和VisionReward分别超过SOTA 4步分块Causal Forcing 0.1、0.3和0.335,同时首帧延迟降低50%,阶段2训练成本降低约$4 imes$。我们进一步将流水线扩展到以动作条件的世界模型生成,秉承Genie3的精神。项目页面:https://github.com/thu-ml/Causal-Forcing 和 https://github.com/shengshu-ai/minWM 。

英文摘要

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

2605.14709 2026-06-02 cs.CV 版本更新

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

打破双重瓶颈:将统一多模态模型演化为自适应交错视觉推理器

Qingyang Liu, Bingjie Gao, Canmiao Fu, Zhipeng Huang, Chen Li, Feng Wang, Shuochen Chang, Shaobo Wang, Yali Wang, Keming Ye, Jiangtong Li, Li Niu

发表机构 * Tsinghua University(清华大学)

AI总结 针对统一多模态模型在理解与生成之间的鸿沟导致的注意力纠缠和视觉细化瓶颈,提出一种自适应切换生成策略的框架,通过分层数据流水线和两阶段训练(SFT+RL)提升X2I任务性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

最近的统一模型在单一框架内集成了多模态理解和生成。然而,“理解-生成鸿沟”仍然存在,模型能够捕捉用户意图,但往往难以将这种语义知识转化为精确的像素级操作。这种鸿沟在任意到图像任务(X2I)中导致了两个瓶颈:注意力纠缠瓶颈,即盲目规划难以处理复杂提示;以及视觉细化瓶颈,即非结构化反馈无法有效纠正缺陷。在本文中,我们提出了一种新颖的框架,使统一模型能够根据指令复杂性和模型能力自主切换生成策略。为此,我们构建了一个分层数据流水线,在三种自适应模式中构建执行路径:简单情况的直接生成、质量细化的自我反思以及分解复杂场景的多步规划。基于该流水线,我们贡献了一个包含超过50,000个样本的高质量数据集,并实施了一个包含SFT和RL的两阶段训练策略。具体地,我们设计了逐步推理奖励以确保逻辑一致性,以及组内复杂度惩罚以防止冗余计算开销。大量实验表明,我们的方法在X2I上优于现有基线,在简单到复杂指令中实现了优越的生成保真度。代码已发布在 https://github.com/WeChatCV/Interleaved_Visual_Reasoner。

英文摘要

Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at https://github.com/WeChatCV/Interleaved_Visual_Reasoner.

2605.08193 2026-06-02 cs.CV cs.AI 版本更新

Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising

任意骨干网络的归一化等变性及其在图像去噪中的应用

Youssef Saied, François Fleuret

发表机构 * University of Cambridge(剑桥大学) DeepMind

AI总结 提出无参数包装器WNE,通过输入归一化、任意骨干网络处理、输出反归一化实现归一化等变,在盲去噪中提升CNN和Transformer对噪声水平失配的鲁棒性且无GPU开销。

详情
AI中文摘要

归一化等变性(NE)是一种结构先验,可提高图像到图像任务中对分布偏移的鲁棒性。函数 $f$ 是归一化等变的当且仅当对于所有 $a>0$ 和 $b\in\mathbb{R}$,有 $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$。现有的NE方法将每个内部层约束为与NE兼容的操作。这些约束增加了运行时成本,并排除了标准的Transformer组件,如softmax注意力和LayerNorm。我们引入了包装归一化等变性(WNE),这是一种无参数包装器,它对输入进行归一化,应用任意骨干网络,然后对输出进行反归一化。我们证明了每个NE函数都允许这种分解,因此该包装器精确参数化了NE函数类。在盲去噪中,包装CNN和Transformer架构在噪声水平失配下提高了鲁棒性,且没有可测量的GPU开销,而架构性NE基线则慢达 $1.6$ 倍。

英文摘要

Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f$ is normalization equivariant iff $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$ for all $a>0$ and $b\in\mathbb{R}$. Existing NE methods constrain every internal layer to NE-compatible operations. These constraints add runtime cost and exclude standard transformer components such as softmax attention and LayerNorm. We introduce Wrapped Normalization Equivariance (WNE), a parameter-free wrapper that normalizes the input, applies any backbone, and denormalizes the output. We prove every NE function admits this factorization, so the wrapper exactly parameterizes the class of NE functions. On blind denoising, wrapping CNN and transformer architectures improves robustness under noise-level mismatch with no measurable GPU overhead, while architectural NE baselines are up to $1.6\times$ slower.

2605.13178 2026-06-02 cs.CV cs.AI 版本更新

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

CLIP Tricks You: 面向大型视觉-语言模型中高效像素定位的无训练令牌剪枝

Sangin Lee, Yukyung Choi

发表机构 * KAIST(韩国科学技术院)

AI总结 提出LiteLVLM,一种无需训练、文本引导的令牌剪枝策略,通过反转CLIP视觉-文本相似度排序,保留指代区域令牌并恢复上下文令牌,实现高效像素定位推理,在多种令牌预算下性能提升超5%,保持90%原始性能同时加速22%并减少2.3倍内存。

Comments Accepted by ICML 2026

详情
AI中文摘要

在大型视觉-语言模型中,视觉令牌通常构成输入令牌的大部分,导致大量计算开销。为了解决这个问题,最近的研究探索了为图像理解任务剪枝冗余或信息量较少的视觉令牌。然而,这些方法在像素定位任务中表现不佳,因为令牌重要性高度依赖于输入文本。通过对CLIP的深入分析,我们观察到指代区域内的视觉令牌与其文本表示的相似度通常较低。受此启发,我们引入了LiteLVLM,一种无需训练、文本引导的令牌剪枝策略,用于高效的像素定位推理。通过反转CLIP视觉-文本相似度的排序,LiteLVLM有效地保留了覆盖指代区域的视觉令牌,同时恢复上下文令牌以实现清晰的前景-背景分离。大量实验表明,LiteLVLM在不同令牌预算下均显著优于现有方法,性能提升超过5%。无需任何训练或微调,LiteLVLM在保持90%原始性能的同时,实现了22%的加速和2.3倍的内存减少。我们的代码可在https://github.com/sejong-rcv/LiteLVLM获取。

英文摘要

In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens within referent regions often exhibit low similarity to their textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90% of the original performance with a 22% speedup and a 2.3X memory reduction. Our code is available at https://github.com/sejong-rcv/LiteLVLM.

2604.24919 2026-06-02 cs.CV 版本更新

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

Agentic AI 在遥感中的应用:技术挑战与研究方向

Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir, Muhammad Haris Khan, Fahad Khan, Xiao Xiang Zhu, Begüm Demir, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学)

AI总结 本文指出遥感中的多步分析工作流存在结构性的地理空间约束,提出面向地球观测的原生智能体设计原则,包括结构化地理空间状态、工具感知推理、验证器引导执行和有效性感知学习评估。

Comments 31 pages. Position Paper

详情
AI中文摘要

地球观测(EO)正从静态预测转向多步分析工作流,这些工作流需要对数据、工具和地理空间状态进行协调推理。尽管基础模型和视觉语言模型在遥感中推进了表示学习和语言基础交互,并且智能体AI在长期推理和工具使用方面显示出强大潜力,但EO并非通用智能体AI的直接扩展。EO工作流处理的是地理参考、多模态和时间结构化的数据,其中重投影、重采样、合成和聚合等操作会改变底层状态,并可能限制后续分析。因此,错误可能跨步骤无声传播,正确性不仅取决于内部一致性,还取决于地理空间一致性、时间有效比较和物理有效性。本文立场是这些挑战是结构性的而非偶然的。我们审视了通用智能体系统中常见的假设,分析了它们在地理空间工作流中失效的方式,并描述了多步EO流水线中的故障模式。然后,我们概述了面向EO的原生智能体设计原则,这些原则围绕结构化地理空间状态、工具感知推理、验证器引导执行以及有效性感知学习和评估。因此,构建可靠的地理空间智能体需要围绕控制EO分析的物理、地理空间和工作流约束重新思考智能体设计。

英文摘要

Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have advanced representation learning and language-grounded interaction in remote sensing, and agentic AI has shown strong potential for long-horizon reasoning and tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate on georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation transform the underlying state and can constrain later analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We examine the assumptions commonly made in generic agentic systems, analyze how they break in geospatial workflows, and characterize failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and validity-aware learning and evaluation. Building reliable geospatial agents, therefore, requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.

2605.12377 2026-06-02 cs.CV 版本更新

Fast Image Super-Resolution via Consistency Rectified Flow

通过一致性修正流实现快速图像超分辨率

Jiaqi Xu, Wenbo Li, Haoze Sun, Fan Li, Zhixin Wang, Long Peng, Jingjing Ren, Haoran Yang, Xiaowei Hu, Renjing Pei, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Noah’s Ark Lab(华为诺亚实验室) HKUST (GZ)(香港科技大学(广州)) South China University of Technology(华南理工大学)

AI总结 提出FlowSR方法,将超分辨率问题重构为从低分辨率到高分辨率图像的修正流,利用改进的一致性学习策略实现单步高质量超分辨率。

Comments Accepted by ICCV 2025; Code: https://github.com/jiaqixuac/FlowSR

详情
AI中文摘要

扩散模型在真实世界图像超分辨率中取得了显著成功,但其依赖耗时的多步采样严重阻碍了实际应用。尽管近期工作引入了少步或单步解决方案,但现有方法要么从噪声输入低效建模,要么未能充分利用迭代生成先验,损害了重建图像的保真度和质量。为解决此问题,我们提出FlowSR,一种将超分辨率问题重构为从低分辨率到高分辨率图像的修正流的新方法。我们的方法利用改进的一致性学习策略实现单步高质量超分辨率。具体而言,我们通过引入高分辨率正则化来优化原始一致性蒸馏过程,确保学习的SR流不仅强制自一致性,而且精确收敛到真实高分辨率目标。此外,我们引入快慢调度策略,其中用于一致性学习的相邻时间步从两个不同的调度器采样:快速调度器使用较少时间步以提高效率,慢速调度器使用更多时间步以捕捉细粒度纹理细节。大量实验表明,FlowSR在效率和图像质量方面均取得了出色性能。代码:\href{https://github.com/jiaqixuac/FlowSR}{this https URL}。

英文摘要

Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality. Code: \href{https://github.com/jiaqixuac/FlowSR}{this https URL}.

2605.05057 2026-06-02 cs.CV 版本更新

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

ScriptHOI:学习脚本化状态转换用于开放词汇人-物交互检测

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, SuiYang Guang, Tuan Kiet Pham, Linh Chi Vo

发表机构 * Phenikaa University(费因克斯大学)

AI总结 提出ScriptHOI框架,将交互短语分解为软脚本化状态转换,通过视觉状态分词器和槽位匹配器校准HOI逻辑,并引入区间部分标签学习和反事实脚本对比损失,提升开放词汇HOI检测中稀有和未见交互的识别,减少功能冲突误报。

详情
AI中文摘要

开放词汇人-物交互(HOI)检测需要识别在训练期间可能未作为注释类别出现的交互短语。最近的视觉-语言HOI检测器通过将人-物特征与文本嵌入匹配来改进语义迁移,但其预测通常受物体功能性和短语级共现主导。因此,模型可能仅凭刀和蛋糕的存在就预测“切蛋糕”,而未验证手、工具、目标、接触模式和物体状态是否共同支持该动作。我们提出 extbf{ScriptHOI},一个结构化框架,将每个交互短语表示为软脚本化状态转换。ScriptHOI不将短语视为单个类别标记,而是将其分解为身体角色、接触、几何、功能性、运动和物体状态槽位。视觉状态分词器将每个检测到的人-物对解析为相应的状态标记,槽位匹配器估计脚本覆盖率和脚本冲突。这两个量校准HOI逻辑值,暴露缺失的视觉证据,并为不完整注释提供训练约束。为避免抑制有效但未注释的交互,我们进一步引入区间部分标签学习,该学习使用脚本导出的下界和上界概率约束未注释的候选,而不是分配封闭世界的负例。反事实脚本对比损失交换单个脚本槽位以阻止仅物体捷径。在HICO-DET、V-COCO和开放词汇HOI分割上的实验表明,ScriptHOI改善了稀有和未见交互的识别,同时大幅减少了功能冲突假阳性。

英文摘要

Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.

2602.08058 2026-06-02 cs.CV cs.AI cs.RO cs.SY eess.SY 版本更新

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Picasso: 基于物理约束采样的整体场景重建

Xihang Yu, Rajat Talak, Lorenzo Shaikewitz, Luca Carlone

发表机构 * Massachusetts Institute of Technology(麻省理工学院) National University of Singapore(新加坡国立大学)

AI总结 提出Picasso,一种通过快速拒绝采样推理多物体交互并考虑几何、非穿透和物理约束的整体场景重建方法,在物理合理性和重建精度上显著优于现有技术。

Comments 15 pages, accepted to Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

在存在遮挡和测量噪声的情况下,几何精确的场景重建(即拟合传感器数据)仍然可能在物理上不正确。例如,当估计场景中物体的姿态和形状并将结果导入模拟器时,微小误差可能导致不合理的配置,包括物体相互穿透或不稳定平衡。这使得使用数字孪生预测场景的动态行为变得困难,而这是基于模拟的接触丰富行为规划和控制的重要步骤。在本文中,我们认为物体姿态和形状估计需要对场景进行整体推理(而不是孤立地推理每个物体),考虑物体交互和物理合理性。为此,我们的第一个贡献是Picasso,一个受物理约束的重建流水线,通过考虑几何、非穿透和物理来构建多物体场景重建。Picasso依赖于一种快速拒绝采样方法,该方法推理多物体交互,利用推断的物体接触图来指导采样。其次,我们提出了Picasso数据集,这是一个包含10个接触丰富真实场景的集合,带有真实标注,以及一个量化物理合理性的指标,我们将其作为基准测试的一部分开源。最后,我们在新引入的数据集和YCB-V数据集上对Picasso进行了广泛评估,结果表明它在提供物理合理且更符合人类直觉的重建的同时,大幅优于现有技术。

英文摘要

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

2605.09883 2026-06-02 cs.CV cs.AI 版本更新

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

笛卡尔捷径:在极坐标空间中重新评估视觉推理

Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang

发表机构 * Stanford University(斯坦福大学) Google Research(谷歌研究院)

AI总结 针对多模态大语言模型在视觉推理中利用笛卡尔坐标捷径的问题,提出Polaris-Bench基准,将任务转换至极坐标空间,揭示模型缺乏拓扑不变性视觉推理。

详情
AI中文摘要

随着当前多模态大语言模型迅速饱和标准视觉推理基准,一个关键问题浮现:这些高分是否真正反映了鲁棒的视觉理解?我们发现了一个普遍存在的漏洞,即笛卡尔捷径:视觉推理基准普遍基于正交网格布局,这些布局可以轻易地离散化为显式的文本坐标。模型系统地利用这一特性,大量依赖基于文本的演绎推理来辅助视觉问题解决。为了系统地消除这一捷径,我们引入了Polaris-Bench,该基准将53个视觉推理任务重新表述在极坐标空间中,并配有对应的笛卡尔坐标作为参考,同时保持一致的逻辑约束和任务语义——从而从根本上打破了模型所利用的正交先验。对14个最先进MLLM的全面评估显示,在笛卡尔布局上达到70%-83%的前沿模型在极坐标等价布局上骤降至31%-39%,即使在完全逻辑等价的情况下,性能下降依然持续。此外,在笛卡尔布局上观察到的推理增益在极坐标等价布局上严重减弱。这些发现揭示了当前MLLM的一个关键缺陷:缺乏拓扑不变的视觉推理。

英文摘要

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the Cartesian Shortcut: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce Polaris-Bench, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.

2605.09503 2026-06-02 cs.CV 版本更新

PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

PermuQuant:通过重新排列通道降低扩散模型每组量化误差

Yongsen Cheng, Kai Liu, Kaiwen Tao, Junxian Li, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 提出PermuQuant框架,通过基于联合二阶矩的通道重排序和校准接受规则,降低低比特扩散模型每组量化误差,实现显著加速和内存压缩。

详情
AI中文摘要

大规模视觉生成模型取得了显著性能。然而,其高计算和内存成本使得在资源受限场景(如交互应用和个人单GPU使用)中部署具有挑战性。训练后量化(PTQ)通过压缩预训练模型而无需昂贵的重新训练,提供了一种实用解决方案。然而,现有的PTQ方法在极低比特设置下仍然存在严重的质量下降。在本文中,我们识别出通道排序是每组量化中一个重要但未被充分探索的因素。在此设置中,每个连续组共享一个量化尺度。当具有非常不同统计特性的通道被放置在同一个组中时,尺度可能被异常值主导,导致大的量化误差。基于这一观察,我们提出了PermuQuant,一种简单有效的低比特扩散模型PTQ框架。PermuQuant在每组量化之前通过联合二阶矩准则对通道进行排序,将具有相似激活和权重统计的通道放入同一组。它进一步使用基于校准的接受规则,仅当所选排列在校准数据上降低量化误差时才应用重排序。选定的排列被吸收到相邻模块中或离线应用于权重,避免了显式的运行时排列操作。在多个大型扩散模型上的大量实验表明,PermuQuant一致地降低了量化误差,并优于现有的PTQ基线。在搭载RTX 5090的FLUX.1-dev上,PermuQuant在W4A4 NVFP4量化下实现了高达1.7倍的单步加速,并将DiT内存占用减少了3.5倍。代码将在https://github.com/yscheng04/PermuQuant提供。

英文摘要

Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.7$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.

2605.09382 2026-06-02 cs.LG cs.CV cs.DS math.OC 版本更新

Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts

学习增强的可扩展线性分配问题优化:基于神经对偶热启动

Ilay Yavlovich, Jad Agbaria, Muhamed Mhamed, Nir Weinberger, Jose Yallouz

发表机构 * Department of Electrical and Computer Engineering, Technion -- Israel Institute of Technology, Haifa, Israel(电气与计算机工程系,技术学院——以色列理工学院,海法,以色列)

AI总结 提出一种学习增强框架,通过预测对偶变量热启动精确求解器,并设计轻量级行独立架构RowDualNet避免O(N^2)内存瓶颈,实现可扩展的神经热启动,在保持最优性的同时获得超过2倍加速。

Comments Accepted to ICML 2026. 23 pages, 18 figures

详情
AI中文摘要

线性分配问题是一个基本的组合优化任务,经典精确求解器能保证最优性但受限于O(N^3)瓶颈,而最近的神经近似方法在可扩展性和精确性上存在困难。我们提出一个学习增强框架,通过预测对偶变量来热启动搜索,加速精确求解器,并配备回退机制以保持最坏情况保证。我们的核心是RowDualNet,一种轻量级、行独立的架构,避免了图模型的O(N^2)内存瓶颈,实现了高达N=16,384的可扩展神经热启动。通过Min-Trick机制,可行性由构造保证,完全消除了昂贵的迭代投影。实验上,我们的方法大幅减少了Jonker-Volgenant (LAPJV)算法的搜索努力,实现了鲁棒的零样本泛化,在复杂合成数据上获得超过2倍的端到端加速,在真实世界跟踪上获得1.25倍加速,在交通网络上获得1.5倍加速,同时严格保持最优性。

英文摘要

The Linear Assignment Problem is a fundamental combinatorial optimization task where classical exact solvers ensure optimality but suffer from an $\mathcal{O}(N^{3})$ bottleneck, while recent neural approximations struggle with scalability and exactness. We propose a learning-augmented framework that accelerates exact solvers by predicting dual variables to warm-start the search, backed by a fallback mechanism to preserve worst-case guarantees. Central to our approach is RowDualNet, a lightweight, row-independent architecture that avoids the $\mathcal{O}(N^{2})$ memory bottleneck of graph models, enabling scalable neural warm-starting up to $N=16{,}384$. Feasibility is guaranteed by construction via the Min-Trick mechanism, completely eliminating the need for costly iterative projections. Empirically, our method drastically reduces the search effort of the Jonker-Volgenant (LAPJV) algorithm, yielding robust zero-shot generalization with strict optimality and end-to-end speedups of over 2x on complex synthetic data, 1.25x on real-world tracking, and 1.5x on transportation networks.

2605.08398 2026-06-02 cs.LG cs.CV 版本更新

Exploring and Exploiting Stability in Latent Flow Matching

探索和利用潜流匹配中的稳定性

Rania Briq, Michael Kamp, Ohad Fried, Sarel Cohen, Stefan Kesselheim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文证明潜流匹配模型对数据缩减和模型容量收缩具有鲁棒性,并利用这种稳定性提出更高效的训练和推理算法,包括数据节省和超过两倍的推理加速。

Comments Accepted at ICML 2026

详情
AI中文摘要

在这项工作中,我们展示了潜流匹配(LFM)模型对不同类型的扰动具有鲁棒性,包括数据缩减和模型容量收缩。我们通过这些模型在相同噪声种子下倾向于生成相似输出来表征这种稳定性。我们提供了一个视角,将这种现象与流匹配理论联系起来,表明这种稳定性是FM目标固有的。我们进一步利用这种稳定性推导出更高效训练和推理的实用算法。具体来说,首先,我们表明通过在显著减少的数据集上训练LFM模型,性能得以保持,并且在计算受限的情况下,模型在保持质量的同时收敛更快。这带来了多种优势,包括由于更快的收敛而节省训练时间,以及在训练条件模型时减轻标注工作。其次,LFM在架构收缩下的稳定性产生了一种双模型由粗到细的方法,一个使用轻量级架构用于FM轨迹的第一阶段,另一个具有更高容量用于第二阶段,从而大幅降低推理成本。为了确定哪些样本具有信息量,我们引入了三个样本评分标准,并在生成模型的标准指标下进行评估。我们的结果在多个数据集上进行了彻底评估,展示了这种稳定性的实际优势,包括数据节省和超过两倍的推理加速,同时生成可比较的输出。

英文摘要

In this work, we show that Latent Flow-Matching (LFM) models are robust to different types of perturbations, including data reduction and model capacity shrinkage. We characterize this stability by these models' tendency to generate similar outputs under identical noise seeds. We provide a perspective relating this phenomenon to flow matching theory, which indicates that this stability is inherent to the FM objective. We further exploit this stability to derive practical algorithms for more efficient training and inference. Concretely, first, we show that by training LFM models on significantly reduced datasets, performance is preserved, and in compute-constrained regimes, the model converges faster while maintaining quality. This yields multiple advantages, including savings in the training time due to faster convergence, and alleviating annotation effort when training conditional models. Second, LFM stability under architectural shrinkage gives rise to a two-model coarse-to-fine approach, one using a light-weight architecture for the first phase of the FM trajectory, and one with higher capacity for the second, thereby reducing the inference cost substantially. To determine which samples are informative, we introduce three sample-scoring criteria and evaluate them under standard metrics for generative models. Our results are thoroughly evaluated on multiple datasets, demonstrating the practical advantage of this stability, including data savings and a more than two-fold inference speedup while generating comparable outputs.

2605.07061 2026-06-02 cs.SD cs.AI cs.CV cs.MM 版本更新

Do Joint Audio-Video Generation Models Understand Physics?

联合音视频生成模型是否理解物理?

Zijun Cui, Xiulong Liu, Hao Fang, Mingwei Xu, Jiageng Liu, Zexin Xu, Weiguo Pian, Shijian Deng, Feiyu Du, Chenming Ge, Yapeng Tian

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Washington(华盛顿大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对联合音视频生成模型,提出AV-Phys Bench基准测试其物理常识,发现所有模型在物理一致性上表现不足,尤其是事件驱动和环境驱动转换场景。

Comments Preprint. Project Page: https://zijuncui.com/AV-Phys/. Full abstract appears in the PDF

详情
AI中文摘要

联合音视频生成模型正迅速接近专业制作质量,这引发了一个核心问题:它们是否理解音视频物理,还是仅仅生成看似合理但违反现实一致性的声音和帧?我们引入了AV-Phys Bench,一个用于评估联合音视频生成中物理常识的基准。AV-Phys Bench测试模型在三种场景类别上的表现:稳态、事件转换和环境转换。它涵盖了从现实场景中提取的基于物理的子类别,以及故意要求物理不一致音视频行为的反AV物理提示。每个生成结果沿五个维度评估:视觉语义遵循、音频语义遵循、视觉物理常识、音频物理常识和跨模态物理常识。在三个专有模型和四个开源模型中,我们发现Seedance 2.0整体表现最佳,但所有模型距离鲁棒的物理理解仍有很大差距。在事件驱动和环境驱动转换上性能急剧下降,即使是强大的专有系统在反AV物理提示上也崩溃。我们进一步引入了AV-Phys Agent,一个结合多模态语言模型与确定性声学测量工具的ReAct风格评估器,产生的排名与人类评分高度一致。我们的结果指出,跨模态物理一致性和转换驱动的场景动态是联合音视频生成的关键开放挑战。

英文摘要

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

2602.04672 2026-06-02 cs.CV cs.GR cs.RO 版本更新

AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation

AGILE: 通过代理生成从视频重建手-物体交互

Jin-Chuan Shi, Binhong Ye, Tao Liu, Junzhe He, Yangjinhui Xu, Xiaoyang Liu, Zeju Li, Hao Chen, Chunhua Shen

发表机构 * State Key Lab of CAD & CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室) Zhejiang University of Technology(浙江工业大学)

AI总结 提出AGILE框架,利用视觉语言模型引导生成完整物体网格,结合锚定-跟踪策略和接触感知优化,从单目视频鲁棒重建手-物体交互,生成可直接用于仿真的资产。

Comments 16 pages, SIGGRAPH 2026

详情
AI中文摘要

从单目视频重建动态手-物体交互对于灵巧操作数据收集以及为机器人和VR创建逼真的数字孪生至关重要。然而,当前方法面临两个难以逾越的障碍:(1) 依赖神经渲染通常在严重遮挡下产生碎片化、不可用于仿真的几何体;(2) 依赖脆弱的运动恢复结构(SfM)初始化导致在野外视频中频繁失败。为克服这些限制,我们提出AGILE,一个鲁棒的框架,将范式从重建转变为交互学习的代理生成。首先,我们采用代理流水线,其中视觉语言模型(VLM)引导生成模型合成一个完整、水密的物体网格,具有高保真纹理,不受视频遮挡影响。其次,完全绕过脆弱的SfM,我们提出一种鲁棒的锚定-跟踪策略。我们使用基础模型在单个交互起始帧初始化物体姿态,并通过利用生成资产与视频观测之间的强视觉相似性在时间上传播姿态。最后,接触感知优化整合语义、几何和交互稳定性约束以强制执行物理合理性。在HO3D、DexYCB、ARCTIC和野外视频上的大量实验表明,AGILE在全局几何精度上优于基线,同时在先前技术经常崩溃的具有挑战性的序列上表现出卓越的鲁棒性。通过优先考虑物理有效性,我们的方法生成可直接用于仿真的资产,并通过真实到仿真重定向在机器人应用中验证。项目页面:https://agile-hoi.github.io。

英文摘要

Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.

2604.17415 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

奖励分数匹配:统一流模型和扩散模型的基于奖励的微调

Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, Korea(人工智能研究生院,韩国科学技术院)

AI总结 提出奖励分数匹配(RSM)框架,统一了多种基于奖励的微调方法,通过分数匹配与值引导目标对齐,简化了设计空间并提高了效率。

Comments 43 pages, 15 figures

详情
AI中文摘要

基于奖励的微调引导预训练的扩散或基于流的生成模型生成更高奖励的样本,同时保持接近预训练模型。尽管现有方法源自不同视角,但我们表明许多方法可以写在一个共同框架下,我们称之为奖励分数匹配(RSM)。在此视角下,对齐变为针对值引导目标的分数匹配,方法间的主要差异归结为值引导估计器的构建和跨时间步的有效优化强度。这种统一澄清了现有设计的偏差-方差-计算权衡,并将核心优化组件与增加复杂性而无明显益处的辅助机制区分开来。在此视角指导下,我们针对代表性的可微和黑盒奖励对齐任务开发了更简单、更高效的重新设计。总体而言,RSM将看似分散的基于奖励的微调方法集合转变为更小、更可解释且更可操作的设计空间。代码可在 https://github.com/jaylee2000/rsm 获取。

英文摘要

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives, we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching against a value-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias-variance-compute tradeoffs of existing designs, and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler, more efficient redesigns across representative differentiable and black-box reward alignment tasks. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space. Code is available at https://github.com/jaylee2000/rsm

2604.09063 2026-06-02 cs.CV cs.AI 版本更新

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

频率增强扩散模型:基于课程引导语义对齐的零样本骨架动作识别

Yuxi Zhou, Zhengbo Zhang, Jingyu Pan, Zhiyu Lin, Zhigang Tu

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(测绘遥感信息工程国家重点实验室) Wuhan University(武汉大学) Information Systems Technology and Design Pillar(信息系统技术与设计学院) Singapore University of Technology and Design(新加坡科技与设计大学) School of Geodesy and Geomatics(测绘学院) School of Mathematics and Statistics(数学与统计学院) Wuhan University Shenzhen Research Institute(武汉大学深圳研究院)

AI总结 提出频率感知扩散模型FDSM,通过语义引导频谱残差模块、时间步自适应频谱损失和课程语义抽象,解决扩散模型频谱偏差导致的高频动态过度平滑问题,实现零样本骨架动作识别,在多个数据集上达到最优性能。

Comments Accepted by The Visual Computer

详情
AI中文摘要

人体动作识别在计算机视觉中至关重要,应用范围从监控到人机交互。尽管基于监督的骨架方法有效,但其对详尽标注的依赖限制了对新动作的泛化能力。零样本骨架动作识别(ZSAR)成为一种有前景的范式,但由于扩散模型的频谱偏差(过度平滑高频动态)而面临挑战。在此,我们提出频率感知扩散用于骨架-文本匹配(FDSM),集成了语义引导频谱残差模块、时间步自适应频谱损失和基于课程的语义抽象以应对这些挑战。我们的方法有效恢复了细粒度运动细节,在NTU RGB+D、PKU-MMD和Kinetics-skeleton数据集上实现了最先进的性能。代码已公开于https://github.com/yuzhi535/FDSM。项目主页:https://yuzhi535.github.io/FDSM.github.io/

英文摘要

Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/

2405.15491 2026-06-02 cs.CV 版本更新

GSDeformer: Direct, Real-time and Extensible Cage-based Deformation for 3D Gaussian Splatting

GSDeformer:面向3D高斯泼溅的直接、实时且可扩展的笼形变形方法

Jiajun Huang, Shuolin Xu, Hongchuan Yu, Tong-Yee Lee

发表机构 * National Centre for Computer Animation(国家计算机动画中心) Bournemouth University(伯恩茅斯大学) Department of Computer Science and Information Engineering(计算机科学与信息工程系)

AI总结 提出GSDeformer,通过代理点云表示桥接笼形变形与3D高斯泼溅,实现无需重新训练、实时且兼容多种3DGS变体的直接变形。

Comments Project Page: https://jhuangbu.github.io/gsdeformer, Video: https://www.youtube.com/watch?v=-ecrj48-MqM

详情
AI中文摘要

我们提出了GSDeformer,一种能够在3D高斯泼溅(3DGS)上实现笼形变形的方法。我们的方法通过使用代理点云表示来桥接笼形变形和3DGS。该点云从3D高斯生成,施加于点云的变形被转换为对3D高斯的变换。为了处理变形可能引起的弯曲,我们引入了一个分裂过程来近似它。我们的方法不修改或扩展3D高斯泼溅的核心架构,因此与任何训练好的原始3DGS或其变体兼容。此外,我们使用渲染-重建方法自动为3DGS及其变体构建笼子。实验表明,与现有方法相比,GSDeformer提供了更优的变形结果,在极端变形下具有鲁棒性,无需重新训练即可编辑,实时运行,并且可以扩展到其他3DGS变体。项目页面:https://jhuangbu.github.io/gsdeformer/

英文摘要

We present GSDeformer, a method that enables cage-based deformation on 3D Gaussian Splatting (3DGS). Our approach bridges cage-based deformation and 3DGS by using a proxy point-cloud representation. This point cloud is generated from 3D Gaussians, and deformations applied to the point cloud are translated into transformations on the 3D Gaussians. To handle potential bending caused by deformation, we incorporate a splitting process to approximate it. Our method does not modify or extend the core architecture of 3D Gaussian Splatting, making it compatible with any trained vanilla 3DGS or its variants. Additionally, we automate cage construction for 3DGS and its variants using a render-and-reconstruct approach. Experiments demonstrate that GSDeformer delivers superior deformation results compared to existing methods, is robust under extreme deformations, requires no retraining for editing, runs in real-time, and can be extended to other 3DGS variants. Project Page: https://jhuangbu.github.io/gsdeformer/

2605.00310 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

超越视觉保真度:通过下游任务集成评估大规模遥感影像的超分辨率模型

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

发表机构 * University of Maryland(马里兰大学) University of Pittsburgh(匹兹堡大学) Worcester Polytechnic Institute(沃思利技术学院) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 针对现有超分辨率评估依赖PSNR/SSIM等保真度指标而忽略下游任务效用的问题,提出GeoSR-Bench基准数据集,集成土地覆盖分割、基础设施映射等下游任务,评估GAN、Transformer等9种SR模型在270种设置下的性能,发现保真度指标与任务性能弱相关甚至负相关。

Comments Under review at IEEE TPAMI

详情
AI中文摘要

超分辨率(SR)技术在从低分辨率输入重建高分辨率图像方面取得了重大进展。分辨率的提高为监测任务提供了视觉增强和实用性。特别是,SR已越来越多地用于基于卫星的地球观测,应用于城市规划、农业、生态学和灾害响应。然而,现有的SR研究和基准通常使用保真度指标如PSNR或SSIM,而超分辨率图像的真实效用在于支持下游任务,如土地覆盖分类、生物量估计和变化检测。为弥合这一差距,我们引入了GeoSR-Bench,一个下游任务集成的SR基准数据集,用于评估超越保真度指标的SR模型。GeoSR-Bench包含来自约36,000个地点的空间共位、时间对齐和质量控制的图像对,覆盖多种土地覆盖类型,分辨率从500米到0.6米。据我们所知,GeoSR-Bench是第一个直接将SR模型提高的图像分辨率与下游地球监测任务(包括土地覆盖分割、基础设施映射和生物物理变量估计)联系起来的SR基准。利用GeoSR-Bench,我们对基于GAN、Transformer、神经算子和扩散的SR模型在感知质量和下游任务性能上进行了基准测试。我们进行了270种设置的实验,涵盖2个跨平台SR任务、9个SR模型、3个下游任务模型以及每个SR任务的5个下游任务。结果表明,传统SR指标的改进通常与任务性能的提升不相关,甚至可能负相关,表明这些指标为选择适用于下游任务的优越模型提供的指导有限。这揭示了将下游任务集成到SR模型开发和评估中的必要性。

英文摘要

Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increased resolution provides visual enhancement and utility for monitoring tasks. In particular, SR has been increasingly developed for satellite-based Earth observation, with applications in urban planning, agriculture, ecology, and disaster response. However, existing SR studies and benchmarks typically use fidelity metrics such as PSNR or SSIM, whereas the true utility of super-resolved images lies in supporting downstream tasks such as land cover classification, biomass estimation, and change detection. To bridge this gap, we introduce GeoSR-Bench, a downstream task-integrated SR benchmark dataset to evaluate SR models beyond fidelity metrics. GeoSR-Bench comprises spatially co-located, temporally aligned, and quality-controlled image pairs from about 36,000 locations across diverse land covers, spanning resolutions from 500m to 0.6m. To the best of our knowledge, GeoSR-Bench is the first SR benchmark that directly connects improved image resolution from SR models with downstream Earth monitoring tasks, including land cover segmentation, infrastructure mapping, and biophysical variable estimation. Using GeoSR-Bench, we benchmark GAN, transformer, neural operator, and diffusion-based SR models on perceptual quality and downstream task performance. We conduct experiments with 270 settings, covering 2 cross-platform SR tasks, 9 SR models, 3 downstream task models, and 5 downstream tasks for each SR task. The results show that improvements in traditional SR metrics often do not correlate with gains in task performance, and the correlations can be negative, indicating that these metrics provide limited guidance for selecting superior models for downstream tasks. This reveals the need to integrate downstream tasks into SR model development and evaluation.

2506.05412 2026-06-02 cs.CV cs.CL 版本更新

Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

视觉-语言模型将头部方向误认为注视方向:非语言对话线索

Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo

发表机构 * Brown University(布朗大学) Columbia University(哥伦比亚大学) Emory University(埃默里大学) Johns Hopkins University(约翰霍普金斯大学) University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学) University of Michigan(密歇根大学) UC San Diego(圣地亚哥大学)

AI总结 本研究通过控制头部方向的实验发现,视觉-语言模型(VLMs)在推断注视目标时主要依赖头部方向而非眼睛外观,导致与人类存在显著性能差距,并指出数据偏差是主要原因。

Comments Accepted by ACL 2026. Project page at https://zoryzhang.github.io/gaze/

详情
AI中文摘要

一个人的注视方向是儿童和成人常用的非语言交流线索。视觉-语言模型(VLMs)推断注视目标的能力如何?为了构建评估刺激,我们拍摄了1,360张真实场景照片,其中一个人注视着桌子上几个物体之一。重要的是,我们还控制了注视者的头部方向:有时朝向注视目标,有时朝向干扰物,有时不加约束。我们发现VLMs与人类之间存在显著的性能差距,排除了分辨率、物体命名能力等替代解释,并确定了差距的主要原因是VLMs使用头部方向而非眼睛外观来推断注视方向。这种偏差可能源于数据而非架构,正如基于transformer的视觉模型微调的概念验证实验所表明的那样。未来的工作应研究这些发现是否广泛适用于基于现有数据训练的各种深度学习方法,以及更好的数据是否能缓解所有架构的这一问题。准确定位原因将为能够解读注视目标的技术奠定基础,从而与人类进行更高效的交互。

英文摘要

Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained. We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance. Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision model. Future work should investigate whether these findings hold broadly across various deep learning methods trained on existing data, and whether better data mitigates this problem for all architectures. Pinpointing the reason sets the stage for technologies that can interpret gaze targets to have more efficient interactions with humans.

2602.08580 2026-06-02 q-bio.TO cs.CV 版本更新

retinalysis-vascx: An explainable software toolbox for the extraction of retinal vascular biomarkers

retinalysis-vascx: 一个用于提取视网膜血管生物标志物的可解释软件工具箱

Jose D. Vargas Quiros, Michael J. Beyeler, Sofia Ortin Vela, EyeNED Reading Center, Sven Bergmann, Caroline C. W. Klave, Bart Liefers, VascX Research Consortium

发表机构 * Department of Ophthalmology, Erasmus University Medical Center(埃因霍温大学医学中心眼科系) Department of Epidemiology, Erasmus University Medical Center(埃因霍温大学医学中心流行病学系) Department of Ophthalmology, Radboud University Medical Center(拉德堡德大学医学中心眼科系) Institute of Molecular and Clinical Ophthalmology, University of Basel(巴塞尔大学分子与临床眼科研究所) Dept. of Computational Biology, University of Lausanne(洛桑大学计算生物学系) Swiss Institute of Bioinformatics, Lausanne, Switzerland(瑞士生物信息学研究所,洛桑,瑞士) Dept. of Integrative Biomedical Sciences, University of Cape Town(开普敦大学整合生物医学科学系)

AI总结 提出开源Python工具箱VascX,从彩色眼底图像中提取视网膜血管生物标志物,包括血管密度、中央视网膜等效值和迂曲度等,并通过可重复性分析和敏感性分析验证其稳健性。

详情
AI中文摘要

从彩色眼底图像(CFI)中自动提取视网膜血管生物标志物对于大规模视网膜血管研究至关重要。我们提出VascX,一个开源的Python工具箱,可从CFI动静脉分割中提取生物标志物。VascX从血管分割掩膜开始,提取其骨架,构建无向和有向血管图,并将血管段解析为更长的血管。导出一组全面的生物标志物,包括血管密度、中央视网膜等效值(CRE)和迂曲度。空间局部化的生物标志物可相对于中央凹和视盘放置的网格进行计算。VascX通过GitHub和PyPI发布,附有全面的文档和示例。我们对同一眼睛在不同设备上重复成像的测试-重测再现性分析表明,大多数VascX生物标志物具有中等至良好的一致性(ICC > 0.5),不同生物标志物的稳健性水平存在重要差异。我们对生物标志物对图像扰动和启发式参数值的敏感性分析支持这些差异,并进一步表征了VascX生物标志物。最终,VascX提供了一个可解释且易于修改的特征提取工具箱,补充了分割以产生可靠的视网膜血管生物标志物。我们基于图的生物标志物计算阶段支持可重复、区域感知的测量,适用于大规模临床和流行病学研究。通过支持轻松提取现有生物标志物和快速实验新生物标志物,VascX支持眼组学研究。其稳健性和计算效率便于在大型数据库中可扩展部署,而开源分发降低了眼科研究人员和临床医生的采用门槛。

英文摘要

Automatic extraction of retinal vascular biomarkers from color fundus images (CFI) is crucial for large-scale studies of the retinal vasculature. We present VascX, an open-source Python toolbox that extracts biomarkers from CFI artery-vein segmentations. VascX starts from vessel segmentation masks, extracts their skeletons, builds undirected and directed vessel graphs, and resolves vessel segments into longer vessels. A comprehensive set of biomarkers is derived, including vascular density, central retinal equivalents (CREs), and tortuosity. Spatially localized biomarkers may be calculated over grids placed relative to the fovea and optic disc. VascX is released via GitHub and PyPI with comprehensive documentation and examples. Our test-retest reproducibility analysis on repeat imaging of the same eye by different devices shows that most VascX biomarkers have moderate to excellent agreement (ICC > 0.5), with important differences in the level of robustness of different biomarkers. Our analyses of biomarker sensitivity to image perturbations and heuristic parameter values support these differences and further characterize VascX biomarkers. Ultimately, VascX provides an explainable and easily modifiable feature-extraction toolbox that complements segmentation to produce reliable retinal vascular biomarkers. Our graph-based biomarker computation stages support reproducible, region-aware measurements suited for large-scale clinical and epidemiological research. By enabling easy extraction of existing biomarkers and rapid experimentation with new ones, VascX supports oculomics research. Its robustness and computational efficiency facilitate scalable deployment in large databases, while open-source distribution lowers barriers to adoption for ophthalmic researchers and clinicians.

2604.18326 2026-06-02 cs.CV 版本更新

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

OmniHuman:面向以人为中心的视频生成的大规模数据集与基准

Lei Zhu, Xing Cai, Yingjie Chen, Yiheng Li, Binxin Yang, Hao Liu, Jie Chen, Chen Li, Jing LYu

发表机构 * Peking University(北京大学) WeChat Lab(微信实验室) Chinese Academy of Sciences(中国科学院)

AI总结 为解决现有数据集在场景多样性、交互建模和属性对齐方面的结构性缺陷,提出OmniHuman大规模多场景数据集及全自动标注流程,并建立OHBench三级评估体系,实现与人类感知高度一致的诊断。

Comments 19 pages, 6 figures

详情
AI中文摘要

近期音频-视频联合生成模型在内容创作方面展现出令人印象深刻的能力。然而,在复杂的真实世界物理场景中生成高保真以人为中心的视频仍然是一个重大挑战。我们指出根本原因在于现有数据集在三个维度上的结构性缺陷:有限的全局场景和相机多样性、稀疏的交互建模(包括人与人以及人与物体),以及不足的个体属性对齐。为弥补这些差距,我们提出了OmniHuman,一个大规模、多场景数据集,专为细粒度人体建模而设计。OmniHuman提供了层次化标注,涵盖视频级场景、帧级交互和个体级属性。为此,我们开发了一个全自动流水线,用于高质量数据收集和多模态标注。作为数据集的补充,我们建立了OmniHuman基准(OHBench),一个三级评估系统,为以人为中心的音频-视频合成提供科学诊断。关键的是,OHBench引入了与人类感知高度一致的指标,通过提供跨全局场景、关系交互和个体属性的全面诊断,填补了现有基准的空白。

英文摘要

Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.

2604.17625 2026-06-02 cs.CV 版本更新

FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation

FlowC2S:从当前帧流向后续帧以实现快速且内存高效的视频延续

Hovhannes Margaryan, Quentin Bammey, Christian Sandor

发表机构 * Team ARAI, Université Paris-Saclay, CNRS, LISN, France(ARAI团队,巴黎萨克雷大学,法国国家科学研究中心,LISN,法国) LTCI, Télécom Paris, Institut Polytechnique de Paris, France(LTCI,巴黎电信学院,巴黎理工学院,法国)

AI总结 提出FlowC2S方法,通过微调预训练文本到视频流模型学习当前与后续视频块之间的向量场,利用固有最优耦合和目标反转实现快速、内存高效的视频延续。

详情
AI中文摘要

本文介绍了一种生成快速且内存高效的视频延续的新方法。我们的方法名为FlowC2S,它微调预训练的文本到视频流模型,以学习当前视频块与后续视频块之间的向量场。两个设计选择是关键。首先,我们引入固有最优耦合,在训练期间利用时间上相邻的视频块作为真实最优耦合的实用代理,从而产生更直的流。其次,我们纳入目标反转,将目标块的倒置潜在变量注入输入表示中,以加强对应关系并提高视觉保真度。通过直接从当前帧流向后续帧,而不是常见的将当前帧与噪声组合以生成视频延续的方式,我们将模型输入的维度减少了一半。所提出的方法从LTXV和Wan微调而来,在FID和FVD的定量评估中超越了最先进的分数,且仅需五次神经函数评估。

英文摘要

This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.

2601.14750 2026-06-02 cs.CL cs.CV 版本更新

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Render-of-Thought: 将文本思维链渲染为图像以进行视觉潜在推理

Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei

发表机构 * Tencent BAC(腾讯BAC) Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院,清华大学) School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) School of Mathematics and Statistics, University of Glasgow(格拉斯哥大学数学与统计学学院)

AI总结 提出Render-of-Thought框架,通过将思维链的文本步骤渲染为图像,利用视觉语言模型的视觉编码器进行语义对齐,实现3-4倍令牌压缩和推理加速,同时保持竞争性能。

Comments Accepted by ACL 2026 Main Conference

详情
AI中文摘要

思维链提示在解锁大型语言模型的推理能力方面取得了显著成功。尽管思维链提示增强了推理能力,但其冗长性带来了巨大的计算开销。最近的工作通常只关注结果对齐,缺乏对中间推理过程的监督。这些缺陷掩盖了潜在推理链的可分析性。为了解决这些挑战,我们引入了Render-of-Thought,这是第一个通过将文本步骤渲染为图像来具体化推理链的框架,使潜在推理过程显式且可追溯。具体来说,我们利用现有视觉语言模型的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。这种设计确保了即插即用的实现,而无需额外的预训练开销。在数学和逻辑推理基准上的大量实验表明,与显式思维链相比,我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外,它与其他方法相比保持了竞争性能,验证了这种范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT获取。

英文摘要

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

2604.17007 2026-06-02 cs.CV cs.AI 版本更新

MobileAgeNet: Lightweight Facial Age Estimation for Mobile Deployment

MobileAgeNet:面向移动部署的轻量级面部年龄估计

Arun Kumar, Aswathy Baiju, Radu Timofte, Dmitry Ignatov

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室、CAIDAS与IFI、乌尔姆大学、德国)

AI总结 提出基于MobileNetV3-Large的轻量级年龄回归框架MobileAgeNet,通过两阶段微调和边界回归策略,在UTKFace测试集上达到4.65年MAE,移动端延迟14.4ms,参数量3.23M。

Comments 9 Pages including references, 3 figures

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3810-3818, 2026
AI中文摘要

面部年龄估计的移动部署需要模型在预测准确性、低延迟和小尺寸之间取得平衡。在这项工作中,我们提出了MobileAgeNet,一个轻量级年龄回归框架,在UTKFace保留测试集上实现了4.65年的MAE,同时使用AI Benchmark应用程序测量,平均延迟为14.4毫秒,保持了高效的设备端推理。该模型基于预训练的MobileNetV3-Large骨干网络,结合紧凑的回归头,支持移动设备上的实时预测。训练和评估流程集成到NN LEMUR数据集框架中,支持可重复实验、结构化超参数优化和一致评估。我们采用边界年龄回归以及两阶段微调策略,以提高训练稳定性和泛化能力。实验结果表明,MobileAgeNet以3.23M参数实现了具有竞争力的准确性,并且从PyTorch训练通过ONNX导出到TensorFlow Lite转换的部署流程,在实际设备条件下保持了预测行为,没有可测量的退化。总体而言,这项工作为面向移动的面部年龄估计提供了一个实用、可部署的基线。

英文摘要

Mobile deployment of facial age estimation requires models that balance predictive accuracy with low latency and compact size. In this work, we present MobileAgeNet, a lightweight age-regression framework that achieves an MAE of 4.65 years on the UTKFace held-out test set while maintaining efficient on-device inference with an average latency of 14.4 ms measured using the AI Benchmark application. The model is built on a pretrained MobileNetV3-Large backbone combined with a compact regression head, enabling real-time prediction on mobile devices. The training and evaluation pipeline is integrated into the NN LEMUR Dataset framework, supporting reproducible experimentation, structured hyperparameter optimization, and consistent evaluation. We employ bounded age regression together with a two-stage fine-tuning strategy to improve training stability and generalization. Experimental results show that MobileAgeNet achieves competitive accuracy with 3.23M parameters, and that the deployment pipeline from PyTorch training through ONNX export to TensorFlow Lite conversion - preserves predictive behavior without measurable degradation under practical on-device conditions. Overall, this work provides a practical, deployment-ready baseline for mobile-oriented facial age estimation.

2601.02997 2026-06-02 cs.LG cs.CV 版本更新

From Memorization to Creativity: LLM as a Designer of Novel Neural Architectures

从记忆到创造:LLM作为新型神经架构的设计者

Waleed Khalid, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 本文提出NNGPT框架,通过闭环架构合成流水线,利用代码型LLM的监督微调循环,结合MinHash-Jaccard新颖性过滤和低保真性能信号,迭代提升生成架构的有效性、性能和多样性,实现从记忆到创造的转变。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3252-3261, 2026
AI中文摘要

大型语言模型(LLM)在程序合成方面表现出色,但其在神经架构设计中的能力——平衡语法可靠性、性能和结构新颖性——仍未得到充分探索。我们提出了NNGPT框架内的闭环架构合成流水线,其中代码型LLM经过22次监督微调循环的演化。在每个循环中,LLM合成PyTorch卷积网络,通过低保真性能信号验证,并通过MinHash-Jaccard标准过滤以防止结构冗余,然后纳入LEMUR数据集。具有新颖架构的高性能候选被转换为提示-代码对,用于参数高效的LoRA微调。这种反馈循环驱动了可测量的分布偏移,逐步内化经验架构先验,使得有效且高性能的输出从稀缺变为主导。在CIFAR-10上,有效生成率稳定在50.6%(峰值74.5%),平均第一轮准确率从28.1%上升到51.0%,超过40%准确率的候选从2.0%增长到96.8%。跨数据集迁移到CIFAR-100和SVHN证实了改进的有效性、偏移的准确率分布和持续的新颖性在不同难度和视觉领域的基准测试中泛化。在22个循环中,有455个原始语料库中不存在的新颖架构被新颖性过滤器接受。通过将合成基于执行反馈和新颖性过滤,我们证明了迭代自监督微调将LLM重塑为任务特化的架构先验——提高了生成可靠性、代理性能和结构多样性——为手工设计的搜索空间提供了一种可复现、无需标注的替代方案。

英文摘要

Large language models (LLMs) excel in program synthesis, yet their capacity for neural architecture design -- balancing syntactic reliability, performance, and structural novelty -- remains underexplored. We present a closed-loop architecture synthesis pipeline within the NNGPT framework, in which a code-oriented LLM evolves over 22 supervised fine-tuning cycles. At each cycle, the LLM synthesizes PyTorch convolutional networks, validated via low-fidelity performance signals and filtered via a MinHash--Jaccard criterion to prevent structural redundancy before being incorporated into the LEMUR dataset. High-performing candidates with novel architectures are converted into prompt--code pairs for parameter-efficient LoRA fine-tuning. This feedback loop drives a measurable distributional shift, progressively internalizing empirical architectural priors such that valid and high-performing outputs evolve from scarce to dominant across cycles. On CIFAR-10, the valid generation rate stabilizes at 50.6% (peaking at 74.5%), mean first-epoch accuracy rises from 28.1% to 51.0%, and candidates exceeding 40% accuracy grow from 2.0% to 96.8%. Cross-dataset transfer to CIFAR-100 and SVHN confirms that improved validity, shifted accuracy distributions, and sustained novelty generalize across benchmarks of varying difficulty and visual domain. Across 22 cycles, 455 unique architectures absent from the original corpus are admitted under the novelty filter. By grounding synthesis in execution feedback and novelty filtering, we demonstrate that iterative self-supervised fine-tuning reshapes an LLM into a task-specialized architectural prior -- improving generation reliability, proxy performance, and structural diversity -- offering a reproducible, annotation-free alternative to hand-crafted search spaces.

2512.24120 2026-06-02 cs.CV cs.AI 版本更新

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

增强基于LLM的神经网络生成:面向自动化架构设计的少样本提示与高效验证

Raghuvir Duvvuri, Chandini Vysyaraju, Avi Goyal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 本文提出少样本架构提示(FSAP)和空白归一化哈希验证方法,以提升基于LLM的计算机视觉架构自动生成效率,并通过大规模实验验证其有效性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3242-3251, 2026
AI中文摘要

自动化神经网络架构设计仍然是计算机视觉中的一个重大挑战。任务多样性和计算约束要求既有效又高效的架构与搜索方法。大型语言模型(LLMs)为计算密集型的神经架构搜索(NAS)提供了一种有前景的替代方案,但它们在计算机视觉架构生成中的应用尚未被系统研究,特别是在提示工程和验证策略方面。基于任务无关的NNGPT/LEMUR框架,本文引入并验证了两项针对计算机视觉的关键贡献。首先,我们提出了少样本架构提示(FSAP),这是首个针对基于LLM的架构生成中支持示例数量(n = 1, 2, 3, 4, 5, 6)的系统研究。我们发现使用n = 3个示例能在视觉任务的架构多样性和上下文聚焦之间取得最佳平衡。其次,我们引入了空白归一化哈希验证,一种轻量级去重方法(耗时小于1毫秒),相比AST解析实现了100倍加速,并防止了重复计算机视觉架构的冗余训练。在七个计算机视觉基准(MNIST、CIFAR-10、CIFAR-100、CelebA、ImageNette、SVHN、Places365)的大规模实验中,我们生成了1,900个独特架构。我们还引入了一种数据集平衡的评估方法,以应对跨异构视觉任务比较架构的挑战。这些贡献为计算机视觉中基于LLM的架构搜索提供了可操作的指导,并建立了严格的评估实践,使计算资源有限的研究人员也能更便捷地进行自动化设计。

英文摘要

Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

2603.18373 2026-06-02 cs.CV cs.AI 版本更新

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

看见还是取悦:揭示视觉语言模型中的视觉谄媚与分裂信念

Rui Hong, Shuxue Quan

发表机构 * George Mason University(乔治·玛斯纳大学) Independent Researcher(独立研究者)

AI总结 提出三层诊断框架,通过反事实干预实验发现视觉语言模型中普遍存在视觉谄媚(内部证据保留但输出幻觉答案)现象,并证明扩展模型规模无法解决该问题。

Comments 14 pages, 1 figures

详情
AI中文摘要

当视觉语言模型正确回答时,它们是否真正依赖视觉信息?我们引入了一个三层诊断框架,包含三个每样本指标:潜在异常检测、视觉必要性分数和竞争分数,用于解耦感知、依赖和对齐失败。在9个视觉语言模型和9000个模型-样本对中,通过反事实盲、噪声和冲突干预,72.9%的样本表现出视觉谄媚,这是一种分裂信念模式,即内部证据被保留但解码出幻觉答案,而零样本表现出稳健拒绝,表明当前的对齐训练已消除拒绝作为解码结果。在Qwen-VL系列中,无论是代内还是代间扩展,都单调减少了语言捷径,但加剧了视觉谄媚,表明仅靠规模和更新的后训练无法解决接地问题。诊断分数进一步实现了一种无需训练的择性预测策略,在50%覆盖率下准确率提升高达9.5个百分点。

英文摘要

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-sample metrics: Latent Anomaly Detection, Visual Necessity Score, and Competition Score, which disentangle perception, dependency, and alignment failures. Across 9 VLMs and 9,000 model-sample pairs under counterfactual blind, noise, and conflict interventions, 72.9% of samples exhibit Visual Sycophancy, a Split Beliefs pattern in which internal evidence is preserved yet a hallucinated answer is decoded, while zero samples show Robust Refusal, indicating that current alignment training has eliminated refusal as a decoding outcome. Scaling within the Qwen-VL family, both within- and across-generation, monotonically reduces Language Shortcuts but amplifies Visual Sycophancy, showing that scale and newer post-training alone cannot resolve the grounding problem. Diagnostic scores further enable a training-free selective-prediction strategy yielding up to +9.5 percentage points accuracy at 50% coverage.

2604.11283 2026-06-02 cs.CV 版本更新

Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey

多模态大语言模型驱动的视频翻译:面向角色的综述

Bingzheng Qu, Kehai Chen, Xuefeng Bai, Min Zhang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院)

AI总结 本文通过面向角色的分类法,系统综述了多模态大语言模型在视频翻译中的应用,将其分为语义推理器、表达执行器和视觉合成器三个功能角色,并总结了数据集、基准和评估指标,指出了端到端视频翻译的挑战与未来方向。

详情
AI中文摘要

多模态大语言模型(MLLMs)的最新进展正在将视频翻译从自动语音识别、机器翻译、文本到语音和唇形同步的级联管道重塑为统一的多模态推理和生成问题。高质量的视频翻译不仅需要语义保真度,还需要跨视觉、听觉和语言流的时间对齐、说话者一致性和情感表现力。本综述通过面向角色的分类法,对MLLM驱动的视频翻译进行了重点回顾。我们将MLLM驱动和MLLM相关的研究组织为三个功能角色:语义推理器,将翻译基于视频理解、时间推理和多模态融合;表达执行器,支持可控和上下文感知的语音生成;视觉合成器,实现唇形同步和视觉连贯的说话者渲染。我们进一步总结了每个角色的代表性数据集、基准和指标,并讨论了当前评估协议如何未能满足端到端视频翻译的要求。最后,我们指出了长视频理解、时间建模、多模态对齐、多语言鲁棒性和负责任部署方面的开放挑战,为自然和可信的跨语言视频通信勾勒了未来方向。

英文摘要

Recent progress in multimodal large language models (MLLMs) is reshaping video translation from a cascaded pipeline of automatic speech recognition, machine translation, text-to-speech, and lip synchronization into a unified multimodal reasoning and generation problem. High-quality video translation requires not only semantic fidelity, but also temporal alignment, speaker consistency, and emotional expressiveness across visual, acoustic, and linguistic streams. This survey provides a focused review of MLLM-enabled video translation through a role-oriented taxonomy. We organize MLLM-enabled and MLLM-relevant studies into three functional roles: Semantic Reasoner, which grounds translation in video understanding, temporal reasoning, and multimodal fusion; Expressive Performer, which supports controllable and context-aware speech generation; and Visual Synthesizer, which enables lip synchronization and visually coherent speaker rendering. We further summarize representative datasets, benchmarks, and metrics for each role, and discuss how current evaluation protocols fall short of end-to-end video translation requirements. Finally, we identify open challenges in long-form video understanding, temporal modeling, multimodal alignment, multilingual robustness, and responsible deployment, outlining future directions for natural and trustworthy cross-lingual video communication.

2604.09877 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

Genie 4D:语义先验引导的4D动态场景重建

Yiru Yang, Zhuojie Wu, Nishant Kumar Singh, Max Schulthess

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出Genie 4D框架,结合实时视觉惯性高斯泼溅前端和前馈4D骨干网络,利用冻结的DINOv3特征作为结构先验抑制身份漂移,并通过条件扩散精炼器恢复高频细节,最终通过轻量级潜在动作头实现用户可控的4D世界模型重建。

详情
AI中文摘要

在计算机视觉与机器人感知的交汇处,动态场景的4D重建将低层几何感知与高层语义理解联系起来。我们提出Genie 4D,一个将手持手机拍摄转化为语义化、动作可控的4D世界模型的框架。Genie 4D将用于度量几何的实时视觉惯性高斯泼溅前端与由冻结的DINOv3特征(作为结构先验)正则化的前馈4D骨干网络相结合。语义先验抑制了动态跟踪中的身份漂移,而短条件扩散精炼器恢复了回归骨干网络平滑掉的高频表面细节。最后,一个轻量级潜在动作头将重建的4D状态暴露给以JEPA风格下一嵌入目标训练的Genie式世界模型,使得场景可以在用户动作下向前推进。在Point Odyssey和TUM-Dynamics基准测试上,Genie 4D保留了前馈基线的线性时间复杂度O(T),同时提高了3D跟踪精度(APD)和重建完整性,并且可以在单个消费级GPU(RTX 5090)上通过iPhone、Mac、Windows和Linux采集客户端交互式运行。Genie 4D为走向物理基础的世界模型提供了一条实用的、语义先验引导的路径。

英文摘要

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing with high-level semantic understanding. We present Genie 4D, a framework that turns hand-held phone capture into a semantically grounded, action-controllable 4D world model. Genie 4D couples a real-time visual-inertial Gaussian splatting front-end for metric geometry with a feed-forward 4D backbone regularized by frozen DINOv3 features acting as structural priors. The semantic priors suppress identity drift during dynamic tracking, while a short conditional diffusion refiner recovers high-frequency surface detail that regression backbones smooth away. Finally, a lightweight latent-action head exposes the reconstructed 4D state to a Genie-style world model trained with a JEPA-style next-embedding objective, so that the scene can be rolled forward under user actions. On the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D retains the linear time complexity O(T) of feed-forward baselines while improving 3D tracking accuracy (APD) and reconstruction completeness, and it runs interactively on a single consumer GPU (RTX 5090) from iPhone, Mac, Windows, and Linux capture clients. Genie 4D offers a practical, semantic-prior-guided path toward physically grounded world models.

2604.02941 2026-06-02 cs.CV 版本更新

MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion

MMTalker: 多分辨率3D说话头合成与多模态特征融合

Bin Liu, Zhixiang Xiong, Zhifen He, Bo Li

发表机构 * IEEE Publication Technology Group(IEEE出版技术组) Piscataway, NJ(新泽西州皮萨卡威)

AI总结 提出一种基于多分辨率表示和多模态特征融合的3D语音驱动面部动画合成方法MMTalker,通过网格参数化、非均匀可微采样、残差图卷积网络和双交叉注意力机制,实现高唇同步精度和逼真面部表情。

Comments This article presents only the preliminary research results, which are not yet complete and lack necessary supplementary experiments. The author has decided to withdraw it to improve the research work, and will submit a more complete version in the future

详情
AI中文摘要

语音驱动的三维(3D)面部动画合成旨在建立从一维(1D)语音信号到时变3D面部运动信号的映射。当前方法在保持唇同步精度和生成逼真面部表情方面仍面临挑战,主要由于这种跨模态映射的高度病态性。本文通过多分辨率表示和多模态特征融合,提出一种新颖的3D音频驱动面部动画合成方法MMTalker,能够准确重建3D面部运动的丰富细节。我们首先通过网格参数化和非均匀可微采样实现带有细节的3D面部连续表示。网格参数化技术建立了UV平面与3D面部网格之间的对应关系,并用于为连续学习提供真值。可微非均匀采样通过在每个三角面中设置可学习的采样概率,实现精确的面部细节获取。接着,我们采用残差图卷积网络和双交叉注意力机制,从多个输入模态中提取判别性面部运动特征。所提出的多模态融合策略充分利用了语音的分层特征和面部网格的显式时空几何特征。最后,一个轻量级回归网络通过联合处理规范UV空间中的采样点和编码的面部运动特征,预测合成说话头的逐顶点几何位移。综合实验表明,与现有最先进方法相比,该方法在唇部和眼部运动的同步精度上取得了显著提升。

英文摘要

Speech-driven three-dimensional (3D) facial animation synthesis aims to build a mapping from one-dimensional (1D) speech signals to time-varying 3D facial motion signals. Current methods still face challenges in maintaining lip-sync accuracy and producing realistic facial expressions, primarily due to the highly ill-posed nature of this cross-modal mapping. In this paper, we introduce a novel 3D audio-driven facial animation synthesis method through multi-resolution representation and multi-modal feature fusion, called MMTalker which can accurately reconstruct the rich details of 3D facial motion. We first achieve the continuous representation of 3D face with details by mesh parameterization and non-uniform differentiable sampling. The mesh parameterization technique establishes the correspondence between UV plane and 3D facial mesh and is used to offer ground truth for the continuous learning. Differentiable non-uniform sampling enables precise facial detail acquisition by setting learnable sampling probability in each triangular face. Next, we employ residual graph convolutional network and dual cross-attention mechanism to extract discriminative facial motion feature from multiple input modalities. This proposed multimodal fusion strategy takes full use of the hierarchical features of speech and the explicit spatiotemporal geometric features of facial mesh. Finally, a lightweight regression network predicts the vertex-wise geometric displacements of the synthesized talking face by jointly processing the sampled points in the canonical UV space and the encoded facial motion features. Comprehensive experiments demonstrate that significant improvements are achieved over state-of-the-art methods, especially in the synchronization accuracy of lip and eye movements.

2510.22276 2026-06-02 cs.CV cs.CL 版本更新

WAON: A Large-Scale Japanese Image-Text Dataset for Cultural Adaptation in Contrastive Vision-Language Models

WAON:用于对比视觉语言模型文化适应的大规模日语图像-文本数据集

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, Naoaki Okazaki

发表机构 * Kyoto University(京都大学) NII LLMC(日本国家研究所语言模型中心) NII(日本国家研究所) Waseda University(早稻田大学) Institute of Science Tokyo(东京科学研究所)

AI总结 提出WAON,一个从Common Crawl构建的包含约1.55亿样本的最大公开原生日语图像-文本数据集,并通过微调实验证明其在日语文化基准上优于翻译数据。

Comments 13 pages, 7 figures

详情
AI中文摘要

对比视觉语言模型通过大规模预训练取得了显著进展。最近的研究表明,去除仅英文的标题过滤器并在全球数据上进行预训练对于提升多元文化表现是有效的。我们研究了这种全球预训练是否足以实现特定文化的理解,或者进一步使用原生数据进行适应能否超越仅全球预训练所达到的性能。为了进行这项研究,我们提出了WAON,这是从Common Crawl中的原生日语网络内容构建的最大公开可用的原生日语图像-文本数据集,包含约1.55亿个样本。我们还引入了WAON-Bench,一个手动策划的涵盖374个类别的日本文化基准。通过在多个日语图像-文本数据集上的比较微调实验,我们观察到在WAON上微调的模型在日本文化基准上始终比在英语到日语翻译数据上微调的模型表现更强。我们发布了数据集和代码。

英文摘要

Contrastive vision-language models have achieved remarkable progress through large-scale pretraining. Recent work has shown that removing English-only caption filters and pretraining on global data is effective for improving multicultural performance. We study whether such global pretraining is sufficient for culture-specific understanding, or whether further adaptation with natively sourced data can boost performance beyond what global pretraining alone achieves. To enable this investigation, we present WAON, the largest publicly available native Japanese image-text dataset constructed from native Japanese web content in Common Crawl, containing approximately 155 million examples. We also introduce WAON-Bench, a manually curated Japanese cultural benchmark spanning 374 classes. Through comparative fine-tuning experiments on multiple Japanese image-text datasets, we observe that models fine-tuned on WAON consistently achieve stronger performance on Japanese cultural benchmarks than those fine-tuned on English-to-Japanese translated data. We release our dataset and code.

2510.01009 2026-06-02 cs.CV cs.MM 版本更新

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

POVQA: 基于偏好的视频问答与数据效率的推理

Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi

发表机构 * University of Southern Mississippi(密西根州立大学)

AI总结 提出POVQA方法,通过时间池化压缩视频帧、监督微调加偏好优化,在长视频问答中实现数据高效推理。

Comments Accepted in MAR at CVPR Workshop (Proceedings Track)

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 11533-11542
AI中文摘要

长视频多模态问答需要对视觉证据和对话进行结构化推理,但大型视觉语言模型(LVLMs)受限于上下文窗口和计算限制。我们提出POVQA,将每秒压缩为时间池化图像(1 fps池化图像),以在固定token预算下保持密集的时间覆盖。然后,我们在推理+答案目标上对Qwen2.5-VL-7B进行监督微调(SFT),并可选地应用直接偏好优化(DPO)进行偏好对齐。我们引入ReasonVQA作为初步诊断数据集,包含12部电影和239个人工标注的QA+推理三元组,用于在压缩下对长上下文多模态推理进行受控分析。在ReasonVQA上,SFT将最佳纯池化基线从0.212 F1提升至0.550 F1,表明池化证据加推理监督在此设置中提供了主要性能提升。在零样本迁移中,POVQA在SFT+DPO后在TVQA上也达到64.7%。这些结果是初步的:ReasonVQA规模小,池化可能丢失细粒度时间顺序,且DPO效果在不同设置中并非一致正面。代码、数据集和额外定性评估见\href{https://povqa.github.io}{https://povqa.github.io}。

英文摘要

Long-video multimodal question answering requires structured reasoning over visual evidence and dialogue, but Large Vision-Language Models (LVLMs) are constrained by context-window and compute limits. We propose POVQA, which compresses each second into a temporally pooled image (1 fps pooled images) to maintain dense temporal coverage under a fixed token budget. We then train Qwen2.5-VL-7B with supervised fine-tuning (SFT) on rationale+answer targets, and optionally apply Direct Preference Optimization (DPO) for preference alignment. We introduce ReasonVQA as a pilot diagnostic dataset with 12 movies and 239 human-annotated QA+rationale triplets for controlled analysis of long-context multimodal reasoning under compression. On ReasonVQA, SFT improves the best pooled-only baseline from 0.212 to 0.550 F1, showing that pooled evidence plus rationale supervision provides the main performance gains in this setting. In zero-shot transfer, POVQA also reaches 64.7\% on TVQA after SFT+DPO. These results are preliminary: ReasonVQA is small, pooling can lose fine-grained temporal order, and DPO effects are not uniformly positive across settings. Code, dataset, and additional qualitative evaluations are available at \href{https://povqa.github.io}{https://povqa.github.io}.

2603.28759 2026-06-02 cs.CV 版本更新

FlowIt: Global Matching via Hierarchical Transformers and Optimal Transport for Optical Flow

FlowIt: 通过分层Transformer和最优传输实现全局匹配的光流估计

Sadra Safadoust, Fabio Tosi, Matteo Poggi, Fatma Güney

发表机构 * Department of Computer Engineering and KUIS AI Center, Koç University, Istanbul, Turkey(计算机工程系和KUIS人工智能中心,科克大学,伊斯坦布尔,土耳其) Department of Computer Science and Engineering (DISI), University of Bologna, Italy(计算机科学与工程系(DISI),博洛尼亚大学,意大利)

AI总结 提出FlowIt架构,结合分层Transformer和最优传输进行全局匹配,并通过置信度与遮挡引导的细化步骤,在多个基准上达到最先进性能。

Comments Project Page: https://kuis-ai.github.io/FlowIt/

详情
AI中文摘要

我们提出FlowIt,一种新颖的光流估计架构,结合了全局匹配与置信度和遮挡引导的细化。其核心是利用分层Transformer架构捕获广泛的全局上下文,使模型能够有效建模长距离对应关系。为了克服局部匹配的局限性,我们将流初始化表述为一个最优传输问题。这种表述产生了一个高度鲁棒的初始流场,以及显式推导的遮挡和置信度图。然后,这些线索无缝集成到引导细化阶段,网络将可靠的运动估计从高置信度区域主动传播到模糊的低置信度区域。在Sintel、KITTI、Spring和LayeredFlow数据集上的大量实验验证了我们方法的有效性。FlowIt在具有挑战性的Sintel基准上取得了最先进的结果,并在Sintel、Spring和LayeredFlow上建立了新的跨数据集零样本泛化性能的最先进水平,同时在KITTI基准和KITTI零样本泛化设置上也提供了有竞争力的性能。

英文摘要

We present FlowIt, a novel architecture for optical flow estimation that combines global matching with confidence and occlusion-guided refinement. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the effectiveness of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel benchmark and establishes new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow, while also delivering competitive performance on both the KITTI benchmark and KITTI zero-shot generalization settings.

2603.27645 2026-06-02 cs.CV 版本更新

OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery

OpenDPR:面向遥感影像的基于视觉中心扩散引导原型检索的开放词汇变化检测

Qi Guo, Jue Wang, Yinhe Liu, Yanfei Zhong

发表机构 * Wuhan University(武汉大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出OpenDPR框架,通过扩散模型构建原型并检索视觉相似性,解决开放词汇变化检测中类别识别瓶颈,并设计S2C模块增强变化定位能力。

Comments Accepted by CVPR 2026

详情
AI中文摘要

开放词汇变化检测(OVCD)旨在通过泛化到预定义类别集合之外,识别任意感兴趣的变化。我们将OVCD重新表述为两阶段流程:首先使用视觉基础模型(如SAM和DINOv2)生成类别无关的变化提议,然后使用视觉语言模型(如CLIP)进行类别识别。我们发现类别识别错误是OVCD的主要瓶颈,这主要是由于基于图像-文本匹配的VLM在表示细粒度土地覆盖类别方面的能力有限。为了解决这个问题,我们提出了OpenDPR,一个无需训练的、以视觉为中心的扩散引导原型检索框架。OpenDPR利用扩散模型离线为目标类别构建多样化的原型,并在推理时与视觉空间中的变化提议进行相似性检索。次要瓶颈在于变化定位,这是由于VFM固有缺乏变化先验。为弥补这一差距,我们设计了一个名为S2C的空间到变化弱监督变化检测模块,以适应其强大的空间建模能力进行变化定位。将预训练的S2C集成到OpenDPR中,得到一个可选的弱监督变体OpenDPR-W,它通过最小监督进一步改进了OVCD。在四个基准数据集上的实验结果表明,所提出的方法在两种监督模式下均达到了最先进的性能。代码可在https://github.com/guoqi2002/OpenDPR获取。

英文摘要

Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and to perform similarity retrieval with change proposals in the visual space during inference. The secondary bottleneck lies in change localization, due to the inherent lack of change priors in VFMs. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for change localization. Integrating the pretrained S2C into OpenDPR leads to an optional weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available at https://github.com/guoqi2002/OpenDPR.

2603.27223 2026-06-02 cs.CV cs.AI 版本更新

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

EuraGovExam:来自现实世界公务员考试的多语言多模态基准

Jaeseong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee

发表机构 * School of Computer Science / Data Intelligence Lab(计算机科学学院/数据智能实验室)

AI总结 提出一个包含8000多道真实公务员考试题目的多语言多模态基准EuraGovExam,要求模型直接从图像中进行布局感知的跨语言推理,当前最先进的视觉语言模型准确率仅达86%。

详情
AI中文摘要

我们提出了EuraGovExam,一个多语言和多模态基准,来源于五个代表性欧亚地区(韩国、日本、台湾、印度和欧盟)的现实世界公务员考试。该数据集旨在反映公共部门评估的真实复杂性,包含超过8000道高分辨率扫描选择题,涵盖17个不同的学术和行政领域。与现有基准不同,EuraGovExam将所有题目内容(包括问题陈述、答案选项和视觉元素)嵌入到单个图像中,仅提供最小化的标准答案格式指令。这种设计要求模型直接从视觉输入进行布局感知的跨语言推理。所有题目均来自真实考试文档,保留了丰富的视觉结构,如表格、多语言排版和类似表单的布局。评估结果显示,即使是最先进的视觉语言模型(VLM)也仅达到86%的准确率,突显了该基准的难度及其诊断当前模型局限性的能力。通过强调文化真实性、视觉复杂性和语言多样性,EuraGovExam为在高风险、多语言、图像基础环境中评估VLM建立了新标准。它还支持电子政务、公共部门文档分析和公平考试准备等实际应用。

英文摘要

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

2603.26779 2026-06-02 cs.CV cs.AI 版本更新

Limits of Spatial Imagery Reasoning in Frontier LLM Models

前沿大语言模型在空间意象推理中的局限性

Sergio Y. Hayashi, Nina S. T. Hirata

发表机构 * Institute of Mathematics and Statistics – University of São Paulo(数学统计研究所 – 圣保罗大学)

AI总结 本研究通过引入外部“意象模块”辅助3D模型旋转任务,发现即使外包整体3D状态维护,前沿模型仍缺乏基础视觉空间原语,导致准确率最高仅62.5%。

Comments 25 pages. v2: Title updated; added a section on object/spatial imagery and propositional reasoning; added new experimental results for the single-object rotation probe

详情
AI中文摘要

大型语言模型(LLMs)展示了令人印象深刻的推理能力,但在需要心理模拟的空间任务(如心理旋转)中表现不佳。本文研究是否通过为LLM配备一个外部“意象模块”——一种能够渲染和旋转3D模型的工具——可以弥合这一差距,充当“认知假体”。我们使用双模块架构进行了实验,其中推理模块(MLLM)与意象模块在3D模型旋转任务上进行交互。性能低于预期,准确率最高达到62.5%。进一步研究表明,即使将维护和操作整体3D状态的负担外包,系统仍然失败。这揭示了当前前沿模型缺乏与意象交互所需的基础视觉空间原语。具体来说,它们缺乏:(1)提取空间信号的低级敏感性,例如(a)深度,(b)运动,以及(c)短视距动态预测;以及(2)对图像进行沉思性推理的能力,动态转移视觉焦点,并平衡意象与符号和关联信息。

英文摘要

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.

2603.26028 2026-06-02 cs.CV 版本更新

Learning to Trim: End-to-End Causal Graph Pruning with Dynamic Anatomical Feature Banks for Medical VQA

学习修剪:基于动态解剖特征库的端到端因果图剪枝用于医学视觉问答

Zibo Xu, Qiang Li, Weizhi Nie, Yuting Su

发表机构 * School of Microelectronics, Tianjin University(天津大学微电子学院) School of Electrical and Information Engineering, Tianjin University(天津大学电气与信息工程学院)

AI总结 提出可学习因果修剪(LCT)框架,通过动态解剖特征库(DAFB)和可微修剪模块,在端到端优化中抑制虚假相关,增强因果信号,提升医学VQA的鲁棒性和泛化性。

详情
AI中文摘要

医学视觉问答(MedVQA)模型通常由于依赖数据集特定的相关性(如重复的解剖模式或问题类型规律)而非真正的诊断证据,表现出有限的泛化能力。现有的因果方法通常实现为静态调整或事后校正。为了解决这个问题,我们提出了一个可学习因果修剪(LCT)框架,将因果修剪集成到端到端优化中。我们引入了一个动态解剖特征库(DAFB),通过动量机制更新,以捕获频繁解剖和语言模式的全局原型,作为数据集级别规律性的近似。我们进一步设计了一个可微修剪模块,估计实例级表示与全局特征库之间的依赖关系。与全局原型高度相关的特征被软抑制,而实例特定证据被强调。这种可学习机制鼓励模型自适应地优先考虑因果信号而非虚假相关。在VQA-RAD、SLAKE、SLAKE-CP和PathVQA上的实验表明,LCT在现有去偏策略上持续提高了鲁棒性和泛化性。

英文摘要

Medical Visual Question Answering (MedVQA) models often exhibit limited generalization due to reliance on dataset-specific correlations, such as recurring anatomical patterns or question-type regularities, rather than genuine diagnostic evidence. Existing causal approaches are typically implemented as static adjustments or post-hoc corrections. To address this issue, we propose a Learnable Causal Trimming (LCT) framework that integrates causal pruning into end-to-end optimization. We introduce a Dynamic Anatomical Feature Bank (DAFB), updated via a momentum mechanism, to capture global prototypes of frequent anatomical and linguistic patterns, serving as an approximation of dataset-level regularities. We further design a differentiable trimming module that estimates the dependency between instance-level representations and the global feature bank. Features highly correlated with global prototypes are softly suppressed, while instance-specific evidence is emphasized. This learnable mechanism encourages the model to prioritize causal signals over spurious correlations adaptively. Experiments on VQA-RAD, SLAKE, SLAKE-CP and PathVQA demonstrate that LCT consistently improves robustness and generalization over existing debiasing strategies.

2603.20176 2026-06-02 cs.CV 版本更新

LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

LagerNVS:用于全神经实时新视角合成的潜在几何

Stanislaw Szymanowicz, Minghao Chen, Jianyuan Wang, Christian Rupprecht, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford(牛津大学视觉几何组) Meta AI

AI总结 提出LagerNVS,一种基于3D感知潜在特征的编码器-解码器神经网络,通过显式3D监督预训练初始化编码器,结合轻量解码器和光度损失端到端训练,实现实时、泛化的新视角合成,在Re10k上达到31.4 PSNR。

Comments IEEE CVF Conference on Computer Vision and Pattern Recognition 2026. Project page with code, models and examples: szymanowiczs.github.io/lagernvs

详情
AI中文摘要

最近的研究表明,神经网络可以在没有显式3D重建的情况下执行新视角合成(NVS)等3D任务。尽管如此,我们认为强3D归纳偏置在这样网络的设计中仍然有帮助。我们通过引入LagerNVS来证明这一点,这是一种用于NVS的编码器-解码器神经网络,建立在“3D感知”潜在特征之上。编码器从使用显式3D监督预训练的3D重建网络初始化。这与轻量解码器配对,并使用光度损失进行端到端训练。LagerNVS在已知和未知相机情况下均实现了最先进的确定性前馈新视角合成(包括在Re10k上达到31.4 PSNR),实时渲染,泛化到野外数据,并且可以与扩散解码器配对用于生成性外推。

英文摘要

Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware' latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.

2603.23902 2026-06-02 cs.CV cs.AI 版本更新

Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

知识精炼的双上下文感知网络用于部分相关视频检索

Junkai Yang, Qirui Wang, Yaoqing Jin, Shuai Ma, Minghan Xu, Shanmin Pang

发表机构 * School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院) Faculty of Computer Science, Electrical Engineering and Information Technology, Universität Stuttgart(斯图加特大学计算机科学、电子工程和信息学院)

AI总结 针对未修剪视频中部分相关片段检索的信息密度不匹配和注意力机制不足问题,提出KDC-Net网络,通过层次语义聚合、动态时间注意力和基于CLIP的蒸馏策略,显著提升检索性能。

Comments Accepted in ICME 2026

详情
AI中文摘要

从未修剪视频中检索部分相关片段仍然面临两个持续挑战:文本与视频片段之间的信息密度不匹配,以及有限的注意力机制忽略了语义焦点和事件相关性。我们提出了KDC-Net,一个知识精炼的双上下文感知网络,从文本和视觉两个角度解决这些问题。在文本方面,层次语义聚合模块捕获并自适应融合多尺度短语线索以丰富查询语义。在视频方面,动态时间注意力机制采用相对位置编码和自适应时间窗口来突出具有局部时间连贯性的关键事件。此外,一种基于CLIP的动态蒸馏策略,结合时间连续性感知精炼,确保了片段感知和目标对齐的知识迁移。在PRVR基准上的实验表明,KDC-Net始终优于最先进的方法,特别是在低片段-视频比率下。

英文摘要

Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.

2603.23647 2026-06-02 cs.CV cs.AI cs.LG 版本更新

λSplit: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy

λSplit: 用于荧光显微镜的自监督内容感知光谱解混

Federico Carrara, Talley Lambert, Mehdi Seifi, Florian Jug

发表机构 * Fondazione Human Technopole(人类技术极地基金会) Harvard Medical School(哈佛医学院) Università Campus Bio-Medico(生物医学大学校园)

AI总结 提出λSplit,一种基于物理信息的深度生成模型,通过分层变分自编码器学习浓度图的条件分布,结合可微分光谱混合器实现最先进的光谱解混和隐式噪声去除。

Comments 14 pages, 25 pages supplement, 16 figures total, 14 tables total

详情
AI中文摘要

在荧光显微镜中,光谱解混旨在从捕获混合荧光发射的光谱图像中恢复单个荧光团浓度。由于经典方法逐像素操作并依赖最小二乘拟合,其性能随着发射光谱重叠增加和噪声水平升高而下降,这表明能够学习并利用结构先验的数据驱动方法可能会带来改进。基于学习的光谱成像方法确实存在,但它们要么未针对显微镜数据进行优化,要么是为不适用于荧光显微镜设置的非常特定情况而开发的。为了解决这个问题,我们提出了λSplit,一种基于物理信息的深度生成模型,它使用分层变分自编码器学习浓度图上的条件分布。一个完全可微的光谱混合器强制与图像形成过程的一致性,而学习到的结构先验实现了最先进的解混和隐式噪声去除。我们在3个真实世界数据集上展示了λSplit,这些数据集被我们合成为总共66个具有挑战性的光谱解混基准。我们将结果与总共10种基线方法进行比较,包括经典方法和一系列基于学习的方法。我们的结果一致显示出竞争性能和在强噪声、光谱显著重叠或光谱维度降低情况下的改进鲁棒性,使λSplit成为荧光显微镜数据光谱解混的新最先进方法。重要的是,λSplit与标准共聚焦显微镜产生的光谱数据兼容,无需专门的硬件修改即可立即采用。

英文摘要

In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed fluorophore emissions. Since classical methods operate pixel-wise and rely on least-squares fitting, their performance degrades with increasingly overlapping emission spectra and higher levels of noise, suggesting that a data-driven approach that can learn and utilize a structural prior might lead to improved results. Learning-based approaches for spectral imaging do exist, but they are either not optimized for microscopy data or are developed for very specific cases that are not applicable to fluorescence microscopy settings. To address this, we propose λSplit, a physics-informed deep generative model that learns a conditional distribution over concentration maps using a hierarchical Variational Autoencoder. A fully differentiable Spectral Mixer enforces consistency with the image formation process, while the learned structural priors enable state-of-the-art unmixing and implicit noise removal. We demonstrate λSplit on 3 real-world datasets that we synthetically cast into a total of 66 challenging spectral unmixing benchmarks. We compare our results against a total of 10 baseline methods, including classical methods and a range of learning-based methods. Our results consistently show competitive performance and improved robustness in high noise regimes, when spectra overlap considerably, or when the spectral dimensionality is lowered, making λSplit a new state-of-the-art for spectral unmixing of fluorescent microscopy data. Importantly, λSplit is compatible with spectral data produced by standard confocal microscopes, enabling immediate adoption without specialized hardware modifications.

2603.06989 2026-06-02 cs.CV 版本更新

MipSLAM: Alias-Free Gaussian Splatting SLAM

MipSLAM:无混叠高斯泼溅SLAM

Yingzhao Li, Yan Li, Shixiong Tian, Yanjie Liu, Lijun Zhao, Gim Hee Lee

发表机构 * State Key Laboratory of Robotics and Systems (HIT), Harbin Institute of Technology(机器人系统国家重点实验室(哈工大)) Yangtze River Delta HlT Robot Technology Research Institute(长江三角洲HLT机器人技术研究院) Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系)

AI总结 提出MipSLAM框架,通过椭圆自适应抗混叠算法和频谱感知位姿图优化,实现高保真抗混叠新视角合成与鲁棒位姿估计。

Comments Accepted to ICRA 2026

详情
AI中文摘要

本文介绍了MipSLAM,一种频率感知的3D高斯泼溅(3DGS)SLAM框架,能够在不同相机配置下实现高保真抗混叠新视角合成和鲁棒位姿估计。现有的基于3DGS的SLAM系统常因滤波不足和纯空间优化而遭受混叠伪影和轨迹漂移。为克服这些限制,我们提出椭圆自适应抗混叠(EAA)算法,通过几何感知数值积分近似高斯贡献,避免昂贵的解析计算。此外,我们提出频谱感知位姿图优化(SA-PGO)模块,在频域中重新表述轨迹估计,通过图拉普拉斯分析有效抑制高频噪声和漂移。在Replica和TUM数据集上的广泛评估表明,MipSLAM在多种分辨率下实现了最先进的渲染质量和定位精度。代码可在https://github.com/yzli1998/MipSLAM获取。

英文摘要

This paper introduces MipSLAM, a frequency-aware 3D Gaussian Splatting (3DGS) SLAM framework capable of high-fidelity anti-aliased novel view synthesis and robust pose estimation under varying camera configurations. Existing 3DGS-based SLAM systems often suffer from aliasing artifacts and trajectory drift due to inadequate filtering and purely spatial optimization. To overcome these limitations, we propose an Elliptical Adaptive Anti-aliasing (EAA) algorithm that approximates Gaussian contributions via geometry-aware numerical integration, avoiding costly analytic computation. Furthermore, we present a Spectral-Aware Pose Graph Optimization (SA-PGO) module that reformulates trajectory estimation in the frequency domain, effectively suppressing high-frequency noise and drift through graph Laplacian analysis. Extensive evaluations on Replica and TUM datasets demonstrate that MipSLAM achieves state-of-the-art rendering quality and localization accuracy across multiple resolutions. Code is available at https://github.com/yzli1998/MipSLAM.

2508.09456 2026-06-02 cs.CV cs.CL cs.CR 版本更新

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

IAG: 基于输入感知的后门攻击针对VLM视觉定位

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Columbia University(哥伦比亚大学) Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学)

AI总结 提出IAG方法,通过文本条件UNet动态生成输入感知的触发器,实现首个多目标后门攻击VLM视觉定位,在多个模型和基准上达到最佳攻击成功率且不影响正常性能。

Comments Accepted by CVPR 2026; Code is at https://github.com/lijunxian111/IAG

详情
Journal ref
https://openaccess.thecvf.com/content/CVPR2026/papers/Li_IAG_Input-aware_Backdoor_Attack_on_VLM-based_Visual_Grounding_CVPR_2026_paper.pdf
AI中文摘要

近期视觉语言模型(VLM)的进展显著提升了视觉定位任务,该任务涉及根据自然语言查询在图像中定位对象。尽管取得了这些进展,基于VLM的定位系统的安全性尚未得到彻底研究。本文揭示了一个新颖且现实的安全漏洞:首个针对VLM视觉定位的多目标后门攻击。与依赖静态触发器或固定目标的先前攻击不同,我们提出了IAG,一种动态生成输入感知、文本引导触发器的方法,这些触发器以任意指定目标对象描述为条件来执行攻击。这是通过一个文本条件的UNet实现的,该网络将难以察觉的目标语义线索嵌入视觉输入,同时保持对良性样本的正常定位性能。我们进一步开发了一个联合训练目标,平衡语言能力与感知重建,以确保隐蔽性、有效性和隐秘性。在多个VLM(如LLaVA、InternVL、Ferret)和基准(RefCOCO、RefCOCO+、RefCOCOg、Flickr30k Entities和ShowUI)上的大量实验表明,IAG在几乎所有设置下都实现了比其他基线最佳的攻击成功率,同时不损害干净准确率,保持对现有防御的鲁棒性,并展现出跨数据集和模型的迁移性。这些发现强调了具有定位能力的VLM中的关键安全风险,并突出了对可信多模态理解的进一步研究的必要性。

英文摘要

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

2510.19496 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CARES: Context-Aware Resolution Selector for VLMs

CARES: 面向视觉语言模型的上下文感知分辨率选择器

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

发表机构 * Technion(技术ion大学) IBM Research(IBM研究院) Tel-Aviv University(特拉维夫大学) Ben-Gurion University(本· Gurion大学)

AI总结 提出CARES轻量级预处理模块,通过紧凑型VLM预测图像-查询对的最小足够分辨率,在保持任务性能的同时最多减少80%计算量。

Comments Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Accepted to ACL 2026 (Oral presentation). Code available at https://github.com/mkimhi/CARES

详情
AI中文摘要

大型视觉语言模型通常以原始或高分辨率处理图像以保持跨任务有效性。这导致视觉令牌通常占总令牌的97-99%,即使低分辨率图像就足够时,也会产生高计算量和延迟。我们引入了CARES——一种上下文感知分辨率选择器,这是一个轻量级预处理模块,给定图像-查询对,预测最小的足够输入分辨率。CARES使用紧凑型VLM(350M)提取特征,并预测目标预训练VLM的响应何时收敛到其正确回答的峰值能力。尽管作为一组可选分辨率上的离散分类器进行训练,但CARES在推理时插值连续分辨率以实现细粒度控制。在涵盖文档和自然图像以及多样化目标VLM的五个多模态基准测试中,CARES在保持任务性能的同时最多减少80%的计算量。

英文摘要

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

2603.18652 2026-06-02 cs.CV cs.AI cs.IR 版本更新

Beyond String Matching: Semantic Evaluation of PDF Table Extraction

超越字符串匹配:PDF表格提取的语义评估

Pius Horn, Janis Keuper

发表机构 * Institute for Machine Learning and Analytics (IMLA)(机器学习与分析研究所) Offenburg University(奥芬堡大学) University of Mannheim(曼海姆大学)

AI总结 提出基于LLM-as-a-judge的语义评估框架,通过合成PDF和人工验证,显著优于现有规则指标(TEDS、GriTS),并评估了21种PDF解析器。

Comments Submitted to BMVC 2026

详情
AI中文摘要

从PDF中可靠地提取表格对于大规模科学数据挖掘和知识库构建至关重要,然而现有的评估方法依赖于基于规则的指标,无法捕捉表格内容的语义等价性。我们提出了一个基于合成PDF的基准测试框架,这些PDF具有精确的LaTeX真实标注,并使用来自arXiv的表格以确保现实的复杂性和多样性。作为我们的核心方法论贡献,我们将LLM-as-a-judge应用于语义表格评估,并将其集成到一个能够适应解析器输出不一致性的匹配流水线中。通过一项包含超过1500个提取表格对的人工验证研究,我们表明基于LLM的评估与人类判断的相关性(Pearson r=0.93)显著高于当前使用的基于树编辑距离的相似度(TEDS, r=0.68)和网格表格相似度(GriTS, r=0.70)。对21个当代PDF解析器在包含451个表格的100个合成文档上的评估揭示了显著的性能差异。我们的结果为选择用于表格数据提取的解析器提供了实用指导,并为这一关键任务建立了一种可重复、可扩展的评估方法。代码和数据:https://github.com/phorn1/pdf-parse-bench 指标研究和人工评估:https://github.com/phorn1/table-metric-study

英文摘要

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to currently used Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

2504.05033 2026-06-02 cs.RO cs.CV 版本更新

CloSE: A Geometric Shape-Agnostic Cloth State Representation

CloSE: 一种几何形状无关的布料状态表示

Jay Kamat, Júlia Borràs, Carme Torras

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(西班牙工业机器人与信息技术研究所,CSIC-UPC)

AI总结 提出一种基于拓扑索引的dGLI圆盘表示,并从中抽象出紧凑、连续的CloSE表示,用于预测布料折叠位置并支持语义标注与规划。

Comments Accepted at ICRA 2026 (8 pages, 11 figures, 1 table). Project page: https://close-representation.github.io/

详情
AI中文摘要

布料操作是一个难题,主要是因为布料的非刚性特性,这使得对变形的良好表示至关重要。我们提出了一种新的布料变形状态表示。首先,我们提出了基于拓扑索引的dGLI圆盘表示,这些索引是针对排列在圆形网格上的布料边界边缘段计算的。dGLI圆盘的热力图揭示了与布料状态特征相对应的模式,这些模式对于不同形状、尺寸或方向的布料是一致的。然后,我们将这些重要特征从dGLI圆盘中抽象成一个圆,称为布料状态表示(CloSE)。这种表示紧凑、连续,且适用于不同形状。我们表明,这种表示能够准确预测多个仿真布料数据集中的折叠位置。最后,我们还展示了这种表示在两个相关应用中的优势:语义标注以及高层和低层规划。代码和数据集可从以下网址获取:https://close-representation.github.io/

英文摘要

Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation based on topological indices computed for edge segments of the cloth border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes or orientation of the cloth. We then abstract these important features from the dGLI disk into a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. We show that this representation is able to accurately predict the fold locations for several simulation clothing datasets. Finally, we also show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code and the dataset can be accessed from: https://close-representation.github.io/

2603.04256 2026-06-02 cs.CV 版本更新

A Hypertoroidal Covering for Perfect Color Equivariance

完美颜色等变的超环面覆盖

Yulong Yang, Zhikun Xu, Yaojun Li, Christine Allen-Blanchette

发表机构 * GitHub

AI总结 提出一种通过将区间值提升到圆上的双覆盖来构建真正等变的颜色等变架构,解决了先前方法中近似饱和度和亮度为1D平移带来的伪影问题,在细粒度分类和医学成像等任务上提升了性能。

Comments Accept to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

当输入图像的颜色分布在推理时发生变化,传统神经网络架构的性能会显著下降。一些研究者已开始将颜色几何的先验知识融入神经网络设计。这些颜色等变架构将色调变化建模为2D旋转,饱和度和亮度变换建模为1D平移。虽然这种方法在多种情况下提高了神经网络对颜色变化的鲁棒性,但我们发现将饱和度和亮度(区间值量)近似为1D平移会引入明显的伪影。本文提出了一种真正等变的颜色等变架构。我们不再用实直线近似区间,而是将区间上的值提升到圆上的值(双覆盖),并在其上构建等变表示。我们的方法解决了先前方法的近似伪影问题,提高了可解释性和泛化能力,并在细粒度分类和医学成像等任务上取得了优于传统和等变基线的预测性能。超越颜色范畴,我们提出的提升方法还可以扩展到尺度等几何变换。

英文摘要

When the color distribution of input images changes at inference, the performance of conventional neural network architectures drops considerably. A few researchers have begun to incorporate prior knowledge of color geometry in neural network design. These color equivariant architectures have modeled hue variation with 2D rotations, and saturation and luminance transformations as 1D translations. While this approach improves neural network robustness to color variations in a number of contexts, we find that approximating saturation and luminance (interval valued quantities) as 1D translations introduces appreciable artifacts. In this paper, we introduce a color equivariant architecture that is truly equivariant. Instead of approximating the interval with the real line, we lift values on the interval to values on the circle (a double-cover) and build equivariant representations there. Our approach resolves the approximation artifacts of previous methods, improves interpretability and generalizability, and achieves better predictive performance than conventional and equivariant baselines on tasks such as fine-grained classification and medical imaging tasks. Going beyond the context of color, we show that our proposed lifting can also extend to geometric transformations such as scale.

2603.00171 2026-06-02 cs.CV cs.AI 版本更新

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

LookWise: 知道何时何地关注多模态大语言模型中的细粒度视觉推理

Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Man Zhou, Chengjun Xie, Haoxuan Che, Xuanhua He, Jie Zhang

发表机构 * Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences(智能机器研究所,合肥物理科学研究院,中国科学院) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) East China Normal University(华东师范大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出LookWise框架,通过置信度模块和语义引导定位模块实现自适应视觉推理,无需额外训练即可提升细粒度推理精度并加速推理。

详情
AI中文摘要

多模态大语言模型正转向通过主动探索图像细节进行“图像思考”。虽然有效,但大规模训练计算成本高昂,这激发了对轻量级、无需训练解决方案的兴趣。然而,现有无需训练方法存在两个缺陷:无差别裁剪导致的感知冗余,增加了计算成本并引入噪声;以及语义意图与空间注意力之间的漂移,阻碍了用户关注区域的准确定位。为应对这些挑战,我们提出LookWise,一个自适应视觉推理框架。LookWise遵循两阶段流程:基于置信度的模块决定何时更仔细地观察,语义引导的定位模块确定观察位置。该设计使MLLM能够自适应获取细粒度视觉证据而无需额外训练。在细粒度和高分辨率视觉推理基准上的实验表明,LookWise在强基线上持续提升准确率,同时相较于基于搜索的方法ZoomEye实现约$4.0 imes$的推理加速,展现出稳健的跨模型泛化能力。

英文摘要

Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which increases computational cost and introduces noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose LookWise, a framework for adaptive visual reasoning. LookWise follows a two-stage pipeline: a confidence-based module decides when to look more carefully, and a semantic-guided localization module determines where to look. This design enables MLLMs to adaptively acquire fine-grained visual evidence without additional training. Experiments on fine-grained and high-resolution visual reasoning benchmarks show that LookWise consistently improves accuracy over strong baselines while achieving an approximately $4.0\times$ inference speedup over the search-based method ZoomEye, demonstrating robust cross-model generalization.

2603.09529 2026-06-02 cs.CV 版本更新

RESBev: Making BEV Perception More Robust

RESBev:使BEV感知更加鲁棒

Lifeng Zhuo, Kefan Jin, Zhe Liu, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出RESBev,一种即插即用的鲁棒BEV感知方法,通过构建潜在世界模型学习时空相关性来预测干净BEV特征,从而在无需修改骨干网络的情况下增强对自然扰动和对抗攻击的鲁棒性。

详情
AI中文摘要

鸟瞰图(BEV)感知已成为自动驾驶系统的基石,为下游规划和控制提供了结构化的、以自我为中心的表示。然而,实际部署面临传感器退化和对抗攻击的挑战,这些可能导致严重的感知异常,最终危及自动驾驶系统的安全性。为了解决这个问题,我们提出了一种弹性且即插即用的BEV感知方法RESBev,它可以轻松应用于现有的BEV感知方法,以增强其对各种扰动的鲁棒性。具体来说,我们将感知鲁棒性重新构建为潜在语义预测问题。构建了一个潜在世界模型来提取连续BEV观测中的时空相关性,从而学习潜在的BEV状态转换,以预测干净的BEV特征来重建被破坏的观测。所提出的框架在Lift-Splat-Shoot管道的语义特征级别上运行,使其能够在无需修改底层骨干网络的情况下,对自然扰动和对抗攻击进行泛化恢复。在nuScenes数据集上的大量实验表明,通过少样本微调,RESBev显著提高了现有BEV感知模型对各种外部扰动和对抗攻击的鲁棒性。

英文摘要

Bird's-eye-view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego-centric representation critical for downstream planning and control. However, real-world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few-shot fine-tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.

2603.09390 2026-06-02 cs.CV 版本更新

Training-Free Coverless Multi-Image Steganography with Access Control

免训练的无载体多图像隐写术与访问控制

Minyeol Bae, Si-Hyeon Lee

发表机构 * Department of Computer Science, Seoul National University(首尔国立大学计算机科学系)

AI总结 提出MIDAS框架,通过潜在级融合实现免训练的多图像隐写与用户特定访问控制,引入随机基机制抑制残差结构信息,并理论分析信息泄露。

Comments Accepted (Poster) at ICML 2026

详情
AI中文摘要

无载体图像隐写术(CIS)在不显式修改载体图像的情况下隐藏信息,提供强不可感知性和对隐写分析的固有鲁棒性。然而,现有的CIS方法大多缺乏鲁棒的访问控制,难以向不同授权用户选择性揭示不同隐藏内容。这种访问控制对于多用户场景中可扩展且隐私敏感的信息隐藏至关重要。我们提出MIDAS(基于多图像扩散的访问控制隐写术),一种免训练的基于扩散的CIS框架,通过潜在级融合实现具有用户特定访问控制的多图像隐藏。MIDAS引入随机基机制以抑制残差结构信息,并附带信息泄露的理论分析,以及一个潜在向量融合模块,该模块重塑聚合的潜在向量以更好地与扩散过程对齐。实验结果表明,MIDAS在访问控制功能、隐写图像质量和多样性、对噪声的鲁棒性以及抵抗隐写分析方面,始终优于现有的免训练CIS基线,为访问控制的无载体隐写术建立了一种实用且可扩展的方法。

英文摘要

Coverless Image Steganography (CIS) hides information without explicitly modifying a cover image, providing strong imperceptibility and inherent robustness to steganalysis. However, existing CIS methods largely lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users. Such access control is critical for scalable and privacy-sensitive information hiding in multi-user settings. We propose MIDAS (Multi-Image Diffusion-based Access-controlled Steganography), a training-free diffusion-based CIS framework that enables multi-image hiding with user-specific access control via latent-level fusion. MIDAS introduces a Random Basis mechanism to suppress residual structural information, together with a theoretical analysis of information leakage, and a Latent Vector Fusion module that reshapes aggregated latents to better align with the diffusion process. Experimental results demonstrate that MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis, establishing a practical and scalable approach to access-controlled coverless steganography.

2603.09292 2026-06-02 cs.RO cs.CV 版本更新

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

看、规划、回退:面向鲁棒机器人操作的进度感知视觉-语言-动作模型

Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zihao Zhang, Zhihui Li, Salman Khan, Jun Yu, Xiaojun Chang

发表机构 * School of Information Science and Technology, University of Science and Technology of China(信息科学与技术学院,中国科学技术大学) University of Technology Sydney(新南威尔士大学) Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence(人工智能与计算机视觉系,Mohamed Bin Zayed人工智能大学) The University of Hong Kong(香港大学) Institute of AI for Industry, Chinese Academy of Sciences(产业人工智能研究所,中国科学院) School of Intelligent Science and Engineering, Harbin Institute of Technology (Shenzhen)(智能科学与工程学院,哈尔滨工业大学(深圳))

AI总结 提出进度感知的视觉-语言-动作框架SPR,通过动态将语言指令映射为空间子目标序列,并利用闭环进度监控实现错误恢复,在LIBERO基准上提升5%性能,在LIBERO-Plus上展现最先进的鲁棒性。

Comments Suggested to CVPR Findings. https://tingjundai.github.io/SPRVLA/

详情
AI中文摘要

通过明确的、可操作的里程碑来测量任务进度对于鲁棒机器人操作至关重要。这种进度感知使模型能够把握当前任务状态,预期可验证的中间状态,并在进度停滞时检测和恢复失败。为体现这一能力,我们引入了 extbf{看}、 extbf{规划}、 extbf{回退}(SPR),一个进度感知的视觉-语言-动作框架,它动态地将语言指令接地到一系列空间子目标中。SPR通过一个连续的核心循环运行:观察当前状态和即将到来的里程碑,规划朝向下一个2D航点的轨迹,并在失败时通过监控与预期序列的进度来回退到可恢复状态。这种闭环方法无需额外训练数据或辅助模型即可实现鲁棒的错误纠正。大量实验证明了该框架的有效性、泛化能力和鲁棒性:SPR在LIBERO基准上比MolmoAct基线高出5%。在具有未见指令和初始状态的挑战性LIBERO-Plus基准上,SPR实现了最先进的鲁棒性,性能下降最小,超越了OpenVLA-OFT和UniVLA,展示了优越的分布外鲁棒性。

英文摘要

Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce \textbf{S}ee, \textbf{P}lan, \textbf{R}ewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.

2509.25773 2026-06-02 cs.CV cs.AI cs.CL 版本更新

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

v-HUB: 从视觉和声音理解视频幽默的基准

Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng

发表机构 * Shanghai Jiao Tong University(上海交通大学) Wuhan University(武汉大学) Beijing Institute for General Artificial Intelligence(北京一般人工智能研究院) Independent Researcher(独立研究者)

AI总结 提出v-HUB基准,通过非语言短视频评估多模态大语言模型在仅凭视觉线索理解幽默的能力,并发现音频信息有助于提升幽默理解。

Comments 24 pages, 9 figures

详情
AI中文摘要

能够理解幽默的AI模型具有现实应用前景——例如,增强人机交互中的参与度。为了评估和诊断多模态大语言模型(MLLMs)理解幽默的能力,我们引入了v-HUB,一个新颖的视频幽默理解基准。v-HUB包含一个精心策划的非语言短视频集合,反映了仅通过视觉线索即可欣赏幽默的现实场景。我们将每个视频片段与丰富的标注配对,以支持各种评估任务和分析,包括一项关于增强幽默的环境声音的新研究。为了扩大其适用性,我们构建了一个开放式问答任务,使v-HUB能够轻松集成到现有的视频理解任务套件中。我们评估了多种MLLMs,从专门的Video-LLMs到能够原生处理音频的多功能OmniLLMs,涵盖了开源和专有领域。实验结果揭示了MLLMs在仅凭视觉线索理解幽默时面临的困难。我们的发现还表明,结合音频有助于视频幽默理解,突显了为复杂视频理解任务整合更丰富模态的前景。

英文摘要

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

2603.06741 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Heterogeneous Decentralized Diffusion Models

异构去中心化扩散模型

Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy

发表机构 * bagel.com(Bagel公司)

AI总结 提出一种异构去中心化训练框架,通过支持不同专家使用不同目标(DDPM和Flow Matching)并统一推理、预训练检查点转换以及高效架构,大幅降低计算和数据需求,使单GPU(24-48GB VRAM)即可参与训练。

Comments Accepted to CVPR2026

详情
AI中文摘要

训练前沿规模的扩散模型通常需要大量计算资源集中在紧密耦合的集群中,限制了只有资源充足的机构才能参与。虽然去中心化扩散模型(DDM)能够独立训练多个专家,但现有方法需要1176 GPU天,且所有专家使用同质化训练目标。我们提出了一个高效框架,大幅降低资源需求,同时支持异构训练目标。我们的方法结合了三个关键贡献:(1)一种异构去中心化训练范式,允许专家使用不同的目标(DDPM和Flow Matching),在推理时无需任何重新训练即可统一;(2)从ImageNet-DDPM到Flow Matching目标的预训练检查点转换,加速收敛并无需针对特定目标的预训练即可初始化;(3)PixArt-$α$的高效AdaLN-Single架构,在保持质量的同时减少参数。在LAION-Aesthetics上的实验表明,相对于先前DDM工作报告的训练规模,我们的方法将计算量减少了16倍,数据量减少了14倍。在对齐的推理设置下,我们的异构配置比同质基线获得了更好的FID和更高的提示内多样性。通过消除同步需求并支持混合DDPM/FM目标,我们的框架使贡献者只需单GPU(24-48GB VRAM)即可进行去中心化生成模型训练。

英文摘要

Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly-coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that dramatically reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three key contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time without any retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-$α$'s efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces the compute by 16$\times$ and data by 14$\times$. Under aligned inference settings, our heterogeneous configuration achieves better FID and higher intra-prompt diversity than the homogeneous baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework makes decentralized generative model training accessible to contributors with single GPUs requiring only 24--48GB VRAM.

2603.06453 2026-06-02 cs.CV 版本更新

Pinterest Canvas: Large-Scale Image Generation at Pinterest

Pinterest Canvas: Pinterest 的大规模图像生成系统

Yu Wang, Eric Tzeng, Raymond Shiau, Jie Yang, Dmitry Kislyuk, Charles Rosenberg

发表机构 * Pinterest, Inc.(Pinterest公司)

AI总结 本文提出 Pinterest Canvas,一个基于扩散模型的大规模图像生成系统,通过基础模型微调为特定任务(如背景增强和宽高比外扩)生成专用模型,并在线上实验中分别获得18.0%和12.5%的参与度提升。

Comments Accepted by KDD 2026 Applied Data Science Track

详情
AI中文摘要

尽管最近的图像生成模型在处理各种图像生成任务方面表现出色,但这种灵活性使得它们难以仅通过提示或简单的推理适应来控制,因此不适用于具有严格产品要求的场景。在本文中,我们介绍了 Pinterest Canvas,这是我们构建的大规模图像生成系统,用于支持 Pinterest 上的图像编辑和增强用例。Canvas 首先在多样化的多模态数据集上进行训练,以生成具有广泛图像编辑能力的基础扩散模型。然而,我们并不依赖一个通用模型来处理所有下游任务,而是针对特定任务的数据集快速微调该基础模型的变体,从而为各个用例生成专用模型。我们描述了 Canvas 的关键组件,并总结了我们在数据集策划、训练和推理方面的最佳实践。我们还通过背景增强和宽高比外扩的案例研究展示了任务特定的变体,突出了我们如何满足其特定的产品需求。在线 A/B 实验表明,我们的增强图像分别获得了显著的 18.0% 和 12.5% 的参与度提升,与人类评估者的比较进一步验证了我们的模型在这些任务上优于第三方模型。最后,我们展示了其他 Canvas 变体,包括多图像场景合成和图像到视频生成,证明了我们的方法可以推广到各种潜在的下游任务。

英文摘要

While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.

2603.06331 2026-06-02 cs.CV 版本更新

WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

WorldCache: 通过异构令牌缓存免费加速世界模型

Weilun Feng, Guoxin Fan, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Dingrui Wang, Longlong Liao, Michele Magno, Yongjun Xu, Chuanguang Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对扩散世界模型中令牌异质性和非均匀时间动态导致的推理慢问题,提出基于曲率引导的异构令牌预测和混沌优先自适应跳过的缓存框架WorldCache,实现高达3.7倍加速并保持98%的推出质量。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于扩散的世界模型在统一世界模拟方面显示出巨大潜力,但迭代去噪对于交互式使用和长程推出而言仍然过于昂贵。虽然特征缓存可以在无需训练的情况下加速推理,但我们发现,为单模态扩散设计的策略由于两个世界模型特有的障碍而难以迁移到世界模型:来自多模态耦合和空间变化的\emph{令牌异质性},以及\emph{非均匀时间动态},其中一小部分困难令牌驱动误差增长,使得均匀跳过要么不稳定,要么过于保守。我们提出了 extbf{WorldCache},一个为扩散世界模型量身定制的缓存框架。我们引入了 extit{曲率引导的异构令牌预测},它使用基于物理的曲率分数来估计令牌可预测性,并对具有突变方向变化的混沌令牌应用Hermite引导的阻尼预测器。我们还设计了 extit{混沌优先的自适应跳过},它累积一个曲率归一化、无量纲的漂移信号,并且仅在瓶颈令牌开始漂移时重新计算。在扩散世界模型上的实验表明,WorldCache实现了高达 extbf{3.7$ imes$}的端到端加速,同时保持了 extbf{98\%}的推出质量,展示了WorldCache在资源受限场景中的巨大优势和实用性。我们的代码发布在https://github.com/FofGofx/WorldCache。

英文摘要

Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7$\times$} end-to-end speedups while maintaining \textbf{98\%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.

2601.09566 2026-06-02 cs.CV cs.AI 版本更新

Hot-Start Chinese Language Modeling:Visual Glyphs Accelerate Sample-Efficient Learning

热启动中文语言建模:视觉字形加速样本高效学习

Shuyang Xiang, Hao Guan

发表机构 * Independent Researcher(独立研究者) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 本文通过将汉字渲染为视觉字形图像,研究其对字符级语言建模的归纳偏置,发现视觉输入产生显著的热启动效应,但最终精度与基于索引的方法一致。

Comments 15 pages, 5 figures, submitted to ACL 2026

详情
AI中文摘要

在这项工作中,我们研究了将汉字渲染为视觉字形图像(而非主流LLM使用的离散token ID)是否为字符级语言建模提供归纳偏置。我们的核心发现给出了一个双刃剑的见解:视觉输入产生显著的热启动效应,在第一个epoch内(占总训练步骤的0.4%)将早期准确率提高一倍以上(视觉输入12.3% vs. 基于索引的基线5.8%),但两种方法最终收敛到几乎相同的最终准确率(39%)。这一模式在低至8x8像素的分辨率、高达50%的部分裁剪以及从110M到1.78B参数的模型规模下均成立。我们识别的机制是,字形渲染在训练之前就将基于部首的结构预编码到嵌入空间中(余弦相似度0.27 vs. 随机嵌入的0.002),从而能够更快地对齐,但无法提高最终容量。我们的结果阐明了视觉表示作为中文语言建模归纳偏置的前景和根本局限性。

英文摘要

In this work, we study whether rendering Chinese characters as visual glyph images, rather than discrete token IDs as mainstream LLMs do, providing an inductive bias for character-level language modeling. Our central finding gives a double-edged insight: visual inputs produce a pronounced hot-start effect, more than doubling early-stage accuracy within the first epoch (at 0.4% of total training steps) (12.3% visual inputs vs. 5.8% index-based baseline), yet both approaches converge to essentially identical final accuracy (39%). This pattern holds across resolutions as low as 8x8 pixels, partial cropping up to 50%, and model scales from 110M to 1.78B parameters. The mechanism we identify is that glyph rendering pre-encodes radical-based structure into embedding space before any training (cosine similarity 0.27 vs. 0.002 for random embeddings), enabling faster alignment but not higher final capacity. Our results clarify both the promise and fundamental limitation of visual representations as inductive biases for Chinese language modeling.

2603.02288 2026-06-02 cs.CV eess.IV 版本更新

AutoFFS: Adversarial Deformations for Facial Feminization Surgery Planning

AutoFFS: 用于面部女性化手术规划的对抗性变形

Paul Friedrich, Florentin Bieder, Florian M. Thieringer, Philippe C. Cattin

发表机构 * Department of Biomedical Engineering, University of Basel(巴塞尔大学生物医学工程系) Department of Oral and Cranio-Maxillofacial Surgery, University Hospital Basel(巴塞尔大学口腔和颅面外科系)

AI总结 提出AutoFFS框架,通过对抗性自由变形生成反事实颅骨形态,为面部女性化手术提供定量规划依据。

Comments Project Page: https://pfriedri.github.io/autoffs-io Code: https://github.com/pfriedri/autoffs

详情
AI中文摘要

面部女性化手术(FFS)是跨性别和性别多样化患者性别确认的关键组成部分,旨在将颅面结构重塑为女性形态。当前的手术规划程序主要依赖主观临床评估,缺乏定量和可重复的解剖学指导。因此,我们提出AutoFFS,一种新颖的数据驱动框架,通过对抗性自由变形生成反事实颅骨形态。我们的方法对一组预训练的、学习了性别二态性的二元性别分类器集成执行基于变形的定向对抗攻击,有效将个体颅骨形状向目标性别转变。生成的反事实颅骨形态为FFS的术前规划提供了定量基础,推动了这一长期被忽视的患者群体的进步。我们通过基于分类器的评估验证了我们的方法,提出了形态学弗雷歇距离(MFD)和形态学核距离(MKD)来评估生成人群与真实人群的分布对齐,并进行了人类感知研究,确认生成的形态表现出目标性别特征。

英文摘要

Facial feminization surgery (FFS) is a key component of gender affirmation for transgender and gender diverse patients, aiming to reshape craniofacial structures toward a female morphology. Current surgical planning procedures largely rely on subjective clinical assessment, lacking quantitative and reproducible anatomical guidance. We therefore propose AutoFFS, a novel data-driven framework that generates counterfactual skull morphologies through adversarial free-form deformations. Our method performs a deformation-based targeted adversarial attack on an ensemble of pre-trained binary sex classifiers that learned sexual dimorphism, effectively transforming individual skull shapes toward the target sex. The generated counterfactual skull morphologies provide a quantitative foundation for preoperative planning in FFS, driving advances in this largely overlooked patient group. We validate our approach through classifier-based evaluation, propose Morphological Fréchet Distance (MFD) and Morphological Kernel Distance (MKD) to evaluate distributional alignment of generated and real populations, and perform a human perceptual study, confirming that the generated morphologies exhibit target sex characteristics.

2512.08048 2026-06-02 cs.CV 版本更新

Family Matters: A Systematic Study of Spatial vs. Frequency Masking for Continual Test-Time Adaptation

家庭事务:空间掩码与频率掩码在连续测试时自适应中的系统研究

Chandler Timm C. Doloriel, Yunbei Zhang, Yeonguk Yu, Taki Hasan Rafi, Muhammad salman siddiqui, Tor Kristian Stevik, Fadi Al Machot, Kristian Hovde Liland, Habib Ullah

发表机构 * Faculty of Science and Technology (REALTEK), Norwegian University of Life Sciences (NMBU)(科学与技术学院(REALTEK)、挪威生命科学大学) Tulane University(路易斯安那州立大学) Gwangju Institute of Science and Technology(全州科学技术学院) Hanyang University(翰阳大学)

AI总结 通过控制变量实验,系统研究了空间掩码与频率掩码在连续测试时自适应中的效果,发现空间掩码在补丁标记化架构上积累稳定表示,而频率掩码导致灾难性崩溃,且最优掩码家族取决于架构-任务对齐。

Comments Accepted to TMLR 2026; code at https://github.com/chandlerbing65nm/m2a.git

详情
AI中文摘要

最近的连续测试时自适应(CTTA)方法采用掩码图像建模来稳定分布偏移下的学习,但每种方法都将其掩码家族F视为固定的设计选择,并仅沿着选择策略S进行创新,从而使得家族轴未被充分探索。我们提出了一项系统的实证研究,隔离了这一轴。通过使用一个受控的CTTA实例——Mask to Adapt (M2A)——它固定S为随机和标准损失,我们仅改变F,跨越空间(补丁、像素)和频率(全频带、低频带、高频带)家族,同时保持其他所有组件相同。该研究的贡献在于为我们评估的CTTA设置提取了设计指导:(1)掩码家族决定了自适应是累积有用的结构还是累积错误——在补丁标记化架构上,空间掩码在长流中积累稳定的表示,而频率掩码则灾难性地崩溃。我们通过结构保持解释来表征这种不稳定性,其中空间相干性维持了避免与腐败的频谱特征最终重叠所需的宽谱冗余;(2)最优家族取决于架构-任务对齐——在CNN上,其重叠的感受野稀释了补丁遮挡,家族差距消失,而在具有全局线索和大容量ViT的细粒度任务上,频率掩码变得有竞争力。在混杂的系统级比较中——其中基线在损失和辅助组件上也不同——M2A的随机选择与启发式策略表现相当,尽管我们将这一观察视为提示性背景,而非对S相对重要性的受控量化。

英文摘要

Recent continual test-time adaptation (CTTA) methods adopt masked image modeling to stabilize learning under distribution shift, yet each treats its masking family F as a fixed design choice and innovates exclusively along the selection strategy S, leaving the family axis underexplored. We present a systematic empirical study that isolates this axis. Using a controlled CTTA instantiation -- Mask to Adapt (M2A) -- that fixes S = random and standard losses, we vary only F across spatial (patch, pixel) and frequency (all-band, low-band, high-band) families while keeping every other component identical. The study's contributions are the design guidance it extracts for the CTTA settings we evaluated: (1) the masking family determines whether adaptation compounds useful structure or compounds errors -- on patch-tokenized architectures, spatial masking accumulates stable representations over long streams while frequency masking collapses catastrophically. We characterize this instability through a structural-preservation account, where spatial coherence maintains the broad-spectrum redundancy needed to avoid terminally overlapping with a corruption's spectral signature; (2) the optimal family depends on architecture-task alignment -- on CNNs, whose overlapping receptive fields dilute patch occlusion, the family gap vanishes, whereas on fine-grained tasks with global cues and large-capacity ViTs, frequency masking becomes competitive. In confounded system-level comparisons -- where baselines also differ in losses and auxiliary components -- M2A's random selection performs comparably to heuristic strategies, though we treat this observation as suggestive context rather than a controlled quantification of S's relative importance.

2603.00133 2026-06-02 cs.CV cs.AI 版本更新

You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

你不需要所有注意力:文本到图像扩散模型中的外科记忆缓解

Kairan Zhao, Eleni Triantafillou, Peter Triantafillou

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GUARD框架,通过吸引-排斥动力学调整去噪过程,结合交叉注意力衰减机制,在不损害图像质量的前提下有效缓解文本到图像扩散模型中的记忆问题。

Comments Accepted at ICML 2026

详情
AI中文摘要

生成模型已被证明会“记忆”某些训练数据,导致生成逐字或近乎逐字的图像,这可能引发隐私问题或版权侵权。我们引入了使用吸引-排斥动力学的引导(GUARD),一种用于文本到图像扩散模型中记忆缓解的新框架。GUARD调整图像去噪过程,引导生成远离原始训练图像,朝向与训练数据不同但仍与提示对齐的图像,防止复制训练数据,同时不损害图像生成质量。我们提出了该框架的一个具体实例,其中我们引导的正向目标由一种新的(交叉)注意力衰减方法给出,该方法基于(i)一种新颖的统计机制,自动识别需要衰减交叉注意力的提示位置,以及(ii)在这些每个提示的位置衰减交叉注意力。由此产生的GUARD提供了一种外科手术式的、动态的、每个提示的推理时方法,我们发现,在两种架构以及逐字和模板记忆方面,它是最稳健的方法,始终产生最先进的记忆缓解结果,同时在图像质量方面也优于或产生可比的结果。

英文摘要

Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

2602.23204 2026-06-02 cs.CV cs.RO 版本更新

Motion-aware Event Suppression for Event Cameras

面向事件相机的运动感知事件抑制

Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich, Switzerland(苏黎世大学机器人与感知组,瑞士)

AI总结 提出首个运动感知事件抑制框架,通过联合分割当前事件流中的独立运动物体并预测其未来运动,实现动态事件的预期抑制,在EVIMO基准上分割精度提升67%,推理速度提高53%。

Comments Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

在这项工作中,我们引入了首个运动感知事件抑制框架,该框架学习实时过滤由独立运动物体和自身运动触发的事件。我们的模型联合分割当前事件流中的独立运动物体,同时预测其未来运动,从而在动态事件发生之前实现预期抑制。我们的轻量级架构在消费级GPU上实现了173 Hz的推理速度,内存使用不到1 GB,在具有挑战性的EVIMO基准上,分割精度比之前最先进的方法提高了67%,同时推理速度提高了53%。此外,我们展示了该方法对下游应用的显著益处:通过令牌剪枝,我们的方法将Vision Transformer推理速度提高了83%,并改进了基于事件的视觉里程计精度,将绝对轨迹误差降低了13%。

英文摘要

In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.

2602.01577 2026-06-02 eess.SP cs.CV 版本更新

Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

基于拉梅曲线LED的可见光定位:一种通用的相机姿态估计方法

Wenxuan Pan, Yang Yang, Dong Wei, Zhiyu Zhu, Jintao Wang, Huan Wu, Yao Nie

发表机构 * Beijing Key Laboratory of Network System Architecture and Convergence, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications(北京网络系统架构与融合重点实验室,信息与通信工程学院,北京邮电大学) Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) College of Physics and Electronic Engineering, Shanxi University(物理与电子工程学院,山西大学) School of Electronic Information and Artificial Intelligence, West Anhui University(电子信息与人工智能学院,皖西学院)

AI总结 本文提出一种基于拉梅曲线LED的通用可见光定位算法LC-VLP,通过统一表示常见LED形状并利用曲线参数进行非线性最小二乘优化,实现高精度相机姿态估计。

Comments Submitted to an IEEE journal for possible publication

详情
AI中文摘要

基于相机的可见光定位(VLP)是一种有前景的技术,可实现精确且低成本的室内相机姿态估计(CPE)。为减少所需发光二极管(LED)的数量,先进方法通常利用LED形状特征进行定位。尽管有趣,但这些方法通常局限于单一LED几何形状,导致在异构LED形状场景中失效。为应对这一挑战,本文研究拉梅曲线作为常见LED形状的统一表示,并提出一种使用拉梅曲线形状LED的通用VLP算法,称为LC-VLP。在所考虑的系统中,多个天花板安装的拉梅曲线形状LED通过可见光通信定期广播其曲线参数,这些参数由配备相机的接收器捕获。基于接收到的LED图像和曲线参数,接收器可使用LC-VLP估计相机姿态。具体而言,离线构建LED数据库以存储曲线参数,而在线定位则被表述为非线性最小二乘问题并迭代求解。为提供可靠的初始化,进一步开发了一种无需对应点的透视n点(FreePnP)算法,无需任何预校准参考点即可实现近似CPE。通过仿真和实验验证了LC-VLP的性能。仿真表明,在圆形和矩形LED场景中,LC-VLP均优于最先进的方法。与透视弧算法相比,LC-VLP可实现平均位置和旋转误差均降低30%以上。实验进一步表明,LC-VLP可实现小于4厘米的平均位置精度。

英文摘要

Camera-based visible light positioning (VLP) is a promising technique for accurate and low-cost indoor camera pose estimation (CPE). To reduce the number of required light-emitting diodes (LEDs), advanced methods commonly exploit LED shape features for positioning. Although interesting, they are typically restricted to a single LED geometry, leading to failure in heterogeneous LED-shape scenarios. To address this challenge, this paper investigates Lamé curves as a unified representation of common LED shapes and proposes a generic VLP algorithm using Lamé curve-shaped LEDs, termed LC-VLP. In the considered system, multiple ceiling-mounted Lamé curve-shaped LEDs periodically broadcast their curve parameters via visible light communication, which are captured by a camera-equipped receiver. Based on the received LED images and curve parameters, the receiver can estimate the camera pose using LC-VLP. Specifically, an LED database is constructed offline to store the curve parameters, while online positioning is formulated as a nonlinear least-squares problem and solved iteratively. To provide a reliable initialization, a correspondence-free perspective-n-points (FreePnP) algorithm is further developed, enabling approximate CPE without any pre-calibrated reference points. The performance of LC-VLP is verified by both simulations and experiments. Simulations show that LC-VLP outperforms state-of-the-art methods in both circular- and rectangular-LED scenarios. Compared to a perspective arcs algorithm, LC-VLP can achieve reductions of both over 30% in average position and rotation errors. Experiments further show that LC-VLP can achieve an average position accuracy of less than 4 cm.

2602.20807 2026-06-02 cs.CV cs.RO 版本更新

RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction

RU4D-SLAM:面向4D场景重建的高斯溅射SLAM不确定性重加权

Yangfan Zhao, Hanwei Zhang, Ke Huang, Qiufeng Wang, Zhenzhou Shao, Dengyu Wu

发表机构 * Capital Normal University(首都师范大学) Saarland University(萨尔兰大学) Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) King’s College London(伦敦国王学院)

AI总结 提出RU4D-SLAM框架,通过引入时间因子、不确定性感知和语义引导重加权机制,解决动态环境中3D高斯溅射SLAM的跟踪与4D场景重建问题。

详情
AI中文摘要

将3D高斯溅射与同时定位与地图构建(SLAM)相结合的方法因其能够在运动过程中实现连续3D环境重建而受到广泛关注。然而,现有方法在动态环境中表现不佳,尤其是移动物体使3D重建复杂化,进而阻碍了可靠的跟踪。4D重建的出现,特别是4D高斯溅射,为解决这些挑战提供了有前景的方向,但其在4D感知SLAM中的潜力尚未得到充分探索。沿着这一方向,我们提出了一种鲁棒且高效的框架,即面向4D场景重建的高斯溅射SLAM不确定性重加权(RU4D-SLAM),该框架将时间因子引入空间3D表示,同时结合了场景变化的不确定性感知、模糊图像合成和动态场景重建。我们通过集成运动模糊渲染增强了动态场景表示,并通过扩展原本为静态场景设计的逐像素不确定性建模来处理模糊图像,从而改进了不确定性感知跟踪。此外,我们提出了一种用于动态场景中逐像素不确定性估计的语义引导重加权机制,并引入可学习的不透明度权重以支持自适应4D映射。在标准基准上的大量实验表明,我们的方法在轨迹精度和4D场景重建方面显著优于最先进的方法,尤其是在存在移动物体和低质量输入的动态环境中。代码地址:https://ru4d-slam.github.io

英文摘要

Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: https://ru4d-slam.github.io

2602.19857 2026-06-02 cs.CV 版本更新

Contrastive meta-domain adaptation for robust skin lesion classification across clinical and acquisition conditions

对比元域适应用于跨临床和采集条件的鲁棒皮肤病变分类

Rodrigo Mota, Kelvin Cunha, Emanoel dos Santos, Fábio Papais, Francisco Filho, Thales Bezerra, Erico Medeiros, Paulo Borba, Tsang Ing Ren

发表机构 * University of São Paulo(圣保罗大学)

AI总结 提出基于视觉元域概念的适应策略,通过将大规模皮肤镜数据集的视觉表示迁移到临床图像域,提高皮肤病变分类的泛化鲁棒性。

Comments 4 pages, 5 figures, 1 table, Published in: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

详情
AI中文摘要

用于皮肤科图像分析的深度学习模型仍然对采集变异性和特定领域的视觉特征敏感,导致在临床部署时性能下降。我们研究了视觉伪影和域偏移如何影响基于深度学习的皮肤病变分类。我们提出了一种基于视觉元域概念的适应策略,将来自较大皮肤镜数据集的视觉表示迁移到临床图像域,从而提高泛化鲁棒性。在多个皮肤科数据集上的实验显示,分类性能持续提升,并且皮肤镜与临床图像之间的差距减小。这些结果强调了域感知训练对于可部署系统的重要性。

英文摘要

Deep learning models for dermatological image analysis remain sensitive to acquisition variability and domain-specific visual characteristics, leading to performance degradation when deployed in clinical settings. We investigate how visual artifacts and domain shifts affect deep learning-based skin lesion classification. We propose an adaptation strategy, grounded in the idea of visual meta-domains, that transfers visual representations from larger dermoscopic datasets into clinical image domains, thereby improving generalization robustness. Experiments across multiple dermatology datasets show consistent gains in classification performance and reduced gaps between dermoscopic and clinical images. These results emphasize the importance of domain-aware training for deployable systems.

2602.19848 2026-06-02 cs.CV 版本更新

DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation

DerMAE: 通过条件潜在扩散和MAE蒸馏改进皮肤病变分类

Francisco Filho, Kelvin Cunha, Fábio Papais, Emanoel dos Santos, Rodrigo Mota, Thales Bezerra, Erico Medeiros, Paulo Borba, Tsang Ing Ren

发表机构 * Universidade Federal do Pernambuco(佛罗里达州帕尔马大学)

AI总结 针对皮肤病变分类中恶性样本不足导致的类别不平衡问题,提出使用类别条件扩散模型生成合成图像,结合自监督MAE预训练学习鲁棒特征,并通过知识蒸馏将大模型知识迁移至轻量级ViT学生模型,在提升分类性能的同时实现高效设备端推理。

Comments 4 pages, 2 figures, 1 table, Published in: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

详情
AI中文摘要

皮肤病变分类数据集通常存在严重的类别不平衡问题,恶性病例显著不足,导致深度学习训练过程中决策边界有偏。我们使用类别条件扩散模型生成合成皮肤镜图像,然后通过自监督MAE预训练使大型ViT模型学习鲁棒的领域相关特征。为了支持需要轻量级模型的实际临床环境部署,我们应用知识蒸馏将这些表示迁移到适合移动设备的较小ViT学生模型。我们的结果表明,在合成数据上进行MAE预训练结合蒸馏,能够提高分类性能,同时实现高效的设备端推理,适用于实际临床应用。

英文摘要

Skin lesion classification datasets often suffer from severe class imbalance, with malignant cases significantly underrepresented, leading to biased decision boundaries during deep learning training. We address this challenge using class-conditioned diffusion models to generate synthetic dermatological images, followed by self-supervised MAE pretraining to enable huge ViT models to learn robust, domain-relevant features. To support deployment in practical clinical settings, where lightweight models are required, we apply knowledge distillation to transfer these representations to a smaller ViT student suitable for mobile devices. Our results show that MAE pretraining on synthetic data, combined with distillation, improves classification performance while enabling efficient on-device inference for practical clinical use.

2602.13430 2026-06-02 cs.CV 版本更新

Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning

处理胸部X光分类中的监督稀缺性:长尾与零样本学习

Ha-Hieu Pham, Hai-Dang Nguyen, Thanh-Huy Nguyen, Min Xu, Ulas Bagci, Trung-Nghia Le, Huy-Hieu Pham

发表机构 * University of Technology, Vietnam(越南技术大学) National University of Singapore(新加坡国立大学) University of California, San Diego(加州大学圣地亚哥分校) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 针对胸部X光分类中极端长尾多标签分布和罕见/未见发现缺失标注的问题,提出不平衡感知多标签学习(任务1)和无需监督标签的零样本预测方法(任务2),在CXR-LT 2026挑战赛中取得领先性能。

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)
AI中文摘要

临床实践中的胸部X光(CXR)分类常受限于不完美的监督,这源于(i)极端长尾多标签疾病分布和(ii)罕见或先前未见发现的缺失标注。CXR-LT 2026挑战赛在基于PadChest的基准上解决这些问题,其标签空间包含36个类别,分为30个训练集内分布类别和6个用于零样本评估的集外分布(OOD)类别。我们提出了针对不同监督机制的任务特定解决方案。对于任务1(长尾多标签分类),我们采用不平衡感知的多标签学习策略,以提高尾类别的识别能力,同时保持对常见发现的稳定性能。对于任务2(零样本OOD识别),我们提出了一种预测方法,在训练期间不使用任何来自OOD类别的监督标签或示例的情况下,为未见疾病类别生成分数。通过宏平均平均精度(mAP)评估,我们的方法在两个任务上均取得了强劲性能,在开发阶段的公开排行榜上排名第一。代码和预训练模型可在https://github.com/hieuphamha19/CXR_LT获取。

英文摘要

Chest X-Ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at https://github.com/hieuphamha19/CXR_LT.

2602.15278 2026-06-02 cs.CV cs.AI 版本更新

Visual Persuasion: What Influences Decisions of Vision-Language Models?

视觉说服:什么影响了视觉语言模型的决策?

Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh

发表机构 * Massachusetts Institute of Technology(麻省理工学院) MIT Media Lab(MIT媒体实验室)

AI总结 提出一个框架,通过控制图像选择任务并系统性地扰动输入,利用视觉提示优化方法推断视觉语言模型的潜在视觉效用,揭示影响模型决策的视觉偏好。

Comments Accepted to ICML 2026

详情
AI中文摘要

网络上充斥着图像,这些图像最初是为人类消费而创建的,现在越来越多地被使用视觉语言模型(VLM)的智能体解释。这些智能体大规模地做出视觉决策,决定点击、推荐或购买什么。然而,我们对它们视觉偏好的结构知之甚少。我们引入了一个框架来研究这一点,通过将VLM置于受控的基于图像的选择任务中,并系统地扰动它们的输入。我们的关键思想是将智能体的决策函数视为一种潜在的视觉效用,可以通过揭示偏好来推断:在系统编辑的图像之间进行选择。从常见图像(如产品照片)开始,我们提出了视觉提示优化的方法,将文本优化方法适应为使用图像生成模型(例如在构图、光照或背景方面)迭代地提出并应用视觉上合理的修改。然后,我们评估哪些编辑增加了选择概率。通过对前沿VLM的大规模实验,我们证明了优化后的编辑在直接比较中显著改变了选择概率。我们开发了一个自动可解释性管道来解释这些偏好,识别出驱动选择的一致视觉主题。我们认为,这种方法提供了一种实用且高效的方式来揭示视觉漏洞和安全问题,否则这些问题可能会在现实世界中隐含地发现,从而支持对基于图像的AI智能体进行更主动的审计和治理。

英文摘要

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

2602.14134 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DenseMLLM: Standard Multimodal LLMs for Dense Prediction

DenseMLLM:用于密集预测的标准多模态大语言模型

Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding, Yinsong Liu, Deqiang Jiang, Xing Sun, Xiaomeng Li

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China(香港科技大学电子与计算机工程系) Tencent, Youtu-Lab, China(腾讯优图实验室)

AI总结 提出DenseMLLM,通过标准多模态大语言模型架构和视觉令牌监督策略,无需任务特定解码器即可实现语义分割、深度估计等密集预测任务,在多个基准上取得竞争性能。

Comments ICML 2026

详情
AI中文摘要

多模态大语言模型在高层次视觉理解方面展现出卓越能力。然而,将这些模型扩展到细粒度的密集预测任务(如语义分割和深度估计)通常需要引入复杂的任务特定解码器和其他定制化组件。这种架构碎片化增加了模型复杂度,偏离了多模态大语言模型的通用设计,最终限制了其实用性。在这项工作中,我们挑战了这一范式,通过调整标准多模态大语言模型来执行密集预测,无需额外的任务特定解码器。所提出的模型称为DenseMLLM,基于标准架构,并采用一种新颖的视觉令牌监督策略来处理多个标签和任务。尽管设计极简,我们的模型在广泛的密集预测和视觉语言基准测试中取得了极具竞争力的性能,表明标准的通用多模态大语言模型可以在没有架构专门化的情况下有效支持密集感知。该项目可在github.com/Eli-YiLi/DenseMLLM获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization. This project is available at github.com/Eli-YiLi/DenseMLLM.

2602.13602 2026-06-02 cs.CV cs.LG 版本更新

Towards Sparse Video Understanding and Reasoning

迈向稀疏视频理解与推理

Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

发表机构 * Northwestern University(西北大学) Johns Hopkins University(约翰霍普金斯大学) Dolby Laboratories(杜比实验室)

AI总结 提出一种多轮视频问答代理,通过稀疏帧选择、状态摘要和早期停止机制,在减少帧数和令牌数的同时提升准确率。

Comments Accepted to CVPR 2026. Project page: https://sparsevideounderstanding.github.io

详情
AI中文摘要

我们提出 \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity),一种用于视频问答 (VQA) 的多轮代理。与均匀采样帧不同,\revise 选择一小部分信息丰富的帧,跨轮维护摘要作为状态,并在置信时提前停止。它支持专有视觉语言模型 (VLM) 的“即插即用”设置,并允许对开源模型进行强化微调。对于微调,我们引入 EAGER (Evidence-Adjusted Gain for Efficient Reasoning),一种无注释奖励,包含三项:(1) 置信增益:添加新帧后,奖励正确选项与最强替代选项之间对数几率差距的增加;(2) 摘要充分性:在回答时仅使用最后提交的摘要重新提问,并奖励成功;(3) 正确且早期停止:在较小的轮次预算内正确回答即获得奖励。在多个 VQA 基准上,\revise 在减少帧数、轮数和提示令牌数的同时提高了准确率,展示了实用的稀疏视频推理。

英文摘要

We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

2602.11554 2026-06-02 cs.RO cs.CV cs.LG 版本更新

HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds

HyperDet: 基于超4D雷达点云的3D目标检测

Yichun Xiao, Runwei Guan, Jin Jin, Fangqiang Ding

发表机构 * University of Edinburgh(爱丁堡大学) HKUST (GZ)(香港科技大学(广州)) University of Oxford(牛津大学) MIT(麻省理工学院)

AI总结 提出一种与检测器无关的框架HyperDet,通过构建任务感知的超4D雷达点云,利用时空累积、跨传感器验证和多普勒引导的运动补偿以及前景生成增强,显著提升仅用雷达的3D目标检测性能。

Comments 11 pages, 3 figures, 3 tables

详情
AI中文摘要

仅使用4D雷达进行3D目标检测能达到什么程度?尽管现代4D雷达为自主感知提供了鲁棒天气和速度感知能力,但其点云仍然稀疏、嘈杂且不稳定,限制了仅用雷达的3D检测。我们提出HyperDet,一种与检测器无关的框架,在检测前构建任务感知的超4D雷达点云。HyperDet首先通过时空累积、跨传感器验证和多普勒引导的运动补偿来细化短窗口环视雷达观测,提高返回可靠性和时间一致性。然后,它利用仅在训练时可用的激光雷达引导的伪雷达监督进行前景生成增强,在保留测量雷达背景和雷达原生属性的同时丰富目标几何。在检测器训练期间,雷达感知的目标级增强进一步在几何重定位下保持多普勒一致性。在推理时,HyperDet仅需雷达输入,可直接与标准3D检测器配合使用。在两个公开的环视4D雷达数据集上的实验表明,与原始雷达输入相比,在标准3D检测器上均取得一致改进,验证了输入级雷达增强作为仅用雷达3D检测的有效方法。

英文摘要

How far can 3D object detection go using 4D radar alone? Despite offering weather-robust and velocity-aware sensing for autonomous perception, modern 4D radar still yields sparse, noisy, and unstable point clouds, limiting radar-only 3D detection. We present HyperDet, a detector-agnostic framework that constructs task-aware hyper 4D radar point clouds before detection. HyperDet first refines short-window surround-view radar observations through spatio-temporal accumulation, cross-sensor validation, and Doppler-guided motion compensation, improving return reliability and temporal coherence. It then performs foreground generative enhancement using LiDAR-guided pseudo-radar supervision available only during training, enriching object geometry while preserving measured radar background and radar-native attributes. During detector training, radar-aware object-level augmentation further preserves Doppler consistency under geometric relocation. At inference time, HyperDet requires radar input alone and can be directly paired with standard 3D detectors. Experiments on two public surround-view 4D radar datasets demonstrate consistent improvements over raw radar inputs across standard 3D detectors, validating input-level radar enhancement as an effective approach to radar-only 3D detection.

2602.12819 2026-06-02 cs.IR cs.CV 版本更新

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

WISE:一种用于视觉场景、音频、物体、人脸、语音和元数据的多模态搜索引擎

Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta

发表机构 * Engineering Science University of Oxford(工程科学大学牛津)

AI总结 提出WISE开源多模态搜索引擎,整合场景级和物体级的自然语言与反向图像查询、人脸搜索、音频事件检索、语音转录搜索及元数据过滤,支持跨模态组合查询,采用向量搜索实现高效扩展,可本地部署。

Comments Software: https://www.robots.ox.ac.uk/~vgg/software/wise/ , Online demos: https://www.robots.ox.ac.uk/~vgg/software/wise/demo/ , Example Queries: https://www.robots.ox.ac.uk/~vgg/software/wise/examples/

详情
Journal ref
International ACM SIGIR Conference on Research and Development in Information Retrieval (2026)
AI中文摘要

在本文中,我们提出WISE,一个开源视听搜索引擎,它将多种多模态检索能力集成到一个单一、实用的工具中,无需机器学习专业知识即可使用。WISE支持图像和视频的场景级(例如空街道)和物体级(例如马)的自然语言和反向图像查询;基于人脸的特定个体搜索;使用文本(例如木头吱吱声)或音频文件的声学事件音频检索;自动转录语音的搜索;以及按用户提供的元数据进行过滤。通过跨模态组合查询可以获得丰富的洞察——例如,通过应用物体查询“火车”和元数据查询“德国”从历史档案中检索德国火车,或在一个地方搜索人脸。通过采用向量搜索技术,WISE可以扩展到支持对数百万张图像或数千小时视频的高效检索。其模块化架构便于集成新模型。WISE可以本地部署用于私有或敏感集合,并已应用于各种实际用例。我们的代码是开源的,可在https://gitlab.com/vgg/wise/wise获取。

英文摘要

In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities -- for example, retrieving German trains from a historical archive by applying the object query "train" and the metadata query "Germany", or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at https://gitlab.com/vgg/wise/wise.

2602.09153 2026-06-02 cs.RO cs.AI cs.CV cs.GR 版本更新

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

SceneSmith: 面向仿真就绪室内场景的智能体生成

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

AI总结 提出层次化智能体框架SceneSmith,通过VLM智能体协作从自然语言生成仿真就绪的室内场景,相比先前方法生成3-6倍物体且碰撞率低于2%。

Comments ICML 2026 Spotlight; Project page: https://scenesmith.github.io/

详情
AI中文摘要

仿真已成为大规模训练和评估家用机器人的关键工具,但现有环境未能捕捉真实室内空间的多样性和物理复杂性。当前的场景合成方法生成的房间稀疏布置,缺乏机器人操作所必需的密集杂乱、铰接式家具和物理属性。我们提出了SceneSmith,一个层次化智能体框架,能够从自然语言提示生成仿真就绪的室内环境。SceneSmith通过连续阶段构建场景——从建筑布局到家具放置再到小物体填充——每个阶段都实现为VLM智能体(设计师、评论家和编排者)之间的交互。该框架通过文本到3D合成生成静态物体、数据集检索获取铰接式物体以及物理属性估计,紧密集成了资产生成。SceneSmith生成的物体数量是先前方法的3-6倍,物体间碰撞率低于2%,且96%的物体在物理仿真下保持稳定。在205名参与者参与的用户研究中,与基线相比,平均真实感胜率达到92%,平均提示忠实度胜率达到91%。我们进一步证明了这些环境可用于端到端的自动机器人策略评估流程。

英文摘要

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

2602.08236 2026-06-02 cs.CV cs.AI cs.CL 版本更新

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

何时想象以及想象多少:基于世界模型的自适应测试时缩放用于视觉空间推理

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal

发表机构 * University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校) Nanyang Technological University(南洋理工大学)

AI总结 本文提出自适应测试时框架AVIC/AVIC-R,通过世界模型选择性调用和缩放视觉想象,在空间推理中平衡准确性与效率,超越GPT-4o等基线。

Comments the first two authors are equally contributed. Project page: https://adaptive-visual-tts.github.io/

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了快速进展,但当正确答案取决于场景在未见或替代视角下的外观时,视觉空间推理仍然不可靠。最近的工作通过使用世界模型进行视觉想象来增强推理,但诸如想象何时真正必要、多少想象有益、以及何时想象有害等问题仍知之甚少。在实践中,无差别的想象可能会增加计算量,甚至通过引入误导性证据而降低性能。在这项工作中,我们深入分析了作为空间推理可控资源的测试时视觉想象。我们首先研究静态视觉证据何时足够,想象何时改进推理,以及过度或不必要的想象如何影响准确性和效率。为了支持这一分析,我们随后引入了AVIC,一个基于世界模型的自适应测试时框架,该框架在选择性调用和缩放视觉想象之前,明确推理当前视觉证据的充分性。最后,为了进一步学习这种门控和规划行为,而无需任何关于何时想象以及想象多少的标注,我们引入了AVIC-R,它通过来自QA正确性奖励和想象成本惩罚的GRPO来训练策略。在空间推理基准(SAT, MMSI)和具身导航基准(R2R)上,我们的结果揭示了想象至关重要、边际或有害的明确场景,并表明选择性控制可以匹配或超越固定想象策略,同时大幅减少世界模型调用和语言标记。我们的AVIC-R超越了包括GPT-4o和GPT-4.1在内的强大专有基线,同时调用世界模型的频率更低。总体而言,我们的发现强调了分析和控制测试时想象对于高效可靠的空间推理的重要性。

英文摘要

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We first study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we then introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Finally, to further learn this gating and planning behavior without any annotation of when and how much to imagine, we introduce AVIC-R, which trains the policy via GRPO from QA-correctness rewards and penalties by imagination cost. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Our AVIC-R surpasses strong proprietary baselines including GPT-4o and GPT-4.1 while invoking the world model less often. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

2602.07955 2026-06-02 cs.CV 版本更新

One-Shot Crowd Counting With Density Guidance For Scene Adaptation

基于密度引导的单次场景自适应人群计数

Jiwei Chen, Qi Wang, Junyu Gao, Jing Zhang, Dingyi Li, Jing-Jia Luo

发表机构 * Jiangsu Key Laboratory of Intelligent Weather Forecasting and Applications Based on Big Data(江苏大数据智能天气预报与应用重点实验室) State Key Laboratory of Climate System Prediction and Risk Management (CPRM)(气候变化预测与风险管理国家重点实验室) ICAR/CIC-FEMD/KLME/ILCEC Nanjing University of Information Science and Technology(南京信息工程大学) School of Artificial Intelligence, OPtics and ElectroNics, Northwestern Polytechnical University(人工智能、光学与电子学学院,西北工业大学) School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院)

AI总结 提出利用局部和全局密度特征引导模型适应未见过的监控场景,通过多局部密度学习器学习支持场景中的多原型密度分布,并编码局部密度相似性矩阵进行局部引导,同时提取全局密度特征进行全局引导,在三个监控数据集上优于现有方法。

详情
AI中文摘要

不同位置摄像头拍摄的人群场景差异很大,现有的人群模型对未见过的监控场景泛化能力有限。为了提高模型的泛化能力,我们将不同的监控场景视为不同的类别场景,并引入小样本学习,使模型适应属于给定示例类别场景的未见过的监控场景。为此,我们提出利用局部和全局密度特征来引导模型对未见过的监控场景进行人群计数。具体来说,为了使模型能够适应目标场景中不同的密度变化,我们提出了多局部密度学习器来学习多个原型,这些原型代表支持场景中的不同密度分布。随后,对这些多局部密度相似性矩阵进行编码,并以局部方式利用它们来引导模型。为了进一步适应目标场景中的全局密度,从支持图像中提取全局密度特征,然后以全局方式用于引导模型。在三个监控数据集上的实验表明,所提出的方法能够适应未见过的监控场景,并在小样本人群计数中优于最近的最先进方法。

英文摘要

Crowd scenes captured by cameras at different locations vary greatly, and existing crowd models have limited generalization for unseen surveillance scenes. To improve the generalization of the model, we regard different surveillance scenes as different category scenes, and introduce few-shot learning to make the model adapt to the unseen surveillance scene that belongs to the given exemplar category scene. To this end, we propose to leverage local and global density characteristics to guide the model of crowd counting for unseen surveillance scenes. Specifically, to enable the model to adapt to the varying density variations in the target scene, we propose the multiple local density learner to learn multi prototypes which represent different density distributions in the support scene. Subsequently, these multiple local density similarity matrixes are encoded. And they are utilized to guide the model in a local way. To further adapt to the global density in the target scene, the global density features are extracted from the support image, then it is used to guide the model in a global way. Experiments on three surveillance datasets shows that proposed method can adapt to the unseen surveillance scene and outperform recent state-of-the-art methods in the few-shot crowd counting.

2602.06442 2026-06-02 cs.CV 版本更新

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

ChatUMM: 面向对话式交错生成的鲁棒上下文追踪

Wenxun Dai, Zhiyuan Zhao, Yule Zhong, Yiji Cheng, Jianwei Zhang, Linqing Wang, Shiyi Zhang, Yunlong Lin, Runze He, Fellix Song, Wayne Zhuang, Yong Liu, Haoji Zhang, Yansong Tang, Chunyu Wang

发表机构 * Tsinghua University(清华大学) Tencent Hunyuan Project lead(腾讯文心一言项目负责人)

AI总结 提出ChatUMM,一种通过交错多轮训练策略和系统化对话数据合成流水线实现鲁棒上下文追踪的对话式统一多模态模型,在视觉理解和指令引导编辑基准上达到开源模型最优性能。

Comments ChatUMM Project

详情
AI中文摘要

统一多模态模型(UMMs)取得了显著进展,但仍受限于单轮交互范式,实际上作为独立请求的求解器而非连续对话中的助手。为弥合这一差距,我们提出了ChatUMM。作为一种对话式统一模型,它擅长鲁棒的上下文追踪以维持交错多模态生成。ChatUMM的能力源于两个关键创新:一种交错多轮训练策略,将序列化的文本-图像流建模为连续的对话流;以及一个系统化的对话数据合成流水线。该流水线通过三个渐进阶段将多样化的标准单轮数据集转化为流畅的对话:构建基本有状态对话,通过带有历史依赖查询重写的“干扰”轮次强制执行长程依赖解析,以及合成自然交错的多模态响应。大量评估表明,ChatUMM在视觉理解和指令引导编辑基准上达到了开源统一模型中的最先进性能,同时在文本到图像生成中保持了有竞争力的保真度。值得注意的是,ChatUMM在复杂多轮场景中展现出卓越的鲁棒性,确保了流畅、上下文感知的对话。

英文摘要

Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor'' turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.

2602.06136 2026-06-02 cs.LG cs.CV 版本更新

Tempora: Characterising the Time-Contingent Utility of Online Test-Time Adaptation

Tempora: 表征在线测试时适应的时间条件效用

Sudarshan Sreeram, Young D. Kwon, Cecilia Mascolo

发表机构 * University of Bristol(布里斯托大学)

AI总结 提出Tempora框架,通过时间场景、评估协议和时间条件效用指标,系统评估测试时适应方法在延迟约束下的准确性-延迟权衡,揭示传统排名在时间压力下失效。

Comments Accepted to ICML 2026

详情
AI中文摘要

测试时适应(TTA)为在域偏移下性能下降的机器学习模型提供了一种引人注目的补救措施,仅使用未标记样本即可即时改进泛化能力。这种灵活性适合实际部署,但传统评估不切实际地假设无限处理时间,忽略了准确性-延迟权衡。随着机器学习越来越多地支撑延迟敏感和面向用户的应用,时间压力限制了可适应推理的可行性;到达太晚而无法采取行动的预测是徒劳的。我们引入了Tempora,一个在这种压力下评估TTA的框架。它由模拟部署约束的时间场景、实现测量的评估协议以及量化准确性-延迟权衡的时间条件效用指标组成。我们用三个这样的指标实例化该框架:(1)用于具有硬截止时间的异步流的离散效用,(2)用于价值随延迟衰减的交互式设置的连续效用,以及(3)用于预算受限部署的摊销效用。通过将Tempora应用于11种TTA方法,我们发现排名不稳定性在跨越不同数据集、模型和硬件平台的750多次时间评估中持续存在;即,传统排名不能预测时间压力下的排名。最高效用方法随偏移和时间压力而变化,没有明确的赢家。通过首次实现跨不同时间约束的系统评估,Tempora揭示了排名何时以及为何变化,为从业者提供了方法选择的视角,为研究人员提供了可部署适应的目标。代码:https://github.com/sudotensor/tempora。

英文摘要

Test-time adaptation (TTA) offers a compelling remedy for machine learning (ML) models that degrade under domain shifts, improving generalisation on-the-fly with only unlabelled samples. This flexibility suits real deployments, yet conventional evaluations unrealistically assume unbounded processing time, overlooking the accuracy-latency trade-off. As ML increasingly underpins latency-sensitive and user-facing use-cases, temporal pressure constrains the viability of adaptable inference; predictions arriving too late to act on are futile. We introduce Tempora, a framework for evaluating TTA under this pressure. It consists of temporal scenarios that model deployment constraints, evaluation protocols that operationalise measurement, and time-contingent utility metrics that quantify the accuracy-latency trade-off. We instantiate the framework with three such metrics: (1) discrete utility for asynchronous streams with hard deadlines, (2) continuous utility for interactive settings where value decays with latency, and (3) amortised utility for budget-constrained deployments. By applying Tempora to 11 TTA methods, we find that rank instability persists across 750+ temporal evaluations spanning diverse datasets, models, and hardware platforms; i.e., conventional rankings do not predict rankings under temporal pressure. The highest-utility method varies with the shift and temporal pressure, with no clear winner. By enabling systematic evaluation across diverse temporal constraints for the first time, Tempora reveals when and why rankings change, offering practitioners a lens for method selection and researchers a target for deployable adaptation. Code: https://github.com/sudotensor/tempora.

2602.05951 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

更好的源,更好的流:学习条件依赖的源分布用于流匹配

Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim

发表机构 * New York University(纽约大学) KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出在流匹配框架中学习条件依赖的源分布,通过方差正则化和源-目标方向对齐,显著提升文本到图像生成的速度和质量。

Comments Project Page: https://junwankimm.github.io/CSFM

详情
AI中文摘要

流匹配最近已成为基于扩散的生成模型的有前途的替代方案,特别是在文本到图像生成方面。尽管它在允许任意源分布方面具有灵活性,但大多数现有方法依赖于标准高斯分布(这是从扩散模型继承的选择),并且很少在这种设置中将源分布本身视为优化目标。在这项工作中,我们表明源分布的原则性设计不仅是可行的,而且在现代文本到图像系统的规模上也是有益的。具体来说,我们提出在流匹配目标下学习条件依赖的源分布,以更好地利用丰富的条件信号。我们识别了将条件直接纳入源时出现的关键失败模式,包括分布坍缩和不稳定性,并表明适当的方差正则化以及源和目标之间的方向对齐对于稳定和有效的学习至关重要。我们进一步分析了目标表示空间的选择如何影响具有结构化源的流匹配,揭示了这种设计最有效的场景。在多个文本到图像基准上的大量实验表明了一致且稳健的改进,包括FID收敛速度提高多达3倍,突出了原则性源分布设计对条件流匹配的实际好处。

英文摘要

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

2602.05435 2026-06-02 cs.CV 版本更新

Stable Velocity: A Variance Perspective on Flow Matching

稳定速度:流匹配的方差视角

Donglin Yang, Yongxing Zhang, Xin Yu, Liang Hou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Renjie Liao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对流匹配中单样本条件速度导致的高方差训练目标,提出稳定速度框架,通过方差表征识别高低方差区域,并引入无偏方差缩减目标(StableVM)、方差感知表示对齐(VA-REPA)以及免微调加速采样(StableVS),在多个大规模模型上实现训练效率提升和超过2倍采样加速。

Comments ICML 2026

详情
AI中文摘要

虽然流匹配优雅,但其对单样本条件速度的依赖导致高方差训练目标,从而破坏优化稳定性并减慢收敛速度。通过显式表征这一方差,我们识别出:1) 先验附近的高方差区域,优化困难;2) 数据分布附近的低方差区域,条件速度与边际速度几乎一致。基于这一洞察,我们提出稳定速度(Stable Velocity),一个统一框架,改进了训练和采样。对于训练,我们引入稳定速度匹配(StableVM),一个无偏方差缩减目标,以及方差感知表示对齐(VA-REPA),在低方差区域自适应增强辅助监督。对于推理,我们展示了低方差区域中的动力学允许闭式简化,从而实现稳定速度采样(StableVS),一种免微调加速。在ImageNet $256\times256$以及大型预训练文本到图像和文本到视频模型(包括SD3.5、Flux、Qwen-Image和Wan2.2)上的大量实验表明,训练效率持续提升,并且在低方差区域内采样速度提升超过2倍,同时不降低样本质量。我们的代码可在https://github.com/linYDTHU/StableVelocity获取。

英文摘要

While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.

2504.15371 2026-06-02 cs.CV cs.NE 版本更新

Event2Vec: Processing Neuromorphic Events Directly by Representations in Vector Space

Event2Vec: 通过向量空间表示直接处理神经形态事件

Wei Fang, Priyadarshini Panda

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Event2Vec表示法,使Transformer能直接处理稀疏异步事件数据,在多个基准上实现高精度、低延迟和高吞吐量。

Comments Accepted at ICML 2026

详情
AI中文摘要

与传统相机相比,神经形态事件相机具有优越的时间分辨率、能效和动态范围。然而,它们的异步和稀疏数据格式对传统深度学习方法构成了重大挑战。大多数现有方法要么将事件密集化为帧,牺牲其稀疏异步特性,要么使用与GPU加速兼容性较差的非规则模型。受词到向量模型的启发,我们提出了event2vec,一种新颖的表示法,使Transformer能够直接处理事件。我们在DVS Gesture、ASL-DVS和DVS-Lip基准上展示了event2vec的有效性,表明event2vec具有显著的参数效率、高吞吐量和低延迟,即使在极低事件数或低空间分辨率下也能实现高精度。这些结果表明,稀疏异步事件数据可以直接集成到高吞吐量Transformer架构中,为实时神经形态视觉提供了一种高效的范式。代码可在https://github.com/Intelligent-Computing-Lab-Panda/event2vec获取。

英文摘要

Neuromorphic event cameras possess superior temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, their asynchronous and sparse data format poses a significant challenge for conventional deep learning methods. Most existing methods either densify events into frames, sacrificing their sparse asynchronous nature, or use irregular models that are less compatible with GPU acceleration. Inspired by word-to-vector models, we propose event2vec, a novel representation that allows Transformers to process events directly. We demonstrate the effectiveness of event2vec on the DVS Gesture, ASL-DVS, and DVS-Lip benchmarks, showing that event2vec is remarkably parameter-efficient, features high throughput and low latency, and achieves high accuracy even with an extremely low number of events or low spatial resolutions. These results show that sparse asynchronous event data can be directly integrated into high-throughput Transformer architectures, offering an efficient paradigm for real-time neuromorphic vision. The code is provided at https://github.com/Intelligent-Computing-Lab-Panda/event2vec.

2602.05293 2026-06-02 cs.CV 版本更新

Fast-SAM3D: 3Dfy Anything in Images but Faster

Fast-SAM3D: 更快地将图像中的任何物体三维化

Weilun Feng, Mingqiang Wu, Zhiliang Chen, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiaokun Liu, Guoxin Fan, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu, Zhulin An

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出Fast-SAM3D,一种无需训练的三维重建加速框架,通过多级异构性感知机制(模态感知步骤缓存、联合时空令牌雕刻、频谱感知令牌聚合)实现高达2.67倍端到端加速且保真度损失极小。

Comments Accepted by ICML 2026

详情
AI中文摘要

SAM3D实现了从复杂场景中可扩展的开放世界三维重建,但其部署受到高昂推理延迟的阻碍。在这项工作中,我们对其推理动态进行了 extbf{首次系统研究},揭示了通用加速策略在此背景下是脆弱的。我们证明这些失败源于忽视了流水线固有的多级 extbf{异构性}:形状与布局之间的运动学差异、纹理细化的内在稀疏性以及几何体之间的频谱方差。为了解决这个问题,我们提出了 extbf{Fast-SAM3D},一个无需训练的框架,动态地将计算与瞬时生成复杂度对齐。我们的方法集成了三种异构性感知机制:(1) extit{模态感知步骤缓存},用于解耦结构演化与敏感布局更新;(2) extit{联合时空令牌雕刻},将细化集中在高熵区域;(3) extit{频谱感知令牌聚合},用于调整解码分辨率。大量实验表明,Fast-SAM3D实现了高达 extbf{2.67$ imes$}的端到端加速,且保真度损失极小,为高效单视图三维生成建立了新的帕累托前沿。我们的代码发布在https://github.com/wlfeng0509/Fast-SAM3D。

英文摘要

SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.

2602.05217 2026-06-02 cs.CV 版本更新

Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation

跨域小样本分割:多视角渐进适应

Jiahao Nie, Guanqiao Fu, Wenbin An, Yap-Peng Tan, Alex C. Kot, Shijian Lu

发表机构 * Interdisciplinary Graduate Programme, Nanyang Technological University(南洋理工大学交叉学科研究生项目) Nanyang Technological University(南洋理工大学) Xi’an Jiaotong University(西安交通大学) VinUniversity(文大学) SMBU

AI总结 提出多视角渐进适应方法,通过混合渐进增强和双链多视角预测,从数据和策略两方面逐步将小样本能力适应到目标域,显著提升跨域小样本分割性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

跨域小样本分割旨在基于少量样本对数据稀缺域中的类别进行分割。典型方法首先在大规模源域中建立小样本能力,然后将其适应到目标域。然而,由于目标样本的数量和多样性有限,现有方法仍表现出受限的性能。此外,源训练模型在目标域中初始的小样本能力较弱,加上显著的域差距,严重阻碍了目标样本的有效利用并进一步阻碍了适应。为此,我们提出多视角渐进适应,从数据和策略两方面逐步将小样本能力适应到目标域。(i) 从数据角度,我们引入混合渐进增强,通过累积的强增强逐步生成更多样化和复杂的视图,从而创建越来越具有挑战性的学习场景。(ii) 从策略角度,我们设计双链多视角预测,在广泛监督下通过顺序和并行学习路径充分利用这些渐进复杂的视图。通过联合强制跨多样化和复杂视图的预测一致性,MPA实现了对目标域的鲁棒且准确的适应。大量实验表明,MPA有效地将小样本能力适应到目标域,以较大优势(+7.0%)超越了最先进的方法。

英文摘要

Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).

2602.04343 2026-06-02 cs.CV 版本更新

Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception

寻找NeMO:面向少样本感知的模板视图几何感知表示

Sebastian Jung, Leonard Klüpfel, Rudolph Triebel, Maximilian Durner

发表机构 * German Aerospace Center (DLR)(德国航空航天中心(DLR))

AI总结 提出NeMO(神经记忆对象)表示,通过少量RGB模板视图编码生成稀疏点云,实现未见对象的检测、分割和6DoF姿态估计,无需重训练。

Comments 17 pages including supplement, published in 3DV 2026, Project website: https://sebastian-jung.github.io/nemo/

详情
Journal ref
Proceedings of the International Conference on 3D Vision (3DV), 2026
AI中文摘要

我们提出了神经记忆对象(NeMO),一种新颖的以对象为中心的表示,可用于使用RGB图像检测、分割和估计训练中未见对象的6DoF姿态。我们的方法包括一个编码器,该编码器仅需少量描绘对象的RGB模板视图,利用包含语义和几何信息的学到的UDF生成稀疏的对象状点云。接下来,解码器将对象编码与查询图像一起使用,生成各种密集预测。通过大量实验,我们展示了我们的方法可用于少样本对象感知,无需任何相机特定参数或对目标数据的重训练。我们提出的将对象信息外包到NeMO中并使用单个网络执行多个感知任务的概念,增强了对新对象的交互,通过启用快速对象接入而无需重训练或大量预处理,提高了可扩展性和效率。我们在BOP基准测试的各种数据集和感知任务上报告了竞争性和最先进的结果,展示了我们方法的多功能性。https://github.com/DLR-RM/nemo

英文摘要

We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. https://github.com/DLR-RM/nemo

2602.04094 2026-06-02 cs.CV 版本更新

VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding

VideoBrain: 学习自适应帧采样以理解长视频

Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, Weining Shen

发表机构 * Stanford University(斯坦福大学)

AI总结 提出VideoBrain框架,通过CLIP和均匀采样双智能体策略,使视觉语言模型自适应获取关键帧,在减少30-40%帧数的同时提升长视频理解准确率3.5%-9.0%。

详情
AI中文摘要

长视频理解对视觉语言模型(VLM)仍然具有挑战性,因为计算约束与捕捉分布在数千帧中的信息之间存在固有的矛盾。现有方法要么均匀采样帧(存在信息丢失风险),要么单次选择关键帧(无法从错误选择中恢复)。我们提出VideoBrain,一个端到端框架,使VLM能够通过学习采样策略自适应地获取视觉信息。我们的方法采用双互补智能体:一个基于CLIP的智能体用于跨视频的语义检索,以及一个均匀智能体用于区间内的密集时间采样。与先前依赖纯文本LLM编排视觉工具的基于智能体的方法不同,我们的VLM直接感知帧并推理信息充分性。为了防止模型不加区分地调用智能体以最大化奖励,我们引入了一个行为感知奖励函数,结合一个数据分类流程,教会模型何时调用智能体真正有益。在四个长视频基准上的实验表明,VideoBrain在比基线少使用30-40%帧的情况下实现了+3.5%至+9.0%的提升,并且对短视频基准具有强大的跨数据集泛化能力。代码可在https://github.com/junbo-zou/VideoBrain获取。

英文摘要

Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40\% fewer frames, with strong cross-dataset generalization to short video benchmarks. The code is available at https://github.com/junbo-zou/VideoBrain.

2602.03766 2026-06-02 cs.CV cs.NE q-bio.NC 版本更新

FOVI: A biologically-inspired foveated interface for deep vision models

FOVI:一种受生物启发的深度视觉模型中央凹接口

Nicholas M. Blauch, George A. Alvarez, Talia Konkle

发表机构 * harvard(哈佛大学) nvidia

AI总结 受人类视觉系统启发,提出基于视网膜和V1的中央凹接口FOVI,通过kNN卷积和低秩适应实现高效变分辨率视觉处理,在减少像素和计算成本的同时保持竞争力。

Comments ICML 2026

详情
AI中文摘要

人类视觉是中央凹的,在大视野中心具有可变分辨率峰值;这反映了主动感知的有效权衡,允许眼球运动将世界的不同部分聚焦,同时其他部分处于上下文中。相比之下,大多数计算机视觉系统以均匀分辨率编码视觉世界,给高效处理全视野高分辨率图像带来了挑战。我们提出了一种基于人类视网膜和初级视觉皮层(V1)的中央凹视觉接口(FOVI),它将可变分辨率的视网膜样传感器阵列重新格式化为均匀密集的V1样传感器流形。感受野被定义为传感器流形上的k近邻(kNN),通过一种新颖的核映射技术实现kNN卷积。我们展示了两个用例:(1)端到端的kNN卷积架构,以及(2)利用低秩适应(LoRA)的DINOv3 ViT基础模型的中央凹适应。这些模型在像素和计算成本仅为全分辨率非中央凹基线的一小部分的情况下提供了有竞争力的性能,为高效且可扩展的高分辨率自我中心视觉主动感知开辟了道路。代码(https://github.com/nblauch/fovi)和预训练模型(https://huggingface.co/fovi-pytorch)已公开。

英文摘要

Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex (V1), that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the DINOv3 ViT foundation model, leveraging low-rank adaptation (LoRA). These models provide competitive performance with a fraction of the pixels and computational cost of full resolution non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code (https://github.com/nblauch/fovi) and pre-trained models (https://huggingface.co/fovi-pytorch) are available.

2602.01753 2026-06-02 cs.CV 版本更新

ObjEmbed: Towards Universal Multimodal Object Embeddings

ObjEmbed:迈向通用多模态对象嵌入

Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ObjEmbed模型,通过分解图像为多个区域嵌入(每个对应一个对象)并生成语义和IoU两种互补嵌入,实现细粒度视觉-语言对齐,在视觉定位、局部和全局图像检索等任务中表现优异。

Comments Accepted by ICML 2026

详情
AI中文摘要

将对象与相应的文本描述对齐是视觉-语言理解中的一个基本挑战和现实需求。虽然最近的多模态嵌入模型在全局图像-文本对齐方面表现出色,但它们通常难以处理图像区域与特定短语之间的细粒度对齐。在这项工作中,我们提出了ObjEmbed,一种新颖的MLLM嵌入模型,它将输入图像分解为多个区域嵌入,每个对应一个单独的对象,以及全局嵌入。它支持广泛的视觉理解任务,如视觉定位、局部图像检索和全局图像检索。ObjEmbed具有三个关键特性:(1)面向对象的表示:通过为每个区域生成两个互补嵌入——用于语义匹配的对象嵌入和预测定位质量的IoU嵌入——来捕获对象的语义和空间方面。最终的对象匹配分数结合了语义相似性和预测的IoU,从而实现更准确的检索。(2)多功能性:无缝处理区域级和图像级任务。(3)高效编码:图像中的所有对象以及整个图像在单次前向传递中编码,效率高。在18个不同基准上的优越性能证明了其强大的语义区分能力。

英文摘要

Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.

2601.23220 2026-06-02 cs.CV cs.AI 版本更新

Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Med-Scout: 通过几何感知的强化学习后训练治愈多模态大语言模型在医学感知中的几何盲点

Anglin Liu, Ruichao Chen, Yi Lu, Hongxia Xu, Jintai Chen

发表机构 * HKUSTGZ-ML4Health-Lab(香港科技大学-ML4Health实验室)

AI总结 提出Med-Scout框架,利用无标注医学图像中的内在几何逻辑,通过强化学习和三种代理任务(层次尺度定位、拓扑拼图重建、异常一致性检测)来缓解多模态大语言模型的几何盲点,并在新基准Med-Scout-Bench上提升超过40%的几何感知性能,同时泛化到更广泛的医学理解任务。

Comments 29 pages, 14 figures. Accepted at ICML 2026

详情
AI中文摘要

尽管最近的多模态大语言模型(MLLMs)在医学诊断中展现出语言能力,但我们发现即使是最先进的MLLMs也存在一个关键的感知缺陷:几何盲点。这种无法将输出基于客观几何约束的问题导致了看似合理但事实错误的幻觉,其根源在于训练范式优先考虑语言流畅性而非几何保真度。本文介绍了Med-Scout,一种新颖的框架,通过强化学习(RL)“治愈”这种盲点,利用未标记医学图像中内在的几何逻辑。Med-Scout不依赖昂贵的人工标注,而是通过受临床医生系统阅读和推理模式启发的三种策略性代理任务推导出可验证的监督信号:层次尺度定位、拓扑拼图重建和异常一致性检测。为了严格量化这一缺陷,我们提出了Med-Scout-Bench,一个专门设计用于评估几何感知的新基准。大量评估表明,Med-Scout显著缓解了几何盲点,在我们的基准上比领先的专有和开源MLLMs提升了超过40%。此外,这种增强的几何感知泛化到更广泛的医学理解,在放射学和综合性医学VQA任务上取得了优异结果。

英文摘要

Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that "cures" this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks inspired by the systematic reading and reasoning patterns of clinicians: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.

2601.22276 2026-06-02 cs.LG cs.CV 版本更新

SurrogateSHAP: Training-Free Contributor Attribution for Text-to-Image (T2I) Models

SurrogateSHAP:文本到图像(T2I)模型的无训练贡献者归因

Mingyu Lu, Soham Gadgil, Chris Lin, Chanwoo Kim, Su-In Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对文本到图像扩散模型中数据贡献者公平估值的高计算成本问题,提出基于预训练模型推理的无重训练框架SurrogateSHAP,利用梯度提升树近似效用函数并解析计算Shapley值,在多个任务上以更低开销超越现有方法。

详情
AI中文摘要

随着文本到图像(T2I)扩散模型在现实创意工作流中的广泛应用,一个用于评估提供数据集合的贡献者的原则性框架对于公平补偿和可持续数据市场至关重要。虽然Shapley值提供了理论上有依据的归因方法,但它面临双重计算瓶颈:(i)对每个采样的玩家(即数据贡献者)子集进行穷举模型重训练的高昂成本,以及(ii)由于贡献者交互,估计边际贡献所需的子集组合数量巨大。为此,我们提出了SurrogateSHAP,一个无需重训练的框架,通过从预训练模型进行推理来近似昂贵的重训练博弈。为了进一步提高效率,我们采用梯度提升树来近似效用函数,并基于树模型解析地推导Shapley值。我们在三个不同的归因任务上评估了SurrogateSHAP:(i)CIFAR-20上DDPM-CFG的图像质量,(ii)后印象派艺术品上Stable Diffusion的美学质量,以及(iii)时尚产品数据上FLUX.1的产品多样性。在各种设置下,SurrogateSHAP在显著降低计算开销的同时优于先前方法,一致地在多个效用指标上识别出有影响力的贡献者。最后,我们证明了SurrogateSHAP能够有效定位导致临床图像中虚假相关的数据源,为审计安全关键型生成模型提供了一条可扩展的路径。

英文摘要

As Text-to-Image (T2I) diffusion models are increasingly used in real-world creative workflows, a principled framework for valuing contributors who provide a collection of data is essential for fair compensation and sustainable data marketplaces. While the Shapley value offers a theoretically grounded approach to attribution, it faces a dual computational bottleneck: (i) the prohibitive cost of exhaustive model retraining for each sampled subset of players (i.e., data contributors) and (ii) the combinatorial number of subsets needed to estimate marginal contributions due to contributor interactions. To this end, we propose SurrogateSHAP, a retraining-free framework that approximates the expensive retraining game through inference from a pretrained model. To further improve efficiency, we employ a gradient-boosted tree to approximate the utility function and derive Shapley values analytically from the tree-based model. We evaluate SurrogateSHAP across three diverse attribution tasks: (i) image quality for DDPM-CFG on CIFAR-20, (ii) aesthetics for Stable Diffusion on Post-Impressionist artworks, and (iii) product diversity for FLUX.1 on Fashion-Product data. Across settings, SurrogateSHAP outperforms prior methods while substantially reducing computational overhead, consistently identifying influential contributors across multiple utility metrics. Finally, we demonstrate that SurrogateSHAP effectively localizes data sources responsible for spurious correlations in clinical images, providing a scalable path toward auditing safety-critical generative models.

2601.21444 2026-06-02 cs.CV cs.AI cs.CL 版本更新

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

APB-V: 通过序列并行感知的近似注意力加速长视频理解

Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu

发表机构 * NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China(清华大学北京校区自然语言处理组、国防科技大学、人工智能研究院、北京理工大学、清华大学) Department of CS&T, Central South University, Changsha, China(中南大学计算机与技术系,长沙,中国) BUPT, Beijing, China(北京邮电大学,北京,中国) Pattern Recognition Center, WeChat AI, Tencent Inc.(腾讯公司微信人工智能研究院)

AI总结 提出APB-V,一种序列并行框架,通过分布式近似注意力在多GPU上加速长视频推理,显著提升速度且不损失性能。

Comments ACL 2026 main

详情
AI中文摘要

长视频推理的效率仍然是一个关键瓶颈,主要由于大型多模态模型(LMMs)预填充阶段的密集计算。现有方法要么压缩视觉嵌入,要么在单个GPU上应用稀疏注意力,导致加速有限或性能下降,并限制了LMMs处理更长、更复杂视频的能力。为了克服这些问题,我们提出了APB-V,一种具有优化注意力的序列并行框架,可在多个GPU上加速长视频推理。通过分布近似注意力,APB-V减少了计算量并增加了并行性,使得无需压缩即可高效处理更多视觉嵌入,从而提升任务性能。系统级优化,如负载均衡和融合前向传递,进一步释放了APB-V的潜力,相较于FlashAttn、ZigZagRing和APB,分别实现了12.72倍、1.70倍和1.18倍的加速,且没有明显的性能损失。代码可在https://github.com/thunlp/APB获取。

英文摘要

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose APB-V, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, APB-V reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of APB-V, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

2601.18340 2026-06-02 cs.CV 版本更新

Beyond Rigid: Benchmarking Non-Rigid Video Editing

超越刚性:非刚性视频编辑基准测试

Bingzheng Qu, Xuefeng Bai, Kehai Chen, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳))

AI总结 提出NRVBench诊断基准,通过物理感知评估框架揭示传统指标在非刚性视频编辑中的不足,并引入VM-Edit基线分析稳定性-可塑性权衡。

详情
AI中文摘要

随着视频生成模型越来越需要处理物理动态,评估必须超越外观保真度和语义对齐。非刚性视频编辑提供了一个独特的揭示性测试平台,其中不同材料施加不同的物理约束。在本文中,我们引入了NRVBench,一个用于非刚性视频编辑的诊断基准,其任务是修改可变形运动,同时保留无关区域并保持材料特定的合理性。NRVBench包含180个精心策划的视频,涵盖六个基于物理的类别,2,340条细粒度编辑指令,360个多项选择题和像素精确的掩码。我们进一步提出了NRVE-Acc,一种基于VLM的结构化协议,将编辑成功分解为指令遵循、材料感知变形合理性和带有运动线索的时间一致性。对代表性推理时视频编辑方法的实验揭示了传统指标与物理感知感知编辑成功之间的明显不匹配:在非刚性动态下,保留外观或实现强全局对齐的方法可能仍然失败。我们还引入了VM-Edit,一个简单的区域条件编辑基线,它释放前景同时锁定背景,暴露了稳定性-可塑性权衡。

英文摘要

As video generation models are increasingly expected to manipulate physical dynamics, there is a growing need to move evaluation beyond appearance fidelity and semantic alignment. Non-rigid video editing offers a uniquely revealing testbed, where distinct materials impose distinct physical constraints. In this paper, we introduce NRVBench, a diagnostic benchmark for non-rigid video editing, where the task is to modify deformable motion while preserving irrelevant regions and maintaining material-specific plausibility. NRVBench contains 180 curated videos across six physics-grounded categories, 2,340 fine-grained editing instructions, 360 multiple-choice questions, and pixel-accurate masks. We further propose NRVE-Acc, a structured VLM-based protocol that decomposes editing success into instruction following, material-aware deformation plausibility, and temporal coherence with motion cues. Experiments on representative inference-time video editing methods reveal a clear mismatch between conventional metrics and physics-aware perceptual editing success: methods that preserve appearance or achieve strong global alignment may still fail under non-rigid dynamics. We additionally introduce VM-Edit, a simple region-conditioned editing baseline that frees the foreground while locking the background, exposing the stability--plasticity trade-off.

2508.06407 2026-06-02 cs.CV cs.AI eess.IV 版本更新

A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

SAR图像中舰船目标的分类感知超分辨率框架

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

发表机构 * University of Malaya(马来亚大学)

AI总结 提出一种将分类目标融入超分辨率过程的算法,通过优化兼顾图像质量和分类性能的损失函数,提升SAR图像分辨率并改善分类精度。

详情
Journal ref
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 19, pp. 6614-6622, 2026
AI中文摘要

高分辨率图像在提升分类、检测和分割等视觉识别任务性能中起着关键作用。在包括遥感和监视在内的许多领域,低分辨率图像可能限制自动分析的准确性。为此,超分辨率(SR)技术被广泛采用,试图从低分辨率输入重建高分辨率图像。相关的传统方法仅基于像素级指标专注于提升图像质量,而超分辨率图像保真度与下游分类性能之间的关系在很大程度上未被探索。这引发了一个关键问题:将分类目标直接集成到超分辨率过程中是否能进一步提高分类精度?在本文中,我们通过部署一种专门的算法策略来研究超分辨率与分类之间的关系,试图回答这一问题。我们提出了一种新颖的方法,通过优化同时考虑图像质量和分类性能的损失函数,提高合成孔径雷达图像的分辨率。我们的方法在提升图像质量(通过科学验证的图像质量指标衡量)的同时,也提高了分类精度。

英文摘要

High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.

2511.06163 2026-06-02 eess.IV cs.CV cs.LG physics.med-ph 版本更新

Cross-Modal Fine-Tuning of 3D Convolutional Foundation Models for ADHD Classification with Low-Rank Adaptation

基于低秩适应的3D卷积基础模型跨模态微调用于ADHD分类

Jyun-Ping Kao, Shinyeong Rho, Shahar Lazarev, Hyun-Hae Cho, Fangxu Xing, Taehoon Shin, C. -C. Jay Kuo, Jonghye Woo

发表机构 * National Institute of Mental Health, National Institutes of Health(国家精神卫生研究所,国立卫生研究院)

AI总结 提出一种参数高效的迁移学习方法,通过3D低秩适应(LoRA)将预训练于CT图像的3D卷积基础模型微调至MRI的ADHD分类任务,在公开扩散MRI数据集上达到71.9%准确率和0.716 AUC,仅需164万可训练参数。

Comments Accepted for presentation at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), pp. 1-4
AI中文摘要

儿童注意缺陷/多动障碍(ADHD)的早期诊断在改善教育和心理健康结果中起着关键作用。然而,由于异质性表现和与其他疾病的重叠症状,使用神经影像数据诊断ADHD仍然具有挑战性。为了解决这一问题,我们提出了一种新颖的参数高效迁移学习方法,将预训练于CT图像的大规模3D卷积基础模型适应于基于MRI的ADHD分类任务。我们的方法通过将3D卷积核分解为2D低秩更新,在3D中引入低秩适应(LoRA),大幅减少可训练参数,同时实现优越性能。在公开扩散MRI数据库上的五折交叉验证评估中,我们的3D LoRA微调策略取得了最先进的结果,一个模型变体达到71.9%的准确率,另一个达到0.716的AUC。两个变体仅使用164万可训练参数(比完全微调的基础模型少113倍以上)。我们的结果代表了神经影像中基础模型首次成功的跨模态(CT到MRI)适应之一,为ADHD分类建立了新的基准,同时大幅提高了效率。

英文摘要

Early diagnosis of attention-deficit/hyperactivity disorder (ADHD) in children plays a crucial role in improving outcomes in education and mental health. Diagnosing ADHD using neuroimaging data, however, remains challenging due to heterogeneous presentations and overlapping symptoms with other conditions. To address this, we propose a novel parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model, pre-trained on CT images, to an MRI-based ADHD classification task. Our method introduces Low-Rank Adaptation (LoRA) in 3D by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while achieving superior performance. In a five-fold cross-validated evaluation on a public diffusion MRI database, our 3D LoRA fine-tuning strategy achieved state-of-the-art results, with one model variant reaching 71.9% accuracy and another attaining an AUC of 0.716. Both variants use only 1.64 million trainable parameters (over 113x fewer than a fully fine-tuned foundation model). Our results represent one of the first successful cross-modal (CT-to-MRI) adaptations of a foundation model in neuroimaging, establishing a new benchmark for ADHD classification while greatly improving efficiency.

2601.04946 2026-06-02 cs.CV cs.AI 版本更新

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

原型性偏差揭示多模态评估指标中的盲点

Subhadeep Roy, Gagan Bhatia, Steffen Eger

发表机构 * University of Technology Nuremberg(图恩大学)

AI总结 本文通过构建受控诊断基准PROTOBIAS,发现并验证了多模态评估指标中存在原型性偏差,即倾向于选择视觉或社会原型性高但语义错误的图像,并提出了轻量级对比训练评估器PROTOSCORE作为缓解基线。

详情
AI中文摘要

自动指标广泛用于评估文生图模型,常常在基准测试、模型选择和大规模数据过滤中取代人类判断。然而,它们可能奖励看起来合理或原型性的图像,而非忠实满足提示的图像。我们识别出原型性偏差是多模态评估中的一个系统性盲点:指标可能偏好语义不正确但在视觉或社会层面具有原型性的图像,而非正确但原型性较弱的图像。我们引入PROTOBIAS,一个跨动物、物体和人口统计的受控诊断基准,其中语义正确的图像与包含单个受控语义违反的合理原型性对抗样本进行对比。基于原型理论和社会类别原型性,PROTOBIAS通过多个提示生成器、图像生成器和独立的VLM过滤器构建,并通过提示质量、人工标注和图像质量控制进行验证。使用PROTOBIAS,我们展示了广泛使用的嵌入、奖励、基于VQA和VLM作为评判的指标经常在这些对比中失败,而人类判断仍然更忠实于语义正确性。我们进一步引入PROTOSCORE,一个轻量级对比训练评估器,作为初始缓解基线。PROTOBIAS为测量原型性驱动的指标失败和开发更语义忠实的T2I评估器提供了一个聚焦基准。

英文摘要

Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and large-scale data filtering. Yet they may reward images that look plausible or prototypical rather than images that faithfully satisfy the prompt. We identify prototypicality bias as a systematic blindspot in multimodal evaluation: metrics can prefer a semantically incorrect but visually or socially prototypical image over a correct but less prototypical one. We introduce PROTOBIAS, a controlled diagnostic benchmark across Animals, Objects, and Demography, where semantically correct images are contrasted with plausible prototypical adversaries containing a single controlled semantic violation. Grounded in prototype theory and social-category prototypicality, PROTOBIAS is constructed with multiple prompt generators, image generators, and independent VLM filters, and validated through prompt-quality, human-annotation, and image-quality controls. Using PROTOBIAS, we show that widely used embedding, reward, VQA-based, and VLM-as-judge metrics frequently fail these contrasts, while human judgments remain more faithful to semantic correctness. We further introduce PROTOSCORE, a lightweight contrastively trained evaluator, as an initial mitigation baseline. PROTOBIAS provides a focused benchmark for measuring prototypicality-driven metric failures and developing more semantically faithful T2I evaluators.

2601.03309 2026-06-02 cs.CV cs.AI 版本更新

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

VLM4VLA:重新审视视觉-语言-动作模型中的视觉-语言模型

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院) Qwen Team, Alibaba Inc.(阿里巴巴公司Qwen团队)

AI总结 本文通过VLM4VLA最小适配管道,系统研究视觉-语言模型(VLM)的选择和能力如何影响下游视觉-语言-动作(VLA)策略性能,发现VLM通用能力无法预测下游任务表现,且视觉模块是性能瓶颈。

详情
AI中文摘要

视觉-语言-动作(VLA)模型将预训练的大型视觉-语言模型(VLM)集成到其策略主干中,因其有前景的泛化能力而受到广泛关注。本文重新审视了一个基本但很少被系统研究的问题:VLM的选择和能力如何转化为下游VLA策略的性能?我们引入了VLM4VLA,一个最小适配管道,仅使用少量新的可学习参数将通用VLM转换为VLA策略,以实现公平高效的比较。尽管简单,VLM4VLA被证明与更复杂的网络设计相比具有惊人的竞争力。通过在三个基准上的各种下游任务进行广泛的实证研究,我们发现虽然VLM初始化比从头训练提供了一致的优势,但VLM的通用能力并不能很好地预测其下游任务性能。这挑战了常见的假设,表明标准VLM能力对于有效的具身控制是必要但不充分的。我们进一步通过微调VLM在七个辅助具身任务(例如,具身问答、视觉指向、深度估计)上研究特定具身能力的影响。与直觉相反,提高VLM在特定具身技能上的性能并不能保证更好的下游控制性能。最后,模态级别的消融实验确定VLM中的视觉模块(而非语言组件)是主要的性能瓶颈。我们证明,即使在下游微调期间编码器保持冻结,向VLM的视觉编码器注入控制相关的监督也能带来一致的收益。这隔离了当前VLM预训练目标与具身动作规划需求之间持续的领域差距。

英文摘要

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

2601.00664 2026-06-02 cs.LG cs.AI cs.CV cs.HC cs.MM 版本更新

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Avatar Forcing:用于自然对话的实时交互式头部化身生成

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) NTU Singapore(新加坡国立大学) DeepAuto.ai

AI总结 提出Avatar Forcing框架,通过扩散强制实现实时交互式头部化身生成,利用直接偏好优化进行无标签学习,在低延迟(约500ms)下生成富有表现力的反应动作。

Comments CVPR 2026. Project page: https://taekyungki.github.io/AvatarForcing/

详情
AI中文摘要

说话头部生成从静态肖像创建逼真的化身,用于虚拟通信和内容创作。然而,当前的模型尚未传达真正交互式通信的感觉,通常生成缺乏情感投入的单向响应。我们确定了实现真正交互式化身的两个关键挑战:在因果约束下实时生成运动,以及在没有额外标注数据的情况下学习富有表现力、生动的反应。为了解决这些挑战,我们提出了Avatar Forcing,一种新的交互式头部化身生成框架,通过扩散强制建模实时用户-化身交互。该设计允许化身处理实时多模态输入,包括用户的音频和运动,以低延迟即时响应语言和非语言线索,如言语、点头和笑声。此外,我们引入了一种直接偏好优化方法,利用通过丢弃用户条件构建的合成失败样本,实现无标签的富有表现力交互学习。实验结果表明,我们的框架能够实现低延迟(约500ms)的实时交互,相比基线加速6.8倍,并生成反应性和富有表现力的化身运动,在80%以上的情况下优于基线。

英文摘要

Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.

2601.00212 2026-06-02 cs.CV 版本更新

IntraStyler: Intra-Domain Style Synthesis for Cross-Modality MRI Domain Adaptation

IntraStyler: 跨模态MRI域适应的域内风格合成

Han Liu, Yubo Fan, Hao Li, Dewei Hu, Daniel Moyer, Zhoubing Xu, Benoit M. Dawant, Ipek Oguz

发表机构 * Siemens Healthineers(西门子医疗) Princeton, NJ, USA(新泽西州普林斯顿) Vanderbilt University(范德比尔特大学) Mayo Clinic(梅奥诊所) Johnson & Johnson Innovative Medicine(强生创新医学)

AI总结 针对T2 MRI中前庭神经鞘瘤和耳蜗分割的域适应问题,提出IntraStyler方法,通过对比学习提取与解剖解耦的风格嵌入,自动发现并合成目标域内多样化的风格图像,提升下游分割模型的泛化性。

Comments Extension of our 1st place solution for the CrossMoDA 2023 challenge

详情
AI中文摘要

从T2 MRI中分割前庭神经鞘瘤和耳蜗在临床上很重要,但需要大量标注。域适应(DA)已被广泛用于弥合标记的对比增强T1和未标记的T2数据集之间的差距。现有方法专注于跨域对齐,但目标域内的域内变异性在很大程度上被忽视。同一域的图像可能因不同的扫描仪、场强和采集协议而存在显著差异。忽略这种变异性会产生同质的合成图像,限制了下游分割模型的泛化能力。为了解决这个问题,我们提出了IntraStyler,一种3D非配对图像翻译方法,无需任何预定义的子域即可自动发现细粒度的域内风格,并使用每幅图像的风格参考合成多样化的目标域图像。为此,我们设计了一个3D风格编码器,通过新颖的对比学习目标进行训练,以提取与解剖解耦的纯风格嵌入。IntraStyler基于CrossMoDA挑战赛的第一名解决方案构建并进一步改进,生成更多样化的合成数据,实现更可靠的下游分割。代码可在https://github.com/MedICL-VU/IntraStyler获取。

英文摘要

Segmentation of vestibular schwannoma and cochlea from T2 MRI is clinically important yet annotation-intensive. Domain adaptation (DA) has been widely adopted to bridge the gap between labeled contrast-enhanced T1 and unlabeled T2 datasets. While existing methods focus on cross-domain alignment, intra-domain variability within the target domain remains largely overlooked. Images from the same domain may vary substantially due to different scanners, field strengths, and acquisition protocols. Ignoring this variability produces homogeneous synthetic images that limit the generalizability of downstream segmentation models. To address this, we propose IntraStyler, a 3D unpaired image translation method that automatically discovers fine-grained intra-domain styles without any predefined sub-domains, and synthesizes diverse target domain images using per-image style references. To this end, we design a 3D style encoder trained with a novel contrastive learning objective to extract style-only embeddings disentangled from anatomy. IntraStyler is built upon the 1st place CrossMoDA challenge solution and further advances it, generating more diverse synthetic data and achieving more reliable downstream segmentation. Code is available at https://github.com/MedICL-VU/IntraStyler.

2512.23351 2026-06-02 cs.CV 版本更新

CountGD++: Generalized Prompting for Open-World Counting

CountGD++: 面向开放世界计数的通用提示

Niki Amini-Naieni, Andrew Zisserman

发表机构 * Visual Geometry Group (VGG)(视觉几何组(VGG)) University of Oxford, UK(牛津大学,英国)

AI总结 提出CountGD++模型,通过扩展提示方式(包括负样本描述、伪示例自动标注和外部图像示例)提升开放世界计数的灵活性、准确性和泛化能力。

Comments CVPR 2026

详情
AI中文摘要

图像和视频中物体自动计数方法的灵活性和准确性受限于物体的指定方式。现有方法允许用户通过文本和视觉示例描述目标物体,但视觉示例必须在图像内手动标注,且无法指定不计数对象。为解决这些问题,我们引入了扩展目标物体指定方式的新能力。具体而言,我们扩展了提示,允许通过文本和/或视觉示例描述不计数对象,引入了在推理时自动标注视觉示例的“伪示例”概念,并将计数模型扩展为接受来自自然和合成外部图像的视觉示例。我们还使用新的计数模型CountGD++作为LLM的视觉专家代理。这些贡献共同扩展了多模态开放世界计数的提示灵活性,并在多个数据集上显著提高了准确性、效率和泛化能力。代码见 https://github.com/niki-amini-naieni/CountGDPlusPlus。

英文摘要

The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.

2512.21472 2026-06-02 cs.CV 版本更新

IMA++: ISIC Archive Multi-Annotator Dermoscopic Skin Lesion Segmentation Dataset

IMA++: ISIC档案多标注者皮肤镜皮损分割数据集

Kumar Abhishek, Jeremy Kawahara, Ghassan Hamarneh

发表机构 * Medical Image Analysis Lab, School of Computing Science, Simon Fraser University(医学影像分析实验室,计算科学学院,西蒙弗雷泽大学) AIP Labs(AIP实验室)

AI总结 提出ISIC MultiAnnot++数据集,包含14,967张皮肤镜图像和17,684个分割掩码,其中2,394张图像有2-5个标注,并附带标注者技能水平和工具元数据,支持多标注者医学图像分割研究。

Comments Published in IEEE Data Descriptions, 12 pages, 7 figures

详情
Journal ref
IEEE Data Descr. 3 (2026) 367-378
AI中文摘要

多标注者医学图像分割是一个重要的研究问题,但需要昂贵收集的标注数据集。皮肤镜皮损成像允许人类专家和AI系统观察在常规临床照片中无法辨别的形态结构。然而,目前没有大规模公开可用的、带有标注者标签的多标注者皮损分割(SLS)数据集用于皮肤镜皮损成像。我们引入了ISIC MultiAnnot++,一个大型公开的多标注者皮损分割数据集,图像来自ISIC档案。最终数据集包含14,967张皮肤镜图像的17,684个分割掩码,其中2,394张皮肤镜图像每张有2-5个分割,使其成为最大的公开SLS数据集。此外,还包括关于分割的元数据,包括标注者的技能水平和分割工具,支持诸如分割的标注者特定偏好建模和标注者元数据分析等研究主题。我们对该数据集的特征、策划的数据分区和共识分割掩码进行了分析。

英文摘要

Multi-annotator medical image segmentation is an important research problem, but requires annotated datasets that are expensive to collect. Dermoscopic skin lesion imaging allows human experts and AI systems to observe morphological structures otherwise not discernable from regular clinical photographs. However, currently there are no large-scale publicly available multi-annotator skin lesion segmentation (SLS) datasets with annotator-labels for dermoscopic skin lesion imaging. We introduce ISIC MultiAnnot++, a large public multi-annotator skin lesion segmentation dataset for images from the ISIC Archive. The final dataset contains 17,684 segmentation masks spanning 14,967 dermoscopic images, where 2,394 dermoscopic images have 2-5 segmentations per image, making it the largest publicly available SLS dataset. Further, metadata about the segmentation, including the annotators' skill level and segmentation tool, is included, enabling research on topics such as annotator-specific preference modeling for segmentation and annotator metadata analysis. We provide an analysis on the characteristics of this dataset, curated data partitions, and consensus segmentation masks.

2512.20251 2026-06-02 cs.CV eess.IV 版本更新

Degradation-Aware Metric Prompting for Hyperspectral Image Restoration

退化感知度量提示用于高光谱图像恢复

Binfeng Wang, Di Wang, Haonan Guo, Ying Fu, Jing Zhang

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China(北京理工大学计算机科学与技术学院) School of Computer Science, Wuhan University, Wuhan, Hubei, China(武汉大学计算机学院) Zhongguancun Academy, Beijing, China(中关村学院)

AI总结 提出退化感知度量提示(DAMP)框架,通过可解释的空间-光谱度量作为退化提示,结合退化自适应混合专家(DAMoE)模块,实现多维度退化统一恢复,在自然和遥感高光谱数据集上达到最先进性能并展现零样本泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

统一高光谱图像(HSI)恢复旨在单个模型中恢复多种退化。然而,当前方法通常依赖于不切实际的显式先验或过拟合训练分布的不透明黑盒表示,阻碍了对未见场景的泛化。为弥补这一差距,我们提出退化感知度量提示(DAMP),一种新颖框架,通过可解释的空间-光谱度量表征多维退化。这些度量作为退化提示(DP),使模型能够捕捉任务间的共享特征并适应未知损坏。我们框架的核心是退化自适应混合专家(DAMoE),其中空间-光谱自适应模块(SSAM)作为专家,利用可学习的融合系数专门处理不同的退化程度。通过使用DP作为门控路由器,DAMoE动态激活针对特定退化特征定制的专家。在自然和遥感HSI数据集上的大量实验表明,DAMP实现了最先进的性能,并在未见恢复任务上展现出卓越的零样本泛化能力。代码公开于 \href{DAMP}{https://github.com/MiliLab/DAMP}。

英文摘要

Unified hyperspectral image (HSI) restoration aims to recover diverse degradations within a single model. However, current methods often rely on impractical explicit priors or opaque black-box representations that overfit to training distributions, hampering generalization to unseen scenarios. To bridge this gap, we propose Degradation-Aware Metric Prompting (DAMP), a novel framework that characterizes multi-dimensional degradations through interpretable spatial-spectral metrics. These metrics serve as Degradation Prompts (DP), enabling the model to capture shared characteristics across tasks and adapt to unknown corruptions. Central to our framework is the Degradation-Adaptive Mixture-of-Experts (DAMoE), where Spatial-Spectral Adaptive Modules (SSAMs) serve as experts that utilize learnable fusion coefficients to specialize in distinct degradation degrees. By using DP as a gating router, DAMoE dynamically activates specialized experts tailored to the specific degradation profile. Extensive experiments on natural and remote sensing HSI datasets demonstrate that DAMP achieves state-of-the-art performance and exhibits exceptional zero-shot generalization on unseen restoration tasks. Code is publicly available at \href{DAMP}{https://github.com/MiliLab/DAMP}.

2508.20072 2026-06-02 cs.CV cs.LG cs.RO 版本更新

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

离散扩散VLA:将离散扩散引入视觉-语言-动作策略中的动作解码

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出离散扩散VLA,通过将动作块离散化并在统一Transformer骨干内使用离散扩散模式进行渐进细化,实现自适应解码顺序和错误纠正,在多个基准上取得高性能并保留预训练的视觉-语言先验。

Comments Accepted by ICML 2026. 17 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型将大型视觉-语言骨干网络适配为将图像和指令映射为机器人动作。然而,当前的VLA要么以固定的从左到右顺序自回归生成动作,性能较差;要么在骨干网络外附加独立的扩散头,这会割裂信息通路并阻碍统一、可扩展的架构。相反,我们提出了离散扩散VLA,它将动作块离散化,并使用离散扩散模式在统一的Transformer骨干内保留渐进细化。我们的方法实现了自适应解码顺序,在解决较难的动作元素之前先解决高置信度的动作元素,并采用二次重掩码来重新审视不确定的预测,从而实现鲁棒的纠错。这种设计保留了预训练的视觉-语言先验,支持并行解码,并提高了效率。离散扩散VLA在LIBERO上达到96.4%的平均成功率,在SimplerEnv-Fractal上达到71.2%的视觉匹配,在SimplerEnv-Bridge上达到54.2%的整体性能。在LIBERO-Goal的分布外测试中,我们的方法仅表现出0.8%的语言退化(相比之下并行解码为8.0%),以及20.4%的视觉退化(相比之下连续扩散为29.0%),表明其很好地保留了预训练的视觉-语言能力。我们还在AgileX Cobot Magic平台上进行了两次真实机器人评估,以展示该方法的有效性。

英文摘要

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

2512.15647 2026-06-02 cs.CV 版本更新

Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

硬标签登场!重新思考硬标签在缓解局部语义漂移中的作用

Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对软标签在稀疏监督下导致的局部语义漂移问题,提出混合硬标签与软标签的HALD训练范式,在数据集蒸馏和大规模分类任务中提升泛化性能。

Comments ICML 2026. Code at: https://github.com/Jiacheng8/HALD

详情
AI中文摘要

来自教师模型的软标签是知识迁移和大规模数据集蒸馏(例如SRe2L、LPLD)的实际做法。然而,当我们限制每张图像的裁剪数量以减少存储预计算软标签的巨大成本时,这些方法会严重遭受局部语义漂移:视觉上模糊的裁剪可能导致软监督偏离图像级别的真实语义,导致持续错误和训练-测试分布不匹配。我们重新审视了硬标签被忽视的作用,并表明当适当整合时,它们可以作为内容不变的语义锚点来校准这种漂移。我们从理论上分析了稀疏软标签监督下漂移的出现,并证明混合硬标签和软标签可以恢复视觉内容与语义监督之间的对齐。基于这一见解,我们提出了一种新的训练范式——用于缓解局部语义漂移的硬标签(HALD),它使用硬标签作为中间校正信号,同时保留软标签的细粒度优势。在数据集蒸馏和大规模分类基准上的大量实验显示了一致的泛化改进。在ImageNet-1K上,我们的方法仅使用285M软标签存储(减少100倍)就达到了42.7%的准确率,优于先前最先进的LPLD 9.0%。

英文摘要

Soft labels from teacher models are a de facto practice for knowledge transfer and large-scale dataset distillation (e.g., SRe2L, LPLD). However, when we limit the number of crops per image to reduce the substantial cost of storing precomputed soft labels, these methods suffer severely from local semantic drift: visually ambiguous crops can cause soft supervision to deviate from the image-level ground-truth semantics, leading to persistent errors and a train-test distribution mismatch. We revisit the overlooked role of hard labels and show that, when properly integrated, they can act as a content-invariant semantic anchor that calibrates such drift. We theoretically analyze the emergence of drift under sparse soft-label supervision and demonstrate that hybridizing hard and soft labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which uses hard labels as intermediate corrective signals while preserving the fine-grained benefits of soft labels. Extensive experiments on dataset distillation and large-scale classification benchmarks show consistent generalization improvements. On ImageNet-1K, our method achieves 42.7% accuracy with only 285M soft-label storage (reduces by 100X), outperforming prior state-of-the-art LPLD 9.0%.

2505.08438 2026-06-02 cs.CV cs.AI 版本更新

A Survey of 3D Reconstruction with Event Cameras

事件相机三维重建综述

Chuanzhi Xu, Haoxian Zhou, Langyi Chen, Haodong Chen, Zeke Zexi Hu, Zhicheng Lu, Ying Zhou, Vera Chung, Qiang Qu, Weidong Cai

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文首次全面综述了基于事件相机的三维重建方法,按输入模态(立体、单目、多模态)和重建技术(几何、深度学习、神经渲染如NeRF和3DGS)分类,并讨论了数据集、评估、表示和动态场景重建等挑战。

Comments This survey has been accepted for publication in the Computational Visual Media Journal

详情
AI中文摘要

事件相机正迅速成为用于三维重建的强大视觉传感器,能够异步捕捉每个像素的亮度变化。与传统基于帧的相机相比,事件相机产生稀疏但时间密集的数据流,即使在高速运动、低光照和极端动态范围等挑战性条件下,也能实现鲁棒且准确的三维重建。这些能力为自动驾驶、机器人、空中导航和沉浸式虚拟现实等各个领域的变革性应用提供了巨大前景。在本文中,我们首次专门针对基于事件的三维重建进行了全面综述。现有方法根据输入模态系统地分为立体、单目和多模态系统,并根据重建方法进一步分类,包括基于几何的技术、深度学习方法以及神经渲染技术,如神经辐射场(NeRF)和3D高斯泼溅(3DGS)。在每个类别中,方法按时间顺序组织,以突出关键概念和进展的演变。此外,我们详细总结了专门适用于基于事件重建任务的公开数据集。最后,我们讨论了数据集可用性、标准化评估、有效表示和动态场景重建方面的重大开放挑战,并概述了未来研究的有见地的方向。本综述旨在作为重要参考,并为推进事件驱动三维重建的最新技术提供清晰且激励人心的路线图。

英文摘要

Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel brightness changes. Compared to traditional frame-based cameras, event cameras produce sparse yet temporally dense data streams, enabling robust and accurate 3D reconstruction even under challenging conditions such as high-speed motion, low illumination, and extreme dynamic range scenarios. These capabilities offer substantial promise for transformative applications across various fields, including autonomous driving, robotics, aerial navigation, and immersive virtual reality. In this survey, we present the first comprehensive review exclusively dedicated to event-based 3D reconstruction. Existing approaches are systematically categorised based on input modality into stereo, monocular, and multimodal systems, and further classified according to reconstruction methodologies, including geometry-based techniques, deep learning approaches, and neural rendering techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Within each category, methods are chronologically organised to highlight the evolution of key concepts and advancements. Furthermore, we provide a detailed summary of publicly available datasets specifically suited to event-based reconstruction tasks. Finally, we discuss significant open challenges in dataset availability, standardised evaluation, effective representation, and dynamic scene reconstruction, outlining insightful directions for future research. This survey aims to serve as an essential reference and provides a clear and motivating roadmap toward advancing the state of the art in event-driven 3D reconstruction.

2512.17605 2026-06-02 cs.CV cs.AI 版本更新

MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

MGRegBench:一个带有解剖标志的乳腺X线图像配准新型基准数据集

Svetlana Krasnova, Emiliya Starikova, Ilia Naletov, Andrey Krylov, Dmitry Sorokin

发表机构 * MSU(莫斯科国立大学)

AI总结 为解决乳腺X线图像配准中缺乏公开数据集和标准化基准的问题,提出了MGRegBench,包含5000多对图像和100对带手动标注解剖标志的数据集,并评估了多种配准方法。

详情
AI中文摘要

稳健的乳腺X线图像配准对于临床相关应用(如追踪乳腺组织疾病进展)至关重要。然而,由于缺乏透明的公共数据集和可重复的标准化基准,进展受到限制。现有研究通常使用私有数据和不一致的评估框架,因此难以直接比较。为解决这一问题,我们提出了MGRegBench,一个患者独立、无泄漏控制的乳腺X线图像配准评估协议,包含超过5000对图像,每对图像带有乳腺分割掩膜,以及100对带有手动标注解剖标志的图像,此外还有标准化的训练/评估分割和即用基线。利用这一资源,我们对多种配准方法进行了基准测试——包括经典方法(ANTs)、基于学习的方法(VoxelMorph, TransMorph)、隐式神经表示(IDIR)、一种乳腺X线专用方法,以及最近的深度学习方法MammoRegNet,并针对该模态调整了实现,同时在独立数据集SDM-MCs上验证了泛化能力。我们的贡献包括:(1)首个此规模且带有手动标注标志和掩膜的乳腺X线图像配准公共数据集;(2)一个透明、无泄漏控制的基准,首次实现了多种经典和基于机器学习的方法的同类比较;(3)在SDM-MCs上的外部验证,以测试主要趋势是否超越MGRegBench;(4)对基于深度学习的配准进行了广泛分析。我们公开发布代码和数据,为公平、可重复且临床相关的比较建立基础资源,并推动AI驱动医学影像的未来研究。

英文摘要

Robust mammography registration is essential for clinically relevant applications like tracking disease progression in breast tissue. However, progress has been limited by the absence of transparent public datasets and reproducible standardized benchmarks. Existing studies are often not directly comparable, as they use private data and inconsistent evaluation frameworks. To address this, we present MGRegBench, a patient-disjoint, leakage-controlled evaluation protocol for mammography registration, comprising over 5,000 image pairs, each with a breast segmentation mask, and 100 pairs with manually annotated anatomical landmarks, plus standardized train/evaluation splits and ready-to-run baselines. Using this resource, we benchmark diverse registration methods -- including classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a mammography-specific approach, and a recent deep learning method MammoRegNet, with implementations adapted to this modality, and validate generalization on the independent SDM-MCs dataset. Our contributions are: (1) the first public dataset of this scale with manual landmarks and masks for mammography registration; (2) a transparent, leakage-controlled benchmark enabling the first like-for-like comparison of diverse classical and machine learning-based methods; (3) external validation on SDM-MCs to test whether the main trend transfers beyond MGRegBench; and (4) an extensive analysis of deep learning-based registration. We publicly release our code and data to establish a foundational resource for fair, reproducible, and clinically relevant comparisons and catalyze future research in AI-driven medical imaging.

2512.14364 2026-06-02 cs.CV 版本更新

Unified Semantic Transformer for 3D Scene Understanding

统一语义Transformer用于3D场景理解

Sebastian Koch, Johanna Wald, Hidenobu Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari

发表机构 * Ulm University(乌尔姆大学) Google(谷歌) TU Vienna(维也纳技术大学) TU Munich(慕尼黑技术大学)

AI总结 提出UNITE,一个统一的语义Transformer,通过端到端训练从RGB图像直接预测多种密集语义属性,实现3D场景理解,并在多个任务上达到最先进性能。

Comments Accepted at TMLR. Project page: https://unite-page.github.io/

详情
AI中文摘要

整体3D场景理解涉及捕捉和解析非结构化3D环境。由于现实世界的固有复杂性,现有模型主要被开发并局限于特定任务。我们引入UNITE,一个用于3D场景理解的统一语义Transformer,这是一种新颖的前馈神经网络,将多种3D密集语义室内任务统一在单个模型中。我们的模型以完全端到端的方式在未见过的场景上训练,仅需几秒钟即可推断完整的3D语义几何。我们的方法能够直接从RGB图像预测多个密集语义属性,包括3D场景分割、实例嵌入、开放词汇特征和关节。该方法使用2D蒸馏和自监督相结合的训练方式,并利用新颖的多视图损失确保3D视图一致性。我们证明UNITE在多个不同的密集室内语义任务上达到了最先进的性能,甚至在许多情况下超越了任务特定模型,超过了使用真实3D几何的方法。参见项目网站 unite-page.github.io。

英文摘要

Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D dense semantic indoor tasks within a single model. Our model operates on unseen scenes trained in a fully end-to-end manner and only takes a couple seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple dense semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different dense indoor semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io

2511.07438 2026-06-02 cs.CV cs.NA math.NA stat.ME 版本更新

Two Datasets Are Better Than One: Method of Double Moments for 3-D Reconstruction in Cryo-EM

两个数据集优于一个:冷冻电镜三维重建的双矩方法

Joe Kileel, Oscar Mickelin, Amit Singer, Sheng Xu

发表机构 * Department of Mathematics and Oden Institute, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系与奥登研究所) Yau Mathematical Sciences Center, Tsinghua University(清华大学姚贝泰数学科学中心) Program in Applied and Computational Mathematics and Department of Mathematics, Princeton University(普林斯顿大学应用与计算数学项目及数学系) Program in Applied and Computational Mathematics, Princeton University(普林斯顿大学应用与计算数学项目)

AI总结 提出双矩方法(MoDM),利用均匀和非均匀两种取向分布下的二阶矩数据唯一确定分子结构,并开发基于凸松弛的算法实现高精度重建。

详情
AI中文摘要

冷冻电镜(cryo-EM)是一种强大的成像技术,用于从随机取向粒子的噪声断层投影图像中重建三维分子结构。我们引入了一种新的数据融合框架,称为双矩方法(MoDM),该方法从两种不同取向分布下获得的投影图像的二阶矩实例中重建分子结构:一种均匀分布,另一种非均匀且未知。我们证明这些矩在一般情况下唯一确定底层结构(全局旋转和反射除外),并开发了一种基于凸松弛的算法,仅使用二阶统计量即可实现精确恢复。我们的结果展示了在不同实验条件下收集和建模多个数据集的好处,表明利用数据集多样性可以显著提高计算成像任务中的重建质量。

英文摘要

Cryo-electron microscopy (cryo-EM) is a powerful imaging technique for reconstructing three-dimensional molecular structures from noisy tomographic projection images of randomly oriented particles. We introduce a new data fusion framework, termed the method of double moments (MoDM), which reconstructs molecular structures from two instances of the second-order moment of projection images obtained under distinct orientation distributions: one uniform, the other non-uniform and unknown. We prove that these moments generically uniquely determine the underlying structure, up to a global rotation and reflection, and we develop a convex-relaxation-based algorithm that achieves accurate recovery using only second-order statistics. Our results demonstrate the advantage of collecting and modeling multiple datasets under different experimental conditions, illustrating that leveraging dataset diversity can substantially enhance reconstruction quality in computational imaging tasks.

2512.10958 2026-06-02 cs.CV 版本更新

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

WorldLens:真实世界中驾驶世界模型的全光谱评估

Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu

发表机构 * WorldBench Team(WorldBench团队) Equal Contributions Project Lead(同等贡献项目负责人) Project Lead(项目负责人) Corresponding Author(通讯作者)

AI总结 提出WorldLens基准,从生成、重建、动作跟随、下游任务和人类偏好五个方面评估生成世界模型在视觉真实性、几何一致性、物理合理性和功能可靠性上的表现,并构建WorldLens-26K数据集和WorldLens-Agent评估模型以实现可扩展的可解释评分。

Comments CVPR 2026 Oral Presentation; 80 pages, 37 figures, 29 tables; Project Page at https://worldbench.github.io/worldlens GitHub at https://github.com/worldbench/WorldLens

详情
AI中文摘要

生成式世界模型正在重塑具身AI,使智能体能够合成逼真的4D驾驶环境,这些环境看起来令人信服,但通常在物理或行为上失败。尽管进展迅速,该领域仍然缺乏统一的方法来评估生成的世界是否保留了几何结构、遵守物理规律或支持可靠的控制。我们引入了WorldLens,一个全光谱基准,用于评估模型在其生成的世界中构建、理解和行为的能力。它涵盖五个方面——生成、重建、动作跟随、下游任务和人类偏好——共同覆盖视觉真实性、几何一致性、物理合理性和功能可靠性。在这些维度上,没有现有的世界模型能够全面表现出色:纹理强的模型往往违反物理规律,而几何稳定的模型缺乏行为保真度。为了将客观指标与人类判断对齐,我们进一步构建了WorldLens-26K,一个大规模的人类标注视频数据集,包含数值评分和文本理由,并开发了WorldLens-Agent,一个从这些标注中蒸馏出的评估模型,以实现可扩展、可解释的评分。基准、数据集和代理共同形成了一个统一的生态系统,用于衡量世界保真度——标准化未来模型不仅根据它们看起来有多真实,而且根据它们的行为有多真实来评判。

英文摘要

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.

2512.07806 2026-06-02 cs.CV 版本更新

Multi-view Pyramid Transformer: Look Coarser to See Broader

多视角金字塔变换器:看得更粗以看得更广

Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park

发表机构 * Sungkyunkwan University(成均馆大学) Yonsei University(延世大学)

AI总结 提出MVP,一种可扩展的多视角变换器架构,通过局部到全局的视角层次和细到粗的空间层次,实现从数十到数百张图像中单次前向重建大型3D场景,结合3D高斯泼溅达到最先进的可泛化重建质量。

Comments Project page: see https://gynjn.github.io/MVP/

详情
AI中文摘要

我们提出了多视角金字塔变换器(MVP),这是一种可扩展的多视角变换器架构,能够直接从数十到数百张图像中单次前向重建大型3D场景。借鉴“看得更广以见全貌,看得更细以见细节”的思想,MVP建立在两个核心设计原则之上:1)局部到全局的视角间层次,逐渐将模型的视角从局部视图扩展到组,最终到整个场景;2)细到粗的视角内层次,从详细的空间表示开始,逐步聚合为紧凑、信息密集的令牌。这种双重层次结构实现了计算效率和表示丰富性,使得快速重建大型复杂场景成为可能。我们在多个数据集上验证了MVP,并表明当与3D高斯泼溅作为底层3D表示结合时,它在广泛视角配置下实现了最先进的可泛化重建质量,同时保持高效率和可扩展性。

英文摘要

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

2512.07192 2026-06-02 cs.CV 版本更新

HyperVQ: Enabling Hyperprior Entropy Modeling for VQ-Based Generative Image Compression

HyperVQ: 为基于VQ的生成式图像压缩实现超先验熵建模

Niu Yi, Xu Tianyi, Ma Mingming, Wang Xinkun

发表机构 * Xidian University(西安电子科技大学)

AI总结 提出HyperVQ框架,通过将概率建模转移到连续嵌入空间并利用高斯密度与码本锚点的距离关系,实现端到端率失真优化,平均节省18.5%比特率。

Comments 22 pages, 16 figures, 4 tables

详情
AI中文摘要

基于向量量化(VQ)的生成式图像压缩已取得显著的感知质量。然而,现有的VQ编解码器存在两个基本限制。首先,它们缺乏高效的内容自适应熵建模,依赖静态频率,导致编码效率低下。其次,离散索引与连续先验之间的固有冲突阻碍了真正的端到端联合率失真(RD)优化。为解决这些问题,我们提出了HyperVQ,一个为基于VQ的编解码器建立高性能超先验熵基础的原则性框架。HyperVQ的核心思想是将概率建模完全转移到连续嵌入空间。HyperVQ不直接预测离散符号的概率,而是为连续潜变量预测一个高维连续多元高斯分布。通过将离散码本条目视为该空间中的固定“锚点”,我们基于相对距离将连续高斯密度转换为分类索引概率。这种优雅的公式提供了一个强大的、空间自适应的熵引擎,并使交叉熵率目标完全可微,使网络能够在训练过程中主动动态优化RD权衡。为确保实用性,我们设计了轻量级H块和概率估计引擎(PEE),以实现高度并行的毫秒级推理。实验表明,HyperVQ作为跨多种VQ架构(单尺度、大码本、RVQ)的通用模块,平均节省18.5%的比特率,是传统霍夫曼编码节省量的7.28倍。这为下一代生成式图像压缩建立了稳健的、RD可控的基础。

英文摘要

Vector Quantization (VQ) based generative image compression has achieved remarkable perceptual quality. However, existing VQ codecs suffer from two fundamental limitations. First, they lack efficient content-adaptive entropy modeling and rely on static frequencies, leading to low coding efficiency. Second, the inherent conflict between discrete indices and continuous priors prevents true end-to-end joint Rate-Distortion (RD) optimization. To resolve these issues, we propose HyperVQ, a principled framework that establishes a high-performance hyperprior entropy foundation for VQ-based codecs. The core insight of HyperVQ is to shift probability modeling entirely into the continuous embedding space. Instead of directly predicting probabilities for discrete symbols, HyperVQ predicts a high-dimensional continuous multivariate Gaussian distribution for the continuous latents. By treating the discrete codebook entries as fixed "anchors" in this space, we convert the continuous Gaussian density into categorical index probabilities based on relative distances. This elegant formulation provides a powerful, spatially-adaptive entropy engine and renders the cross-entropy rate objective fully differentiable, empowering the network to actively and dynamically optimize the RD trade-off during training. To ensure practicality, we design the lightweight H Block and the Probability Estimation Engine (PEE) to facilitate highly parallel, millisecond-level inference. Experiments demonstrate that HyperVQ acts as a universal module across diverse VQ architectures (single-scale, large-codebook, RVQ), achieving an average bitrate saving of 18.5%, which is 7.28x the saving achieved by conventional Huffman coding. This establishes a robust, RD-controllable foundation for next-generation generative image compression.

2512.04069 2026-06-02 cs.CV cs.RO 版本更新

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

SpaceTools: 通过双交互强化学习实现工具增强的空间推理

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay

发表机构 * NVIDIA University of Michigan(密歇根大学)

AI总结 提出双交互强化学习(DIRL)框架,通过两阶段训练让视觉语言模型学会协调多种工具(如深度估计、分割、姿态估计)进行精确空间推理,在多个基准上达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

视觉语言模型(VLM)表现出强大的定性视觉理解能力,但在具身应用所需的度量精确空间推理方面存在困难。智能体范式承诺VLM可以使用多种工具来增强这些能力,例如深度估计器、分割模型和姿态估计器。然而,如何在不完全依赖手工设计的提示策略或强制固定预定义工具管道(限制VLM发现最优工具使用模式的能力)的情况下实现这一愿景仍然是一个开放挑战。强化学习可以克服这一差距,但迄今为止由于多工具推理中的大搜索空间,仅限于使用单个视觉工具进行推理。我们引入了双交互强化学习(DIRL),这是一个两阶段训练框架,其中VLM通过交互式探索和反馈学习协调多个工具。在教学阶段,我们将通过交互式RL训练的单个工具专家的演示与使用所有工具的前沿模型的轨迹相结合。在探索阶段,模型通过持续的RL进一步优化多工具协调。我们的模型SpaceTools具有工具增强的空间推理能力,在空间理解基准(RoboSpatial-Home、BLINK、BOP-ASK)上达到了最先进的性能,并展示了使用7自由度机器人作为工具的可靠现实世界操作。DIRL在普通SFT(RoboSpatial上+12%)和RL(RoboSpatial上+16%)基线上提供了显著改进。项目页面:https://spacetools.github.io/。

英文摘要

Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.

2512.00088 2026-06-02 cs.CV cs.LG 版本更新

Semimage: HSV-Based Semantic Image Encoding for Disentangled Text Representation

Semimage: 基于HSV的语义图像编码用于解缠文本表示

Mohammad Zare

发表机构 * AI Lab at Department of Computer Engineering(计算机工程系人工智能实验室) AriooBarzan Engineering Team and Information Technology(AriooBarzan工程团队和信息技术) Shiraz University of Technology(谢兹大学技术学院)

AI总结 提出SemImage方法,将文本表示为二维语义图像,利用HSV颜色空间解缠主题、情感和强度特征,通过多任务学习实现,并在文档分类中取得竞争性性能。

详情
Journal ref
2026 12th International Conference on Web Research (ICWR), 253-259
AI中文摘要

我们提出SemImage,一种将文本文档表示为二维语义图像以由卷积神经网络(CNN)处理的新方法。在SemImage中,每个单词表示为二维图像中的一个像素:行对应句子,并在句子之间插入额外的边界行以标记语义转换。每个像素不是典型的RGB值,而是解缠HSV颜色空间中的向量,编码不同的语言特征:色调(具有两个分量H_cos和H_sin以考虑循环性)编码主题,饱和度编码情感,明度编码强度或确定性。我们通过多任务学习框架强制这种解缠:ColorMapper网络将每个词嵌入映射到HSV空间,并对色调和饱和度通道应用辅助监督以预测主题和情感标签,同时执行主要任务目标。在句子之间插入动态计算的边界行,当连续句子在语义上不相似时,会在图像中产生清晰的视觉边界,有效地使段落边界突出。我们将SemImage与标准2D CNN(例如ResNet)集成用于文档分类。在多标签数据集(同时具有主题和情感标注)和单标签基准上的实验表明,SemImage能够达到与强文本分类基线(包括BERT和层次注意力网络)相当或更好的准确性,同时提供增强的可解释性。消融研究证实了多通道HSV表示和动态边界行的重要性。最后,我们展示了SemImage的可视化,定性地揭示了生成图像中与主题转换和情感变化相对应的清晰模式,表明我们的表示使这些语言特征对人类和机器都可见。

英文摘要

We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.

2506.22881 2026-06-02 cs.CV 版本更新

CLIP-like Model as a Foundational Density Ratio Estimator

CLIP-like模型作为基础密度比估计器

Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi, Shota Takashiro, Masahiro Suzuki, Hirokatsu Kataoka, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学) AIST(日本产业技术综合研究所)

AI总结 本文重新解释CLIP类模型为预训练的通用密度比估计器,提出重要性权重学习和KL散度估计两种应用,通过单一提示提升F1分数达7点,并利用KL散度实现数据筛选。

Comments Accepted to CVPR 2026. Code: https://github.com/fumiyauchiyama/CLIP_Density_Ratio

详情
AI中文摘要

密度比估计是统计机器学习中的核心概念,因为它为重要性加权、散度估计和无似然推断等任务提供了统一机制,但其在视觉和语言模型中的潜力尚未被充分探索。现代视觉-语言编码器(如CLIP和SigLIP)通过对比目标进行训练,隐式优化联合图像-文本分布与边缘分布之间的对数密度比,从而学习与对数密度比成比例的相似度分数。然而,先前的工作主要关注其嵌入效用,而对比学习诱导的密度比结构在多模态应用中尚未被系统性地检验或利用。为填补这一空白,我们重新解释CLIP类模型为预训练的通用密度比估计器,并表明这一视角能够实现新的算法能力。我们统一解释了对比目标如何估计密度比,并提出了两种实际应用:重要性权重学习和KL散度估计。我们的重要性权重学习方法仅需单个额外提示,即可将F1分数提升最多7点。我们进一步证明,基于CLIP的密度比支持估计KL散度,该散度量化了以图像或文本为条件如何改变另一模态的分布。通过定性示例和标题的N-gram分析,我们发现这些散度捕捉了多模态数据中的语义多样性和模式结构。利用这一特性,我们引入了一种简单的KL引导数据筛选方法,其性能可与LAION2B筛选相媲美。

英文摘要

Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications. To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities. We present a unified explanation of how contrastive objectives estimate density ratios and propose two practical applications: Importance Weight Learning and KL divergence estimation. Our Importance Weight Learning method requires only a single additional prompt and improves F1 scores by up to 7 points. We further show that CLIP-based density ratios support estimation of KL divergences that quantify how conditioning on an image or text alters the distribution of the other modality. Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering.

2511.21397 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Understanding the Effects of Distractors on Reasoning Vision-Language Models

理解干扰项对推理视觉语言模型的影响

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(坡山科学技术大学(POSTECH))

AI总结 本文通过构建包含语义和数值维度干扰项的视觉问答数据集Idis,研究视觉干扰项如何影响视觉语言模型的测试时缩放行为,发现视觉干扰项以与文本干扰项根本不同的方式降低准确率而不增加推理长度,并提出简单提示策略缓解干扰项驱动的预测。

Comments preprint

详情
AI中文摘要

无关信息(即干扰项)如何影响视觉语言模型(VLM)的测试时缩放?先前关于纯文本语言模型的研究表明,文本干扰项可以加剧逆缩放,导致模型推理更长但推理轨迹效率更低。在这项工作中,我们研究了类似现象是否在多模态设置中出现。我们引入了Idis(带干扰项的图像),这是一个视觉问答数据集,系统性地沿着语义和数值维度变化干扰项。我们的分析揭示,视觉干扰项以与文本干扰项根本不同的方式影响推理VLM:尽管逆缩放仍然出现,但视觉干扰项降低了准确率而不增加推理长度。我们进一步展示了从推理轨迹中提取的属性计数为干扰项如何与推理长度和准确率交互提供了关键见解。作为合理性检查,我们提出了一种简单的提示策略,以减轻推理视觉语言模型中干扰项驱动的预测。

英文摘要

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causing models to reason longer but less effective reasoning traces. In this work, we investigate whether similar phenomena arise in multimodal settings. We introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic and numerical dimensions. Our analyses reveal that visual distractors affect reasoning VLMs in a fundamentally different way from textual distractors: although inverse scaling still emerges, visual distractors reduce accuracy without increasing reasoning length. We further show that attribute counts extracted from reasoning traces provide key insights into how distractors interact with reasoning length and accuracy. As a sanity check, we propose a simple prompting strategy that mitigates distractor-driven predictions in reasoning vision-language models.

2511.20615 2026-06-02 cs.CV cs.AI 版本更新

Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

评估深度学习模型在负重活动期间全身动态3D姿态预测中的性能

Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri

发表机构 * Department of Mechanical Engineering, Sharif University of Technology(谢赫·巴赫什大学机械工程系)

AI总结 本研究利用双向长短期记忆和Transformer架构的时间序列模型,通过优化身体段长度约束的代价函数,实现了对动态负重活动中全身3D姿态的高精度预测。

Comments 11 pages, 6 figures, 7 tables, This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

本研究旨在探索深度神经网络在动态负重活动中全身人体姿态预测的应用。使用双向长短期记忆(BLSTM)和Transformer架构训练了两个时间序列模型。数据集包含20名正常体重健康男性个体的3D全身插件步态动态坐标,每人从不同负载位置执行204次负重任务,并采用不同的举升和处理技术。模型输入包括手-负载位置的3D坐标、举升(弯腰、全蹲和半蹲)和处理(单手和双手)技术、体重和身高,以及任务前25%时间的身体姿态3D坐标数据。模型利用这些输入预测任务剩余75%时间内的身体坐标。此外,提出了一种新方法,通过优化新的代价函数强制身体段长度恒定,以提高先前和当前姿态预测网络的准确性。结果表明,新代价函数使手臂和腿部模型的预测误差分别降低了约8%和21%。我们发现,使用Transformer架构(均方根误差为41.4 mm)的长期性能比基于BLSTM的模型准确约58%。本研究证明了利用捕捉时间序列依赖性的神经网络在3D运动帧中的价值,为理解和预测人工物料搬运活动中的运动动力学提供了独特方法。

英文摘要

This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 41.4 mm, exhibited approximately 58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.

2511.20295 2026-06-02 cs.CV 版本更新

Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

回归特征:用视频反事实解释来解释视频分类器

Chao Wang, Chengan Che, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

发表机构 * Visual Understanding Research Group, Department of Informatics, King’s College London, UK(信息学院视觉理解研究组,伦敦国王学院,英国) Department of Informatics, King’s College London, UK(信息学院,伦敦国王学院,英国)

AI总结 提出BTTF优化框架,通过两阶段优化和渐进式去噪策略生成物理合理、时间连贯的视频反事实解释,揭示视频分类器的决策依据。

Comments Accepted at CVPR2026 main conference

详情
AI中文摘要

反事实解释(CFEs)是对模型输入的最小且语义上有意义的修改,能够改变模型预测。它们突出了模型依赖的决定性特征,为分类器提供对比性解释。最先进的视觉反事实解释方法主要集中于解释图像分类器,而视频模型领域相对未被充分探索。为了使视频CFEs有用,它们必须物理上合理、时间上连贯,并表现出平滑的运动轨迹。现有的基于图像的CFE方法旨在解释图像分类器,缺乏生成时间连贯、平滑且物理合理的视频CFEs的能力。为了解决这个问题,我们提出了回归特征(BTTF),一个生成视频CFEs的优化框架。我们的方法引入了两个新颖的特性:1)一个优化方案,用于检索由输入视频第一帧条件化的初始潜在噪声;2)一个两阶段优化策略,使得能够在输入视频附近搜索反事实视频。两个优化过程仅由目标分类器指导,确保解释的忠实性。为了加速收敛,我们还引入了一种渐进式优化策略,逐步增加去噪步骤的数量。在Shape-Moving(运动分类)、MEAD(情感分类)和NTU RGB+D(动作分类)等视频数据集上的大量实验表明,我们的BTTF有效地生成了有效、视觉相似且逼真的反事实视频,为分类器的决策机制提供了具体见解。

英文摘要

Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods have primarily focused on interpreting image classifiers, leaving the domain of video models relatively underexplored. For the video CFEs to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.

2507.02792 2026-06-02 cs.CV 版本更新

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

RichControl: 面向文本到图像生成的、结构和外观丰富的免训练空间控制

Lexi Pang, Liheng Zhang, Hang Ye, Xiaoxuan Ma, Yizhou Wang

发表机构 * Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种免训练框架,通过解耦条件特征的采样调度与去噪过程,并引入重启细化调度和外观丰富提示策略,在复杂条件下实现结构和外观平衡的受控生成。

详情
AI中文摘要

文本到图像(T2I)扩散模型在从文本提示生成高质量图像方面取得了显著成功。最近的研究尝试扩展这些模型以融入条件图像(例如,Canny边缘)实现细粒度空间控制。其中,特征注入方法作为传统基于微调方法的免训练替代方案出现。然而,它们常常遭受结构错位、条件泄露和视觉伪影,特别是当条件图像与自然RGB分布显著偏离时。通过对现有方法的分析,我们识别出一个关键限制:条件特征的采样调度(此前未被探索)未能考虑扩散步骤中结构保持与域对齐之间不断变化的相互作用。受此观察启发,我们提出一个灵活的免训练框架,将条件特征的采样调度与去噪过程解耦,并系统性地研究特征注入调度的谱系,以实现结构对齐与外观质量之间的更好平衡。我们进一步通过引入重启细化调度来增强采样过程,并通过外观丰富的提示策略改善视觉质量。这些设计共同实现了既结构丰富又外观丰富的免训练可控生成。大量实验表明,我们的方法在复杂多样的条件下达到了最先进的性能。由于其通用性,我们的框架自然支持组合条件生成,并以即插即用的方式跨架构泛化,从基于UNet的扩散模型到现代DiT骨干网络(如FLUX)。

英文摘要

Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules to achieve a better balance between structural alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free controllable generation that is both structure-rich and appearance-rich. Extensive experiments demonstrate that our method achieves state-of-the-art performance under complex and diverse conditions. Owing to its generality, our framework naturally supports compositional conditional generation and generalizes across architectures in a plug-and-play manner, from UNet-based diffusion models to modern DiT backbones such as FLUX.

2511.10367 2026-06-02 cs.CV cs.AI 版本更新

DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile

DermAI:通过质量驱动的图像采集实现移动端AI分类的临床皮肤病学

Thales Bezerra, Emanoel Thyago, Kelvin Cunha, Rodrigo Abreu, Fábio Papais, Francisco Mauro, Natália Lopes, Érico Medeiros, Jéssica Guido, Shirley Cruz, Paulo Borba, Tsang Ing Ren

发表机构 * Centro de Informática, Universidade Federal de Pernambuco, Brazil(巴西佩纳布卢克联邦大学计算机中心) Hospital das Clínicas, Universidade Federal de Pernambuco, Brazil(巴西佩纳布卢克联邦大学临床医院)

AI总结 提出DermAI智能手机应用,通过实时质量检查、本地模型适应和多样化数据集收集,解决AI皮肤病学中数据集偏差、图像质量差异和验证不足的问题。

Comments 4 pages, 2 figures, 1 table, submitted on ISBI

详情
AI中文摘要

基于AI的皮肤病学应用仍然受到数据集偏差、图像质量变化和验证有限的限制。我们介绍了DermAI,一个轻量级的基于智能手机的应用,能够在常规咨询期间实时捕获、标注和分类皮肤病变。与以往专注于皮肤镜的工具不同,DermAI在设备上进行质量检查和本地模型适应。DermAI临床数据集涵盖了广泛的肤色、种族和源设备。初步实验中,在公共数据集上训练的模型无法泛化到我们的样本,而使用本地数据进行微调则提高了性能。这些结果强调了标准化、多样化数据收集的重要性,这些数据应与医疗需求一致并面向机器学习开发。

英文摘要

AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.

2511.10806 2026-06-02 eess.IV cs.CV 版本更新

From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring

从注意力到频率:融合Vision Transformer与FFT-ReLU的图像去模糊增强方法

Syed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Prothito Shovon Majumder, Abdul Mohaimen Al Radi, Md. Haider Ali, Md. Mosaddek Khan

发表机构 * Department of Computer Science and Engineering, University of Dhaka(达卡大学计算机科学与工程系)

AI总结 提出一种双域架构,将Vision Transformer与频域FFT-ReLU模块结合,通过空间注意力建模和频率稀疏性抑制模糊伪影并保留细节,在基准数据集上取得优于现有方法的PSNR、SSIM和感知质量。

详情
Journal ref
Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 2, Marbella, Spain, March 5-7, 2026, pp. 1810-1820. SCITEPRESS
AI中文摘要

图像去模糊在计算机视觉中至关重要,旨在从由运动或相机抖动引起的模糊图像中恢复清晰图像。尽管诸如CNN和Vision Transformers(ViTs)等深度学习方法推动了该领域的发展,但它们往往难以处理复杂或高分辨率的模糊以及计算需求。我们提出了一种新的双域架构,将Vision Transformers与频域FFT-ReLU模块统一起来,明确桥接了空间注意力建模和频率稀疏性。在该结构中,ViT骨干网络捕获局部和全局依赖关系,而FFT-ReLU组件则强制执行频域稀疏性以抑制与模糊相关的伪影并保留精细细节。在基准数据集上的大量实验表明,与最先进模型相比,该架构实现了优越的PSNR、SSIM和感知质量。定量指标、定性比较和人类偏好评估均证实了其有效性,为真实世界图像恢复建立了一个实用且可泛化的范式。

英文摘要

Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.

2509.22689 2026-06-02 eess.IV cs.CV 版本更新

Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation

基于图论一致性的鲁棒且拓扑感知的半监督组织病理学分割

Ha-Hieu Pham, Minh Le, Han Huynh, Nguyen Quoc Khanh Le, Huy-Hieu Pham

发表机构 * Student(学生)

AI总结 提出拓扑图一致性(TGC)框架,通过对齐预测图与参考图的拉普拉斯谱、组件计数和邻接统计,在仅5-10%标注下实现最先进的半监督分割性能。

Comments Accepted to the AAAI 2026 Student Abstract and Poster Program

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence 2026
AI中文摘要

半监督语义分割(SSSS)在计算病理学中至关重要,因为密集标注成本高昂且有限。现有方法通常依赖像素级一致性,这会传播噪声伪标签并产生碎片化或拓扑无效的掩膜。我们提出拓扑图一致性(TGC),一个通过对齐预测图与参考图的拉普拉斯谱、组件计数和邻接统计来整合图论约束的框架。这强制执行全局拓扑并提高分割精度。在GlaS和CRAG上的实验表明,TGC在5-10%监督下实现了最先进的性能,并显著缩小了与全监督的差距。

英文摘要

Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision.

2511.06720 2026-06-02 cs.CV 版本更新

Relative Energy Learning for LiDAR Out-of-Distribution Detection

面向激光雷达离群点检测的相对能量学习

Zizhao Li, Zhengkang Xiang, Jiayang Ao, Joseph West, Kourosh Khoshelham

发表机构 * The University of Melbourne(墨尔本大学)

AI总结 提出相对能量学习(REL)框架,利用正负逻辑之间的能量差距作为相对评分函数,并结合轻量级数据合成策略Point Raise,有效提升激光雷达点云中离群点检测性能。

Comments Project Page: https://github.com/343gltysprk/rel

详情
AI中文摘要

离群点检测是可靠自动驾驶的关键要求,安全依赖于识别超出训练分布的障碍物和意外物体。尽管2D图像上的离群点检测研究广泛,但直接迁移到3D激光雷达点云已被证明无效。当前的激光雷达离群点检测方法难以区分罕见异常与常见类别,导致高误报率和安全关键场景中的过度自信错误。我们提出相对能量学习(REL),一个简单而有效的激光雷达点云离群点检测框架。REL利用正(分布内)和负逻辑之间的能量差距作为相对评分函数,缓解原始能量值中的校准问题,并提高跨场景的鲁棒性。为解决训练中缺乏离群点样本的问题,我们提出一种轻量级数据合成策略Point Raise,通过扰动现有点云生成辅助异常而不改变内点语义。在SemanticKITTI和Spotting the Unexpected基准测试上,REL以较大优势持续优于现有方法。我们的结果表明,结合简单合成离群点的相对能量建模为开放世界自动驾驶中的可靠离群点检测提供了原则性和可扩展的解决方案。

英文摘要

Out-of-distribution (OOD) detection is a critical requirement for reliable autonomous driving, where safety depends on recognizing road obstacles and unexpected objects beyond the training distribution. Despite extensive research on OOD detection in 2D images, direct transfer to 3D LiDAR point clouds has been proven ineffective. Current LiDAR OOD methods struggle to distinguish rare anomalies from common classes, leading to high false-positive rates and overconfident errors in safety-critical settings. We propose Relative Energy Learning (REL), a simple yet effective framework for OOD detection in LiDAR point clouds. REL leverages the energy gap between positive (in-distribution) and negative logits as a relative scoring function, mitigating calibration issues in raw energy values and improving robustness across various scenes. To address the absence of OOD samples during training, we propose a lightweight data synthesis strategy called Point Raise, which perturbs existing point clouds to generate auxiliary anomalies without altering the inlier semantics. Evaluated on SemanticKITTI and the Spotting the Unexpected (STU) benchmark, REL consistently outperforms existing methods by a large margin. Our results highlight that modeling relative energy, combined with simple synthetic outliers, provides a principled and scalable solution for reliable OOD detection in open-world autonomous driving.

2511.02591 2026-06-02 cs.CV 版本更新

Zero-Shot Multi-Animal Tracking in the Wild

野外零样本多动物跟踪

Jan Frederik Meier, Timo Lüddecke

发表机构 * Institute of Computer Science and Campus Institute Data Science(计算机科学研究所和校园数据科学研究院)

AI总结 本文基于视觉基础模型,结合Grounding DINO与SAM 2,通过三项针对性修改实现无需重新训练或调参的零样本多动物跟踪,在多个数据集上取得最优结果。

Comments CV4Animals Workshop at CVPR26

详情
AI中文摘要

多动物跟踪对于理解动物生态和行为至关重要,但由于栖息地、运动模式和物种外观的差异,仍然具有挑战性。传统方法通常需要针对每个新场景进行大量的微调和启发式设计。在这项工作中,我们探索了用于零样本多动物跟踪的视觉基础模型。基于SAM2MOT,我们将Grounding DINO与Segment Anything Model2(SAM 2)相结合,并引入了三项针对性修改,以使框架适应动物的外观和行为,而无需在数据集之间进行任何重新训练或超参数调整。我们还评估了最新的SAM3模型,但发现了限制其在野外多动物跟踪中适用性的实际局限性。我们的方法在Chimp-Act、Bird Flock Tracking、AnimalTrack和GMOT-40的一个子集上取得了最先进的结果,展示了跨不同物种和环境的强大泛化能力。代码可在https://github.com/ecker-lab/SAM2-Animal-Tracking获取。

英文摘要

Multi-animal tracking is crucial for understanding animal ecology and behavior, yet remains challenging due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive fine-tuning and heuristic design for each new scenario. In this work, we explore vision foundation models for zero-shot multi-animal tracking. Building on SAM2MOT, we combine Grounding DINO with the Segment Anything Model2 (SAM 2) and introduce three targeted modifications to adapt the framework to animal appearance and behavior without any retraining or hyperparameter tuning between datasets. We also evaluate the recent SAM3 model, but identify practical limitations that restrict its applicability to multi-animal tracking in the wild. Our method achieves state-of-the-art results across Chimp-Act, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40, demonstrating robust generalization across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.

2511.02086 2026-06-02 cs.CV 版本更新

Markerless Augmented Reality Registration for Surgical Guidance: A Multi-Anatomy Clinical Accuracy Study

用于手术引导的无标记增强现实配准:一项多解剖结构临床精度研究

Yue Yang, Fabian Necker, Christoph Leuze, Michelle Chen, Andrey Finegersh, Jake Lee, Vasu Divi, Bruce Daniel, Brian Hargreaves, Jie Ying Wu, Fred M Baik

发表机构 * School of Medicine, Stanford University(斯坦福大学医学院) Vanderbilt Institute of Surgery and Engineering(范德比尔特手术与工程研究院)

AI总结 本文开发并临床评估了一种基于深度相机的无标记增强现实配准方法,在头戴式显示器上实现多解剖结构(足、耳、小腿)的手术引导,中位误差约3-4 mm,接近临床可接受阈值。

详情
AI中文摘要

目的:在本文中,我们开发并在临床中评估了一种基于深度相机的无标记增强现实(AR)配准流程,应用于头戴式显示器,并在真实手术环境中评估小或低曲率解剖结构的精度。方法:在HoloLens 2上,我们通过(i)深度偏差校正,(ii)简短的闭环初始化,(iii)全局和局部配准,将Articulated HAnd Tracking(AHAT)深度与CT衍生的皮肤网格对齐。我们通过比较AR追踪工具在腿和足模型上的“皮肤到骨骼”相对距离与CT真实值,验证了表面追踪误差度量。随后,在腓骨游离皮瓣移植和下颌骨重建手术的初始阶段,我们进行了七次术中目标试验(足x2,耳x3,腿x2),每次试验收集500多个数据点。结果:临床前验证显示AR追踪距离与CT距离高度一致(腿:中位|Δd| 0.78 mm,RMSE 0.97 mm;足:0.80 mm,1.20 mm)。临床中,每点误差中位数为3.9 mm。按解剖结构划分的中位误差分别为足3.2 mm、耳4.3 mm、小腿5.3 mm,5 mm覆盖率分别为92-95%、84-90%和72-86%。足与小腿之间存在显著差异(中位差约1.1 mm;p < 0.001)。结论:在无基准标记的情况下,基于深度相机的无标记AR流程在活体手术环境中对足、耳和小腿实现了约3-4 mm的中位误差,接近中等风险任务的典型临床误差阈值。人工引导初始化结合全局到局部配准使得在小或低曲率目标上实现精确对齐,提高了无标记AR引导的临床准备度。

英文摘要

Purpose: In this paper, we develop and clinically evaluate a depth-only, markerless augmented reality (AR) registration pipeline on a head-mounted display, and assess accuracy across small or low-curvature anatomies in real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i) depth-bias correction, (ii) brief human-in-the-loop initialization, (iii) global and local registration. We validated the surface-tracing error metric by comparing "skin-to-bone" relative distances to CT ground truth on leg and foot models, using an AR-tracked tool. We then performed seven intraoperative target trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap harvest and mandibular reconstruction surgery, and collected 500+ data per trial. Results: Preclinical validation showed tight agreement between AR-traced and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm, 1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed significantly (Delta median ~1.1 mm; p < 0.001). Conclusion: A depth-only, markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear, and lower leg in live surgical settings without fiducials, approaching typical clinical error thresholds for moderate-risk tasks. Human-guided initialization plus global-to-local registration enabled accurate alignment on small or low-curvature targets, improving the clinical readiness of markerless AR guidance.

2510.27249 2026-06-02 cs.CV 版本更新

C-LEAD: Contrastive Learning for Enhanced Adversarial Defense

C-LEAD:用于增强对抗防御的对比学习

Suklav Ghosh, Sonal Kumar, Arijit Sur

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology(计算机科学与工程系,印度理工学院)

AI总结 提出利用对比学习增强分类模型对对抗攻击的鲁棒性,通过联合优化模型参数和扰动,学习鲁棒特征表示。

Comments Published in SN Computer Science, May 2026

详情
AI中文摘要

深度神经网络(DNN)在图像分类、分割和目标检测等计算机视觉任务中取得了显著成功。然而,它们容易受到对抗攻击,输入图像中的微小扰动可能导致错误预测。解决这个问题对于部署鲁棒的深度学习系统至关重要。本文提出了一种新颖的方法,利用对比学习进行对抗防御,这是一个先前未探索的领域。我们的方法利用对比损失函数,通过使用干净图像和对抗扰动图像训练分类模型,增强其鲁棒性。通过联合优化模型参数和扰动,我们的方法使网络能够学习鲁棒的特征表示,从而不易受到对抗攻击。实验结果表明,模型对各种对抗扰动的鲁棒性有显著提升。这表明对比损失有助于提取更具信息性和鲁棒性的特征,为深度学习中的对抗鲁棒性领域做出了贡献。代码已在GitHub上公开:https://github.com/suklav/C_Lead。

英文摘要

Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model's parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model's robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning. The code is publicly made available on GitHub in the following link: https://github.com/suklav/C_Lead .

2510.14904 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

CaptionFormer:时空对象的统一分割、跟踪与描述

Gabriel Fiastre, Antoine Yang, Cordelia Schmid

发表机构 * Inria, École Normale Supérieure, CNRS, PSL Research University(法国国家科学研究中心、巴黎高等师范学院、国家科学研究中心、巴黎综合理工研究所) Google DeepMind(谷歌DeepMind)

AI总结 提出 CaptionFormer 模型,通过利用 VLM 生成合成描述并扩展数据集,实现视频中对象轨迹的联合检测、分割、跟踪与描述,在三个基准上达到最优。

Comments 17 pages, 10 figures

详情
AI中文摘要

密集视频对象描述(DVOC)是联合检测、跟踪和描述视频中对象轨迹的任务,需要理解时空细节并用自然语言描述。由于任务复杂性和手动标注的高成本,先前方法采用有限数据的训练策略,可能导致次优性能。为解决此问题,我们提出利用最先进的 VLM 生成关于时空定位实体的描述,并用我们的合成描述(LVISCap 和 LV-VISCap)扩展 LVIS 和 LV-VIS 数据集。此外,我们引入端到端模型 CaptionFormer,能够联合检测、分割、跟踪和描述对象轨迹。CaptionFormer 在三个现有基准(VidSTG、VLN 和 BenSMOT)上取得了最先进的 DVOC 结果。数据集和代码可在 https://www.gabriel.fiastre.fr/captionformer/ 获取。

英文摘要

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM, and extend the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap). Moreover, we introduce an end-to-end model, CaptionFormer, capable of jointly detecting, segmenting, tracking and captioning object trajectories. CaptionFormer achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/captionformer/.

2510.24870 2026-06-02 cs.CL cs.CV cs.IR 版本更新

Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

看穿MiRAGE:评估多模态检索增强生成

Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, Benjamin Van Durme

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Human Language Technology Center of Excellence(人类语言技术卓越中心)

AI总结 提出MiRAGE框架,通过InfoF1和CiteF1指标评估多模态RAG的事实性和引用支持,并验证其与人工判断的一致性。

Comments https://github.com/alexmartin1722/mirage

详情
AI中文摘要

我们介绍了MiRAGE,一个用于从多模态源进行检索增强生成(RAG)的评估框架。随着视听媒体成为在线信息的普遍来源,RAG系统必须将这些来源的信息整合到生成中。然而,现有的RAG评估以文本为中心,限制了它们在多模态环境中的适用性。MiRAGE是一种以声明为中心的多模态RAG评估方法,包括评估事实性和信息覆盖率的InfoF1,以及评估引用支持和完整性的CiteF1。我们表明,当由人类应用时,MiRAGE与输出质量的外部判断高度一致。我们还介绍了MiRAGE的自动实现,以及三种著名的基于文本的RAG指标——ALCE、ARGUE和RAGAS——的多模态变体,展示了以文本为中心的工作的局限性,并为自动评估奠定了基础。我们发布开源实现,并概述了多模态RAG的评估方法。

英文摘要

We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal settings. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, which assesses factuality and information coverage, and CiteF1, which assesses citation support and completeness. We show that, when applied by humans, MiRAGE strongly aligns with extrinsic judgments of output quality. We additionally introduce an automatic implementation of MiRAGE as well as multimodal variants of three prominent text-based RAG metrics -- ALCE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline evaluation methods for multimodal RAG.

2510.24078 2026-06-02 cs.CV 版本更新

Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

超越对象:面向细粒度分类的上下文合成数据生成

William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky

发表机构 * Princeton University(普林斯顿大学) Google DeepMind(谷歌深Mind)

AI总结 提出BOB微调策略,通过提取并条件化类不可知属性(如场景背景和物体姿态)来缓解过拟合,提升细粒度分类中合成训练数据的质量,在多个数据集上达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

文本到图像(T2I)模型越来越多地用于合成数据集生成,但为分类生成有效的合成训练数据仍然具有挑战性。用少量真实样本微调T2I模型有助于提高合成训练数据的质量,但可能导致过拟合并降低生成样本的多样性。我们提出了一种微调策略BOB(BeyondOBjects)来缓解细粒度分类中的这些问题。给定少量真实样本,我们首先提取类不可知属性,如场景背景和物体姿态。然后在T2I模型微调过程中显式地以这些属性为条件,并在生成过程中将其边缘化。这种设计减轻了过拟合,保留了T2I模型的生成先验,减少了估计误差,并进一步最小化了意外的类间关联。在多个T2I模型、骨干网络和数据集上的大量实验表明,当使用合成数据增强时,我们的方法在低样本细粒度分类中达到了最先进的性能。具体来说,在Aircraft数据集上,BOB比DataDream提高了7.4%(从50.0%到57.4%,当使用5张真实图像和100张合成图像微调CLIP分类器时)。在四个基准测试中的三个中,使用5张真实图像和BOB增强的合成数据微调下游模型,其性能优于使用10张真实图像微调。总体而言,在24个实验设置中,BOB在18个设置中优于先前技术,其中14个设置的准确率提升超过2%。

英文摘要

Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.

2510.23057 2026-06-02 cs.RO cs.CV cs.SY eess.IV eess.SY 版本更新

Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation

Seq-DeepIPC:足式机器人导航中用于端到端控制的顺序感知

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada(计算机科学与电子系,加查马达大学) Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,东福冈技术大学)

AI总结 提出Seq-DeepIPC模型,通过融合多模态感知(RGB-D+GNSS)与时间序列,实现足式机器人在真实环境中的端到端导航控制,并在机器人狗上验证了其有效性。

Comments This work has been accepted for publication in the IEEE Sensors Journal. https://ieeexplore.ieee.org/document/11373257/

详情
AI中文摘要

我们提出了Seq-DeepIPC,一种用于足式机器人在真实环境中导航的顺序端到端感知到控制模型。Seq-DeepIPC通过将多模态感知(RGB-D+GNSS)与时间融合和控制紧密结合,推进了自主足式导航的智能感知。该模型联合预测语义分割和深度估计,为规划和控制提供更丰富的空间特征。为了在边缘设备上高效部署,我们使用轻量级模型作为编码器,在保持精度的同时减少计算量。通过移除噪声较大的IMU,转而通过顺序GNSS坐标的差分分析推导全局航向,简化了航向估计。我们收集了一个更大且更多样化的数据集,包括道路和草地地形,并在机器人狗上验证了Seq-DeepIPC。对比和消融研究表明,顺序输入改善了我们的模型中的感知和控制,而其他基线则没有受益。Seq-DeepIPC以合理的模型大小取得了具有竞争力或更好的结果;尽管仅使用GNSS的航向在高大建筑物附近可靠性较低,但在开阔区域是鲁棒的。总体而言,Seq-DeepIPC将端到端导航从轮式机器人扩展到更通用和具有时间感知能力的系统。为了支持未来的研究,我们将在GitHub仓库https://github.com/oskarnatan/Seq-DeepIPC发布代码。

英文摘要

We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in real-world environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use a lightweight model as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead deriving global heading via differential analysis of sequential GNSS coordinates. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/Seq-DeepIPC.

2510.17045 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Video Reasoning without Training

无需训练的视频推理

Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague

发表机构 * Qualcomm AI Research(高通AI研究) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出V-Reason方法,利用输出分布熵作为信号,通过轻量级控制器在推理时自适应调整值缓存,无需强化学习或微调即可提升视频推理性能。

Comments CVPR Findings 2026. Project Page https://deepaksridhar.github.io/vreason.github.io/

详情
AI中文摘要

使用大型多模态模型(LMM)进行视频推理依赖于昂贵的强化学习(RL)和冗长的思维链,导致训练和推理过程中产生大量计算开销。此外,这些推理模型中控制思维过程的机制非常有限。在本文中,我们利用模型输出分布的熵作为信号来研究和指导推理行为。我们发现高质量模型表现出微探索和微利用循环的特征模式,随后出现后期熵峰值(即更长的思考)和较低的最终熵,表明更谨慎的探索和自信的收敛(即当模型探索或思考答案时避免过度随机性)。然后,我们利用这些新颖的、有理论基础的见解,引入了V-Reason(Video-Reason),一种推理时优化方法,通过轻量级、可训练的控制器自适应调整LMM的值缓存。我们提出的控制器由基于熵的目标引导,直接在推理时调整模型行为,无需使用任何RL或监督微调。我们的实验表明,V-Reason在许多视频推理数据集上显著优于基础指令调优模型,将与RL模型的差距平均缩小到0.6%的准确率以内。我们在无需任何训练的情况下实现了这一点,同时提供了效率优势:V-Reason使用的token比RL模型少58.6%。项目页面:https://deepaksridhar.github.io/vreason.github.io/

英文摘要

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, we use the entropy of the model's output distribution as a signal to study and guide reasoning behavior. We discover that high-quality models exhibit a characteristic pattern of micro-exploration and micro-exploitation cycles, followed by a later entropy peak (i.e., longer thinking) and a lower final entropy, indicating more deliberate exploration and confident convergence (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We then use these novel, theoretically-grounded insights to introduce V-Reason (Video-Reason), an inference-time optimization method that adapts the value cache of the LMM through a lightweight, trainable controller. Our proposed controller is guided by an entropy-based objective, to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Our experiments show that V-Reason significantly outperforms the base instruction-tuned models on many video reasoning datasets, narrowing the gap with RL models to within 0.6% accuracy on average. We achieve this without any training, while offering efficiency benefits: V-Reason uses 58.6% fewer tokens than the RL model. Project Page https://deepaksridhar.github.io/vreason.github.io/

2510.16660 2026-06-02 cs.CV cs.LG physics.med-ph 版本更新

Universal and Transferable Attacks on Pathology Foundation Models

病理基础模型的通用与可迁移攻击

Yuntian Wang, Xilin Yang, Che-Yung Shen, Nir Pillar, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校电子与计算机工程系) Bioengineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校生物工程系) California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校加州纳米系统研究所) Department of Pathology, Hadassah Hebrew University Medical Center, Jerusalem, 91120, Israel(海法希伯来大学医疗中心病理学系) Department of Surgery, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校外科系)

AI总结 提出通用可迁移对抗扰动(UTAP),通过固定弱噪声模式破坏多个病理基础模型的特征表示能力,导致下游任务性能下降,并展示其跨数据集通用性和跨模型可迁移性。

Comments 38 Pages, 8 Figures

详情
Journal ref
Light: Science & Applications (2026)
AI中文摘要

我们为病理基础模型引入了通用可迁移对抗扰动(UTAP),揭示了其能力中的关键脆弱性。UTAP 使用深度学习优化,由一个固定的弱噪声模式组成,当添加到病理图像时,会系统地破坏多个病理基础模型的特征表示能力。因此,UTAP 会导致利用基础模型的下游任务性能下降,包括在广泛的未见数据分布上的错误分类。除了损害模型性能,我们展示了 UTAP 的两个关键特征:(1)通用性:其扰动可应用于不同的视野,与开发 UTAP 的数据集无关;(2)可迁移性:其扰动能成功降低各种外部、黑盒病理基础模型(从未见过)的性能。这两个特征表明 UTAP 不是针对特定基础模型或图像数据集的专用攻击,而是对多种新兴病理基础模型及其应用构成广泛威胁。我们在多个数据集上系统评估了 UTAP 对各种最先进病理基础模型的影响,通过使用固定噪声模式对输入图像进行视觉上不可察觉的修改,导致其性能显著下降。这些强大攻击的开发为模型鲁棒性评估建立了一个关键的高标准基准,凸显了推进防御机制的需求,并可能为对抗训练提供必要资产,以确保 AI 在病理学中的安全可靠部署。

英文摘要

We introduce Universal and Transferable Adversarial Perturbations (UTAP) for pathology foundation models that reveal critical vulnerabilities in their capabilities. Optimized using deep learning, UTAP comprises a fixed and weak noise pattern that, when added to a pathology image, systematically disrupts the feature representation capabilities of multiple pathology foundation models. Therefore, UTAP induces performance drops in downstream tasks that utilize foundation models, including misclassification across a wide range of unseen data distributions. In addition to compromising the model performance, we demonstrate two key features of UTAP: (1) universality: its perturbation can be applied across diverse field-of-views independent of the dataset that UTAP was developed on, and (2) transferability: its perturbation can successfully degrade the performance of various external, black-box pathology foundation models - never seen before. These two features indicate that UTAP is not a dedicated attack associated with a specific foundation model or image dataset, but rather constitutes a broad threat to various emerging pathology foundation models and their applications. We systematically evaluated UTAP across various state-of-the-art pathology foundation models on multiple datasets, causing a significant drop in their performance with visually imperceptible modifications to the input images using a fixed noise pattern. The development of these potent attacks establishes a critical, high-standard benchmark for model robustness evaluation, highlighting a need for advancing defense mechanisms and potentially providing the necessary assets for adversarial training to ensure the safe and reliable deployment of AI in pathology.

2507.23277 2026-06-02 cs.CV 版本更新

iLRM: An Iterative Large 3D Reconstruction Model

iLRM:一种迭代式大型3D重建模型

Gyeongjin Kang, Seungtae Nam, Seungkwon Yang, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park

发表机构 * Sungkyunkwan University(首尔大学) Yonsei University(延世大学) Rembrand Meta

AI总结 提出一种迭代式大型3D重建模型iLRM,通过解耦场景表示、分解多视图交互和注入高分辨率信息,实现高效、可扩展的前馈3D重建,在RE10K和DL3DV数据集上优于现有方法。

Comments Project page: https://gynjn.github.io/iLRM/

详情
AI中文摘要

前馈3D建模已成为快速高质量3D重建的一种有前景的方法。特别是直接生成显式3D表示(如3D高斯泼溅)因其快速高质量的渲染而备受关注。然而,许多基于Transformer架构的最先进方法存在严重的可扩展性问题,因为它们依赖于跨多个输入视图的图像令牌的全注意力,导致随着视图数量或图像分辨率的增加,计算成本变得难以承受。为了实现可扩展且高效的前馈3D重建,我们引入了一种迭代式大型3D重建模型(iLRM),该模型通过迭代细化机制生成3D高斯表示,并遵循三个核心原则:(1)将场景表示与输入图像解耦,以实现紧凑的3D表示;(2)将全局多视图交互分解为两阶段注意力方案,以降低计算成本;(3)在每一层注入高分辨率信息,以实现高保真重建。在广泛使用的数据集(如RE10K和DL3DV)上的实验结果表明,iLRM在重建质量和速度上均优于现有方法。

英文摘要

Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enable compact 3D representations; (2) decomposing global multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.

2510.14025 2026-06-02 cs.CV 版本更新

NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

NAPPure: 针对非加性扰动的鲁棒图像分类的对抗净化

Junjie Nan, Jianing Li, Wei Chen, Mingkun Zhang, Xueqi Cheng

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(人工智能安全国家重点实验室,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出NAPPure框架,通过似然最大化解耦干净图像与扰动参数,有效提升图像分类模型对非加性扰动(如模糊、遮挡、失真)的鲁棒性。

详情
AI中文摘要

对抗净化在对抗图像扰动方面取得了巨大成功,这些扰动通常被假定为加性的。然而,非加性对抗扰动,如模糊、遮挡和失真,在现实世界中也很常见。在这种扰动下,现有的对抗净化方法效果较差,因为它们是为适应加性性质而设计的。在本文中,我们提出了一个扩展的对抗净化框架,名为NAPPure,它可以进一步处理非加性扰动。具体来说,我们首先建立对抗图像的生成过程,然后通过似然最大化来解耦潜在的干净图像和扰动参数。在GTSRB和CIFAR-10数据集上的实验表明,NAPPure显著提升了图像分类模型对非加性扰动的鲁棒性。

英文摘要

Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.

2510.13774 2026-06-02 cs.LG cs.CV 版本更新

UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

UrbanFusion: 用于鲁棒空间表示对比学习的随机多模态融合

Dominik J. Mühlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出UrbanFusion模型,通过随机多模态融合(SMF)和Transformer模块整合街景、遥感、地图和POI数据,在56个城市41项任务中优于现有GeoAI模型。

详情
Journal ref
International Conference on Machine Learning (ICML), 2026
AI中文摘要

预测房价和公共卫生指标等城市现象需要有效整合各种地理空间数据。当前方法主要使用特定任务模型,而近期用于空间表示的通用模型通常仅支持有限模态且缺乏多模态融合能力。为克服这些挑战,我们提出UrbanFusion,一种具有随机多模态融合(SMF)的空间表示模型。该框架采用模态特定编码器处理不同类型输入,包括街景图像、遥感数据、制图地图和兴趣点(POI)数据。这些多模态输入通过基于Transformer的融合模块进行集成,学习统一表示。在全世界56个城市的41项任务上的广泛评估表明,与最先进的GeoAI模型相比,UrbanFusion具有强大的泛化和预测性能。具体而言,它1)在位置编码上优于先前模型,2)允许推理时多模态输入,3)能很好地泛化到训练中未见过的区域。UrbanFusion在预训练和推理过程中均可灵活利用给定位置的任何可用模态子集,从而在多样化的数据可用性场景中实现广泛适用性。

英文摘要

Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent generic models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a spatial representation model that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios.

2505.16915 2026-06-02 cs.CV cs.AI 版本更新

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

DetailMaster:你的文本到图像模型能处理长提示吗?

Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li

发表机构 * Sun Yat-Sen University(中山大学) Alibaba Group(阿里巴巴集团) Worcester Polytechnic Institute(沃斯特理工学院) Guangdong Provincial Key Laboratory of Fire Science and Intelligent Emergency Technology(广东省火灾科学与智能应急技术重点实验室)

AI总结 提出DetailMaster基准,通过自动数据构建和评估流程,系统评估文本到图像模型在长提示下的性能,发现编码器和扩散模型在细节密集条件下的局限性,并证明高保真生成需要扩展提示限制与长提示训练的协同组合。

Comments 36 pages, 10 figures, 21 tables, accepted by ICML2026

详情
AI中文摘要

尽管最近的文本到图像(T2I)模型在从简短描述合成图像方面表现出令人印象深刻的能力,但它们在专业应用所需的冗长、详细提示上存在困难。我们提出了DetailMaster,一个全面的基准,用于评估T2I模型在具有复杂组合要求的长提示上的能力,并附有自动数据构建流程和评估工作流。我们的基准包含专家验证的提示,平均长度为284.89个标记,引入了四个关键评估维度:角色属性、结构化角色位置、多维场景属性以及空间/交互关系。对各种通用和长提示优化模型的评估揭示了关键的性能限制,表明弱编码器难以保留提示中的句法依赖关系,并且扩散模型在细节密集条件下遭受属性泄漏。通过在不同约束下的受控消融研究,我们进一步表明高保真生成需要扩展提示限制和长提示训练的协同组合。我们开源了数据集和代码,以促进长提示驱动的T2I生成的发展。

英文摘要

While recent Text-to-Image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, they struggle with the long, detailed prompts required for professional applications. We present DetailMaster, a comprehensive benchmark for evaluating T2I capabilities on long prompts with complex compositional requirements, accompanied by an automated data construction pipeline and an evaluation workflow. Comprising expert-validated prompts averaging 284.89 tokens, our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. Evaluations on various general-purpose and long-prompt-optimized models reveal critical performance limitations, showing that weak encoders struggle to preserve syntactic dependencies within prompts and diffusion models suffer from attribute leakage under detail-intensive conditions. Through a controlled ablation study under varying constraints, we further show that high-fidelity generation requires a synergistic combination of expanded prompt limits and long-prompt training. We open-source our dataset and code to foster progress in long-prompt-driven T2I generation.

2510.09608 2026-06-02 cs.CV cs.AI cs.CL 版本更新

StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM:无限视频流的实时理解

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Yao Lu, Song Han

发表机构 * MIT(麻省理工学院) NVIDIA(英伟达)

AI总结 提出StreamingVLM,通过统一训练与流推理的框架,利用注意力汇点状态复用和滑动窗口机制实现无限视频流的实时稳定理解,在Inf-Streams-Eval基准上以8 FPS速度达到66.18%胜率,并提升通用VQA能力。

Comments Published as a conference paper at ICLR 2026. The first two authors contributed equally to this work

详情
AI中文摘要

视觉语言模型(VLM)可以为实时助手和自主代理提供动力,但它们面临一个关键挑战:理解近乎无限的视频流而不增加延迟和内存使用。对整个视频进行全注意力处理会导致二次计算成本和在长视频上性能不佳。同时,简单的滑动窗口方法也存在缺陷,它们要么破坏连贯性,要么由于冗余重计算而遭受高延迟。在本文中,我们介绍了StreamingVLM,一种专为实时、稳定理解无限视觉输入而设计的模型。我们的方法是一个统一框架,将训练与流推理对齐。在推理过程中,我们通过重用注意力汇点状态、最近视觉令牌的短窗口和最近文本令牌的长窗口来维护一个紧凑的KV缓存。这种流式能力通过一个简单的监督微调(SFT)策略灌输,该策略在短的重叠视频块上应用全注意力,有效地模拟了推理时的注意力模式,而无需在过长的上下文中进行训练。为了评估,我们构建了Inf-Streams-Eval,一个新的基准,包含平均超过两小时的视频,需要帧与文本之间的密集、每秒对齐。在Inf-Streams-Eval上,StreamingVLM对GPT-4O mini实现了66.18%的胜率,并在单个NVIDIA H100上以高达8 FPS的速度保持稳定、实时的性能。值得注意的是,我们的SFT策略还增强了通用的VQA能力,无需任何VQA特定的微调,在LongVideoBench上提高了+4.30,在OVOBench Realtime上提高了+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm获取。

英文摘要

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

2510.03938 2026-06-02 physics.optics cs.CV cs.NE physics.app-ph 版本更新

Super-resolution image projection over an extended depth of field using a diffractive decoder

使用衍射解码器实现扩展景深上的超分辨率图像投影

Hanlong Chen, Cagatay Isil, Tianyi Gan, Mona Jarrahi, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, California 90095, USA(加州大学洛杉矶分校电子与计算机工程系) Bioengineering Department, University of California, Los Angeles, California 90095, USA(加州大学洛杉矶分校生物工程系) California NanoSystems Institute (CNSI), University of California, Los Angeles, California 90095, USA(加州大学洛杉矶分校加州纳米系统研究所)

AI总结 提出一种混合图像投影系统,结合CNN编码器和全光学衍射解码器,实现扩展景深和像素超分辨率,提升空间带宽积。

Comments 18 Pages, 6 Figures

详情
Journal ref
Light: Science & Applications (2026)
AI中文摘要

图像投影系统必须在数据存储、计算和传输方面高效,同时保持输出的大空间带宽积(SBP)。本文介绍了一种混合图像投影系统,该系统结合了基于卷积神经网络(CNN)的数字编码器和全光学衍射解码器,实现了具有改进分辨率的扩展景深(DOF)。基于CNN的编码器将输入图像压缩为紧凑的相位表示,随后由低分辨率(LR)投影仪显示,并由模拟衍射解码器进行全光学图像重建。该光学解码器完全被动,设计用于合成像素超分辨图像投影,具有扩展景深,同时无需额外功耗即可实现超分辨图像重建。我们的像素超分辨率(PSR)图像投影系统在约267倍波长(W)的扩展景深内展示了高保真图像合成,同时在每个横向平面上提供高达约16倍的SBP改进。通过太赫兹波段的实验验证了该概念,并且该系统可扩展到电磁波谱的不同部分。这种图像投影架构可以减少显示系统的数据存储和传输需求,而不会对光学解码器施加额外的功率限制。除了扩展景深PSR图像投影外,该方法的基本原理还可扩展到各种应用,包括光学计量和显微镜。

英文摘要

Image projection systems must be efficient in data storage, computation and transmission while maintaining a large space-bandwidth-product (SBP) at their output. Here, we introduce a hybrid image projection system that achieves extended depth-of-field (DOF) with improved resolution, combining a convolutional neural network (CNN)-based digital encoder with an all-optical diffractive decoder. A CNN-based encoder compresses input images into compact phase representations, which are subsequently displayed by a low-resolution (LR) projector and processed by an analog diffractive decoder for all-optical image reconstruction. This optical decoder is completely passive, designed to synthesize pixel super-resolved image projections that feature an extended DOF while eliminating the need for additional power consumption for super-resolved image reconstruction. Our pixel super-resolution (PSR) image projection system demonstrates high-fidelity image synthesis over an extended DOF of ~267xW, where W is the illumination wavelength, concurrently offering up to ~16-fold SBP improvement at each lateral plane. The proof of concept of this approach is validated through an experiment conducted in the THz spectrum, and the system is scalable across different parts of the electromagnetic spectrum. This image projection architecture can reduce data storage and transmission requirements for display systems without imposing additional power constraints on the optical decoder. Beyond extended DOF PSR image projection, the underlying principles of this approach can be extended to various applications, including optical metrology and microscopy.

2510.00053 2026-06-02 eess.IV cs.CV cs.LG 版本更新

DPsurv: Dual-Prototype Evidential Fusion for Uncertainty-Aware and Interpretable Whole-Slide Image Survival Prediction

DPsurv: 双原型证据融合用于不确定性感知和可解释的全切片图像生存预测

Yucheng Xing, Ling Huang, Jingying Ma, Ruping Hong, Jiangdong Qiu, Pei Liu, Kai He, Huazhu Fu, Mengling Feng

发表机构 * National University of Singapore National University of Singapore Guangzhou Research Translation Innovation Institute Imperial College London Peking Union Medical College Hospital, Chinese Academy of Medical Sciences \& Peking Union Medical College Hunan University Institute of High Performance Computing, Agency for Science, Technology Research (A STAR)

AI总结 提出DPsurv双原型证据融合网络,通过不确定性感知的生存区间预测和基于补丁原型分配图、组件原型及组件级相对风险聚合的可解释性,在五个公开数据集上取得最佳一致性指数和积分Brier分数。

详情
AI中文摘要

病理全切片图像(WSIs)因其在细胞和组织水平上全面的组织病理学信息而被广泛用于癌症生存分析,能够进行定量、大规模且预后丰富的肿瘤特征分析。然而,现有大多数WSI生存分析方法可解释性有限,且常常忽略异质性切片图像中的预测不确定性。本文提出DPsurv,一种双原型全切片图像证据融合网络,输出不确定性感知的生存区间,同时通过补丁原型分配图、组件原型和组件级相对风险聚合实现预测的解释。在五个公开数据集上的实验取得了最高的平均一致性指数和最低的平均积分Brier分数,验证了DPsurv的有效性和可靠性。预测结果的解释在特征、推理和决策层面提供了透明度,从而增强了DPsurv的可信度和可解释性。

英文摘要

Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, while enabling interpretation of predictions through patch prototype assignment maps, component prototypes, and component-wise relative risk aggregation. Experiments on five publicly available datasets achieve the highest mean concordance index and the lowest mean integrated Brier score, validating the effectiveness and reliability of DPsurv. The interpretation of prediction results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.

2408.01653 2026-06-02 cs.CV 版本更新

MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas

MCPDepth:基于多圆柱全景图的立体匹配全方位深度估计

Feng Qiao, Zhexiao Xiong, Xinge Zhu, Yuexin Ma, Qiumeng He, Nathan Jacobs

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) The Chinese University of Hong Kong(香港中文大学) ShanghaiTech University(上海科技大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出MCPDepth两阶段框架,通过圆柱全景图的立体匹配和融合,利用循环注意力模块处理垂直畸变,在标准网络组件上实现高效的全方位深度估计,在Deep360和3D60数据集上MAE分别降低18.8%和19.9%。

Comments Accepted at the OmniCV Workshop, CVPR 2026

详情
AI中文摘要

全方位深度估计由于全景图像固有的畸变而面临重大挑战。尽管取得了显著进展,但投影方法的影响仍未得到充分探索。我们引入了多圆柱全景深度估计(MCPDepth),这是一种新颖的两阶段框架,旨在通过多个圆柱全景图之间的立体匹配来增强全方位深度估计。MCPDepth首先使用圆柱全景图进行立体匹配,然后对不同视图得到的深度图进行鲁棒融合。与现有方法依赖定制内核来处理畸变不同,MCPDepth利用标准网络组件,便于在嵌入式设备上无缝部署,同时提供卓越的性能。为了有效处理圆柱全景图中的垂直畸变,MCPDepth结合了循环注意力模块,显著扩展了传统卷积的感受野。我们对常见的全景投影——球面、圆柱和立方体——进行了全面的理论和实验分析,证明了圆柱投影的优越性。我们的方法在室外数据集Deep360上将平均绝对误差(MAE)降低了18.8%,在真实数据集3D60上降低了19.9%。这项工作为其他任务和实际应用提供了实用见解,建立了全方位深度估计的新范式。代码可在https://github.com/Qjizhi/MCPDepth获取。

英文摘要

Omnidirectional depth estimation presents a significant challenge due to the inherent distortions in panoramic images. Despite notable advancements, the impact of projection methods remains underexplored. We introduce Multi-Cylindrical Panoramic Depth Estimation (MCPDepth), a novel two-stage framework designed to enhance omnidirectional depth estimation through stereo matching across multiple cylindrical panoramas. MCPDepth initially performs stereo matching using cylindrical panoramas, followed by a robust fusion of the resulting depth maps from different views. Unlike existing methods that rely on customized kernels to address distortions, MCPDepth utilizes standard network components, facilitating seamless deployment on embedded devices while delivering exceptional performance. To effectively address vertical distortions in cylindrical panoramas, MCPDepth incorporates a circular attention module, significantly expanding the receptive field beyond traditional convolutions. We provide a comprehensive theoretical and experimental analysis of common panoramic projections-spherical, cylindrical, and cubic-demonstrating the superior efficacy of cylindrical projection. Our method improves the mean absolute error (MAE) by 18.8% on the outdoor dataset Deep360 and by 19.9% on the real dataset 3D60. This work offers practical insights for other tasks and real-world applications, establishing a new paradigm in omnidirectional depth estimation. The code is available at https://github.com/Qjizhi/MCPDepth.

2504.10552 2026-06-02 cs.LG cs.AI cs.CV cs.DL 版本更新

LEMUR Neural Network Dataset: Towards Seamless AutoML

LEMUR 神经网络数据集:迈向无缝 AutoML

Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg(计算机视觉实验室,CAIDAS,乌尔姆大学)

AI总结 提出 LEMUR 开源数据集与框架,通过统一模板、结构化存储和自动化超参数优化,标准化神经网络实现与评估,以加速 AutoML 研究并促进公平基准测试。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3291-3300, 2026
AI中文摘要

神经网络是现代人工智能的支柱,但设计、评估和比较它们仍然劳动密集。尽管存在许多用于训练的数据集,但模型本身的标准化集合很少。我们介绍 LEMUR,一个开源数据集和框架,它提供了大量基于 PyTorch 的神经网络集合,涵盖分类、分割、检测和自然语言处理等任务。每个模型遵循统一模板,配置和结果存储在结构化数据库中,以确保一致性和可重复性。LEMUR 通过 Optuna 集成自动超参数优化,包括统计分析和可视化工具,并提供 API 以无缝访问性能数据。该框架是可扩展的,允许研究人员添加新模型、数据集或指标而不破坏兼容性。通过标准化实现和统一评估,LEMUR 旨在加速 AutoML 研究,实现公平基准测试,并降低大规模神经网络实验的障碍。为支持采用和协作,LEMUR 及其插件在 MIT 许可下发布,网址为:https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

英文摘要

Neural networks are the backbone of modern artificial intelligence, but designing, evaluating, and comparing them remains labor-intensive. While numerous datasets exist for training, there are few standardized collections of the models themselves. We introduce LEMUR, an open-source dataset and framework that provides a large collection of PyTorch-based neural networks across tasks such as classification, segmentation, detection, and natural language processing. Each model follows a unified template, with configurations and results stored in a structured database to ensure consistency and reproducibility. LEMUR integrates automated hyperparameter optimization via Optuna, includes statistical analysis and visualization tools, and offers an API for seamless access to performance data. The framework is extensible, allowing researchers to add new models, datasets, or metrics without breaking compatibility. By standardizing implementations and unifying evaluation, LEMUR aims to accelerate AutoML research, enable fair benchmarking, and reduce barriers to large-scale neural network experimentation. To support adoption and collaboration, LEMUR and its plugins are released under the MIT license at: https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

2509.16635 2026-06-02 cs.CV 版本更新

Towards Anytime Retrieval: A Benchmark for Anytime Person Re-Identification

面向任意时间检索:任意时间行人重识别基准

Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Qinhong Yang, Tao Gong, Qi Chu, Mang Ye, Nenghai Yu

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学与技术学院) Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) The Chinese University of Hong Kong(香港中文大学) School of Computer Science, Wuhan University, China(武汉大学计算机科学学院)

AI总结 提出任意时间行人重识别(AT-ReID)任务,构建大规模多场景数据集AT-USTC,并设计统一模型Uni-AT实现全天候多场景有效检索。

Comments Accepted by IJCAI 2025 (oral)

详情
AI中文摘要

在实际应用中,行人重识别(ReID)需要能够在任何时间(包括白天和夜晚,从短期到长期)检索目标行人。然而,现有的ReID任务和数据集无法满足这一需求,因为它们受限于可用时间,仅提供特定场景的训练和评估。因此,我们研究了一项名为任意时间行人重识别(AT-ReID)的新任务,旨在基于时间变化在多个场景中实现有效检索。为了解决AT-ReID问题,我们收集了首个大规模数据集AT-USTC,其中包含由RGB和IR相机拍摄的403k张穿着多件衣服的个体图像。我们的数据收集跨越21个月,270名志愿者在不同日期或场景下平均被拍摄29.1次,比现有数据集多4-15倍,为AT-ReID的后续研究提供了条件。此外,为了应对多场景检索的新挑战,我们提出了一个统一模型Uni-AT,该模型包括一个用于场景特定特征学习的多场景ReID(MS-ReID)框架、一个减轻场景间干扰的属性专家混合(MoAE)模块,以及一个确保所有场景平衡训练的分层动态加权(HDW)策略。大量实验表明,我们的模型取得了令人满意的结果,并在所有场景中表现出优异的泛化能力。

英文摘要

In real applications, person re-identification (ReID) is expected to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets can not meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 403k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans 21 months, and 270 volunteers were photographed on average 29.1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.

2509.15234 2026-06-02 cs.CV 版本更新

Exploring the Capabilities of Large Language Model Encoders for Image-Text Retrieval in Chest X-rays

探索大语言模型编码器在胸部X光片图像-文本检索中的能力

Hanbin Ko, Gihun Cho, Inhyeok Baek, Donguk Kim, Joonbeom Koo, Changi Kim, Dongheon Lee, Chang Min Park

发表机构 * Interdisciplinary Program in Bioengineering, Seoul National University Graduate School(生物工程跨学科项目,首尔国立大学研究生院) Integrated Major in Innovative Medical Science, Seoul National University Graduate School(创新医学科学整合专业,首尔国立大学研究生院) Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine(浙江大学医学院第一附属医院放射科) Seoul National University College of Medicine(首尔国立大学医学院) Department of Radiology, Seoul National University College of Medicine, Seoul National University Hospital(首尔国立大学医学院放射科,首尔国立大学医院) Institute of Medical and Biological Engineering, Seoul National University Medical Research Center(医学与生物工程研究所,首尔国立大学医学研究所以及) Institute of Radiation Medicine, Seoul National University Medical Research Center(放射医学研究所,首尔国立大学医学研究所以及)

AI总结 提出一种领域自适应的双向大语言模型文本编码器,通过掩码标记预测和监督对比学习训练,结合参数高效的双塔对比视觉语言框架,提升胸部X光片图像与文本的对齐和检索性能。

Comments 12 pages, 2 figures, under review

详情
AI中文摘要

从配对的医学图像和临床文本中进行多模态学习是医学数据驱动信息学中的核心挑战,其中有效的跨模态对齐对于可扩展的分析和检索至关重要。在胸部放射学中,视觉语言预训练受到异质性放射学报告的制约,这些报告包含缩写、仅印象笔记和机构特定的写作风格。与通用领域不同,当报告风格差异显著时,简单聚合大量噪声报告可能会使多模态学习停滞甚至退化。我们提出了一种针对胸部放射学报告的领域自适应双向大语言模型文本编码器,通过在风格多样但临床等效的报告变体上进行掩码标记预测和监督对比学习训练,以生成鲁棒、可泛化的文本嵌入。然后,我们使用参数高效适配将该编码器集成到双塔对比视觉语言框架中,以改善图像-文本对齐。在来自公共数据集和去标识化医院队列的160万对配对研究上,所提出的模型提高了双向检索准确性和外部泛化能力,在MIMIC-CXR上达到0.308的GREEN分数,在Open-I上达到0.618,同时减少了在训练中添加富含缩写、仅印象的医院报告时观察到的退化。

英文摘要

Multimodal learning from paired medical images and clinical text is a central challenge in medical data-driven informatics, where effective cross-modal alignment is critical for scalable analysis and retrieval. In chest radiography, vision-language pretraining is constrained by heterogeneous radiology reports that contain abbreviations, impression-only notes, and institution-specific writing styles. Unlike general-domain settings, naively aggregating large collections of noisy reports can plateau or even degrade multimodal learning when reporting styles differ substantially. We propose a domain-adapted bidirectional large language model text encoder for chest radiograph reports, trained with masked token prediction and supervised contrastive learning on stylistically diverse but clinically equivalent report variants to produce robust, generalizable text embeddings. We then integrate this encoder into a dual-tower contrastive vision-language framework using parameter-efficient adaptation to improve image-text alignment. Across 1.6 million paired studies from public datasets and a de-identified hospital cohort, the proposed models improve bidirectional retrieval accuracy and external generalization, achieving GREEN scores of 0.308 on MIMIC-CXR and 0.618 on Open-I, while reducing the degradation observed when abbreviation-rich, impression-only hospital reports are added to training.

2205.02071 2026-06-02 cs.CV 版本更新

Representation-Centric Survey of Supervised Skeletal Action Recognition and the New Benchmark

以表示为中心的监督式骨骼动作识别综述与新基准

Yang Liu, Jiyao Yang, Madhawa Perera, Pan Ji, Dongwoo Kim, Min Xu, Tianyang Wang, Saeed Anwar, Tom Gedeon, Lei Wang, Zhenyue Qin

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算学院) University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) OPPO US Research Center(OPPO美国研究中心) Carnegie Mellon University(卡内基梅隆大学) University of Western Australia(西澳大利亚大学) Curtin University(Curtin大学) School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院) School of Medicine, Yale University(耶鲁大学医学院)

AI总结 本文以输入表示类型(关节坐标、骨骼向量、运动流及扩展表示)为中心,系统综述了监督式3D骨骼动作识别方法,并提出了包含多视角、复杂多人交互等挑战的大规模数据集ANUBIS,通过实验揭示了动作-特征依赖关系及多表示融合的局限性。

Comments Accepted for publication in Pattern Recognition

详情
AI中文摘要

3D骨骼动作识别已成为传统RGB和基于深度的方法的有力替代方案,具有对环境变化的鲁棒性、计算效率和增强的隐私性。尽管取得了显著进展,当前研究仍因输入表示多样而碎片化,且缺乏反映现实挑战场景的评估。本文以表示为中心,对监督式骨骼动作识别进行了综述,根据输入特征类型(关节坐标、骨骼向量、运动流和扩展表示)系统地对最先进方法进行分类,并分析这些选择如何影响时空建模策略。基于综述的见解,我们提出了ANUBIS,这是一个大规模、具有挑战性的数据集,旨在填补现有基准的关键空白。ANUBIS包含多视角记录(包括背面视角)、复杂的多人交互、细粒度和暴力动作以及当代社会行为。我们在ANUBIS上对多种最先进模型进行了基准测试,并深入分析了不同特征类型如何影响102个动作类别的识别性能。我们的结果显示了强烈的动作-特征依赖性,突出了朴素多表示融合的局限性,并指出了对任务感知、语义对齐的集成策略的需求。这项工作既提供了全面的基础,也提供了实用的基准资源,旨在指导下一代针对复杂现实场景的鲁棒、可泛化的基于骨骼的动作识别系统。数据集、基准框架和代码可在 https://yliu1082.github.io/ANUBIS/ 获取。

英文摘要

3D skeletal action recognition has emerged as a powerful alternative to traditional RGB and depth-based approaches, offering robustness to environmental variations, computational efficiency, and enhanced privacy. Despite remarkable progress, current research remains fragmented across diverse input representations and lacks evaluation under scenarios that reflect real-world challenges. This paper presents a representation-centric review of supervised skeletal action recognition, systematically categorizing state-of-the-art methods by their input feature types: joint coordinates, bone vectors, motion flows, and extended representations, and analyzing how these choices influence spatiotemporal modeling strategies. Building on the insights from this review, we introduce ANUBIS, a large-scale, challenging dataset designed to address critical gaps in existing benchmarks. ANUBIS incorporates multi-view recordings with back-view perspectives, complex multi-person interactions, fine-grained and violent actions, and contemporary social behaviors. We benchmark a diverse set of state-of-the-art models on ANUBIS and conduct an in-depth analysis of how different feature types affect recognition performance across 102 action categories. Our results show strong action-feature dependencies, highlight the limitations of naive multi-representational fusion, and point toward the need for task-aware, semantically aligned integration strategies. This work offers both a comprehensive foundation and a practical benchmarking resource, aiming to guide the next generation of robust, generalizable skeleton-based action recognition systems for complex real-world scenarios. The dataset, benchmarking framework, and code are available at https://yliu1082.github.io/ANUBIS/.

2507.19881 2026-06-02 cs.CV cs.AI 版本更新

FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving

FedS2R: 面向自动驾驶中合成到真实语义分割的一次性联邦域泛化

Tao Lian, Jose L. Gómez, Antonio M. López

发表机构 * Computer Vision Center (CVC) Univ. Autònoma de Barcelona (UAB) Barcelona, Spain(计算机视觉中心(CVC)巴塞罗那自治大学(UAB)巴塞罗那,西班牙)

AI总结 提出FedS2R框架,通过不一致性驱动的数据增强和多客户端知识蒸馏,实现自动驾驶中合成到真实语义分割的一次性联邦域泛化,在五个真实数据集上性能接近集中式训练。

Comments Accepted by IEEE Intelligent Vehicles Symposium (IV) 2026

详情
AI中文摘要

联邦域泛化在图像分类中通过多客户端协作训练而不共享原始数据已显示出有希望的进展。然而,其在自动驾驶语义分割中的潜力尚未被充分探索。本文提出FedS2R,这是第一个用于自动驾驶中合成到真实语义分割的一次性联邦域泛化框架。FedS2R包含两个组件:一种不一致性驱动的数据增强策略,用于生成不稳定类别的图像;以及一种具有特征融合的多客户端知识蒸馏方案,从多个客户端模型中蒸馏出全局模型。在五个真实数据集Cityscapes、BDD100K、Mapillary、IDD和ACDC上的实验表明,全局模型显著优于单个客户端模型,并且仅比同时访问所有客户端数据训练的模型落后2个mIoU点。这些结果证明了FedS2R在联邦学习下自动驾驶合成到真实语义分割中的有效性。

英文摘要

Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple clients without sharing raw data. However, its potential in the semantic segmentation of autonomous driving remains underexplored. In this paper, we propose FedS2R, the first one-shot federated domain generalization framework for synthetic-to-real semantic segmentation in autonomous driving. FedS2R comprises two components: an inconsistency-driven data augmentation strategy that generates images for unstable classes, and a multi-client knowledge distillation scheme with feature fusion that distills a global model from multiple client models. Experiments on five real-world datasets, Cityscapes, BDD100K, Mapillary, IDD, and ACDC, show that the global model significantly outperforms individual client models and is only 2 mIoU points behind the model trained with simultaneous access to all client data. These results demonstrate the effectiveness of FedS2R in synthetic-to-real semantic segmentation for autonomous driving under federated learning

2507.18863 2026-06-02 cs.CV cs.CL 版本更新

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

基于点视觉融合与语言模型重建的音素级视觉语音识别

Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh

发表机构 * Kyushu Institute of Technology(九州工业大学)

AI总结 提出一种两阶段音素级视觉语音识别框架,通过融合视觉和面部地标运动特征,并利用LLM模型重建单词,在LRS2和LRS3数据集上分别实现17.4%和21.0%的词错误率。

Comments Accepted at ICASSP 2026. This version corresponds to the camera-ready manuscript

详情
AI中文摘要

视觉自动语音识别(V-ASR)是一项具有挑战性的任务,涉及仅从视觉信息(如唇部运动和面部表情)解释口语。由于缺乏听觉线索以及音素(表现出相似视位——在唇部运动中看起来相同的不同声音)的视觉模糊性,该任务尤为困难。现有方法通常旨在直接从视觉线索预测单词或字符,但由于视位模糊性,它们通常遭受高错误率,并且需要大量预训练数据。我们提出了一种新颖的基于音素的两阶段框架,融合视觉和地标运动特征,随后使用LLM模型进行单词重建以应对这些挑战。第一阶段包括V-ASR,输出预测的音素,从而降低训练复杂度。同时,面部地标特征处理说话者特定的面部特征。第二阶段包括一个编码器-解码器LLM模型NLLB,将输出的音素重建回单词。除了使用大型视觉数据集进行深度学习微调外,我们的PV-ASR方法在LRS2数据集上实现了17.4%的词错误率,在LRS3数据集上实现了21.0%的词错误率,展现出优越性能。

英文摘要

Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.

2503.06520 2026-06-02 cs.CV cs.MM 版本更新

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Seg-Zero: 通过认知强化学习的推理链引导分割

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia

发表机构 * The Chinese University of Hong Kong(香港中文大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Renmin University of China(中国人民大学)

AI总结 提出Seg-Zero框架,通过解耦的推理模型和分割模型,结合GRPO强化学习与格式-精度奖励机制,实现零样本推理分割,在ReasonSeg基准上超越LISA-7B 18%。

详情
AI中文摘要

传统的推理分割方法依赖于使用类别标签和简单描述进行监督微调,限制了其域外泛化能力且缺乏显式推理过程。为解决这些限制,我们提出了Seg-Zero,一种新颖的框架,通过认知强化学习展现出显著的泛化能力并推导出显式的思维链推理。Seg-Zero引入了一个解耦架构,包含一个推理模型和一个分割模型。推理模型解释用户意图,生成显式推理链,并产生位置提示,随后分割模型利用这些提示生成精确的像素级掩码。我们设计了一个复杂的奖励机制,整合了格式奖励和精度奖励,以有效指导优化方向。仅通过GRPO的强化学习训练,无需显式推理数据,Seg-Zero实现了鲁棒的零样本泛化,并展现出涌现的测试时推理能力。实验表明,Seg-Zero-7B在ReasonSeg基准上达到了57.5的零样本性能,超越了之前的LISA-7B 18%。这一显著提升突显了Seg-Zero跨域泛化的能力,同时呈现了显式的推理过程。

英文摘要

Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process.

2502.08884 2026-06-02 cs.CV cs.AI cs.GR 版本更新

ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models

ShapeLib: 利用大型语言模型设计程序化3D形状抽象库

R. Kenny Jones, Paul Guerrero, Niloy J. Mitra, Daniel Ritchie

发表机构 * Stanford University(斯坦福大学) Adobe Research(Adobe研究) University College London(伦敦大学学院) Brown University(布朗大学)

AI总结 提出ShapeLib方法,利用大型语言模型的先验知识,通过引导式工作流自动设计可泛化的程序化3D形状抽象库,并支持下游形状编辑与生成。

详情
AI中文摘要

我们提出ShapeLib,这是第一个利用大型语言模型(LLM)的先验知识来设计程序化3D形状抽象库的方法。我们的系统接受两种形式的用户提供的设计意图:输出库中应包含的功能的高级文本描述,以及一小部分示例形状的种子集。我们通过引导式LLM工作流发现与设计意图匹配的抽象库,该工作流首先提出应用和实现功能的不同方式,然后验证这些功能有助于表示种子集形状。为了扩展到种子集之外,我们开发了特定于库的识别网络,将形状(表示为基元、体素或点云)映射到使用这些新发现的抽象的程序。跨多个建模领域(按形状类别划分),我们发现,当LLM与几何推理深思熟虑地结合时,可以引导它们编写出能跨形状分布泛化的抽象函数库。我们的框架朝着实现长期以来的形状分析愿望迈出了一步,即发现可重用的、程序化的形状抽象,同时暴露可解释的、语义对齐的接口。我们的广泛评估表明,ShapeLib在泛化性、可用性和在操作下保持合理性方面,优于先前的替代抽象发现方法。最后,我们展示了ShapeLib的抽象函数解锁了多个下游应用,将LLM对形状程序的推理与几何处理工具相结合,以支持形状编辑和生成工作流。

英文摘要

We present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abstractions. Our system accepts two forms of user-provided design intent: high-level text descriptions of functions to include in the output library and a small seed set of exemplar shapes. We discover a library of abstractions that matches this design intent with a guided LLM workflow that first proposes different ways of applying and implementing functions, and then validates these functions are helpful in representing seed set shapes. To extend beyond the seed set, we develop library-specific recognition networks that map shapes (represented as primitives, voxels, or point clouds) to programs that use these newly discovered abstractions. Across multiple modeling domains (split by shape category), we find that LLMs, when thoughtfully combined with geometric reasoning, can be guided to author libraries of abstraction functions that generalize across shape distributions. Our framework takes a step towards realizing the long-standing shape analysis aspiration of discovering reusable, programmatic shape abstractions while exposing interpretable, semantically aligned interfaces. Our extensive evaluation demonstrates that ShapeLib provides distinct advantages over prior alternative abstraction discovery works in terms of generalization, usability, and maintaining plausibility under manipulation. Finally, we demonstrate that ShapeLib's abstraction functions unlock a number of downstream applications, combining LLM reasoning over shape programs with geometry processing tools to support shape editing and generation workflows.

2506.10858 2026-06-02 eess.IV cs.CV 版本更新

Med-URWKV†: Toward Enhanced Pretrained Pure VRWKV Models for Medical Image Segmentation

Med-URWKV†:面向医学图像分割的增强型预训练纯VRWKV模型

Zhenhuan Zhou, Yining Li, Yanlin Wu, Haohan Zou, Yan Wang, Tao Li

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院) Key Laboratory of Data and Intelligent System Security, Ministry of Education(教育部数据与智能系统安全重点实验室) School of Medicine, Nankai University(南开大学医学院) Nankai University Eye institute, Nankai University(南开大学眼科研究院) Tianjn Eye Hospital(天津眼科医院) Haihe Lab of ITAI(海河ITAI实验室)

AI总结 本文提出Med-URWKV模型,通过重用预训练VRWKV编码器并设计FAWA和MSCF模块,在五个数据集上达到SOTA性能,其中Med-URWKV†以半参数实现最高平均Dice 88.00%。

Comments Under Review Since 2026-1-22, 12 pages. Copyright: College of Computer Science, Nankai University. All rights reserved

详情
AI中文摘要

医学图像分割是计算机辅助诊断和治疗中的基本任务。基于CNN、ViT、Mamba和混合模型的现有方法仍存在感受野受限、计算成本高或精度不足等问题。最近,视觉感受野加权键值(VRWKV)模型作为一种有前景的替代方案出现,为视觉任务提供了强大的长距离依赖建模能力。然而,当前基于VRWKV的医学图像分割研究主要集中于从头训练的混合架构,而大规模预训练纯VRWKV模型的潜力尚未被探索。在这项工作中,我们系统研究了纯VRWKV架构在医学图像分割中的有效性。通过在不同尺度上重用预训练VRWKV编码器并搭配纯VRWKV解码器,我们构建了Med-URWKV-T和Med-URWKV-S,从而对该领域中的预训练纯VRWKV模型进行全面评估。为进一步提升性能,我们提出了两个VRWKV兼容模块:频率感知小波注意力(FAWA)模块,利用小波变换捕捉边缘细节和结构特征;以及多尺度通道融合(MSCF)模块,整合多尺度特征以增强信息性通道表示。通过将它们集成到Med-URWKV-T中,我们得到了增强模型Med-URWKV†。在五个医学图像分割数据集上的大量实验表明,Med-URWKV取得了与最先进方法及精心设计的混合VRWKV架构相当或更优的性能。此外,Med-URWKV†进一步提升了分割精度,在仅使用一半参数量的情况下超越了Med-URWKV-S,并达到了最高的平均Dice相似系数88.00%。代码将公开发布。

英文摘要

Medical image segmentation is a fundamental task in computer-aided diagnosis and treatment. Existing approaches based on CNNs, ViTs, Mamba, and hybrid models still suffer from limitations such as restricted receptive fields, high computational cost, or insufficient accuracy. Recently, Vision Receptive-field Weighted Key-Value (VRWKV) models have emerged as a promising alternative,delivering strong long-range dependency modeling for visual tasks. However, current studies on VRWKV-based medical image segmentation mainly focus on hybrid architectures trained from scratch, while the potential of large-scale pretrained pure VRWKV models remains unexplored. In this work, we systematically investigate the effectiveness of pure VRWKV architectures for medical image segmentation. We construct Med-URWKV-T and Med-URWKV-S by reusing pretrained VRWKV encoders at different scales and pairing them with pure VRWKV decoders, enabling a comprehensive evaluation of pretrained pure VRWKV models in this domain. To further enhance performance, we propose two VRWKV-compatible modules: a Frequency-Aware Wavelet Attention (FAWA) module, which exploits wavelet transforms to capture edge details and structural characteristics, and a Multi-Scale Channel Fusion (MSCF) module, which integrates multi-scale features to strengthen informative channel representations. By incorporating them into Med-URWKV-T, we obtain the enhanced model Med-URWKV†. Extensive experiments on five medical image segmentation datasets demonstrate that Med-URWKV achieves performance comparable to or superior to state-of-the-art methods and carefully designed hybrid VRWKV architectures. Moreover, Med-URWKV† further improves segmentation accuracy, surpassing Med-URWKV-S while using only half of its parameter count, and achieves the highest average Dice similarity coefficient of 88.00%. The codes will be released.

2506.08137 2026-06-02 cs.CV cs.AI 版本更新

IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation

IGraSS: 通过迭代图约束语义分割从卫星图像中识别基础设施网络

Oishee Bintey Hoque, Abhijin Adiga, Aniruddha Adiga, Siddharth Chaudhary, Madhav V. Marathe, S. S. Ravi, Kirti Rajagopalan, Amanda Wilson, Samarth Swarup

发表机构 * Biocomplexity Institute, University of Virginia(弗吉尼亚大学生物复杂性研究所) Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系) Department Biomedical Systems Engineering, Washington State University(华盛顿州立大学生物医学系统工程系) Earth System Science Center, University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校地球系统科学中心)

AI总结 提出IGraSS迭代框架,结合语义分割与图约束优化,将不可达运河段从18%降至3%,并提升道路网络完整性。

详情
AI中文摘要

精确的运河网络制图对于水资源管理(包括灌溉规划和基础设施维护)至关重要。最先进的基础设施制图语义分割模型(如道路)依赖于大规模、良好标注的遥感数据集。然而,不完整或不充分的真实标注会阻碍这些学习方法。许多基础设施网络具有图级属性,如可达性(运河)或连通性(道路),可用于改进现有真实标注。本文开发了一种新颖的迭代框架IGraSS,将结合RGB和额外模态(NDWI、DEM)的语义分割模块与基于图的真实标注精化模块相结合。分割模块处理卫星图像块,而精化模块将基础设施网络视为图,在整个数据上运行。实验表明,IGraSS将不可达运河段从约18%降至3%,并且使用精化后的真实标注进行训练显著改善了运河识别。IGraSS是一个鲁棒的框架,既可用于精化噪声真实标注,也可用于从遥感影像中绘制运河网络。我们还以道路网络为例,应用不同的图论约束来完善道路网络,证明了IGraSS的有效性和泛化能力。

英文摘要

Accurate canal network mapping is essential for water management, including irrigation planning and infrastructure maintenance. State-of-the-art semantic segmentation models for infrastructure mapping, such as roads, rely on large, well-annotated remote sensing datasets. However, incomplete or inadequate ground truth can hinder these learning approaches. Many infrastructure networks have graph-level properties such as reachability to a source (like canals) or connectivity (roads) that can be leveraged to improve these existing ground truth. This paper develops a novel iterative framework IGraSS, combining a semantic segmentation module-incorporating RGB and additional modalities (NDWI, DEM)-with a graph-based ground-truth refinement module. The segmentation module processes satellite imagery patches, while the refinement module operates on the entire data viewing the infrastructure network as a graph. Experiments show that IGraSS reduces unreachable canal segments from around 18% to 3%, and training with refined ground truth significantly improves canal identification. IGraSS serves as a robust framework for both refining noisy ground truth and mapping canal networks from remote sensing imagery. We also demonstrate the effectiveness and generalizability of IGraSS using road networks as an example, applying a different graph-theoretic constraint to complete road networks.

2506.09035 2026-06-02 cs.CV 版本更新

Princeton365: A Diverse Dataset with Accurate Camera Pose

Princeton365: 一个具有精确相机位姿的多样化数据集

Karhan Kayan, Stamatis Alexandropoulos, Rishabh Jain, Yiming Zuo, Erich Liang, Jia Deng

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出Princeton365数据集,包含365个视频和精确相机位姿,通过校准板和360度相机的新颖真值采集框架弥合精度与多样性差距,并引入基于光流的尺度感知评估指标及新颖视图合成基准。

Comments Update v2: Match the ICCV 2025 camera-ready version. Fix typos

详情
AI中文摘要

我们介绍了Princeton365,一个包含365个视频的大规模多样化数据集,具有精确的相机位姿。我们的数据集通过引入一种新颖的真值采集框架,利用校准板和360度相机,弥合了当前SLAM基准中精度与数据多样性之间的差距。我们收集了室内、室外和物体扫描视频,并同步输出单目和立体RGB视频以及IMU数据。我们进一步提出了一种基于相机位姿估计误差引起的光流的新场景尺度感知SLAM评估指标。与当前指标相比,我们的新指标允许跨场景比较SLAM方法的性能,而现有指标如平均轨迹误差(ATE)则不能,从而使研究人员能够分析其方法的失败模式。我们还提出了一个具有挑战性的新颖视图合成基准,涵盖了当前NVS基准未覆盖的情况,例如具有360度相机轨迹的完全非朗伯场景。请访问 https://princeton365.cs.princeton.edu 获取数据集、代码、视频和提交信息。

英文摘要

We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360-camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360-degree camera trajectories. Please visit https://princeton365.cs.princeton.edu for the dataset, code, videos, and submission.

2503.06473 2026-06-02 cs.CV cs.AI 版本更新

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

通过剪枝冗余检索增强层注意力效率

Hanze Li, Yaosong Du, Zhibo Yao, Mengyao Zeng, Xiuqi Ge, Xiande Huang

发表机构 * De Artificial Intelligence Lab(德人工智能实验室)

AI总结 针对层注意力机制中相邻层权重冗余导致特征重复和训练效率低的问题,提出基于KL散度量化冗余并利用增强Beta分位数映射(EBQM)跳过冗余层的高效层注意力(ELA)架构,在图像分类和目标检测任务中训练时间减少30%且性能提升。

Comments 5 pages

详情
AI中文摘要

越来越多的证据表明,层注意力机制增强了深度神经网络中层间的交互,显著推进了网络架构的发展。然而,现有的层注意力方法存在冗余问题,因为相邻层学习的注意力权重往往变得高度相似。这种冗余导致多个层提取几乎相同的特征,降低了模型的表示能力并增加了训练时间。为了解决这个问题,我们提出了一种新颖的方法,利用相邻层之间的Kullback-Leibler(KL)散度来量化冗余。此外,我们引入了一种增强Beta分位数映射(EBQM)方法,能够准确识别并跳过冗余层,从而保持模型稳定性。我们提出的高效层注意力(ELA)架构提高了训练效率和整体性能,在图像分类和目标检测等任务中实现了30%的训练时间减少,同时提升了性能。

英文摘要

Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30% reduction in training time while enhancing performance in tasks such as image classification and object detection.

2411.15076 2026-06-02 eess.IV cs.CV q-bio.QM 版本更新

RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency

RankByGene: 通过跨模态排序一致性实现基因引导的组织病理学表示学习

Wentao Huang, Meilong Xu, Xiaoling Hu, Shahira Abousamra, Aniruddha Ganguly, Saarthak Kapse, Alisa Yurovsky, Prateek Prasanna, Tahsin Kurc, Joel Saltz, Michael L. Miller, Chao Chen

发表机构 * Stony Brook University(石英溪大学) Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School(阿提诺拉A.马丁努斯生物医学影像中心,麻省总医院和哈佛医学院) Department of Biomedical Data Science, Stanford University(生物医学数据科学系,斯坦福大学) Department of Pathology and Cell Biology, Columbia University(病理学与细胞生物学系,哥伦比亚大学)

AI总结 提出基于排序对齐损失的框架,利用教师-学生网络自监督知识蒸馏,解决空间转录组学与组织学图像的对齐问题,在基因表达预测、切片分类和生存分析任务中表现优异。

Comments 18 pages, 9 figures

详情
AI中文摘要

空间转录组学通过映射组织内的基因表达提供必要的空间背景,从而能够详细研究细胞异质性和组织组织。然而,由于固有的空间扭曲和模态特异性变化,将ST数据与组织学图像对齐面临挑战。现有方法主要依赖直接对齐,通常无法捕捉复杂的跨模态关系。为解决这些限制,我们提出一种新颖框架,使用基于排序的对齐损失来对齐基因和图像特征,保留跨模态的相对相似性,并实现稳健的多尺度对齐。为进一步增强对齐的稳定性,我们采用教师-学生网络架构的自监督知识蒸馏,有效减轻基因表达数据中高维性、稀疏性和噪声带来的干扰。在涵盖基因表达预测、切片级分类和生存分析的七个公共数据集上的大量实验证明了我们方法的有效性,显示出比现有方法更好的对齐和预测性能。

英文摘要

Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment's stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on seven public datasets that encompass gene expression prediction, slide-level classification, and survival analysis demonstrate the efficacy of our method, showing improved alignment and predictive performance over existing methods.

2503.15639 2026-06-02 cs.CV cs.AI 版本更新

A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition

一种轻量级上下文驱动的免训练网络用于场景文本分割与识别

Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal, Cheng-Lin Liu

发表机构 * CVPR Unit, Indian Statistical Institute, Kolkata, India(印度统计研究所柯西拉分校CVPR单位) Manipal University Jaipur, India(印度贾浦尔曼普尔大学) University of Salford, UK(英国萨尔福德大学) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出一种基于上下文理解、无需训练的即插即用框架,通过注意力分割和语义评估实现高效场景文本识别,性能与SOTA相当且资源消耗更低。

Comments Accepted at ICDAR 2025 (ORAL) 21 pages, 8 figures, 7 tables

详情
AI中文摘要

现代场景文本识别系统通常依赖于大型端到端架构,这些架构需要大量训练,并且对于实时场景来说成本过高。在这种情况下,由于内存、计算资源和延迟的限制,部署重型模型变得不切实际。为了应对这些挑战,我们提出了一种新颖的、无需训练的即插即用框架,该框架利用预训练文本识别器的优势,同时最小化冗余计算。我们的方法使用基于上下文的理解,并引入了一个基于注意力的分割阶段,该阶段在像素级别细化候选文本区域,从而改进下游识别。我们不执行传统的文本检测(即特征图与源图像之间的块级比较),而是利用预训练的标题生成器来利用上下文信息,使框架能够直接从场景上下文生成单词预测。候选文本经过语义和词汇评估以获得最终分数。达到或超过预定义置信度阈值的预测绕过更重的端到端文本STR(场景文本识别)流程,确保更快的推理并减少不必要的计算。在公共基准上的实验表明,我们的范式实现了与最先进系统相当的性能,但所需资源大大减少。我们的代码可在此处找到:https://ritabrata04.github.io/Context-driven-STR/。

英文摘要

Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.Our code can be found here: https://ritabrata04.github.io/Context-driven-STR/.

2503.06136 2026-06-02 cs.CV cs.AI 版本更新

GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

GSV3D: 基于高斯溅射的几何蒸馏与稳定视频扩散用于单图像3D物体生成

Ye Tao, Jiawei Zhang, Yahao Shi, Dongqing Zou, Bin Zhou

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(虚拟现实技术与系统国家重点实验室,北京航空航天大学) SenseTime Research(商汤研究) PBVR

AI总结 提出一种结合2D扩散模型隐式3D推理能力与高斯溅射几何蒸馏的方法,通过高斯溅射解码器将SV3D潜变量输出转换为显式3D表示,实现多视图一致性和高质量3D生成。

详情
AI中文摘要

基于图像的3D生成在机器人和游戏领域有广泛应用,其中高质量、多样化的输出和一致的3D表示至关重要。然而,现有方法存在局限性:3D扩散模型受限于数据集稀缺和缺乏强大的预训练先验,而基于2D扩散的方法则难以保证几何一致性。我们提出了一种方法,利用2D扩散模型的隐式3D推理能力,同时通过基于高斯溅射的几何蒸馏确保3D一致性。具体来说,所提出的高斯溅射解码器通过将SV3D潜变量输出转换为显式3D表示来强制3D一致性。与仅依赖隐式2D表示进行视频生成的SV3D不同,高斯溅射显式编码空间和外观属性,通过几何约束实现多视图一致性。这些约束纠正了视图不一致性,确保了稳健的几何一致性。因此,我们的方法同时生成高质量、多视图一致的图像和精确的3D模型,为基于单图像的3D生成提供了可扩展的解决方案,并弥合了2D扩散多样性与3D结构一致性之间的差距。实验结果表明,该方法在多个数据集上实现了最先进的多视图一致性和强泛化能力。代码将在接收后公开。

英文摘要

Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.

2502.07617 2026-06-02 cs.CV 版本更新

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

将视觉语言模型的预训练扩展到一千亿数据

Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai

发表机构 * Google DeepMind(谷歌DeepMind)

AI总结 本文通过实验探究将视觉语言模型预训练数据扩展到一千亿规模的效果,发现传统基准性能饱和,但文化多样性任务和低资源语言受益显著,并指出质量过滤可能减少文化多样性。

Comments v2: CVPR Findings'26

详情
AI中文摘要

我们提供了一个关于将视觉语言模型预训练扩展到前所未有规模——一千亿样本——潜力的实证研究。我们发现,在许多常见的西方中心分类和检索基准(如COCO Captions)上,模型性能在此规模下趋于饱和。然而,文化多样性任务从一千亿规模的网络数据中获得了更实质性的提升,这得益于其对长尾概念的覆盖。此外,我们分析了模型的多语言能力,并展示了在低资源语言上的提升。另外,我们观察到,通过使用如CLIP等质量过滤器减少预训练数据集的大小(通常用于提升性能)可能会无意中减少大规模数据集中所代表的文化多样性。我们的结果强调,虽然传统基准可能不会从将噪声原始网络数据扩展到一千亿样本中显著受益,但这一数据规模对于构建真正包容的多模态系统至关重要。

英文摘要

We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.

2111.03861 2026-06-02 cs.CV cs.AI cs.LG 版本更新

What augmentations are sensitive to hyper-parameters and why?

哪些数据增强对超参数敏感以及为什么?

Ch Muhammad Awais, Imad Eddine Ibrahim Bekkouch

发表机构 * Knowledge Representation Lab Innopolis University(知识表示实验室 印尼奥利普斯大学) Sorbonne Center for Artificial Intelligence - SCAI Sorbonne University(索邦人工智能中心 - SCAI 索邦大学)

AI总结 本研究通过局部代理(LIME)解释和线性回归系数评估不同数据增强对模型超参数的敏感性、一致性和影响,发现某些增强对超参数高度敏感,而另一些则更稳健可靠。

Comments 10 pages, 17 figures

详情
Journal ref
Intelligent Computing: Proceedings of the 2022 Computing Conference
AI中文摘要

我们对数据集应用增强以提高预测质量,并使最终模型对噪声数据和领域漂移更具鲁棒性。然而,问题仍然存在:这些增强在不同的超参数下表现如何?在本研究中,我们通过执行局部代理(LIME)解释来评估增强对模型超参数的敏感性、一致性和影响,当不同增强应用于机器学习模型时,解释超参数的影响。我们利用线性回归系数来加权每个增强。我们的研究证明,有些增强对超参数高度敏感,而其他增强则更具鲁棒性和可靠性。

英文摘要

We apply augmentations to our dataset to enhance the quality of our predictions and make our final models more resilient to noisy data and domain drifts. Yet the question remains, how are these augmentations going to perform with different hyper-parameters? In this study we evaluate the sensitivity of augmentations with regards to the model's hyper parameters along with their consistency and influence by performing a Local Surrogate (LIME) interpretation on the impact of hyper-parameters when different augmentations are applied to a machine learning model. We have utilized Linear regression coefficients for weighing each augmentation. Our research has proved that there are some augmentations which are highly sensitive to hyper-parameters and others which are more resilient and reliable.

2501.12178 2026-06-02 cs.CV 版本更新

Visualizing definitional divergence in high-dimensional data by manifold alignment: Application to 3D right ventricular strain computations

通过流形对齐可视化高维数据中的定义差异:应用于3D右心室应变计算

Maxime Di Folco, Gabriel Bernardino, Patrick Clarysse, Nicolas Duchateau

发表机构 * Univ Lyon, Université Claude Bernard Lyon 1, INSA-Lyon,CNRS, Inserm, CREATIS UMR 5220, U1294(里昂大学,克劳德·贝尔纳里 Lyon 1 大学,INSA-里昂,CNRS,Inserm,CREATIS UMR 5220,U1294) Institute of Machine Learning in Biomedical Imaging, Helmholtz Center Munich, Germany(生物医学成像机器学习研究所,海德堡中心慕尼黑,德国) LTCI, Telecom Paris, Institut Polytechnique de Paris(LTCI,电信巴黎,巴黎理工学院) DTIC, Universitat Pompeu Fabra, Barcelona, Spain(DTIC,庞培法布拉大学,巴塞罗那,西班牙) Institut Universitaire de France (IUF)(法国大学研究所(IUF))

AI总结 提出一种基于表示学习的策略,通过流形对齐匹配不同定义的高维数据,并重建参数图以可视化定义差异,应用于右心室应变分析。

Comments Accepted for publication in IEEE Transactions on Medical Imaging, DOI: 10.1109/TMI.2026.3698240 \c{opyright} 2026 IEEE. Personal use is permitted. For all other uses, permission must be obtained from IEEE

详情
AI中文摘要

医学影像研究通常依赖于每个受试者的单个样本,假设其能代表生理特征。然而,输入描述符定义或计算方式的变化(例如由于科学领域缺乏共识)可能对分析产生关键影响,但在实践中很少被考虑。本文提出一种基于表示学习的原创策略,用于估计反映这种定义差异对先前从医学图像中提取的特定生理描述符影响的参数图。我们将这些生理描述符的不同定义或计算视为不同的高维数据,可能具有异构类型。我们特别关注心肌变形(应变),其定义尚未达成共识。我们首先使用流形对齐来匹配与该描述符不同定义相关的潜在表示。然后,我们在潜在空间中制定合理的分布来表示描述符之间的定义差异,并从中重建高维参数图以可视化这种定义差异。由于缺乏针对该特定临床应用的适当真实数据,我们首先在玩具实验上演示该方法,然后扩展到从3D超声心动图图像序列获得的受试者右心室应变数据的评估,其中右心室内膜表面网格的每个点都有不同类型的应变可用。除了这一说明性应用外,我们的方法具有推广到其他考虑异构高维描述符的人群分析的潜力。

英文摘要

Medical imaging studies often rely on a single sample per subject, assuming it is representative of their physiological traits. However, variations in how input descriptors are defined or computed (e.g. due to a lack of consensus in the scientific field) may have a crucial impact on the analysis, and are hardly considered in practice. In this paper, we propose an original strategy based on representation learning to estimate a parametric map reflecting the impact of such definitional differences on a given physiological descriptor, previously extracted from medical images. We consider the different definitions or computations of such physiological descriptors as different high-dimensional data, potentially of heterogeneous types. We specifically focus on myocardial deformation (strain), for which there is limited agreement on its definition. We first use manifold alignment to match the latent representations associated with the different definitions of this descriptor. Then, we formulate plausible distributions in the latent space to represent definitional divergence across descriptors, from which we reconstruct a high-dimensional parametric map to visualize such definitional divergence. Due to the lack of proper ground truth for this specific clinical application, we first demonstrate this methodology on toy experiments and then expand the evaluation on right ventricular strain data from subjects obtained from 3D echocardiographic image sequences, for which different types of strain are available at each point of the right ventricle endocardial surface mesh. Beyond this illustrative application, our methodology has the potential to be generalised to many other population analyses considering heterogeneous high-dimensional descriptors.

2412.10362 2026-06-02 cs.LG cs.CV 版本更新

OP-LoRA: The Blessing of Dimensionality

OP-LoRA:维度的祝福

Piotr Teterwak, Kate Saenko, Bryan A. Plummer, Ser-Nam Lim

发表机构 * Boston University(波士顿大学) University of Central Florida(中央佛罗里达大学)

AI总结 提出OP-LoRA方法,通过额外MLP预测LoRA适配器权重以改善优化,训练后丢弃MLP,在零额外推理成本下提升性能并降低对学习率的敏感性。

详情
AI中文摘要

低秩适配器(LoRA)使得仅用少量参数即可微调大模型。然而,它们常常面临病态的损失景观,导致优化困难。先前的工作通过自定义优化器将适配器更新与全微调梯度对齐来解决这些挑战,但这些方法缺乏适应新适配器架构的灵活性,且计算成本高。我们引入了OP-LoRA,一种新颖的方法,它用额外的MLP预测的权重替换每个LoRA适配器,该MLP在训练后被丢弃。这允许在训练期间临时增加额外参数以改善优化,但比自定义优化器需要更少的墙钟时间,并且在推理时零额外成本,因为MLP被丢弃。关键的是,将OP-LoRA扩展到其他适配器只需修改每个新适配器类型的预测头大小。我们表明,OP-LoRA允许优化自适应地增加或减少步长,从而提高性能并降低对学习率的敏感性。在小型和大型LoRA微调任务中,我们观察到OP-LoRA相对于LoRA及其变体的一致性能提升。我们在图像生成中取得了特别显著的改进,OP-LoRA的CMMD分数相对于LoRA提高了多达15分。这使得OP-LoRA能够在推理参数减半的情况下达到LoRA的性能。

英文摘要

Low-rank adapters (LoRA) enable finetuning of large models with only a small number of parameters. However, they often suffer from an ill-conditioned loss landscape, leading to difficult optimization. Prior work addresses these challenges by aligning adapter updates with full finetuning gradients via custom optimizers, but these methods lack the flexibility to accommodate new adapter architectures and are computationally expensive. We instead introduce OP-LoRA, a novel method which replaces each LoRA adapter with weights predicted by an extra MLP, which is discarded after training. This temporarily allows additional parameters during training to improve optimization, yet requires less wall time than custom optimizers and zero extra cost at inference time because the MLP is discarded. Crucially, extending OP-LoRA to other adapters is as simple as modifying the size of the prediction head for each new adapter type. We show that OP-LoRA allows the optimization to adaptively increase or decrease step size, improving performance and decreasing sensitivity to learning rate. On both small and large-scale LoRA tuning tasks, we observe consistent performance gains of OP-LoRA relative to LoRA and its variants. We achieve especially notable improvements in image generation, with OP-LoRA CMMD scores improving by up to 15 points relative to LoRA. This allows OP-LoRA to achieve the performance of LoRA with half of the inference parameters.

2411.17790 2026-06-02 cs.CV cs.AI 版本更新

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

基于潜在先验的自监督单目内窥镜深度与姿态估计

Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher

发表机构 * University of Oxford(牛津大学) University of Leeds(利兹大学)

AI总结 提出一种结合生成潜在库和变分自编码器的自监督框架,通过自然图像深度先验和姿态潜在变量正则化,实现内窥镜复杂场景下的高精度深度与姿态估计。

详情
AI中文摘要

内窥镜中的精确3D映射能够实现胃肠道(GI)内定量、整体的病变表征,这需要可靠的深度和姿态估计。然而,内窥镜系统是单目的,现有依赖合成数据集或复杂模型的方法在具有挑战性的内窥镜条件下往往缺乏泛化能力。我们提出了一种鲁棒的自监督单目深度和姿态估计框架,该框架结合了生成潜在库(Generative Latent Bank)和变分自编码器(VAE)。生成潜在库利用自然图像中的广泛深度场景来调节深度网络,通过潜在特征先验增强深度预测的真实感和鲁棒性。对于姿态估计,我们将其重新构建在VAE框架内,将姿态转换视为潜在变量以正则化尺度、稳定z轴突出性并提高x-y灵敏度。这种双重精炼流程能够实现精确的深度和姿态预测,有效应对胃肠道复杂的纹理和光照。在SimCol和EndoSLAM数据集上的广泛评估证实,我们的框架在内窥镜深度和姿态估计方面优于已发表的自监督方法。

英文摘要

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

2411.12321 2026-06-02 cs.CV 版本更新

Enhancing Blind Source Separation with Dissociative Principal Component Analysis

增强盲源分离的解离主成分分析

Muhammad Usman Khalid

发表机构 * College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University(伊斯兰国际大学计算机与信息科学学院)

AI总结 提出解离主成分分析(DPCA),通过联合估计主成分和载荷向量并显式建模其相互依赖关系,克服传统稀疏PCA在源重叠时性能下降的问题,在模拟fMRI源恢复、前景背景分离等任务中优于经典sPCA。

Comments 13 pages with 6 figures, this work has not been published before

详情
AI中文摘要

主成分分析(PCA)及其稀疏变体(sPCA)被广泛用作独立成分分析(ICA)的前置步骤,用于盲源分离(BSS)。然而,sPCA通常依赖于一种逐次提取成分并在它们之间施加正交性的缩减策略。当底层源重叠时,这会丢弃ICA所依赖的跨成分结构,从而降低分离效果。本文提出解离PCA(DPCA),它联合估计成分而非通过缩减。DPCA在基于SVD的分解中引入左、右解离矩阵,以显式建模主成分(PC)和载荷向量(LV)之间的相互依赖关系,同时通过稀疏约束保持可解释性。我们开发了三种算法,称为DPCA1a、DPCA1b和DPCA2,采用自适应软阈值与梯度下降和坐标下降相结合,并辅以二次硬阈值步骤,以保持稀疏性并抑制恢复的载荷向量中的背景噪声。该方法在四个设置上进行了评估,即模拟fMRI源恢复、前景与背景分离、图像重建和图像修复,在这些设置中,它比基于经典sPCA的流程更可靠地恢复源结构,在显著空间重叠下增益最大。当稀疏参数为零时,DPCA退化为普通PCA。所提出算法的MATLAB实现可在https://github.com/usmankhalid06/DPCA公开获取。

英文摘要

Principal component analysis (PCA) and its sparse variants (sPCA) are widely used as a precursor to independent component analysis (ICA) for blind source separation (BSS). However, sPCA typically relies on a deflation strategy that extracts components sequentially and imposes orthogonality between them. When the underlying sources overlap, this discards the cross component structure that ICA depends on, degrading separation. This paper proposes dissociative PCA (DPCA), which estimates components jointly rather than by deflation. DPCA introduces left and right dissociation matrices into the SVD based decomposition to explicitly model the interdependencies among principal components (PCs) and loading vectors (LVs), while sparsity constraints maintain interpretability. We develop three algorithms called DPCA1a, DPCA1b, and DPCA2, using adaptive soft thresholding with gradient and coordinate descent, together with a secondary firm thresholding step that preserves sparsity and suppresses background noise in the recovered loading vectors. The method is evaluated on four settings, namely simulated fMRI source retrieval, foreground and background separation, image reconstruction, and image inpainting, where it recovers source structure more reliably than classical sPCA based pipelines, with the largest gains under significant spatial overlap. DPCA reduces to ordinary PCA when the sparsity parameter is zero. A MATLAB implementation of the proposed algorithms is publicly available at https://github.com/usmankhalid06/DPCA.

2411.05359 2026-06-02 cs.CV cs.AI cs.CY 版本更新

Agricultural Landscape Understanding At Country-Scale

国家级农业景观理解

Radhika Dua, Aditi Agarwal, Aishwarya Jayagopal, Depanshu Sani, Alex Wilson, Hoang Tran, Ishan Deshpande, Bogdan Floristean, Neelabh Goyal, Ramya Cheruvu, Vishal Batchu, Yan Mayster, Gaurav Aggarwal, Alok Talekar, Vaibhav Rajan

发表机构 * Google DeepMind(谷歌深Mind) Google(谷歌)

AI总结 提出首个国家级农业制图系统,通过新颖的后处理启发式方法实现田地、树木和水体的实例分割,并在全国范围内部署验证。

Comments 32 pages, 11 tables, 22 figs

详情
AI中文摘要

全面的农业景观理解对于应对粮食安全、气候变化和资源管理等全球挑战至关重要。这不仅需要绘制农田地图,还需要绘制树木和水体等重要特征,这些特征在主导全球南方的复杂 extit{小农户}系统中形成了错综复杂的镶嵌结构。以往开发此类土地利用地图的努力受到限制,仅专注于田地划界的方法,并且没有开发出实际部署所必需的稳健后处理步骤。此外,据我们所知,之前没有针对小农户农场的系统在国家范围内进行部署和评估。本文通过提出首个国家级农业制图系统来解决这些局限性,该系统超越了简单的田地划界,能够对田地、树木和水体等农业实例进行分割。我们的系统通过新颖的后处理启发式方法进行了优化,以确保地图的一致性和准确性,并通过严格、多方面的评估过程进行了验证。我们系统生成的精细土地利用地图可通过API在 extit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}公开访问,支持从精准农业和政策制定到推进全球可持续发展目标的各种应用。

英文摘要

Comprehensive agricultural landscape understanding is critical for addressing global challenges in food security, climate change, and resource management. This requires mapping not just crop fields, but also vital features like trees and water bodies which form an intricate mosaic in complex \textit{smallholder} systems dominating the Global South. Previous efforts to develop such land use maps have been limited by a narrow focus on methods for field delineation only, and also do not develop robust post-processing steps essential for real-world deployment. Further, to our knowledge, no prior system for smallholder farms has been deployed and evaluated at a national scale. This work addresses these limitations by presenting the first national-scale agricultural mapping system that moves beyond simple field delineation to enable segmentation of agricultural instances like fields, trees and water bodies. Our system is refined for real-world application using novel post-processing heuristics to ensure map consistency and accuracy, and is validated through a rigorous, multi-faceted evaluation process. Fine-grained land use maps generated by our system are publicly accessible via an API at \textit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}, enabling a wide range of applications from precision agriculture and policy-making to advancing global sustainability development goals.

2410.21361 2026-06-02 cs.CV cs.LG 版本更新

Domain Adaptation with a Single Vision-Language Embedding

基于单一视觉-语言嵌入的域适应

Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

发表机构 * Inria(法国国家信息与自动化研究所) Kyutai(Kyutai公司)

AI总结 提出一种利用单一视觉-语言(VL)嵌入进行域适应的框架,通过提示/照片驱动的实例归一化(PIN)挖掘多种视觉风格,实现零样本和单样本无监督域适应,在语义分割任务上优于基线方法。

Comments International Journal of Computer Vision (IJCV 2026)

详情
AI中文摘要

域适应在计算机视觉中已被广泛研究,但仍需要在训练时访问目标数据,这在现实世界的自动驾驶场景中可能难以获得,尤其是在罕见或恶劣条件下。本文提出了一种新的域适应框架,该框架依赖于单一的视觉-语言(VL)潜在嵌入,而不是完整的目标数据。首先,利用对比语言-图像预训练模型(CLIP),我们提出了提示/照片驱动的实例归一化(PIN)。PIN是一种特征增强方法,通过优化低级源特征的仿射变换,使用单一的目标VL潜在嵌入挖掘多种视觉风格。VL嵌入可以来自描述目标域的语言提示、部分优化的语言提示或单一未标记的目标图像。其次,我们表明这些挖掘的风格(即增强)可用于零样本(即无目标)和单样本无监督域适应。在真实世界驾驶数据集(包括Cityscapes和ACDC(恶劣条件))上的语义分割实验证明了所提出方法的有效性,在实用的零样本和单样本设置中优于相关基线。

英文摘要

Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in real-world autonomous driving scenarios, especially under rare or adverse conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language-image pre-training model (CLIP), we propose prompt/photo-driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low-level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation. Experiments on semantic segmentation in real-world driving datasets, including Cityscapes and ACDC (adverse conditions), demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the practical zero-shot and one-shot settings.

2404.13621 2026-06-02 cs.CV cs.LG cs.MM 版本更新

Attack on Scene Flow using Point Clouds

使用点云对场景流进行攻击

Haniyeh Ehsani Oskouie, Mohammad-Shahram Moin, Shohreh Kasaei

发表机构 * Sharif University of Technology(谢里弗大学) ICT Research Institute(信息与通信技术研究所)

AI总结 针对场景流网络提出白盒对抗攻击方法,在KITTI和FlyingThings3D数据集上实现平均端点误差相对下降33.7%,并揭示单维度或单颜色通道攻击的影响。

详情
AI中文摘要

深度神经网络在使用点云准确估计场景流方面取得了显著进展,这对于视频分析、动作识别和导航等许多应用至关重要。然而,这些技术的鲁棒性仍然令人担忧,特别是在面对已被证明能在许多领域欺骗最先进深度神经网络的对抗攻击时。令人惊讶的是,场景流网络对此类攻击的鲁棒性尚未得到彻底研究。为解决这一问题,本文提出了一种专门针对场景流网络的白盒对抗攻击方法。实验结果表明,生成的对抗样本在KITTI和FlyingThings3D数据集上使平均端点误差相对下降高达33.7%。研究还揭示了仅针对点云的一个维度或颜色通道的攻击对平均端点误差的显著影响。通过分析这些攻击在场景流网络及其2D光流网络变体上的成功与失败,发现光流网络具有更高的脆弱性。代码可在https://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git获取。

英文摘要

Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. The robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants shows a higher vulnerability for the optical flow networks. Code is available at https://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git.

2012.01494 2026-06-02 cs.CV 版本更新

Braille to Text Translation for Bengali Language: A Geometric Approach

孟加拉语盲文到文本翻译:一种几何方法

Minhas Kamal, Amin Ahsan Ali, Muhammad Asif Hossain Khan, Mohammad Shoyaib

发表机构 * Institute of Information Technology(信息科技研究所) University of Dhaka(达卡大学)

AI总结 针对孟加拉语缺乏盲文翻译工具的问题,提出一种基于图像处理和几何结构分析的盲文到文本翻译方法,识别准确率达97.25%。

Comments GitHub Repo.: https://github.com/MinhasKamal/BrailleToTextTranslator

详情
Journal ref
Jahangirnagar University Journal of Information Technology (JJIT), vol. 7, pp. 93-111, June, 2018
AI中文摘要

盲文是视障人士阅读和书写的唯一系统。然而,普通人无法阅读盲文。因此,教师和亲属在帮助他们学习时遇到困难。几乎所有主要语言都有用于此翻译目的的软件解决方案。然而,在孟加拉语中缺乏这一有用的工具。在这里,我们提出盲文到文本翻译器,它获取这些触觉字母的图像,并将其翻译为纯文本。图像退化、扫描时页面旋转和盲文点变形是该方案中的主要问题。所有这些挑战都通过特殊的图像处理和几何结构分析直接检查。该技术在识别盲文字符方面达到了97.25%的准确率。

英文摘要

Braille is the only system to visually impaired people for reading and writing. However, general people cannot read Braille. So, teachers and relatives find it hard to assist them with learning. Almost every major language has software solutions for this translation purpose. However, in Bengali there is an absence of this useful tool. Here, we propose Braille to Text Translator, which takes image of these tactile alphabets, and translates them to plain text. Image deterioration, scan-time page rotation, and braille dot deformation are the principal issues in this scheme. All of these challenges are directly checked using special image processing and geometric structure analysis. The technique yields 97.25% accuracy in recognizing Braille characters.

2404.11326 2026-06-02 cs.CV 版本更新

SCL: Towards Domain Generalization via Single-Temporal Multimodal Contrastive Learning for Remote Sensing Change Detection

SCL:面向遥感变化检测的单时相多模态对比学习域泛化方法

Qiangang Du, Jinlong Peng, Xu Chen, Qingdong He, Liren He, Qiang Nie, Mingmin Chi

发表机构 * Fudan University(复旦大学) Tencent YouTu Lab(腾讯YouTu实验室)

AI总结 提出基于视觉-语言预训练模型的单时相多模态对比学习(SCL)基础模型,结合动态文本-视觉上下文优化(DTCO)和可控生成与单时相训练策略(SAIN),无需目标数据集训练即可实现遥感变化检测的跨数据集泛化。

Comments CVPRW 2026

详情
AI中文摘要

近年来,基于CNN和Transformer的变化检测与异常检测模型在基于配对数据的多个数据集上取得了显著成功。然而,由于领域特定的设计,大多数此类方法表现出有限的跨数据集泛化能力,并且通常依赖于大量配对的标注数据。本文基于视觉-语言预训练模型,引入了一种单时相多模态对比学习(SCL)基础模型,用于变化检测,无需在目标数据集上进行训练。为了进一步提高模型学习文本和视觉信息上下文的能力,我们提出了一种动态文本-视觉上下文优化(DTCO)模块用于提示学习。同时,为了解决现有方法的数据依赖性问题,我们引入了一种可控生成和单时相训练策略(SAIN)。这使得我们能够利用大量现有的单时相图像训练模型,而无需配对标签。在各种真实世界变化检测数据集上的大量实验表明,SCL具有优越的性能和泛化能力,在评估设置下优于最先进的方法。代码可在https://github.com/Kane-Du/scl-cd.git获取。

英文摘要

In recent years, change detection and anomaly detection models based on CNN and transformer have achieved remarkable success across various datasets based on paired data. However, most such methods exhibit limited crossdataset generalization due to domain-specific designs and typically rely on large amounts of paired labeled data. In this paper, based on visual-language pre-training model, we introduce a Single-temporal multimodal Contrastive Learning (SCL) foundation models for change detection without training on the target dataset. To further improve the model's ability to learn context of textual and visual information, we propose a Dynamic Text-vision Context Optimization (DTCO) module for prompt learning. Meanwhile, to address the data dependency issue of existing methods, we introduce a controllable generation and Single-temporal trAINing strategy (SAIN). This allows us to train the model using a large number of existing single-temporal images without the need for paired label. Extensive experiments on various realworld change detection datasets demonstrate the superior performance and generalization of SCL, outperforming state-of-the-art methods under the evaluated settings. Code is available at https://github.com/Kane-Du/scl-cd.git.

2307.06647 2026-06-02 cs.RO cs.AI cs.CV 版本更新

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

DeepIPCv2: 基于LiDAR的鲁棒环境感知与自动驾驶导航控制

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada(计算机科学与电子系,加查马达大学) Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,toyohashi技术大学)

AI总结 提出DeepIPCv2端到端自动驾驶框架,通过融合LiDAR点云分割与多视图投影构建鲁棒场景表示,结合门控循环单元、命令特定多层感知器和PID控制器实现路径点与导航控制命令的联合估计,在光照变化下取得最低总指标误差和最少驾驶干预。

Comments This work has been accepted for publication in IEEE Access. https://ieeexplore.ieee.org/document/11313052

详情
AI中文摘要

我们提出DeepIPCv2,一个端到端的自动驾驶框架,它集成了基于LiDAR的环境感知与命令特定的控制学习。与先前依赖摄像头的模型不同,DeepIPCv2采用点云分割和多视图投影来构建鲁棒的场景表示。这些特征通过门控循环单元、命令特定的多层感知器和PID控制器的组合进行融合和解码,以估计路径点和导航控制命令。这种设计增强了机动性并解决了驾驶数据集中的动作不平衡问题。为了验证模型,我们构建了一个覆盖不同光照条件的数据集,并进行了消融研究和与包括TransFuser在内的最新方法的对比测试。结果表明,DeepIPCv2实现了最低的总指标误差和最少的驾驶干预,突显了其对光照变化的鲁棒性和改进的控制精度。通过稍后在https://github.com/oskarnatan/DeepIPCv2发布代码,我们旨在支持端到端自动驾驶研究的可重复性和未来进展。

英文摘要

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific control learning. Unlike prior camera-reliant models, DeepIPCv2 employs point cloud segmentation and multi-view projection to construct robust scene representations. These features are fused and decoded through a combination of gated recurrent units, command-specific multi-layer perceptrons, and PID controllers to estimate both waypoints and navigational control commands. This design enhances maneuverability and addresses action imbalance in driving datasets. To validate the model, we constructed a dataset covering diverse illumination conditions and conducted ablation studies and comparative tests against recent methods, including TransFuser. Results demonstrate that DeepIPCv2 achieves the lowest total metric error and the fewest driving interventions, highlighting both its robustness to illumination changes and its improved control accuracy. By releasing the codes at https://github.com/oskarnatan/DeepIPCv2 later, we aim to support reproducibility and future advancements in end-to-end autonomous driving research.

2310.15676 2026-06-02 cs.CV cs.AI 版本更新

Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

多模态3D智能的最新进展:综合调查与评估

Yinjie Lei, Zixuan Wang, Feng Chen, Guoqing Wang, Peng Wang, Yang Yang

发表机构 * College of Electronics and Information Engineering, Sichuan University(四川大学电子信息工程学院) School of Computer Science, University of Adelaide(阿德莱德大学计算机科学学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 本文系统综述了多模态3D智能方法,提出基于模态和任务的新分类法,并比较了基准数据集上的结果,最后讨论了未来研究方向。

详情
AI中文摘要

多模态3D智能因其在自动驾驶和世界模拟等领域的广泛应用而受到广泛关注。与传统的单模态3D理解相比,引入额外模态不仅提升了场景解释的丰富性和精确性,还为更高层次的物理世界交互奠定了基础。在仅依赖3D数据可能不足的多样化和挑战性环境中,这一点变得尤为关键。尽管过去六年中多模态3D方法的发展激增,特别是那些整合多相机图像(3D+2D)和文本描述(3D+语言)的方法,但缺乏全面深入的综述。在本文中,我们通过系统调查最新进展来弥补这一空白。我们首先简要总结了各种3D多模态任务中的独特挑战。之后,我们提出了一种新的分类法,根据模态和任务对现有方法进行彻底分类,探讨它们各自的优势和局限性。此外,我们提供了近期方法在几个基准数据集上的比较结果及深入分析。最后,我们讨论了未解决的问题,并提出了未来研究的几个潜在方向。

英文摘要

Multi-modal 3D Intelligence has gained considerable attention due to its wide applications in autonomous driving and world simulation, etc. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also provides a foundation for higher-level physical world interaction. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over the past six years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this paper, we present a systematic survey of recent progress to bridge this gap. We begin by briefly summarizing the unique challenges among various 3D multi-modal tasks. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.

2208.00967 2026-06-02 cs.CV 版本更新

Counterfactual Intervention Feature Transfer for Visible-Infrared Person Re-identification

反事实干预特征迁移用于可见光-红外行人重识别

Xulin Li, Yan Lu, Bin Liu, Yating Liu, Guojun Yin, Qi Chu, Jinyang Huang, Feng Zhu, Rui Zhao, Nenghai Yu

发表机构 * School of Information Science and Technology, University of Science and Technology of China(信息科学与技术学院,中国科学技术大学) Key Laboratory of Electromagnetic Space Information, Chinese Academy of Science(电磁空间信息重点实验室,中国科学院) School of Data Science, University of Science and Technology of China(数据科学学院,中国科学技术大学) SenseTime Research(商汤研究院) Qing Yuan Research Institute, Shanghai Jiao Tong University(青元研究院,上海交通大学)

AI总结 针对可见光-红外行人重识别中图模型泛化性差的问题,提出反事实干预特征迁移方法,通过同质与异质特征迁移减少模态不平衡,并利用反事实关系干预增强图拓扑结构的可靠性。

Comments Accepted by ECCV 2022

详情
AI中文摘要

基于图模型的方法最近在行人重识别任务中取得了巨大成功,该方法首先计算不同行人之间的图拓扑结构(亲和度),然后跨行人传递信息以获得更强的特征。但我们发现,现有的基于图模型的方法在可见光-红外行人重识别任务(VI-ReID)中存在泛化性差的问题,原因有二:1)训练-测试模态平衡差距,这是VI-ReID任务的一个特性。训练阶段两种模态的数据量是平衡的,但在推理时极度不平衡,导致基于图的VI-ReID方法泛化性低。2)图模块的端到端学习方式导致次优的拓扑结构。我们分析认为,训练良好的输入特征削弱了图拓扑的学习,使其在推理过程中不够泛化。在本文中,我们提出了一种反事实干预特征迁移(CIFT)方法来解决这些问题。具体而言,设计了同质与异质特征迁移(H2FT),通过两种独立设计的图模块和不平衡场景模拟来减少训练-测试模态平衡差距。此外,提出了反事实关系干预(CRI),利用反事实干预和因果效应工具来突出拓扑结构在整个训练过程中的作用,使图拓扑结构更加可靠。在标准VI-ReID基准上的大量实验表明,CIFT在各种设置下均优于最先进的方法。

英文摘要

Graph-based models have achieved great success in person re-identification tasks recently, which compute the graph topology structure (affinities) among different people first and then pass the information across them to achieve stronger features. But we find existing graph-based methods in the visible-infrared person re-identification task (VI-ReID) suffer from bad generalization because of two issues: 1) train-test modality balance gap, which is a property of VI-ReID task. The number of two modalities data are balanced in the training stage, but extremely unbalanced in inference, causing the low generalization of graph-based VI-ReID methods. 2) sub-optimal topology structure caused by the end-to-end learning manner to the graph module. We analyze that the well-trained input features weaken the learning of graph topology, making it not generalized enough during the inference process. In this paper, we propose a Counterfactual Intervention Feature Transfer (CIFT) method to tackle these problems. Specifically, a Homogeneous and Heterogeneous Feature Transfer (H2FT) is designed to reduce the train-test modality balance gap by two independent types of well-designed graph modules and an unbalanced scenario simulation. Besides, a Counterfactual Relation Intervention (CRI) is proposed to utilize the counterfactual intervention and causal effect tools to highlight the role of topology structure in the whole training process, which makes the graph topology structure more reliable. Extensive experiments on standard VI-ReID benchmarks demonstrate that CIFT outperforms the state-of-the-art methods under various settings.

2203.03768 2026-06-02 cs.CV 版本更新

CrowdFormer: Weakly-supervised Crowd counting with Improved Generalizability

CrowdFormer: 改进泛化性的弱监督人群计数

Siddharth Singh Savner, Vivek Kanhangad

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Indore, India(印度理工学院印度尔分校电子工程系)

AI总结 提出基于金字塔视觉变换器的弱监督人群计数方法,通过全局上下文建模实现与现有方法相当的性能并展现显著泛化性。

详情
Journal ref
Journal of Visual Communication and Image Representation, vol. 94, article 103853, 2023
AI中文摘要

卷积神经网络(CNN)由于其强大的局部特征学习能力,在计算机视觉领域主导了近十年。然而,由于感受野有限,CNN无法建模全局上下文。另一方面,基于注意力的变换器可以轻松建模全局上下文。尽管如此,目前关于变换器在人群计数中有效性的研究仍然有限。此外,现有的大多数人群计数方法基于密度图回归,这需要对场景中每个人进行点级标注。这种标注任务既费力又容易出错。这导致了对仅需要计数级标注的弱监督人群计数方法的关注增加。在本文中,我们提出了一种使用金字塔视觉变换器的弱监督人群计数方法。我们进行了广泛评估以验证所提出方法的有效性。我们的方法在基准人群数据集上与最先进方法相当。更重要的是,它表现出显著的泛化性。

英文摘要

Convolutional neural networks (CNNs) have dominated the field of computer vision for nearly a decade due to their strong ability to learn local features. However, due to their limited receptive field, CNNs fail to model the global context. On the other hand, transformer, an attention-based architecture can model the global context easily. Despite this, there are limited studies that investigate the effectiveness of transformers in crowd counting. In addition, the majority of the existing crowd counting methods are based on the regression of density maps which requires point-level annotation of each person present in the scene. This annotation task is laborious and also error-prone. This has led to increased focus on weakly-supervised crowd counting methods which require only the count-level annotations. In this paper, we propose a weakly-supervised method for crowd counting using a pyramid vision transformer. We have conducted extensive evaluations to validate the effectiveness of the proposed method. Our method is comparable to the state-of-the-art on the benchmark crowd datasets. More importantly, it shows remarkable generalizability.