arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28820 2026-05-28 cs.CV 版本更新

From Pixels to Words -- Towards Native One-Vision Models at Scale

从像素到文字——迈向原生单视觉大规模模型

Haiwen Diao, Jiahao Wang, Penghao Wu, Yuhao Dong, Yuwei Niu, Yue Zhu, Zhongang Cai, Weichen Fan, Linjun Dai, Silei Wu, Xuanyu Zheng, Mingxuan Li, Yuanhan Zhang, Bo Li, Hanming Deng, Huchuan Lu, Quan Wang, Lei Yang, Lewei Lu, Dahua Lin, Ziwei Liu

发表机构 * S-Lab, NTU(S实验室,国立科技大学) SenseTime Research(秒速科技研究院) DLUT(大连理工大学)

AI总结 本文提出NEO-ov原生基础模型,通过端到端学习跨帧和像素-文字对应,无需外部编码器或适配器,在细粒度视觉感知上缩小了与模块化模型的差距,验证了原生单视觉架构的可行性和竞争力。

Comments 13 pages, 6 figures

详情
AI中文摘要

当前的视觉语言模型(VLM)通常通过多阶段对齐将独立的图像编码器和语言解码器拼接在一起,这种模块化框架不可避免地碎片化跨帧的像素级信号,并分散了早期的像素-文字交互。与此同时,原生VLM尽管在单图像上表现令人印象深刻,但在多图像、视频理解和空间智能方面仍鲜有探索。因此,我们引入了NEO-ov,一个原生基础模型,它端到端地学习跨帧和像素-文字对应,无需任何外部编码器、辅助适配器或事后融合。通过完全消除模块边界,NEO-ov使得细粒度且统一的时空建模能够在模型内部原生地涌现。值得注意的是,NEO-ov在缩小与模块化模型差距的同时,在细粒度视觉感知方面表现出色,验证了原生“单视觉”架构不仅可行,而且在大规模上具有竞争力。除了实证性能,我们还揭示了系统的架构分析和详细的训练配方,以促进后续的原生多模态建模。我们的代码和模型可在 https://github.com/EvolvingLMMs-Lab/NEO 公开获取。

英文摘要

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

2605.28816 2026-05-28 cs.CV 版本更新

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

Gamma-World: 超越双玩家的生成式多智能体世界建模

Fangfu Liu, Kai He, Tianchang Shen, Tianshi Cao, Sanja Fidler, Yueqi Duan, Jun Gao, Igor Gilitschenski, Zian Wang, Xuanchi Ren

发表机构 * NVIDIA Tsinghua University(清华大学) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出一种生成式多智能体世界模型,通过Simplex Rotary Agent Encoding实现智能体置换等价性,并采用Sparse Hub Attention降低跨智能体注意力成本,支持多玩家交互视频生成。

Comments Project Page: https://research.nvidia.com/labs/sil/projects/gamma-world

详情
AI中文摘要

用于交互式视频生成的世界模型主要集中在单智能体设置中,其中未来观测由单个控制信号生成。然而,许多生成环境需要多智能体交互:多个玩家、机器人或具身智能体在共享空间中同时行动。将世界模型扩展到此类设置需要原则性的多智能体设计:智能体应保持独立可控、置换对称,并在保持时间和视角一致性的同时支持高效推理。在本文中,我们提出了用于交互式模拟的生成式多智能体世界模型。它引入了Simplex Rotary Agent Encoding,这是3D RoPE的一种无参数扩展,将智能体表示为旋转角度空间中的正则单纯形顶点。这为每个智能体赋予不同的相位,同时使所有智能体置换等价,从而无需学习每个槽位的身份或固定的智能体顺序即可实现可扩展的智能体身份。为了避免跨智能体的密集全连接注意力,我们进一步提出了Sparse Hub Attention,其中可学习的中心令牌调解跨智能体的令牌交互,将跨智能体注意力成本从智能体数量的二次方降低到线性。为了实现实时展开,我们将全上下文扩散教师模型蒸馏为因果学生模型,该模型通过KV缓存顺序生成时间块,实现24 FPS的动作响应生成。在多玩家虚拟环境中的实验表明,与基于槽位和密集注意力的基线相比,我们的模型提高了视频保真度、动作可控性和智能体间一致性,并且无需额外训练即可从两个玩家泛化到四个玩家。

英文摘要

World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi-agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering. To avoid dense all-to-all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. For real-time rollout, we distill a full-context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.

2605.28811 2026-05-28 cs.CV 版本更新

HarmoVid: Relightful Video Portrait Harmonization

HarmoVid: 可重光照的视频人像协调

Jun Myeong Choi, Jae Shin Yoon, Luchao Qi, Roni Sengupta, Joon-Young Lee

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Adobe Research(Adobe研究)

AI总结 提出一种基于视频扩散模型和光照去闪烁方法,实现前景视频与目标背景场景在阴影、色调和光照强度上的协调,解决视频时域抖动问题。

Comments CVPR 2026

详情
AI中文摘要

我们提出了一种方法,用于协调前景视频的光照以匹配目标背景场景,调整阴影、色调和光照强度(可重光照协调)。与图像不同,获取视频的标注数据(即相同运动在不同光照条件下记录)实际上不可行且不可扩展。虽然创建这种配对数据的一种方法是将现有的基于图像的协调模型逐帧应用于视频,但得到的输出常常遭受显著的时域抖动。我们通过引入一种新颖的光照去闪烁模型来克服这个问题,该模型可以稳定全局和局部的光照闪烁伪影。我们的视频扩散模型从这些升级后的去闪烁数据中学习,结合大量真实和合成视频,生成高质量的视频协调结果。我们进一步提出了一种非对称alpha掩膜调节技术,以从真实视频中学习干净的边界。实验表明,与先前的基于图像和基于视频的协调方法相比,我们的模型实现了强时域一致性、自然性、更干净的边界和物理上有意义的光照行为,同时保持了强大的重光照表现力。

英文摘要

We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and non-scalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.

2605.28809 2026-05-28 cs.CV cs.LG 版本更新

AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

AREA: 基于CLIP的类增量学习中的属性提取与聚合

Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国) School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国)

AI总结 提出AREA方法,通过主测地线分析稳定属性提取、轻量级任务专家和变分信息瓶颈正则化稳定属性聚合,并利用最优传输进行推理,以解决CLIP类增量学习中的灾难性遗忘问题。

Comments Accepted to ICML 2026. Code is available at https://github.com/LAMDA-CL/ICML2026-AREA

详情
AI中文摘要

类增量学习(CIL)在构建真实世界学习系统中至关重要。在基于CLIP的CIL中,模型通过比较从模板提示(例如,“一张[类别]的照片”)获得的视觉和文本嵌入之间的相似性来执行分类。这种看似单一的匹配过程可以分解为两个概念上不同的阶段:属性提取和属性聚合。例如,模型可能通过毛皮纹理和胡须等属性识别猫。当学习新类别(如汽车)时,模型必须提取额外属性(如轮子),并调整它们在共享表示空间中的聚合方式。然而,由于只有当前任务的数据可用,增量更新可能使属性提取和聚合偏向新类别,导致灾难性遗忘。因此,我们提出了AREA,用于基于CLIP的CIL中的属性提取和聚合。为了稳定提取,我们通过主测地线分析将类别级视觉和文本属性锚定在超球面嵌入空间上。为了稳定聚合,我们学习轻量级任务特定专家,并带有评分和残差细化,通过变分信息瓶颈目标进行正则化。在推理时,我们通过最优传输在任务属性流形上进行路由,以实现更简洁的预测。实验表明,AREA持续优于最先进的方法。代码可在https://github.com/LAMDA-CL/ICML2026-AREA获取。

英文摘要

Class-Incremental Learning (CIL) is important in building real-world learning systems. In CLIP-based CIL, the model performs classification by comparing similarity between visual and textual embeddings obtained from template prompts, e.g., ``a photo of a [CLASS]''. This seemingly monolithic matching process can be decomposed into two conceptually distinct stages: attribute extraction and attribute aggregation. For example, a model may recognize cat using attributes such as fur texture and whiskers. When learning a new class like car, the model must extract additional attributes like wheels and adjust how they are aggregated in the shared representation space. However, since only data from the current task is available, incremental updates can bias both attribute extraction and aggregation toward new classes, leading to catastrophic forgetting. Therefore, we propose AREA for attribute extraction and aggregation in CLIP-based CIL. To stabilize extraction, we anchor class-level visual and textual attributes on the hyperspherical embedding space via principal geodesic analysis. To stabilize aggregation, we learn lightweight task-specific experts with scoring and residual refinement, regularized by a variational information bottleneck objective. During inference, we perform routing over task attribute manifolds via optimal transport for more concise prediction. Experiments show that AREA consistently outperforms SOTA methods. Code is available at https://github.com/LAMDA-CL/ICML2026-AREA.

2605.28806 2026-05-28 cs.CV cs.CL cs.IR 版本更新

Personal Visual Memory from Explicit and Implicit Evidence

来自显式和隐式证据的个人视觉记忆

Viet Nguyen, Thao Nguyen, Vishal M. Patel, Yuheng Li

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe研究)

AI总结 本文提出个人视觉记忆基准和VisualMem混合架构,通过显式与隐式视觉证据增强AI代理的长期记忆,显著提升个性化任务性能。

Comments Project Page: https://viettmab.github.io/visualmem-page/

详情
AI中文摘要

长期记忆对于个性化AI代理越来越重要,然而现有的基准和方法仍然主要以文本为中心。即使包含图像,后续问题所需的用户特定信息通常仅从文本中即可恢复,并且大多数记忆系统将图像轮次简化为通用描述。然而,图像通常携带文本很少陈述的个人信息——包括显式证据(如重复出现的用户相关实体)和隐式证据(如从视觉或多模态线索推断出的潜在用户事实)。我们引入了一个针对这两种证据形式的个人视觉记忆基准,并提出了VisualMem,一种混合视觉-文本架构,通过结构化个人视觉记忆模块增强文本记忆后端。VisualMem不是将图像压缩为描述,而是利用对话上下文来解析身份、所有权和持久的用户事实。实验表明,VisualMem在我们的基准上显著优于先前的记忆系统,同时在标准文本记忆基准上保持竞争力,这表明个人视觉记忆是个性化AI代理长期记忆中一个独特且重要的组成部分。

英文摘要

Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states -- both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual--text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.

2605.28805 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1: 具有显式结构化重校准的多模态元验证器

Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang

发表机构 * Tsinghua University(清华大学) Pennsylvania State University(宾夕法尼亚州立大学) University of Southern California(南加州大学) Microcyto Princeton University(普林斯顿大学)

AI总结 提出OmniVerifier-M1,通过符号化元验证(如边界框)和解耦强化学习,实现多模态大模型的可靠细粒度验证与动态区域级自校正。

Comments ICML 2026. Project: https://github.com/Cominclip/OmniVerifier

详情
AI中文摘要

视觉结果日益成为多模态大语言模型的核心,因此可靠且细粒度的验证对于扩展通用基础模型至关重要。在这项工作中,我们研究了多模态元验证,它利用验证器生成的推理过程而非仅决策信号,并探索如何有效地将元验证反馈纳入多模态验证器训练。我们发现了两个关键发现。首先,符号化验证器输出(例如边界框)作为元验证推理过程优于文本解释,能够实现高效的基于规则的强化学习奖励,同时避免依赖来自辅助评判模型的基于模型的奖励。其次,解耦二元判断和元验证的强化学习目标显著优于联合奖励优化,这是由于输出结构和学习动态的内在差异。基于这些见解,我们训练了OmniVerifier-M1,一个利用符号化元验证和解耦强化学习的通用视觉验证器。OmniVerifier-M1提供稳健的验证和细粒度的错误定位,并进一步实现了M1-TTS,一个由验证器驱动的智能体生成系统,实现动态区域级自校正。这种方法为更可靠、可解释和细粒度的多模态验证铺平了道路,支持更安全、更可控的基础模型部署。

英文摘要

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

2605.28803 2026-05-28 cs.CV cs.LG 版本更新

Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

Ω-QVLA: 通过复合旋转和逐步缩放实现视觉-语言-动作模型的鲁棒量化

Xinyu Wang, Mingze Li, Sicheng Lyu, Dongxiu Liu, Kaicheng Yang, Ziyu Zhao, Yufei Cui, Xiao-Wen Chang, Peng Lu

发表机构 * McGill University(麦吉尔大学) Université de Montréal(蒙特利尔大学) Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Jiao Tong University(上海交通大学) Mila – Quebec AI Institute(魁北克AI研究所)

AI总结 提出Ω-QVLA,首个无需训练的后训练量化框架,通过复合SVD-Hadamard旋转和逐步DiT激活缩放量化,将VLA模型的语言骨干和扩散动作头统一压缩至W4A4精度,在LIBERO上达到或超越FP16性能,内存减少71.3%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型将感知、推理和控制统一在单个策略中,但其数十亿参数的骨干网络和基于扩散的动作头使得设备端部署成本过高。先前的量化工作仅提供部分解决方案,压缩LLM骨干网络而保留DiT动作头为全精度,或采用混合精度方案,原因是认为统一量化动作头本质上不稳定。我们通过Ω-QVLA挑战了这一假设,这是首个无需训练的后训练量化框架,将VLA模型的语言骨干和整个扩散动作头统一压缩至W4A4精度,无需混合精度分配。Ω-QVLA结合了复合SVD-Hadamard旋转(均衡每通道权重能量并扩散残差激活异常值)与逐步DiT激活缩放量化(吸收去噪步骤中的动态范围漂移)。在LIBERO上,Ω-QVLA将Pi 0.5和GR00T N1.5压缩至W4A4,任务成功率分别为98.0%和87.8%,匹配或超越其FP16参考值(97.1%和87.0%),同时静态内存占用减少71.3%。真实世界操作实验进一步证实了平滑、精确的操作,而先前方法失败。代码可在https://github.com/UCMP13753/Omega-QVLA获取。

英文摘要

Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quantization efforts offer only partial solutions, compressing the LLM backbone while leaving the DiT action head at full precision, or resorting to mixed-precision schemes, driven by the belief that uniformly quantizing the action head is inherently unstable. We challenge this assumption with Omega-QVLA, the first training-free post-training quantization framework that compresses both the language backbone and the entire diffusion action head of a VLA model to a uniform W4A4 precision, eliminating the need for mixed-precision allocation. Omega-QVLA combines a composite SVD-Hadamard rotation that equalizes per-channel weight energy while diffusing residual activation outliers with per-step DiT activation scaling quantization that absorbs dynamic-range drift across denoising steps. On LIBERO, Omega-QVLA compresses Pi 0.5 and GR00T N1.5 to W4A4 with 98.0% and 87.8% task success rates, matching or exceeding their FP16 references of 97.1% and 87.0%, while reducing the static memory footprint by 71.3%. Real-world manipulation experiments further confirm smooth, accurate manipulation where prior methods fail. Code is available at https://github.com/UCMP13753/Omega-QVLA.

2605.28780 2026-05-28 cs.CV cs.LG 版本更新

Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

偏差留下梯度痕迹:基于概念分解的梯度探针实现无标签偏差识别

Thomas Vitry, Kieran Edgeworth, Stefan Wermter, Jae Hee Lee

发表机构 * University of Hamburg(汉堡大学) Ecole Normale Superieure de Rennes(里昂高等师范学校)

AI总结 提出一种无需偏差标签的后处理方法,通过非负矩阵分解提取概念向量,并利用误分类样本的梯度信号识别视觉模型中的虚假关联,在不重新训练的情况下提升最差组准确率。

Comments Accepted to the 49th German Conference on Artificial Intelligence (KI2026)

详情
AI中文摘要

视觉分类器可能利用虚假关联,在分布内取得高准确率但在分布偏移下失败。现有的偏差缓解和分析方法通常依赖于精心策划的数据集、虚假属性或组标签,或重新训练,这在模型部署后或相关偏差未知时可能不可行。我们提出一种无需偏差标签的后处理方法,用于识别冻结视觉模型中的虚假概念,仅依赖于来自保留审计数据集的标准类标签。对于每个目标类,我们从预测为该类的输入中收集补丁,并对中间激活应用非负矩阵分解,以获得可解释的概念向量库。然后,通过从误分类示例的反向传播梯度与这些概念的相互作用导出的偏差估计器对候选概念进行排序:偏差概念在纠正假阴性时倾向于被激活,而在纠正假阳性时被抑制。在Colored MNIST和Waterbirds上,该方法恢复了与已知虚假线索一致的概念;在CelebA上,它揭示了仅部分与注释性别属性重合的决策相关方向;在推理时抑制排名靠前的概念,无需任何重新训练或参数更新,即可将Waterbirds的最差组准确率提高最多17.9个百分点,CelebA提高10.4个百分点。我们的方法识别出不一定与注释属性重合的决策相关虚假方向,为冻结视觉模型提供了可解释的审计工具和可操作的去偏处理。代码可在https://github.com/vitryt/label-free-bias-identification获取。

英文摘要

Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias mitigation and analysis often depend on curated datasets, spurious-attribute or group labels, or retraining, which may be infeasible once a model is deployed or the relevant bias is unknown. We present a bias-label-free, post-hoc method for identifying spurious concepts in frozen vision models, relying only on standard class labels from a held-out audit dataset. For each target class, we collect patches from inputs predicted as that class and apply non-negative matrix factorization to intermediate activations to obtain a bank of interpretable concept vectors. Candidate concepts are then ranked with a bias estimator derived from their interaction with backpropagated gradients on misclassified examples: bias concepts tend to get activated when correcting false negatives and suppressed when correcting false positives. On Colored MNIST and Waterbirds the method recovers concepts aligned with the known spurious cue, and on CelebA it surfaces decision-relevant directions that only partially coincide with the annotated gender attribute; suppressing the top-ranked concepts at inference time improves worst-group accuracy by up to 17.9 percentage points on Waterbirds and 10.4 on CelebA without any retraining or parameter updates. Our method identifies decision-relevant spurious directions that need not coincide with annotated ones, providing both an interpretable auditing tool and an actionable debiasing handle for frozen vision models. Code is available at https://github.com/vitryt/label-free-bias-identification.

2605.28779 2026-05-28 cs.CL cs.CV 版本更新

The Abstraction Gap in Vision-Language Causal Reasoning

视觉-语言因果推理中的抽象差距

Chinh Hoang, Mohammad Rashedul Hasan

发表机构 * Department of Electrical and Computer Engineering, University of Nebraska--Lincoln, Lincoln, Nebraska, USA(电气与计算机工程系,内布拉斯加大学林肯分校,内布拉斯加,林肯,美国)

AI总结 针对视觉-语言模型(VLM)生成因果解释时语言流畅性与忠实因果推理的混淆问题,提出双探针方法和抽象差距(AG)指标,通过CAGE基准评估发现多数模型存在显著AG,但通过预训练和架构选择可缩小差距。

详情
AI中文摘要

视觉-语言模型(VLM)能生成流畅的因果解释,但当前的评估无法区分语言合理性与忠实因果推理。我们提出一种双探针方法来分离这些属性。文本探针测量语言质量。链式文本探针要求模型首先生成显式因果链。抽象差距(AG)指标量化归一化的性能差异。在CAGE(因果抽象差距评估)基准上评估八个VLM,该基准包含跨越Pearl因果层次的5,500张图像上的49,500个问题,我们发现七个模型的AG超过0.50,文本得分为6-8,但链式得分低于2.5。在45,000个链式标注样本上进行微调未能缩小差距。然而,一个模型实现了接近零的AG。该能力存在于当前VLM架构中,并取决于预训练和架构选择。CAGE为评估VLM中的忠实因果推理提供了诊断工具。

英文摘要

Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

2605.28741 2026-05-28 cs.CV 版本更新

Self-Prophetic Decoding to Unlock Visual Search in LVLMs

自预言解码以解锁LVLM中的视觉搜索

Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) ShanghaiTech University(上海科技大学)

AI总结 提出SeProD框架,通过自预言解码利用预训练模型的内在单步能力,以无训练、即插即用的方式增强LVLM在多步视觉搜索中的连贯推理,在4个基准的12个分割上一致提升性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型视觉语言模型(LVLM)正迅速向真正的多模态推理发展,视觉搜索代表了“用图像思考”范式的具体实例。然而,LVLM视觉搜索面临两个关键挑战:后训练后内在能力之间的不兼容性,以及长多步推理上下文中的干扰。为解决这些问题,我们提出了两个新颖的见解。首先,预训练和后训练LVLM之间的自我调节利用了预训练模型的内在单步能力,以减轻能力退化和长上下文干扰。其次,基于概率的预言采样取代了简单的提示,提供了一个概率接口,其中预训练模型充当预言家,后训练模型在其输出分布下选择性地接受预言令牌,从而保持连贯的多步推理。基于这些见解,我们引入了SeProD,一个自预言解码框架,它利用内在的单步能力以无训练、即插即用的方式实现连贯的多步推理。实验表明,由于并行的预言接受机制,SeProD在4个视觉搜索基准的所有12个分割以及通用VQA基准上一致地提升了多个视觉搜索LVLM的性能,且没有增加计算开销。

英文摘要

Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.

2605.28735 2026-05-28 cs.CV 版本更新

SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

SeeGroup: 通过自确定分组的透明表面多层深度估计

Hongyu Wen, Jia Deng

发表机构 * Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 提出SeeGroup方法,通过将多层深度建模为点过程并采用置换不变损失,实现自适应分组,显著提升透明表面多层深度估计精度。

详情
AI中文摘要

透明物体在日常生活中很常见,理解其多层深度(包括透明表面及其背后的物体)非常重要。现有的多层深度方法通常扩展单层预测,通过3D点的前后顺序定义层并顺序预测。然而,由于分层几何允许将3D点分组为多个有效层,预定义的分组策略本质上是受限的。在这项工作中,我们提出了SeeGroup,一种避免施加预定义分组并允许模型自适应地将表面分配到深度图的多层深度估计方法。我们将逐像素多层深度公式化为一个点过程,将深度层视为沿每条相机射线的无序事件。这引出了观测深度层上的置换不变似然,产生了一个自然支持任意层分组的损失函数。实验表明,我们的方法显著推进了多层深度估计的最新水平,在LayeredDepth基准上将四重相对深度准确率从61.34%提升至70.09%。代码可在https://github.com/princeton-vl/SeeGroup获取。

英文摘要

Transparent objects are common in daily life, and it is important to understand their multilayer depth, including the transparent surface and the objects behind it. Existing methods for multilayer depth typically extend single-layer prediction. They define layers by the front-to-back ordering of 3D points and predict the layers sequentially. However, as layered geometry can admit multiple valid groupings of 3D points into layers, a predefined grouping strategy is inherently restrictive. In this work, we propose SeeGroup, a multi-layer depth estimation method that avoids imposing a predefined grouping and allows the model itself to adaptively assign surfaces to depth maps. We formulate per-pixel multi-layer depth as a point process, treating depth layers as unordered events along each camera ray. This induces a permutation-invariant likelihood over the observed depth layers, yielding a loss that naturally supports arbitrary layer groupings. Experiments demonstrate that our method significantly advances the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34% to 70.09%. Code is available at https://github.com/princeton-vl/SeeGroup.

2605.28697 2026-05-28 eess.IV cs.AI cs.CV 版本更新

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

深度学习应变估计:基于物理的模拟是解决方案吗?

Thierry Judge, Nicolas Duchateau, Andreas Østvik, Khuram Faraz, Anders Austlid Taskén, Sigve Karlsen, Thor Edvardsen, Harald Brunvand, Md Abulkalam Azad, Havard Dalen, Bjørnar Grenne, Gabriel Kiss, Pierre-Yves Courand, Lasse Lovstakken, Pierre-Marc Jodoin, Olivier Bernard

发表机构 * Dept. of Computer Science, University of Sherbrooke(计算机科学系, Sherbrooke 大学) INSA, Université Lyon 1, CNRS UMR 5220, Inserm U1206, CREATIS(INSA,里昂 1 大学,CNRS UMR 5220,Inserm U1206,CREATIS) Institut Universitaire de France (IUF)(法国国家研究院(IUF)) Cardiology Dept., Hôpital Croix-Rousse, Hospices Civils de Lyon(里昂医院心血管科,Hospices Civils de Lyon) Cardiology Dept., Hôpital Lyon Sud, Hospices Civils de Lyon(里昂南部医院心血管科,Hospices Civils de Lyon) Dept. of Computer Science, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology (NTNU)(计算机科学系,信息科技与电气工程学院,挪威科学技术大学(NTNU)) Dept. of Circulation and Medical Imaging, NTNU(循环医学与医学影像系,NTNU) Department of Medicine, Hospital of Southern Norway, Arendal, Norway(南部挪威医院医学部,Arendal,挪威) Dept. of Cardiology and Cardiothoracic Surgery, St. Olavs Hospital, Trondheim, Norway(心内科和心胸外科部,St. Olavs 医院,Trondheim,挪威) Dept. of Health Research, SINTEF Digital, Trondheim, Norway(健康研究部,SINTEF 数字技术,Trondheim,挪威) Dept. of Medicine, Levanger Hospital, Nord-Trøndelag Hospital Trust, Levanger, Norway(医学部,Levanger 医院,Nord-Trøndelag 医院信托,Levanger,挪威) Dept. of Cardiology, Oslo University Hospital, Rikshospitalet and the Faaculty of Medicine, University of Oslo, Norway(心内科,奥斯陆大学医院 Rikshospitalet,奥斯陆大学医学院,挪威)

AI总结 针对超声心动图中应变估计缺乏可靠运动参考的问题,提出一种结合真实视频散斑去相关测量与迭代细化过程的模拟策略,生成逼真数据集训练运动估计算法,在全局和区域应变上达到优于临床参考的性能。

Comments 10 pages

详情
AI中文摘要

斑点追踪超声心动图(STE)是心肌应变估计的临床标准。尽管在全局应变(GLS)上表现良好,但其区域应变的准确性仍然有限,尽管这一生物标志物对于早期诊断和表征细微异常高度相关。深度学习是一种有前景的替代方案,但其发展受到缺乏可靠运动参考的限制。现有解决方案要么依赖于STE衍生的标签,要么依赖于基于物理模型生成的模拟,但这些合成序列与临床数据相比仍缺乏足够的真实性。在本文中,我们提出了一种新的模拟策略,该策略结合了来自真实视频的散斑去相关测量,并使用迭代细化过程来改善模拟中的运动真实性。我们创建了一个包含1,478个视频及其参考运动的开源逼真数据集,用于训练超声心动图运动估计算法。所提出的方法在全局和区域应变上实现了无与伦比的性能,特别是在专家间设置中,GLS变异性达到1.42%,而临床参考为1.78%。

英文摘要

Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical data.In this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.

2605.28691 2026-05-28 cs.CV 版本更新

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

OSP-Next: 结合稀疏序列并行、HiF8量化和强化学习的高效高质量视频生成

Yunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin, Bin Zhu, Xinhua Cheng, Li Yuan

发表机构 * Peking University(北京大学) Nanyang Technological University, Singapore(南洋理工大学)

AI总结 提出OSP-Next文本到视频生成模型,通过混合全稀疏注意力架构、稀疏序列并行(SSP)、HiF8量化和混合GRPO后训练,在保持高质量的同时显著提升效率,在NVIDIA H200和Ascend 950PR上实现1.5倍以上加速。

详情
AI中文摘要

扩散Transformer在视频生成中取得了高质量,但全注意力的二次成本限制了效率。我们提出OSP-Next,一种高效的文本到视频生成模型,集成了稀疏注意力、并行、量化和强化学习。OSP-Next采用混合全稀疏注意力架构,其中稀疏组件通过Skiparse-2D注意力实现。这种固定模式机制沿空间维度应用逐token和逐组的稀疏注意力,利用局部性同时保持与FlashAttention内核的原生兼容性。基于Skiparse-2D注意力中重排的局部等价性,我们进一步提出稀疏序列并行(SSP),它将子序列划分到多个rank,并通过一次All-to-All通信切换稀疏模式。与Ulysses序列并行(SP)相比,SSP为稀疏注意力提供了原生并行策略,并将通信量减少了75%。OSP-Next还引入了HiF8量化,以实现8位量化和稀疏微调的稳定联合训练,并应用Mix-GRPO后训练来提升稀疏模型的性能。实验表明,OSP-Next的VBench总得分为83.73%,超过了Wan2.1基线。在5秒720P和5秒768P设置下,OSP-Next在NVIDIA H200 GPU上实现了高达1.64倍的单GPU加速和超过1.52倍的八GPU加速。此外,在VBench总分仅下降0.4%的情况下,OSP-Next-HiF8在单个Ascend 950PR上分别实现了1.69倍和2.27倍的加速,展示了OSP-Next跨硬件平台的效率和性能。

英文摘要

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

2605.28630 2026-05-28 cs.CV cs.MM 版本更新

EntroAD: Structural Entropy-Guided Prompt Adaptation for Zero-Shot Anomaly Detection

EntroAD: 结构熵引导的提示自适应用于零样本异常检测

Xinyu Zhao, Qingyun Sun, Jiayi Luo, Jianxin Li

发表机构 * Beihang University(北京航空航天大学)

AI总结 提出EntroAD框架,利用结构熵引导动态路由机制和置信度感知双分支提示自适应,实现零样本异常检测,在跨数据集设置中达到最优性能。

详情
AI中文摘要

零样本异常检测(ZSAD)旨在无需目标域适应的情况下检测未见域中的异常。最近的基于CLIP的方法通过利用提示学习和视觉-文本对齐展示了有前景的性能。然而,大多数现有方法依赖于单一适应路径,这可能不足以处理跨域的异质异常模式。在实践中,异常表现出截然不同的特征,从显著、局部的结构破坏到微妙、扩散且不规则的变异。为了解决这一挑战,我们提出了EntroAD,一种结构熵引导的零样本异常检测框架。与以往方法不同,EntroAD引入了一种动态路由机制,通过专门的适应策略处理不同类型的异常。具体地,我们从自注意力诱导的补丁关系中估计补丁级结构熵,并将其作为关系不确定性的代理来指导异常感知的令牌路由。基于该路由信号,我们构建异常感知的路由令牌,以更好地捕捉具有不同结构特征的异常线索。我们进一步引入了一个置信度感知的双分支提示自适应模块,以稳定视觉-文本对齐,同时保留CLIP的可迁移先验。在10个工业和医学基准上的大量实验表明,EntroAD在具有挑战性的跨数据集ZSAD设置中达到了最先进的性能。

英文摘要

Zero-Shot Anomaly Detection (ZSAD) aims to detect anomalies in unseen domains without target-domain adaptation. Recent CLIP-based methods have shown promising performance by leveraging prompt learning and visual-text alignment. However, most existing approaches rely on a single adaptation pathway, which may be insufficient for heterogeneous anomaly patterns across domains. In practice, anomalies exhibit vastly different characteristics, ranging from salient, localized structural disruptions to subtle, diffuse, and irregular variations. To address this challenge, we propose EntroAD, a structural entropy-guided zero-shot anomaly detection framework. Unlike previous methods, EntroAD introduces a dynamic routing mechanism to process different types of anomalies with specialized adaptation strategies. Specifically, we estimate patch-level structural entropy from self-attention-induced patch relations and use it as a proxy for relational uncertainty to guide anomaly-aware token routing. Based on this routing signal, we construct anomaly-aware routed tokens to better capture anomaly cues with different structural characteristics. We further introduce a confidence-aware dual-branch prompt adaptation module to stabilize visual-text alignment while preserving CLIP's transferable prior. Extensive experiments on 10 industrial and medical benchmarks show that EntroAD achieves state-of-the-art performance in challenging cross-dataset ZSAD settings.

2605.28619 2026-05-28 cs.CV nlin.AO 版本更新

A Multiscale Kinetic Framework for Image Segmentation: From Particle Systems to Continuum Models

图像分割的多尺度动力学框架:从粒子系统到连续模型

Horacio Tettamanti, Giulia Guicciardi, Mattia Zanella

AI总结 提出一种基于共识的多尺度动力学框架,通过将图像视为粒子系统并推导动力学方程与宏观模型,结合粒子优化实现图像分割。

Comments 26 pages, 34 figures

详情
AI中文摘要

在这项工作中,我们提出了一种用于基于共识的图像分割的多尺度动力学框架。通过将图像解释为相互作用的粒子系统,每个像素由其空间位置和编码颜色信息的内部特征来表征。我们引入了一个耦合相互作用方案,控制粒子在位置和特征空间中的演化,由此推导出空间-特征域中粒子密度的动力学公式,结合了输运、聚集和扩散效应。此外,通过适当的缩放,我们获得了一个一阶宏观模型,描述携带关于具有特定特征的像素分数信息的像素分数的演化。基于这个简化复杂度的模型,我们提出了一种数据导向的方法,利用基于粒子的优化技术进行精确的图像分割。数值测试显示了所提出框架的有效性及其在不同噪声条件下的鲁棒性。

英文摘要

In this work, we present a multiscale kinetic framework for consensus-based image segmentation. By interpreting an image as a system of interacting particles, each pixel is characterised by its spatial position and an internal feature encoding color information. We introduce a coupled interaction scheme governing the evolution of particles in both position and feature spaces, from which we derive a kinetic formulation for the particle density in the space-feature domain combining transport, aggregation, and diffusion effects. Furthermore, through a suitable scaling, we obtain a first-order macroscopic model describing the evolution of the fraction of pixels carrying information on the fraction of pixels having a certain feature. Based on this reduced-complexity model, we present a data-oriented approach where we make use of particle-based optimisation techniques for the accurate segmentation of images. Numerical tests show the effectiveness of the proposed framework and its robustness under different noise conditions.

2605.28615 2026-05-28 cs.CV 版本更新

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

基于区域感知双模态直接偏好优化的组合式文本到图像生成

Zhuohan Liu, Wujian Peng, Yitong Chen, Zuxuan Wu

发表机构 * Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University(上海智能信息处理关键实验室,复旦大学计算机学院) Shanghai Collaborative Innovation Center of Intelligent Visual Computing(上海智能视觉计算协同创新中心)

AI总结 提出BiDPO框架,通过构建大规模偏好数据集BiComp和扩展Diffusion DPO联合优化图像与文本偏好,结合区域级引导方法,提升文本到图像模型对复杂组合提示的生成保真度。

详情
AI中文摘要

尽管文本到图像(T2I)模型取得了快速进展,但生成准确反映复杂组合提示(涵盖属性绑定、对象关系、计数)的图像仍然具有挑战性。为了解决这个问题,我们提出了BiDPO,一个增强T2I模型组合式文本到图像生成能力的框架。我们首先引入一个精心设计的流程,构建大规模偏好数据集BiComp,并进行严格的质量控制。然后,我们将Diffusion DPO扩展到联合优化图像和文本偏好,这在提高模型遵循复杂文本提示生成方面被证明非常有效。为了进一步增强模型的细粒度对齐,我们采用区域级引导方法,聚焦于与组合概念相关的区域。实验结果表明,我们的BiDPO显著提高了组合保真度,在多个基准测试中持续优于先前方法。我们的方法突显了基于偏好微调在复杂文本到图像任务中的潜力,为现有技术提供了一种灵活且可扩展的替代方案。

英文摘要

Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.

2605.28609 2026-05-28 cs.CV 版本更新

JECA^2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models

JECA^2: 面向取证视觉语言模型的判断-解释一致对抗攻击

Jiachen Qian

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 针对取证视觉语言模型,提出一种白盒对抗攻击方法JECA^2,通过Grad-CAM引导的视觉扰动和令牌邻近约束的文本嵌入优化,实现判断与解释的一致性,实验表明攻击成功率和一致性优于基线。

Comments 37 pages, 6 figures. Includes supplementary material

详情
AI中文摘要

取证视觉语言模型(VLM)最近被开发用于检测图像篡改并提供自然语言解释。然而,它们对抗对抗性操纵的鲁棒性仍未得到充分探索。现有的对抗攻击通常旨在翻转模型的二元判断,而伴随的解释可能仍然揭示取证线索并与被攻击的判断相矛盾。在本文中,我们研究了针对取证VLM的判断-解释一致对抗攻击,并提出了JECA^2,一种受控的白盒红队诊断方法,它联合重定向视觉注意力并将文本解释与目标判断对齐。在视觉方面,JECA^2使用Grad-CAM引导的扰动将注意力从篡改区域转移到良性区域。在文本方面,它在令牌邻近约束下优化提示嵌入,使其朝向真实性肯定的语义。在取证VLM基准上的实验表明,在白盒威胁设置下,JECA^2比实现的基线实现了更高的攻击成功率和自动判断-解释一致性,而迁移到闭源VLM仍然可测量但有限。我们的结果突显了基于解释的取证VLM中的一致性失败模式,并激励了超越二元检测准确性的未来鲁棒性评估。

英文摘要

Forensic vision-language models (VLMs) have recently been developed to detect image tampering and provide natural-language explanations. However, their robustness against adversarial manipulation remains underexplored. Existing adversarial attacks typically aim to flip the model's binary judgment, while the accompanying explanation may still reveal forensic cues and contradict the attacked judgment. In this paper, we study judgment-explanation consistent adversarial attacks against forensic VLMs and propose JECA^2, a controlled white-box red-team diagnostic that jointly redirects visual attribution and aligns textual explanations with the target judgment. On the visual side, JECA^2 uses Grad-CAM-guided perturbations to divert attribution from tampered regions toward benign regions. On the textual side, it optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint. Experiments on forensic VLM benchmarks show that JECA^2 achieves higher attack success and automated judgment-explanation consistency than implemented baselines under white-box threat settings, while transfer to closed-source VLMs remains measurable but limited. Our results highlight a consistency failure mode in explanation-based forensic VLMs and motivate future robustness evaluation beyond binary detection accuracy.

2605.28605 2026-05-28 cs.CV 版本更新

Internally Referenced Low-Light Enhancement

内部参考的低光增强

Peiyuan He, Hainuo Wang, Hengxing Liu, Mingjia Li, Xiaojie Guo

发表机构 * College of Intelligence and Computing(智能与计算学院) Tianjin University(天津大学) Tianjin 300350, China(天津300350, 中国)

AI总结 提出一种内部参考低光增强框架,通过从退化输入中提取物理和结构参考,结合局部曝光模拟、双域保持策略和增益自适应特征调制,实现自监督低光图像增强,在噪声抑制和纹理保真度上达到最优性能。

详情
AI中文摘要

自监督低光图像增强(LLIE)因其消除了对外部配对数据的依赖而极具吸引力。然而,缺乏外部参考导致网络难以解耦纠缠的照明、精细纹理和放大的噪声。为解决这一挑战,我们提出了一种内部参考的LLIE框架,该框架从退化输入图像本身提取可靠的物理和结构参考。首先,我们引入了一种局部曝光模拟方案来提取低频伪真值。这作为内部物理参考,用于指导全局照明估计和校正色偏。其次,我们提出了一种具有空间和光谱约束的双域保持策略来构建内部结构参考。具体来说,照明对齐感知损失在照明变化下保留全局结构,而平移不变光谱相关损失捕获细粒度局部结构并抑制高频噪声。最后,我们提出了一种增益自适应特征调制(GAFM)机制来处理高度空间变化的残余噪声。通过将自估计的照明图转换为内部空间增益先验,GAFM动态引导盲点网络进行空间感知去噪。大量实验表明,我们的方法实现了最先进的性能,提供了卓越的噪声抑制和纹理保真度。代码将在https://visonj.github.io/IRLE/公开。

英文摘要

Self-supervised low-light image enhancement (LLIE) is highly appealing as it eliminates the reliance on external paired data. However, the lack of external references causes networks to struggle with decoupling entangled illumination, delicate textures, and amplified noise. To resolve this challenge, we propose an Internally Referenced LLIE framework that extracts reliable physical and structural references from the degraded input image itself. First, we introduce a local exposure-simulated scheme to extract a low-frequency pseudo ground-truth. This serves as an internal physical reference to guide global illumination estimation and correct color casts. Second, we propose a dual-domain preservation strategy with spatial and spectral constraints to construct internal structural references. Specifically, an Illumination-Aligned Perceptual loss preserves global structures under illumination shifts, while a Shift-Invariant Spectral Correlation loss captures fine-grained local structures and suppresses high-frequency noise. Finally, we propose a Gain-Adaptive Feature Modulation (GAFM) mechanism to address highly spatially-variant residual noise. By transforming the self-estimated illumination map into an internal spatial gain prior, GAFM dynamically guides a blind-spot network for spatially-aware denoising. Extensive experiments demonstrate that our method achieves state-of-the-art performance, delivering superior noise suppression and textural fidelity. Code will be publicly released at https://visonj.github.io/IRLE/.

2605.28604 2026-05-28 cs.CV cs.AI 版本更新

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

挖掘多模态时空线索用于视频重要人物识别

Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang, Zheng Wang, Xin Xu, Mang Ye

发表机构 * School of Computer Science and Technology, Wuhan University of Science and Technology(武汉科技大学计算机科学与技术学院) Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology(湖北省智能信息处理与实时工业系统重点实验室) School of Computer Science, National Engineering Research Center for Multimedia Software, Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University(计算机科学学院,国家多媒体软件工程技术研究中心,湖北省多媒体与网络通信工程重点实验室,武汉大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 针对视频中人物重要性随时间变化的问题,提出VIP-Net框架,通过多模态时空线索融合与时间重要性矫正,在Temporal-VIP数据集上达到67.3%准确率。

详情
AI中文摘要

识别视频场景中的关键人物对于自动视频编辑和智能监控等应用至关重要。当前方法主要关注静态图像和即时视觉线索,忽略了视频中丰富的时空信息。这导致了时间重要性转移(TIS)现象,即早期帧中被认为重要的人物在考虑整个时间上下文后可能被降级。为了解决这一问题,我们引入了视频重要人物(VIP)识别任务,旨在自动识别视频中最具影响力的人物,同时提供文本理由。我们提出了Temporal-VIP,一个大规模的理由标注数据集,包含11个类别的9,249个视频片段,并附有对齐的重要性理由。为了缓解TIS,我们开发了VIP-Net框架,包括用于提取多模态时空线索的社会线索编码器(SCE)、用于层次化线索融合和跨模态对齐的时间重要性矫正器(TIR),以及用于人物排序的VIP推理。实验结果表明,VIP-Net达到了67.3%的准确率,显著优于最先进的模型(37.5%-53.9%),并通过特征引导的LLM优化,平均理由相似度达到0.63。数据集和代码可在https://huggingface.co/datasets/yml2002/Temporal-VIP获取。

英文摘要

Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at https://huggingface.co/datasets/yml2002/Temporal-VIP.

2605.28587 2026-05-28 cs.CV 版本更新

Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation

可变形高斯占据:通过分解蒸馏解耦刚性与非刚性运动

Yang Gao, Wuyang Li, Po-Chien Luan, Alexandre Alahi

发表机构 * École Polytechnique Fédérale de Lausanne(瑞士联邦理工学院洛桑校区)

AI总结 提出DeGO框架,通过解耦高斯变形和分解式4D基础模型蒸馏,在弱监督下实现动态场景中刚性与非刚性运动的分离,显著提升人体实例占据预测性能。

Comments CVPR 2026

详情
AI中文摘要

理解动态3D环境对于安全自动驾驶至关重要,特别是推理以人为中心的非刚性智能体时。然而,现有的弱监督占据预测框架主要假设刚体运动并依赖简单的帧间偏移,限制了其捕捉细粒度变形和保持时间一致性的能力。为解决此问题,我们提出DeGO,一个可变形高斯占据框架,统一了解耦高斯变形与分解式4D基础模型蒸馏。DeGO解耦刚性和非刚性运动,使每个高斯基元通过变形和偏移更新共同演化。同时,分解式4D蒸馏策略从VGGT基础模型迁移跨相机和跨帧知识,产生对齐基础模型的特征,增强时间一致性。在Occ3D-NuScenes基准上的实验表明,我们的方法在弱监督下达到了最先进性能,在人体实例上获得13.5%的提升,整体提升10.9%。这些结果凸显了变形感知和基础模型引导的占据建模对动态场景理解的有效性。代码已公开:https://github.com/vita-epfl/DeGO

英文摘要

Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing weakly supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consistency. Experiments on the Occ3D-NuScenes benchmark demonstrate that our method achieves state-of-the-art performance under weak supervision, delivering 13.5% gains on human-centric instances and 10.9% overall improvements. These results highlight the effectiveness of deformation-aware and foundation-guided occupancy modeling for dynamic scene understanding. The code is publicly available: https://github.com/vita-epfl/DeGO

2605.28548 2026-05-28 cs.CV 版本更新

GEM: Generative Supervision Helps Embodied Intelligence

GEM: 生成式监督助力具身智能

Ruowen Zhao, Bangguo Li, Zuyan Liu, Yinan Liang, Junliang Ye, Fangfu Liu, Diankun Wu, Zhengyi Wang, Xumin Yu, Yongming Rao, Han Hu, Jun Zhu

发表机构 * Tsinghua University(清华大学) Tencent Hunyuan(腾讯文心)

AI总结 提出GEM模型,通过在视觉语言模型预训练中引入深度图生成任务,联合训练以提升具身智能的语义理解与物理操作能力,并发布大规模数据集GEM-4M,在多个基准上取得最优结果。

Comments Project Page: https://zhaorw02.github.io/GEM/

详情
AI中文摘要

具身视觉语言模型(VLMs)在机器人领域,特别是在视觉-语言-动作框架中,展示了令人印象深刻的性能和泛化能力。然而,标准文本引导预训练范式的高层语义焦点与具身环境中执行所需的关键低层空间和物理知识之间仍存在显著差距。在本文中,我们介绍了GEM,一种生成式监督的具身视觉语言模型,旨在弥合这一鸿沟。我们提出将深度图生成任务直接集成到VLM预训练阶段。通过将这一生成目标与主模型联合训练,我们观察到具身智能的显著提升,同时增强了语义理解和物理操作能力。为了支持这一范式,我们整理并发布了GEM-4M,一个包含基础、推理和规划数据与高质量深度监督配对的大规模综合数据集。大量实验表明,GEM在多个具身基准上取得了最先进的结果。此外,我们部署的动作模型GEM-VLA在模拟环境和真实世界评估中均表现出卓越的任务执行能力。代码、模型和数据集可在https://zhaorw02.github.io/GEM/获取。

英文摘要

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/

2605.28544 2026-05-28 cs.CV 版本更新

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

DriveWAM: 视频生成先验实现自动驾驶的可扩展世界-动作建模

Chen Shi, Jinrui Xu, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Voyager Research, Didi Chuxing(Voyager Research,滴滴出行)

AI总结 提出DriveWAM,通过将预训练视频扩散Transformer适配为自回归视频-动作策略,并引入场景演化驾驶引导和选择性KV记忆,实现可扩展的世界-动作建模,在NAVSIM和PhysicalAI基准上取得强规划性能。

详情
AI中文摘要

预训练基础模型已成为端到端自动驾驶的重要基础。与主要在静态图像-文本对上预训练的视觉-语言模型相比,视频生成模型捕获了自然适合驾驶的时间动态和运动先验。我们提出DriveWAM,一种驾驶世界-动作模型,它将预训练的视频扩散Transformer适配为自回归视频-动作策略。DriveWAM将视频和动作流组织成统一的时序token序列,并在联合流匹配目标下训练它们,保留预训练的视频生成架构,同时将其大规模视频先验适应于动作生成。为了融入高层场景理解,我们引入了场景演化驾驶引导,其中冻结的VLM生成块特定的语义意图以指导视频-动作生成。为了保持长时域推演有界,我们进一步引入了选择性KV记忆,通过推理时的相关性-冗余性缓存选择来维护有界的模态感知视频和动作记忆池。在NAVSIM和PhysicalAI-Autonomous-Vehicles基准上的实验表明,DriveWAM实现了强大的规划性能,从4k到100k驾驶片段的数据缩放研究进一步证实了世界-动作建模在端到端自动驾驶中的扩展潜力。

英文摘要

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

2605.28495 2026-05-28 cs.CV 版本更新

Janus-LoRA: A Balanced Low-Rank Adaptation for Continual Learning

Janus-LoRA:面向持续学习的平衡低秩适配

Cheng Chen, Pengpeng Zeng, Yuyu Guo, Lianli Gao, Hengtao Shen, Jingkuan Song

发表机构 * School of Computer Science and Technology, Tongji University, Shanghai, China(同济大学计算机科学与技术学院,上海,中国) School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China(电子科技大学计算机科学与工程学院,成都,中国) Shanghai Innovation Institute, Shanghai, China(上海创新研究院,上海,中国) Independent Researcher(独立研究者)

AI总结 提出Janus-LoRA框架,通过梯度修正实现参数级正交性以克服灾难性遗忘,并利用解耦边际损失增强特征级分离,从而在持续学习中平衡稳定性与可塑性。

Comments 9pages, International Conference on Machine Learning

详情
AI中文摘要

低秩适配(LoRA)已成为持续学习的一种有前景的范式。它独立更新其低秩因子($A$和$B$),通过它们的相互作用对完整权重矩阵产生复合更新。为了防止灾难性遗忘,该更新应保持与包含先前学习知识的任务特定子空间正交。然而,我们发现这种复合更新系统性地违反了这种正交性,重新引入了干扰并破坏了稳定性。此外,天真地强制执行这种正交性会损害可塑性,破坏微妙的稳定性-可塑性权衡。为了解决这些问题,我们提出了 extbf{Janus-LoRA}框架,通过两个新颖的组件恢复这种平衡。具体来说,我们首先引入梯度修正,这是一种闭式解,数学上解耦LoRA的因子更新,针对通过高效在线估计识别的历史知识子空间强制执行正交性。接下来,为了增强可塑性,我们引入解耦边际损失,通过将新特征表示推离旧特征表示来促进特征级分离,从而为新学习创建独特、低干扰的区域。在具有挑战性的基准上的全面实验表明,通过协调参数级正交性与特征级分离,Janus-LoRA实现了优越的平衡,并建立了新的最先进性能。

英文摘要

Low-Rank Adaptation (LoRA) has emerged as a promising paradigm for Continual Learning. It independently updates its low-rank factors ($A$ and $B$), creating a composite update to the full weight matrix through their interaction. To prevent catastrophic forgetting, this update should remain orthogonal to the task-specific subspace that contains previously learned knowledge. However, we identify that this composite update systematically violates this orthogonality, reintroducing interference and undermining stability. Furthermore, naively enforcing this orthogonality compromises plasticity, disrupting the delicate stability-plasticity trade-off. To resolve these issues, we propose \textbf{Janus-LoRA}, a framework that restores this balance through two novel components. Specifically, we first introduce Gradient Rectification, a closed-form solution that mathematically decouples LoRA's factor updates, enforcing orthogonality against the historical knowledge subspace identified by an efficient Online Estimation. Next, to enhance plasticity, we introduce a Decoupled Margin Loss that promotes feature-level separation by pushing new feature representations away from old ones, thus creating distinct, low-interference regions for new learning. Comprehensive experiments on challenging benchmarks demonstrate that by harmonizing parameter-level orthogonality with feature-level separation, Janus-LoRA achieves a superior balance and establishes new state-of-the-art performance.

2605.28491 2026-05-28 cs.CV 版本更新

DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

DiscoForcing:基于扩散强制的实时音频驱动角色控制统一框架

Kaiyang Ji, Bingsheng Qian, Binghuan Wu, Kangyi Chen, Ye Shi, Jingya Wang

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 针对实时音频响应角色控制问题,提出DiscoForcing框架,结合因果音乐编码器和扩散强制序列模型,在严格因果、有限延迟的流式生成中实现音频与全身运动的稳定对齐。

Comments accepted by ICML 2026

详情
AI中文摘要

我们研究实时音频响应角色控制作为一个部署忠实性问题:严格因果、有限延迟的流式生成,必须在交互帧率下生成连贯的全身运动,同时音频条件可能突然变化,包括节奏变化、音频丢失或用户编辑。先前的音乐到运动系统主要针对具有全局上下文的离线生成进行优化,在流式部署中,当条件历史变得过时或不可靠时,性能会下降。我们引入了DiscoForcing,一个流式音频驱动扩散框架,它将捕获节奏结构和相位动态的因果音乐编码器与在时间范围内以异构噪声水平训练的扩散强制序列模型相结合。在此基础上,我们设计了一个混合时间调度和一个历史引导的流式采样器,以明确权衡响应性与非平稳音频下的长期一致性。在端到端实时交互系统中实现,包括在线虚拟角色回放和人形部署工作流,DiscoForcing在匹配因果性和延迟约束下,比先前基线提供更稳定的长期展开和更清晰的音频-运动对齐,同时保持实时吞吐量。

英文摘要

We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly, including tempo shifts, drops, or user edits. Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.

2605.28490 2026-05-28 cs.CV cs.AI 版本更新

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

SSR3D-LLM: 通过潜在步骤实现结构化空间推理以实现统一3D-LLM中的细粒度定位

Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen, Jiajie Xu, Xiaofang Zhou

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Soochow University(苏州大学)

AI总结 针对统一3D-LLM中细粒度查询的脆弱性,提出SSR3D-LLM,通过潜在空间推理步骤和几何感知评分器逐步精炼候选排名,在多个基准上取得最优结果。

详情
AI中文摘要

3D物体定位从自然语言中定位3D场景中的所指对象。统一的以实例为中心的3D-LLM旨在同时解决定位、对话、问答和描述任务,但许多方法依赖于单一的指针式定位决策,将关系指令压缩为一个选择。这对于需要根据上下文对象和空间关系排除多个同类候选的细粒度查询来说是脆弱的。我们提出结构化空间推理3D-LLM(SSR3D-LLM),一种用于统一3D-LLM的结构化定位接口。给定固定的Mask3D物体提议,LLM从查询中写出一系列潜在的空间推理步骤和记忆令牌,然后一个几何感知评分器读取这些潜在步骤,通过逐步长度掩码逐步精炼候选排名。潜在步骤从标准基准目标监督和训练期间的辅助指代线索监督中学习,而推理仅使用输入查询和Mask3D提议。在ReferIt3D、ScanRefer和Multi3DRef上,SSR3D-LLM在统一3D-LLM基线中取得了最强结果,在细粒度定位上相比单指针QPG基线有显著提升,并相比先前的统一3D-LLM有一致改进,同时保留了默认的语言任务路径。

英文摘要

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

2605.28459 2026-05-28 cs.CV 版本更新

REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection

REVEAL:基于参考依据的多模态篡改检测推理

Jun Zhou, Bingwen Hu, Yaxiong Wang, Zhedong Zheng, Yongzhen Wang, Yuchen Zhang, Ping Liu

AI总结 提出REVEAL框架,通过参考依据验证和差异感知融合机制,结合任务解耦的混合专家架构,实现多模态篡改检测与定位,并支持无训练域适应。

Comments 11 pages, 3 figures

详情
AI中文摘要

多模态篡改检测旨在同时识别伪造的图像-文本对并定位被篡改区域,然而现有方法通常依赖于记忆孤立伪影,难以应对难以察觉的篡改痕迹或域偏移。受人类比较推理启发,我们将此任务重新表述为基于参考依据的验证问题,通过将查询与检索到的真实证据进行比较来评估真实性。我们提出REVEAL(参考支持的证据分析与定位验证),一个专门为此比较范式设计的框架。为支持该范式,我们构建了一个包含17万对真实新闻图像-文本对的大规模参考库,涵盖超过4万名公众人物。在技术上,REVEAL采用差异感知融合机制来捕捉查询与检索证据之间的细粒度差异。此外,我们引入任务解耦的混合专家(MoE)架构,以联合执行实例级检测和细粒度定位,有效缓解这些异构目标之间的优化冲突。大量实验表明,REVEAL显著优于最先进方法,并且通过简单更新参考库即可实现无训练域适应,为检测不断演变的虚假信息提供了稳健且实用的解决方案。代码可在 https://anonymous.4open.science/r/REVEAL-Reference-A006 获取。

英文摘要

Multimodal manipulation detection aims to simultaneously identify forged image--text pairs and localize tampered regions, yet existing methods typically rely on memorizing isolated artifacts and struggle with imperceptible manipulation traces or domain shifts. Inspired by human comparative reasoning, we reformulate this task as a reference-grounded verification problem, where authenticity is assessed by comparing a query against retrieved authentic evidence. We propose REVEAL Reference-Enabled Verification for Evidence Analysis and Localization), a framework explicitly designed for this comparative paradigm. To support this paradigm, we construct a large-scale reference library comprising 170K authentic news image--text pairs featuring over 40K public figures. Technically, REVEAL employs a difference-aware fusion mechanism to capture fine-grained discrepancies between the query and retrieved evidence. Furthermore, we introduce a task-decoupled Mixture-of-Experts (MoE) architecture to jointly execute instance-level detection and fine-grained grounding, effectively mitigating optimization conflicts between these heterogeneous objectives. Extensive experiments demonstrate that REVEAL significantly outperforms state-of-the-art methods, and notably enables \emph{training-free domain adaptation} by simply updating the reference library, offering a robust and practical solution for detecting evolving misinformation. Code is available at https://anonymous.4open.science/r/REVEAL-Reference-A006.

2605.28456 2026-05-28 cs.AI cs.CV eess.AS 版本更新

Diffusion Large Language Models for Visual Speech Recognition

用于视觉语音识别的扩散大语言模型

Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro

发表机构 * Integrated Vision Language Lab, KAIST, South Korea(韩国加耶大学集成视觉语言实验室)

AI总结 提出首个基于扩散大语言模型(DLLM)的视觉语音识别框架DLLM-VSR,通过迭代掩码去噪和灵活顺序解码,结合置信度引导的解掩码策略及两阶段训练,并引入长度引导候选解码以降低目标长度不确定性,在LRS3上取得19.5%的词错误率。

Comments Code: https://github.com/JeongHun0716/dllm-vsr

详情
AI中文摘要

现有的视觉语音识别(VSR)系统通常依赖于从左到右的自回归解码,这可能在获得足够上下文之前,迫使对视觉模糊的令牌做出过早决策。我们提出DLLM-VSR,据我们所知,这是首个基于扩散大语言模型(DLLM)的VSR框架,将转录过程表述为具有灵活顺序解码的迭代掩码去噪。通过基于置信度的解掩码,DLLM-VSR早期提交高置信度位置,并利用已提交的令牌作为双向上下文来细化模糊令牌。为了使DLLM适应VSR,我们引入了一种两阶段掩码去噪训练策略,将视觉到文本的内容对齐与长度建模分离。我们进一步观察到,在假设知道真实转录长度的oracle长度解码下存在性能差距,这表明减少目标长度不确定性可以改善基于DLLM的VSR。为了缩小这一差距,我们开发了长度引导的候选解码,利用视频时长构建合理的转录长度假设,在多个假设下解码,并使用长度合理性和解码置信度对候选进行重新排序。所提出的方法仅使用LRS3的标注训练数据,就实现了19.5%的词错误率(WER),达到了最先进水平。

英文摘要

Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion Large Language Model (DLLM)-based VSR framework, formulating transcription as iterative masked denoising with flexible-order decoding. With confidence-based unmasking, DLLM-VSR commits high-confidence positions early and uses the committed tokens as bidirectional context to refine ambiguous ones. To adapt DLLMs to VSR, we introduce a two-stage masked-denoising training strategy that separates visual-to-text content alignment from length modeling. We further observe a performance gap with oracle-length decoding, which assumes access to the true transcript length, indicating that reducing target-length uncertainty can improve DLLM-based VSR. To reduce this gap, we develop length-guided candidate decoding, which uses video duration to construct plausible transcript-length hypotheses, decodes under multiple hypotheses, and reranks candidates using length plausibility and decoding confidence. The proposed method achieves a state-of-the-art WER of 19.5\% on LRS3 using only its labeled training data.

2605.28450 2026-05-28 cs.CV cs.AI 版本更新

BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers

BiasEdit: 一种无需训练的偏差检测与编辑框架,用于学习公平的视觉分类器

Jungwook Seo, Yoonsik Park, Changmin Lee, Sungyong Baik

发表机构 * Hanyang University Department of Artificial Intelligence BAIK Lab Seoul South Korea(翰阳大学人工智能系BAIK实验室首尔韩国) Hanyang University Department of Data Science BAIK Lab Seoul South Korea(翰阳大学数据科学系BAIK实验室首尔韩国) Hanyang University Department of Data Science Department of Artificial Intelligence BAIK Lab Seoul South Korea(翰阳大学数据科学系人工智能系BAIK实验室首尔韩国) Hanyang University(翰阳大学)

AI总结 提出BiasEdit框架,通过统计依赖和互信息分析自动检测偏差属性,并利用文本引导的图像编辑生成无偏样本,无需手动标注即可实现公平分类。

Comments Accepted to The Web Conference 2026 (formerly WWW) as an Oral presentation

详情
AI中文摘要

来自网络的视觉数据为图像分类器提供动力,这些分类器通常支撑着许多网络服务,如推荐和内容审核。然而,原始网络数据常常包含虚假关联和社会偏见,而神经网络以其倾向于学习数据中存在的偏见而闻名。这可能会加剧网络服务和网络数据中的不公平性,导致恶性循环。在图像分类的背景下,当大多数图像仅针对给定类包含相同属性时,网络会学习该类别的偏差属性。因此,从有偏数据集中训练公平且去偏的分类器需要处理多数具有偏差属性的图像(偏差对齐样本)与少数没有偏差属性的图像(偏差冲突样本)之间的不平衡问题。在这项工作中,我们引入了BiasEdit,一个模块化框架,能够自动从原始数据集中检测偏差属性并对其进行编辑,以构建去偏数据集。具体来说,BiasEdit首先通过视觉-语言表示的统计依赖性和互信息分析检测未知的偏差属性,然后使用文本引导的图像编辑显式编辑这些属性,以生成逼真的偏差冲突样本。与先前假设已知偏差属性或依赖合成混合的工作不同,我们的方法无需手动标注,并且可以利用现成的视觉-语言和编辑模型。BiasEdit解决了网络来源视觉AI中的一个基本挑战,减轻了数据集引起的偏差,并在训练数据完全有偏的情况下实现了最先进的去偏性能。

英文摘要

Visual data from the Web power image classifiers, which often underpin many web services, such as recommendation and content moderation. However, the raw Web data often contain spurious correlations and social biases, and neural networks are known for their tendency to learn biases present in data. This can reinforce unfairness in web services and the web data, leading to a vicious cycle. In the context of image classification, networks learn bias attributes for a specific class when a majority of images contain the same attribute only for a given class. Hence, training a fair and debiased classifier from a biased dataset demands handling an imbalanced problem between a majority of images with bias attributes (bias-aligned samples) and a minority without (bias-conflict samples). In this work, we introduce BiasEdit, a modular framework that automatically detects bias attributes from the original dataset and edits them to construct a debiased dataset. Specifically, BiasEdit first detects unknown bias attributes via statistical dependence and mutual information analysis of visual-linguistic representations, and then explicitly edits those attributes using text-guided image editing to generate realistic bias-conflict samples. Unlike prior works that assume known bias attributes or relies on synthetic mixing, our method operates without manual annotations and can leverage off-the-shelf vision-language and editing models. BiasEdit addresses a fundamental challenge in Web-sourced visual AI, mitigating dataset-induced bias and achieving state-of-the-art debiasing performance even when training data are fully biased.

2605.28441 2026-05-28 cs.CV cs.AI 版本更新

Bayesian Gated Non-Negative Contrastive Learning

贝叶斯门控非负对比学习

Peng Cui, Jiahao Zhang, Lijie Hu

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对对比学习中表示纠缠问题,提出贝叶斯门控非负对比学习,通过概率门控机制动态过滤无关特征,在Imagenet-100上语义一致性提升142.1%。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然对比学习(CL)已经革新了自监督表示学习,但其潜在表示仍然高度纠缠且不透明,限制了在安全关键应用中的可解释性。我们发现这种纠缠的一个根本原因是对确定性相似度量的依赖,该度量平等地对待所有特征维度。在组合场景中,这会产生优化冲突:常见的背景特征(如“蓝天”)被鼓励在正对中对齐,但同时又在负对中排斥,导致梯度振荡,阻碍精确的语义解缠。为了解决这个问题,我们提出了BayesNCL(贝叶斯门控非负对比学习)。与标准方法不同,BayesNCL引入了一种概率门控机制,动态过滤掉与任务无关的高频常见特征,同时选择性地保留判别性语义。通过将特征选择形式化为具有稀疏伯努利先验的变分推理问题,我们的方法有效解决了优化冲突。在Imagenet-100上的实验结果表明,与最先进的基线相比,BayesNCL在语义一致性上实现了142.1%的显著提升,在不影响下游任务性能的情况下产生了高度可解释的表示。代码可在 https://github.com/Cui-Peng-624/BayesNCL 获取。

英文摘要

While Contrastive Learning (CL) has revolutionized self-supervised representation learning, its latent representations remain highly entangled and opaque, limiting their interpretability in safety-critical applications. We identify that a fundamental cause of this entanglement is the reliance on deterministic similarity measures, which treat all feature dimensions equally. In compositional scenes, this creates an Optimization Conflict: common background features, such as, "blue sky", are encouraged to align in positive pairs but simultaneously repelled in negative pairs, causing gradient oscillations that hinder precise semantic disentanglement. To address this, we propose BayesNCL (Bayesian Gated Non-Negative Contrastive Learning). Unlike standard approaches, BayesNCL introduces a probabilistic gating mechanism that dynamically filters out task-irrelevant, high-frequency common features while selectively retaining discriminative semantics. By formalizing feature selection as a variational inference problem with a sparse Bernoulli prior, our method effectively resolves the optimization conflict. Empirical experimental results on Imagenet-100 demonstrate that BayesNCL achieves a remarkable 142.1% improvement in semantic consistency compared to state-of-the-art baselines, yielding highly interpretable representations without compromising downstream task performance. Code is available at https://github.com/Cui-Peng-624/BayesNCL.

2605.28428 2026-05-28 cs.CV cs.AI 版本更新

Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization

通过无训练图拉普拉斯能量最小化的非一致性异常检测

Jungwook Seo, Minjeong Kim, Younkwan Lee, Seungho Shin, Sungyong Baik

发表机构 * Dept. of Artificial Intelligence, Hanyang University(人工智能系,翰阳大学) Dept. of Data Science, Hanyang University(数据科学系,翰阳大学) Global Technology Research, Samsung Electronics(三星电子全球技术研究)

AI总结 提出一种无训练图拉普拉斯能量优化方法ANoCo,通过查询补丁与正常流形对齐所需的更新幅度来度量异常,无需学习参数或采样,在标准基准上取得强图像级AUROC和稳定定位图。

Comments Accepted to CVPR 2026

详情
AI中文摘要

检测图像中的细微视觉异常仍然具有挑战性,特别是当仅预先提供正常样本时。这种无监督异常检测通常通过测量查询补丁与正常补丁记忆库的特征相似性来解决。然而,仅凭相似性无法揭示查询补丁在多大程度上违反了正常特征流形的结构。我们提出了一种无训练的拉普拉斯图能量优化公式,名为ANoCo,它通过查询补丁与固定正常流形对齐所需的非一致性成本来评分异常。对于每个查询补丁,我们构建一个由余弦亲和性加权的二分查询-正常图,明确移除查询-查询和正常-正常边以防止证据稀释。我们将异常评分公式化为带有锚定正常节点的凸拉普拉斯能量,并以闭式求解。特别地,我们不使用优化后的特征本身——异常分数是满足正常性约束所需的更新幅度,将图拉普拉斯重新定义为非一致性算子而非平滑先验。所提出的方法不引入可学习参数、消息传递或采样,其复杂度与单次线性求解相当。在标准基准上,它实现了强大的图像级AUROC、稳定的定位图以及相比先前方法更强的鲁棒性,证明了使用优化诱导的特征漂移作为异常度量的有效性。

英文摘要

Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsupervised anomaly detection is typically solved by measuring feature similarity of a query patch to a memory of normal patches. However, similarity alone does not reveal how strongly a query patch violates the structure of the normal feature manifold. We propose a training-free Laplacian graph energy optimization formulation, named ANoCo that scores Anomaly by the cost of Non-Conformity of a query patch to align with a fixed normal manifold. For each query patch, we construct a bipartite query to normal graph weighted by cosine affinity, explicitly removing query-query and normal-normal edges to prevent evidence dilution. We formulate anomaly scoring as a convex Laplacian energy with anchored normal nodes, and solve in closed form. In particular, we do not use the optimized features themselves-the anomaly score is the magnitude of the update required to satisfy normality constraints, reframing the graph Laplacian as a non-conformity operator rather than a smoothing prior. The proposed method introduces no learnable parameters, message passing, or sampling, and has complexity comparable to a single linear solve. Across standard benchmarks, it delivers strong image-level AUROC, stable localization maps, and improved robustness over prior methods, demonstrating the effectiveness of using optimization-induced feature drift as anomaly measure.

2605.28422 2026-05-28 cs.CV cs.AI 版本更新

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

VITAL: 视觉-语义双重监督增强可解释的医学多模态大语言模型潜在推理

Qiaoru Li, Shaotian Liang, Jintao Chen, Haoran Sun, Yuxiang Cai, Jianwei Yin, Yankai Jiang

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) Tencent(腾讯) Ningbo Global Innovation Center, Zhejiang University(宁波全球创新中心,浙江大学) Zhejiang Key Laboratory of Digital-Intelligence Service Technology(浙江省数字智能服务技术重点实验室)

AI总结 提出VITAL框架,通过视觉-语义双重监督(文本解码器重构推理链、视觉投影器回归ROI特征)实现医学MLLM的可解释潜在推理,在7个基准上达到SOTA。

详情
AI中文摘要

潜在推理能够对连续隐藏状态而非显式token进行推理,避免了医学VQA中思维链的语言瓶颈和推理开销。然而,现有方法存在模态崩溃、视觉监督不足以及训练-推理不匹配的问题。此外,其不透明的潜在状态缺乏可解释性,而这在临床应用中至关重要。我们提出VITAL,一个用于医学MLLM的潜在空间推理框架,具有视觉-语义双重监督:一个辅助文本解码器从潜在状态重建推理链,同时一个视觉投影器从冻结的独立医学视觉编码器回归ROI特征。两个模块在推理时被丢弃,零开销,但可以在事后重新附加以实现双重可解释性,在不牺牲效率的情况下提供推理过程的文本和视觉解释。我们构建了一个涵盖9种成像模态的61K数据集,比之前的医学视觉潜在推理数据集大一个数量级。在7个基准上的实验表明,VITAL一致且显著优于骨干模型、所有潜在推理基线以及在更大数据上训练的医学MLLM,达到了与万亿参数专有模型竞争的最先进结果。

英文摘要

Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.

2605.28401 2026-05-28 cs.CV 版本更新

EgoRelight: Egocentric Human Capture and Illumination Recovery for Relightable and Photoreal Avatar Rendering

EgoRelight: 基于自我中心的人体捕捉与光照恢复实现可重光照和逼真化身渲染

Jianchun Chen, Yinda Zhang, Rohit Pandey, Thabo Beeler, Marc Habermann, Christian Theobalt

发表机构 * MPI for Informatic, SIC \& VIA Research Center Saarbr \"u cken Germany Google Mountain View USA Google Z \"u rich Switzerland MPI for Informatic \& VIA Research Center Saarbr \"u cken Germany MPI for Informatic, SIC \& VIA Research Center Google MPI for Informatic \& VIA Research Center

AI总结 提出EgoRelight框架,通过头戴显示器上的立体下视相机提取深度图驱动网格化身,并利用神经外观模型分别合成视角相关镜面反射和视角无关漫反射,结合测试时逆渲染恢复HDR环境图,实现从单一HMD进行全身性能捕捉、逼真可重光照外观合成和环境光照估计。

详情
AI中文摘要

混合现实(MR)头戴显示器承诺了一个沉浸式远程呈现的未来,其中虚拟人无缝地融入真实或虚拟环境。实现这一愿景需要一种方法,能够从头戴显示器(HMD)的受限视角捕捉用户的运动、估计新光照下的外观并理解环境。现有方法将这些视为孤立问题:它们要么专注于驱动具有固定光照的化身,要么依赖工作室设置进行重光照。在本文中,我们提出了EgoRelight,一个用于自我中心远程呈现的整体框架,它同时捕捉全身人体性能、合成逼真且可重光照的外观,并从单个HMD估计高动态范围(HDR)环境图。首先,为了确保运动和表面重建,我们提出了一个自我中心感知模块,利用立体下视相机提取密集深度图,作为几何控制信号驱动基于网格的化身。其次,我们引入了一种新颖的神经外观模型,该模型学习分别合成视角相关的镜面反射和视角无关的漫反射。通过采用专门的射线采样策略,我们的模型能够泛化到未见过的光照,而不依赖限制性的解析BRDF先验。第三,我们通过测试时逆渲染过程实现化身无缝集成到物理世界,该过程通过将预训练化身的外观与实时自我中心相机观测匹配来恢复HDR环境图。我们通过一个社交远程呈现应用演示了我们的系统,其中远程用户根据其物理环境被一致地重光照。大量实验表明,我们的组件和集成系统在几何精度、渲染以及重光照保真度方面显著优于最先进的基线方法。

英文摘要

Mixed Reality (MR) headsets promise a future of immersive telepresence where virtual humans blend indistinguishably into real or virtual surroundings. Achieving this vision requires a method for capturing a user's motion, estimating appearance under novel lighting, and understanding the environment - all from the constrained viewpoint of a head-mounted display (HMD). Existing approaches treat these as isolated problems: they either focus on driving avatars with baked-in lighting or rely on studio setups for relighting. In this paper, we present EgoRelight, a holistic framework for egocentric telepresence that simultaneously captures full-body human performance, synthesizes photorealistic and relightable appearance, and estimates high dynamic range (HDR) environment maps from a single HMD. First, to ensure motion and surface reconstruction, we propose an egocentric perception module that leverages stereo down-facing cameras to extract dense depth maps, which serve as geometric control signals to drive a mesh-based avatar. Second, we introduce a novel neural appearance model that learns to synthesize view-dependent specular and view-independent diffuse shading separately. By employing a specialized ray-sampling strategy, our model generalizes to unseen illumination without relying on restrictive analytical BRDF priors. Third, we enable seamless avatar integration into the physical world via a test-time inverse rendering process, which recovers an HDR environment map by matching the pre-trained avatar's appearance to live egocentric camera observations. We demonstrate our system through a social telepresence application, where remote users are coherently relit according to their physical environment. Extensive experiments show that our components and the integrated system significantly outperform state-of-the-art baselines in geometric accuracy and rendering as well as relighting fidelity.

2605.28397 2026-05-28 cs.CV 版本更新

Adaptive Temporal Gating of Longitudinal Magnetic Resonance Imaging for Alzheimer's Prediction

用于阿尔茨海默病预测的纵向磁共振成像自适应时间门控

Alireza Moayedikia, Sara Fin, Alicia Troncoso Lora, Uffe Kock Wiil

发表机构 * organization= School of Business Law Entrepreneurship, Swinburne University of Technology , city= Melbourne , state= VIC , country= Australia organization= Australian Regenerative Medicine Institute, Monash University , city= Melbourne , state= VIC , country= Australia organization= Data Science \& Big Data Lab, Universidad Pablo de Olavide , city= Seville , country= Spain organization= The Maersk Mc-Kinney M ller Institute, University of Southern Denmark , city= Odense , country= Denmark

AI总结 提出TAF-Net混合CNN-Transformer架构,通过自适应时间门控融合纵向3D MRI的时空表示,在MCI-to-AD转化预测中仅用结构MRI即达到最优性能,接近需多模态数据的方法。

详情
AI中文摘要

从轻度认知障碍(MCI)到阿尔茨海默病(AD)的转化预测对于早期干预至关重要。当前的深度学习范式主要依赖于横截面结构MRI,忽略了患者特定解剖轨迹中的预后价值。我们引入了时间自适应融合网络(TAF-Net),这是一种混合CNN-Transformer架构,用于建模配对的纵向3D MRI扫描。TAF-Net的核心是由自适应时间门控的时间融合模块,该模块学习患者特定的权重以合成三种时空表示:显式结构变化、区域间时间交叉注意力和双侧特征拼接。在阿尔茨海默病神经影像学倡议队列上进行的三年MCI-to-AD转化预测评估中,TAF-Net仅使用结构MRI就在所有评估方法中取得了最高的判别性能,显著优于最强基线,并接近需要PET、CSF或遗传数据的多模态方法。该架构表现出卓越的数据效率,仅用少量训练数据即可匹配基线性能。消融研究表明,纵向融合提高了判别能力,同时与单时间点评估相比,预测方差降低了48%。可解释性分析显示,空间注意力与内侧颞叶和脑室中已建立的AD病理学一致,而门控机制优先考虑与转化风险强正相关的显式体积变化。

英文摘要

Predicting conversion from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) is critical for early intervention. Current deep learning paradigms predominantly rely on cross-sectional structural MRI, neglecting prognostic value in patient-specific anatomical trajectories. We introduce the Temporal Adaptive Fusion Network (TAF-Net), a hybrid CNN-Transformer architecture that models paired longitudinal 3D MRI scans. Central to TAF-Net is a Temporal Fusion Module governed by an Adaptive Temporal Gate, which learns patient-specific weightings to synthesize three spatiotemporal representations: explicit structural change, region-to-region temporal cross-attention, and bilateral feature concatenation. Evaluated on the Alzheimer's Disease Neuroimaging Initiative cohort for three-year MCI-to-AD conversion prediction, TAF-Net achieved the highest discriminative performance among all evaluated methods using only structural MRI, significantly outperforming the strongest baseline and approaching multimodal methods requiring PET, CSF, or genetic data. The architecture exhibited exceptional data efficiency, matching baseline performance with a fraction of training data. Ablation studies demonstrate that longitudinal fusion improves discrimination while reducing predictive variance by 48% compared to single-timepoint evaluation. Interpretability analyses reveal spatial attention aligned with established AD pathology in the medial temporal lobe and ventricles, while the gating mechanism prioritizes explicit volumetric change with strong positive correlation to conversion risk.

2605.28394 2026-05-28 cs.CV cs.GR 版本更新

Sketch2Motion: Text-driven 2D Sketch to 3D Animation via Diffusion-guided Skeleton Optimization

Sketch2Motion: 文本驱动的二维草图到三维动画的扩散引导骨架优化

Gaurav Rai, Ojaswa Sharma

发表机构 * Graphics Research Group, IIIT Delhi(IIIT德里图形研究组)

AI总结 提出Sketch2Motion框架,结合扩散模型和骨架优化,将二维草图转化为三维动画,无需配对运动数据,支持多种角色类型。

详情
AI中文摘要

二维手绘草图的动画化提供了一种有效的视觉交流媒介。然而,这些草图带来了挑战,特别是在处理遮挡和准确映射运动方面。虽然三维动画自然地解决了这些挑战,但估计三维运动仍然是一项非常复杂的任务。最近将二维草图转换为三维动画的方法主要集中在特定类型的运动上,例如双足运动和面部表情。我们提出了Sketch2Motion,一个基于扩散引导的骨架运动合成框架,它将经典的角色动画流程与深度生成先验相结合。我们的方法使用骨架变换来表示运动,通过线性混合蒙皮将其传播到网格变形。为了引导生成的动画朝向真实且语义上有意义的运动,我们通过运动感知分数蒸馏采样(MoSDS)集成了文本到视频扩散模型,从而无需配对运动数据即可进行优化。此外,我们应用物理启发的平滑性、拓扑和接触约束来稳定优化并保持运动合理性。进一步地,我们集成了一个弹簧-质量模拟器来引入次级运动效果。所提出的框架是通用的、完全可微的、模块化的,并且兼容双足、四足和非生命体铰接角色。实验表明,我们的方法生成了时间上连贯、与文本对齐的动画,其性能优于缺乏生成先验或显式物理约束的基线运动迁移方法。我们将公开我们的代码和数据集。

英文摘要

Animation of 2D hand-drawn sketches provides an effective medium for visual communication. However, these sketches pose challenges, particularly in handling occlusions and accurately mapping motion. While 3D animation naturally addresses these challenges, estimating 3D motion remains a very complex task. Recent approaches to converting 2D sketches to 3D animations have mainly focused on specific types of motion, such as bipedal movements and facial expressions. We propose Sketch2Motion, a diffusion-guided framework for skeleton-based motion synthesis that combines classical character animation pipelines with deep generative priors. Our method represents motion using skeletal transformations, which are propagated to mesh deformations via linear blend skinning. To guide the resulting animation toward realistic and semantically meaningful motion, we integrate a text-to-video diffusion model via motion-aware score-distillation sampling (MoSDS), enabling optimization without paired motion data. Additionally, we apply physics-inspired smoothness, topological, and contact constraints to stabilize optimization and preserve motion plausibility. Further, we integrate a spring-mass simulator to introduce secondary motion effects. The proposed framework is generalized, fully differentiable, modular, and compatible with biped, quadruped, and non-living articulated characters. Experiments demonstrate that our approach produces temporally coherent, text-aligned animations that outperform baseline motion transfer methods that lack generative priors or explicit physical constraints. We will make our code and dataset publicly available.

2605.28392 2026-05-28 cs.CV 版本更新

Bound-Constrained Sparse Representation for Electrical Impedance Tomography

边界约束稀疏表示用于电阻抗成像

Chun Zhang, Dong Liu

发表机构 * School of Biomedical Engineering(生物医学工程学院) Suzhou Institute for Advanced Research(苏州先进研究院) University of Science and Technology of China(中国科学技术大学) Laboratory of Spin Magnetic Resonance(磁共振实验室) Anhui Province Key Laboratory of Scientific Instrument Development and Application(安徽省科学仪器开发与应用重点实验室) Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology(江苏省多模态数字孪生技术重点实验室) Institute of Quantum Sensing of WuXi(武西安量子传感研究所)

AI总结 提出一种边界约束稀疏表示框架,通过隐式复合参数化从低维潜变量生成电导率,无需显式正则化即可改善电阻抗成像中的电导率估计。

详情
AI中文摘要

本研究提出了一种用于电阻抗成像(EIT)的边界约束稀疏表示(BC-SR)框架,旨在无需显式正则化的情况下改善电导率估计。BC-SR采用表示驱动策略,通过隐式复合参数化从低维潜变量生成电导率。利用截断图拉普拉斯基嵌入结构先验,同时通过边界保持非线性映射强制电导率处于允许范围内,并通过隐式梯度调制改善条件。该方法即使在噪声或不完整数据下也能确保鲁棒收敛。在2D/3D模拟、水箱实验和体内肺部数据上的广泛验证表明,BC-SR提高了物理一致性和结构保真度,与传统方法相比具有更强的鲁棒性。此外,BC-SR能够实现3D时差EIT重建,提供更好的空间分辨率和更连贯的3D电导率分布表示,尤其对于体内肺部数据。这表明其在EIT中具有改进性能的潜力,特别是在呼吸监测的临床应用中。

英文摘要

This study proposes a bound-constrained sparse representation (BC-SR) framework for electrical impedance tomography (EIT), aimed at improving conductivity estimation without explicit regularization. BC-SR adopts a representation-driven strategy, generating conductivity from low-dimensional latent variables via an implicit composite parameterization. Structural priors are embedded using a truncated graph-Laplacian basis, while a bound-preserving nonlinear mapping enforces admissible conductivity ranges and improves conditioning through implicit gradient modulation. The approach ensures robust convergence, even under noisy or incomplete data. Extensive validation on 2D/3D simulations, tank experiments, and in-vivo lung data shows that BC-SR improves physical consistency and structural fidelity, offering enhanced robustness compared to traditional methods. Additionally, BC-SR enables 3D time-difference EIT reconstruction, offering improved spatial resolution and a more coherent representation of 3D conductivity distributions, particularly for in-vivo lung data. This suggests potential for improved performance in EIT, particularly in clinical applications for respiratory monitoring.

2605.28348 2026-05-28 cs.CV 版本更新

Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

面向语义无关和形状感知的视觉-语言分割模型

Corentin Seutin, Mohamed Amine Ettaki, Michaël Clément, Pierrick Coupé, Rémi Giraud

发表机构 * Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, France(波尔多大学,法国国家科学研究中心,波尔多理工学院,LaBRI实验室,UMR 5800,法国) Univ. Bordeaux, CNRS, Bordeaux INP, IMS, UMR 5218, France(波尔多大学,法国国家科学研究中心,波尔多理工学院,IMS,UMR 5218,法国)

AI总结 提出语义无关且形状感知(SANSA)分割范式,通过非语义文本描述微调模型,在保持语义提示性能的同时,在新任务上提升高达20% mIoU。

Comments Accepted at the 2026 IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

视觉-语言分割模型最近通过利用自然语言表达的高层语义对象类别取得了强大性能。然而,这种语义依赖性限制了它们对形状、几何或纹理等内在视觉属性的推理能力,而这些属性在许多实际应用中至关重要。在这项工作中,我们引入了语义无关且形状感知(SANSA)分割,这是一种新的范式,要求分割模型仅从非语义文本描述中运行。为此,我们提出了两种基于字典约束或示例指导生成SANSA分割提示的策略,两者都生成语义无关的文本描述。然后使用这些提示在语义无关监督下微调分割模型。实验表明,与预训练的最先进模型相比,在此新分割任务上对SANSA提示进行微调可带来高达20%的mIoU改进,同时在标准语义提示上保持强劲性能。这些结果强调了低层和中层视觉推理对于提高视觉-语言分割模型的泛化性和可控性的重要性。

英文摘要

Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.

2605.28331 2026-05-28 cs.CV 版本更新

Transfer learning RGB models to hyperspectral images with trainable tensor decompositions

使用可训练张量分解将RGB模型迁移到高光谱图像

Mariette Schönfeld, Laurens Devos, Wannes Meert, Hendrik Blockeel

发表机构 * KU Leuven, Dept. of Computer Science(库勒芬大学计算机科学系) Leuven.AI - KU Leuven Institute for AI(莱文人工智能 - 库勒芬大学人工智能研究所)

AI总结 提出一种通过可训练张量分解将预训练RGB模型的卷积滤波器分解为空间和光谱成分,并替换光谱成分以适应高光谱图像通道数的方法,实现高光谱图像迁移学习,实验表明该方法比其他方法更准确和鲁棒。

详情
AI中文摘要

迁移学习使得大型视觉网络能够通过将模型的通用滤波器专门化到新任务,从而应用于各种领域。然而,这些网络假设输入图像具有3个输入通道,使其与多光谱或高光谱图像不兼容。当前缓解这种不兼容性的方法要么牺牲图像信息,要么牺牲模型信息。本文提出了一种新颖的方法,通过使用部分可训练的张量分解来保留图像和模型中的空间信息。我们创建预训练卷积滤波器的这种分解,将滤波器分离为空间和光谱成分。然后,将光谱成分替换为具有更高通道维度的可训练成分。这创建了能够专门化到新数据集的高光谱滤波器,同时保留原始滤波器的空间模式。在各种高光谱数据集上的实验表明,我们的方法比其他高光谱迁移学习方法更准确和鲁棒。

英文摘要

Transfer learning makes it possible to use large vision networks on a variety of domains, by specializing their models' general filters to new tasks. However, these networks assume the input images to have 3 input channels, making them incompatible with multi- or hyperspectral images. Current approaches that mitigate this incompatibility sacrifice information in either the image, or the model. This work proposes a novel approach that preserves the image and spatial information present in the model by using partially trainable tensor decompositions. We create such decompositions of pretrained convolutional filters, separating the filters into spatial and spectral components. The spectral components are then replaced with trainable components of higher channel dimensionality. This creates hyperspectral filters that can specialize to new datasets, while retaining the spatial patterns of the original filter. Experiments on a variety of hyperspectral datasets show that our approach is more accurate and robust than other hyperspectral transfer learning methods.

2605.28324 2026-05-28 cs.CV 版本更新

Inpainting-Style Conditional Diffusion for Multivariable Time Series Forecasting

基于修复风格条件扩散的多变量时间序列预测

Kourosh Kiani, S. M. Muyeen

发表机构 * Electrical and Computer Engineering, Semnan University(半导体与计算机工程系,森南大学) Electrical Engineering Department, Qatar University(电气工程系,卡塔尔大学)

AI总结 提出一种将多变量时间序列预测重构为图像修复问题的条件扩散框架,通过掩码条件扩散机制和零填充策略实现高精度短期太阳能功率预测。

详情
AI中文摘要

本文提出了一种新颖的基于条件扩散的框架,用于多变量时间序列太阳能功率预测。该方法通过滑动窗口补丁构建将时间序列光伏数据重新表述为结构化二维表示(图像),从而在统一的时空学习范式内应用去噪扩散概率模型(DDPM)。本文的一个关键贡献是将太阳能预测表述为修复问题,其中未来时间步被视为待重建的缺失区域。这是通过基于掩码的条件扩散机制实现的,其中历史观测作为条件上下文保留,而目标(未来)区域则逐步被破坏并通过反向扩散恢复。该模型学习基于观测数据生成连贯的未来序列,有效执行时间序列修复。为了充分利用所有可用特征并确保与U-Net架构约束兼容,引入了一种零填充策略来构建固定大小的输入。模型使用监督去噪目标进行训练,以预测注入的噪声,从而在反向过程中实现准确的迭代重建。在包括GEFCom2014在内的基准光伏数据集上进行的大量实验表明,所提出的方法实现了高预测精度,特别是在短期预测中。结果凸显了将基于扩散的生成建模与修复公式相结合用于稳健、灵活和高保真太阳能功率预测的有效性。

英文摘要

In this paper, we propose a novel conditional diffusion-based framework for multivariable time-series solar power forecasting. The proposed method reformulates temporal PV data as structured two-dimensional representations (images) using a sliding-window patch construction, enabling the application of Denoising Diffusion Probabilistic Models (DDPM) within a unified spatiotemporal learning paradigm. A key contribution of this work is the formulation of solar forecasting as an inpainting problem, where future time steps are treated as missing regions to be reconstructed. This is achieved through a mask-based conditional diffusion mechanism, in which historical observations are preserved as conditioning context while the target (future) region is progressively corrupted and subsequently recovered via reverse diffusion. The model learns to generate coherent future sequences conditioned on observed data, effectively performing time-series inpainting. To fully utilize all available features and ensure compatibility with U-Net architectural constraints, a zero-padding strategy is introduced to construct fixed-size inputs. The model is trained using a supervised denoising objective to predict injected noise, enabling accurate iterative reconstruction during the reverse process. Extensive experiments conducted on benchmark PV dataset, including GEFCom2014, demonstrate that the proposed approach achieves high forecasting accuracy, particularly for short-term horizons. The results highlight the effectiveness of integrating diffusion-based generative modeling with an inpainting formulation for robust, flexible, and high-fidelity solar power forecasting.

2605.28312 2026-05-28 cs.RO cs.CV 版本更新

EventShiftFlow: Towards Hardware-efficient FPGA-based Flow Estimation

EventShiftFlow:面向硬件高效的基于FPGA的流估计

Arianna Alonso Bizzi, Fernando Cladera, C. J. Taylor

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出一种基于事件相机的流估计方法,通过离散化事件、构建1位空间占用网格并并行评估速度假设,仅使用固定宽度整数逻辑实现,无需帧重建、浮点运算或迭代优化,适用于低延迟机器人感知。

Comments 10 pages, 5 figures. Accepted to the IEEE ICRA 2026 Workshop on Challenges and Opportunities of Neuromorphic Field Robotics and Automation

详情
AI中文摘要

基于事件的视觉传感器提供异步、高时间分辨率的测量,适用于低延迟机器人感知,但许多基于事件的运动估计方法计算密集且难以映射到FPGA硬件。我们提出一种流式速度估计器,将异步事件离散化为固定持续时间的时间片,构建1位空间占用网格,并并行评估多个速度假设,仅使用固定宽度整数逻辑——移位寄存器、计数器、比较器和小型LUT映射乘法——无除法器且无DSP模块。它不需要帧重建、浮点运算或迭代优化。该方法有意将密集亚像素光流替换为每个活动像素的稀疏量化速度估计,适用于尺寸、重量和功率受限平台上的反应式避障等低延迟任务。在具有已知真实速度的噪声合成数据上,该方法恢复了幅度和方向,其中当不同速度的物体相交时幅度估计最具挑战性。在真实事件相机序列上,所有四个评估运动段的方向准确率达到99.5%,在10-40%的占用密度范围内性能保持稳健。我们表征了算法的密度依赖行为,进行了参数敏感性分析,表明所提出的数据路径需要小于2 kB的存储,并在低成本Xilinx Artix-7上实现了单轴原型。

英文摘要

Event-based vision sensors offer asynchronous, high-temporal-resolution measurements that are attractive for low-latency robotic perception, but many event-based motion estimation methods are computationally intensive and difficult to map to FPGA hardware. We present a streaming velocity estimator that discretizes asynchronous events into fixed-duration time bins, constructs a 1-bit spatial occupancy grid, and evaluates multiple velocity hypotheses in parallel using only fixed-width integer logic - shift registers, counters, comparators, and small LUT-mapped multiplies - with no dividers and no DSP blocks. It requires no frame reconstruction, no floating-point arithmetic, and no iterative optimization. The method deliberately trades dense sub-pixel optical flow for a sparse, quantized velocity estimate at each active pixel, suited to low-latency tasks such as reactive obstacle avoidance on size-, weight-, and power-constrained platforms. On noisy synthetic data with known ground-truth velocities, the method recovers both magnitude and direction, with magnitude estimates being most challenged when objects of different velocities intersect. On a real event-camera sequence, directional accuracy reaches 99.5% across all four evaluated motion segments, with performance remaining robust across occupancy densities in the 10-40% range. We characterize the algorithm's density-dependent behavior, present a parameter sensitivity analysis, show that the proposed datapath requires less than 2 kB of storage, and implement a single-axis prototype on a low-cost Xilinx Artix-7.

2605.28272 2026-05-28 cs.CV 版本更新

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

EchoAvatar: 从音频流实时生成动画虚拟化身

Bohong Chen, Yumeng Li, Yinglin Xu, Youyi Zheng, Yanlin Weng, Kun Zhou

发表机构 * State Key Lab of CAD\&CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室)

AI总结 提出统一流式架构,从流式语音和音乐中低延迟生成连续全身运动,通过强化学习优化在线生成质量,并利用工具调用接口实现意图驱动控制。

Comments SIGGRAPH 2026; Project Page: https://robinwitch.github.io/EchoAvatar-Page

详情
AI中文摘要

从音频实时合成高保真3D角色运动是下一代交互式虚拟化身和虚拟助手的关键组成部分。然而,大多数现有方法仅限于离线处理完整音频序列,或局限于特定领域,很少能有效处理语音和音乐。本文提出了一种新颖框架,旨在从流式语音和音乐中低延迟生成连续、连贯的全身运动。我们方法的核心是一种统一的流式架构,能够从增量音频输入中合成连续运动。我们采用鲁棒的训练策略,强制音频依赖性,使模型能够无缝泛化到对话式语音和节奏性音乐,无需显式领域标签或模式切换。此外,我们探索了强化学习以优化在线生成质量。进一步地,我们通过工具调用接口将反应式动画与意图驱动行为连接起来,允许上游大型语言模型注入显式语义控制。通过将这种可控性与流式音频驱动合成相结合,我们的框架可作为即插即用解决方案,将语音代理转化为交互式人形虚拟化身。大量实验表明,我们的方法在运动质量和同步性上优于最先进的实时基线,同时保持了实时部署所需的灵活性。我们的代码、预训练模型和视频可在 https://robinwitch.github.io/EchoAvatar-Page 获取。

英文摘要

Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/EchoAvatar-Page.

2605.28271 2026-05-28 cs.CV 版本更新

LV-OSD: Language-Vision-Complementary Open-Set Object Detection

LV-OSD: 语言-视觉互补的开放集目标检测

Yupeng Zhang, Ruize Han, Wei Feng, Song Wang, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University(智能与计算学院,天津大学) Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology(计算机科学与人工智能学院,深圳先进技术大学)

AI总结 提出语言-视觉互补开放集目标检测问题,设计双分支检测框架LVDor,通过目标引导提示动态加权模块和提示随机掩码机制实现文本与图像提示的灵活组合与语义对齐。

详情
AI中文摘要

目标检测是计算机视觉中的重要任务,旨在通过给定的类别列表或查询图像检测感兴趣的目标。在这项工作中,我们提出了一个语言-视觉互补开放集目标检测(LV-OSD)的新问题,即使用灵活的基于文本和/或基于图像的提示来指定所需的目标类别。这种设置在现实应用中更为常见和实用。为此,我们设计了一个双分支检测框架LVDor,它可以同时接受文本和图像提示。具体来说,我们首先为每个类别构建包含多种文本描述和图像样本的多模态提示(MPr)。随后,为了弥合输入图像、文本提示和图像提示之间的语义差距,我们设计了一个目标引导提示动态加权(TPDW)模块。在该模块中,通过目标图像的先验信息,动态生成与目标语义最对齐的文本和图像提示,实现精确对齐并有效减少两种模态之间的差异,从而适应LV-OSD设置。我们还提出了一种简单的训练时提示随机掩码(PRM)机制,以模拟测试时文本和/或图像提示的任意组合。大量的实验结果验证了我们问题表述的合理性和方法的有效性。提示和代码将公开发布。

英文摘要

Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation's reasonability and our method's effectiveness. Prompts and code will be released publicly.

2605.28270 2026-05-28 cs.CV 版本更新

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

Every9D-21M:日常物体的大规模真实世界9D规范化

Leonhard Sommer, Emil Akopyan, Adam Kortylewski

发表机构 * University of Freiburg(弗赖堡大学) CISPA Helmholtz Center for Information Security(信息安全赫尔姆霍茨研究中心)

AI总结 针对真实世界9D姿态数据缺乏的问题,提出包含2180万张图像、700类物体的Every9D-21M数据集,通过多视图几何重建点云并跨实例对齐实现大规模标注,验证了其在多个基准上的性能提升。

详情
AI中文摘要

从单张真实世界图像估计日常物体的9D姿态仍然具有挑战性。这很大程度上是由于缺乏大规模监督。大多数现有数据集要么严重依赖合成渲染,要么对真实世界物体的覆盖有限:迄今为止最大的真实世界9D姿态数据集仅包含9个类别的17K个标注物体。我们通过Every9D-21M数据集填补了这一空白,该数据集包含来自109K个以物体为中心的视频的2180万张真实世界图像的9D姿态标注,涵盖700个日常物体类别——在图像和类别数量上比之前的真实世界9D姿态基准大两个数量级。为了实现这一规模,我们利用以物体为中心的视频,通过多视图几何重建物体级点云,并将相似实例对齐到共享的规范坐标系中。仅对一小部分参考物体(少于所有图像的0.01%)手动标注规范姿态,并通过跨实例对齐传播到其余实例。然后从多个视角验证所有传播的规范姿态。我们进一步引入了跨类别方向规则,以诱导类别级对称性,从而实现对称感知评估。除了建立专用的训练和评估划分作为9D姿态基础模型的基准外,我们还表明,在Every9D-21M上训练可提高在ImageNet3D和PASCAL3D+上的性能,并且比在ImageNet3D上训练更好地泛化到HANDAL。数据和代码可在https://github.com/GenIntel/Every9D获取。

英文摘要

Estimating the 9D pose of everyday objects from a single real-world image remains challenging. This is largely due to the lack of large-scale supervision. Most existing datasets either rely heavily on synthetic renderings or provide limited coverage of real-world objects: the largest real-world 9D pose dataset to date contains only 17K annotated objects across 9 categories. We address this gap with Every9D-21M, a dataset of 9D pose annotations for 21.8M real-world images from 109K object- centric videos spanning 700 everyday object categories - two orders of magnitude larger than prior real-world 9D pose benchmarks in both image and category count. To achieve this scale, we leverage object-centric videos by reconstructing object- level point clouds via multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for only a small set of reference objects (fewer than 0.01% of all images) and propagated to the remaining instances via cross-instance alignment. All propagated canonical poses are then verified from multiple viewpoints. We further introduce cross-category orientation rules that induce category-level symmetries, enabling symmetry-aware evaluation. Beyond establishing dedicated training and evaluation splits as a benchmark for 9D pose foundation models, we show that training on Every9D-21M improves performance on ImageNet3D and PASCAL3D+, and generalizes to HANDAL substantially better than training on ImageNet3D. Data and code are available at https://github.com/GenIntel/Every9D.

2605.28261 2026-05-28 cs.CV 版本更新

MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance Annotations

MORI-Seg: 无需实例标注的形态学几何学习用于实例分割

Leiyue Zhao, Tianyu Shi, Daniel Reisenbuchler, Xinzi He, Junchao Zhu, Tianyuan Yao, Yuechen Yang, Yanfan Zhu, Junlin Guo, Gelei Xu, Haichun Yang, Yuankai Huo, Mert R. Sabuncu, Yihe Yang, Ruining Deng

发表机构 * Southern University of Science and Technology(南方科技大学) Sichuan University(四川大学) University of Regensburg(莱茵-魏尔堡大学) Cornell University(康奈尔大学) Vanderbilt University(范德比尔特大学) University of Notre Dame(Notre Dame 大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Cornell Tech(康奈尔科技) Weill Medical College of Cornell University(康奈尔大学韦尔医学院)

AI总结 提出MORI-Seg框架,通过从语义掩码学习形态感知几何表示(对象中心距离场和边界带表示)以及类条件特征解耦模块,在仅语义监督下实现端到端的实例分割,提升拥挤粘连区域的实例分离精度。

详情
AI中文摘要

肾脏功能单元的实例级量化对于形态测量分析至关重要,然而大多数公开可用的病理数据集仅提供语义分割标注,其中同一类别的相邻结构被合并为单个区域。这阻碍了可靠的实例级分析,并限制了后续的定量研究。现有的启发式后处理方法在拥挤和粘连区域往往产生次优的实例分离,而基于深度学习的实例分割方法通常需要密集的实例级标注,这些标注成本高昂且劳动密集。我们提出MORI-Seg,一个无需实例级标注即可实现实例分割的深度学习框架。MORI-Seg不依赖启发式分割或实例监督,而是通过联合建模对象中心距离场和边界带表示,直接从语义掩码学习形态感知的几何表示,以编码内部结构和接触界面。类条件特征解耦模块进一步促进实例内一致性和实例间分离。在仅语义监督下,MORI-Seg以端到端的方式将连接的语义区域分解为不同的实例掩码。实验表明,与经典的后处理流程和代表性的语义到实例学习方法相比,MORI-Seg在实例分离准确性和更可靠的形态测量量化方面表现更优。官方实现已在 https://github.com/ddrrnn123/MORI-Seg 公开。

英文摘要

Instance-level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance-level analysis and limits downstream quantitative studies. Existing heuristic post-processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning-based instance segmentation approaches typically require intensive instance-level annotations that are costly and labor-intensive to obtain. We propose MORI-Seg, a deep learning framework that enables instance segmentation without requiring instance-level annotations. Instead of heuristic splitting or instance supervision, MORI-Seg learns morphology-aware geometric representations directly from semantic masks by jointly modeling object-centric distance fields and boundary-band representations to encode interior structure and contact interfaces. A class-conditioned feature disentanglement module further promotes intra-instance coherence and inter-instance separation. Under semantic-only supervision, MORI-Seg decomposes connected semantic regions into distinct instance masks in an end-to-end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post-processing pipelines and representative semantic-to-instance learning approaches. The official implementation is publicly available at https://github.com/ddrrnn123/MORI-Seg.

2605.28258 2026-05-28 cs.SE cs.AI cs.CV cs.HC 版本更新

GUI Agents for Continual Game Generation

面向持续游戏生成的GUI智能体

Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo

发表机构 * Fudan University(复旦大学) Xiaohongshu Inc.(小红书公司) Tongji University(同济大学) University of California, Santa Barbara(加州大学圣芭芭拉分校)

AI总结 提出利用GUI智能体作为客观评估者和主观测试者,通过PlaytestArena和Play2Code框架实现持续游戏生成,显著提升可玩性。

详情
AI中文摘要

生成一个游戏与制作一个可玩的游戏不同。尽管代码生成取得了进展,现有方法将游戏生成视为从提示到产物的单次翻译,导致交互层面的失败未被检测。我们认为评估和改进游戏生成需要一个玩家,并研究了图形用户界面(GUI)智能体在此过程中的两个角色:(1)作为客观评估者,为此我们引入了PlaytestArena,这是一个新的评估环境,将8个游戏类型的200个基于浏览器的游戏生成任务与预期的游戏行为准则配对,由GUI智能体在浏览器中加载每个构建并玩它来裁决;(2)作为主观测试者,为此我们提出了Play2Code,其中游戏智能体和GUI智能体在共享内存的持续循环中运行,将游戏生成转化为编码和游戏之间的对话。我们的实验表明,即使是前沿模型也难以直接生成可玩的游戏,而Play2Code达到了66.8%的准则通过率,分别比单次传递和智能体编码基线提高了37.1和14.6个百分点。进一步分析表明,GUI测试者的反馈比人类报告更可追溯,但在某些方面具有类似人类测试者的特质,将游戏测试确立为交互式代码生成的关键测试平台。我们的项目网站位于https://continual-game-generation.vercel.app/。

英文摘要

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.

2605.28257 2026-05-28 cs.CV 版本更新

Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

基于可变形对象先验的相机空间类别级3D对应

Leonhard Sommer, Artur Jesslen, Basavaraj Sunagad, Adam Kortylewski

发表机构 * University of Freiburg, Germany(弗赖堡大学,德国) CISPA Helmholtz Center for Information Security, Germany(信息安全部署中心,德国)

AI总结 通过学习共享可变形对象先验,从单张图像预测类别内实例间一致的3D位置,无需显式对应监督,并在新基准HouseCorr3D上达到最优。

Comments 14 pages, 4 figures. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D

详情
AI中文摘要

从图像理解3D对象是机器人和AR/VR应用的基础。尽管近期工作在类别级姿态估计方面取得了进展,但当前的表示未能捕捉到推理对象部件、功能和交互所需的细粒度语义。在这项工作中,我们研究了相机空间中的类别级3D对应——从单张图像预测在类别内实例间保持一致的3D位置——并展示了通过学习共享的可变形对象先验,无需显式对应监督即可涌现出这种对应。为了推动这一方向的研究,我们引入了HouseCorr3D,这是首个大规模的单目类别级3D对应基准,包含50个家庭对象类别的178k张图像、280个独特实例以及直接标注在CAD模型上的3D关键点。关键的是,HouseCorr3D提供了遮挡区域的模态补全对应标签和显式对称标注,解决了现有数据集的主要局限性。我们进一步提出了Morpheus,一种通过学习解耦规范形状、形变和对象姿态来学习可变形类别级形状先验的方法。通过这种共享的规范基础,相机空间中有语义意义的3D对应隐式地涌现出来。这些涌现的3D对应在HouseCorr3D上达到了新的最优水平,证明了无需直接对应监督即可实现语义3D对象理解。数据和代码公开于https://github.com/GenIntel/HouseCorr3D。

英文摘要

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.

2605.28241 2026-05-28 cs.CV 版本更新

PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment

PointQ-Bench:诊断性和可解释的点云质量评估基准

Duanchu Wang, Cheng Li, Junjie Yang, Jing Huang, Zihang Cheng, Zhi Gao, ZhuBohong, Di Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) Xidian University(西安电子科技大学) University of Chinese Academy of Sciences(中国科学院大学) Ningxia University(宁夏大学)

AI总结 提出PointQ-Bench基准,通过异常感知、缺陷诊断、可用性分级和开放式质量报告任务,将点云质量评估从标量评分扩展到全面质量理解,并揭示当前模型在感知与诊断之间的差距。

详情
AI中文摘要

点云质量在3D采集、重建、渲染和感知中起着关键作用,然而现有的点云质量评估(PCQA)研究主要集中于标量分数预测。在实际检测场景中,质量评估通常涉及识别缺陷、表征主要问题类型、评估下游可用性以及提供基于证据的描述,而这些并未被当前基准明确评估。我们引入了PointQ-Bench,一个旨在将PCQA从标量评分扩展到全面质量理解的基准。PointQ-Bench包含3,083个点云,涵盖真实扫描、模拟失真和AI生成内容,覆盖八种主要问题类型。每个样本都标注有平均意见分数(MOS)、质量等级、问题标签、专家依据的描述以及12,332个问答对。该基准支持三个感知导向任务:异常感知、缺陷诊断和可用性分级,以及一个认知导向任务:开放式质量报告。为了评估自由形式的质量描述,我们进一步提出了SSFRQ-5D,一个通过人机一致性分析验证的五维评估协议。在14个视觉语言模型和传统PCQA基线上的大量实验揭示了一致的感知-诊断差距:虽然当前模型在粗粒度缺陷感知方面表现出新兴能力,但在基于证据的诊断和质量校准方面存在困难。强大的2D多模态大语言模型通常优于现有的3D视觉语言模型,而额外视图或点级输入的收益并不均匀,在不同任务、数据源和模型之间变化,特别是在边界模糊条件下。总体而言,PointQ-Bench为推进可靠且可解释的点云质量理解提供了一个诊断性测试平台。

英文摘要

Point cloud quality plays a critical role in 3D acquisition, reconstruction, rendering, and perception, yet existing point cloud quality assessment (PCQA) research remains largely centered on scalar score prediction. In practical inspection scenarios, quality assessment often involves identifying defects, characterizing dominant issue types, assessing downstream usability, and providing evidence-supported descriptions, which are not explicitly evaluated by current benchmarks. We introduce PointQ-Bench, a benchmark designed to extend PCQA from scalar scoring toward comprehensive quality understanding. PointQ-Bench consists of 3,083 point clouds spanning authentic scans, simulated distortions, and AI-generated content, covering eight major issue types. Each sample is annotated with mean opinion scores (MOS), quality levels, issue tags, expert-grounded descriptions, and 12,332 question-answer pairs. The benchmark supports three perception-oriented tasks: anomaly sensing, defect diagnosis, and usability grading, as well as a cognition-oriented task of open-ended quality reporting. To evaluate free-form quality descriptions, we further propose SSFRQ-5D, a five-dimensional evaluation protocol validated through human-AI agreement analysis. Extensive experiments on 14 vision-language models and traditional PCQA baselines reveal a consistent perception-diagnosis gap: while current models exhibit emerging abilities in coarse defect perception, they struggle with grounded diagnosis and quality calibration. Strong 2D MLLMs generally outperform existing 3D VLMs, and the benefit of additional views or point-level inputs is non-uniform, varying across tasks, data sources, and models, particularly under boundary-ambiguous conditions. Overall, PointQ-Bench provides a diagnostic testbed for advancing reliable and interpretable point cloud quality understanding.

2605.28239 2026-05-28 cs.CV 版本更新

Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

学习标注:一种用于半监督指代表达分割的强化自进化框架

Runlong Cao, Ying Zang, Chuanwei Zhou, Tianrun Chen, Tong Zhang, Zhen Cui, Chunyan Xu

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院) School of Information Engineering, Huzhou Normal University(湖州师范学院信息工程学院) School of Artificial Intelligence, Nanjing University of Posts and Telecommunications(南京邮电大学人工智能学院) National Key Laboratory of Tibetan Language Intelligence(藏语智能国家重点实验室) Zhejiang University(浙江大学) Beijing Normal University(北京师范大学)

AI总结 提出L2L框架,通过强化学习将伪标签构建转化为可学习的决策过程,结合多模态大模型提取先验,实现半监督指代表达分割的联合优化。

Comments 24 pages, 13 figures

详情
AI中文摘要

半监督指代表达分割(SS-RES)旨在有限标注下实现精确的像素级语言定位,但在利用未标注图像-文本对时面临监督不足和伪标签不可靠的问题。本文提出学习标注(L2L),一种强化自进化框架,将伪标签构建视为可学习的决策过程。为建立基础理解,我们利用多模态大语言模型提取语义-空间先验,将其实例化为初始软分割提议,并与文本线索一起提升为可学习引导信号,以条件化层次分割网络。为确保稳定学习,强化伪标签选择被表述为探索性决策过程,基于多模态先验和模型预测自适应地奖励高实用性的像素级监督。这种强化自进化循环实现了分割模型和伪标签的联合优化,在稀疏监督下逐步增强标签可靠性。在RefCOCO、RefCOCO+和RefCOCOg上的大量实验表明,该方法优于现有方法,验证了其有效性和泛化能力。

英文摘要

Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.

2605.28237 2026-05-28 cs.RO cs.CV 版本更新

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

POINav: 在真实世界视觉语言导航中基准测试与增强最终米级到达

Ruiyan Gong, Meisheng Zhang, Yuxiang Zhao, Mingchao Sun, Yanfen Shen, Zedong Chu, Zhining Gu, Wei Guo, Xiaolong Cheng, Qiming Li, Kangning Niu, Yanqing Zhu, Xiaolong Wu, Tianlun Li, Mu Xu

发表机构 * Amap CV Lab, Alibaba Group(阿里集团阿里的Amap视觉实验室)

AI总结 针对真实世界POI导航的“最后几米”挑战,提出首个闭环评估基准POINav-Bench,并设计脑-动作框架结合70K真实标志-入口数据对,实现高保真度导航。

Comments 25 pages, 9 figures

详情
AI中文摘要

真实世界导航本质上由兴趣点(POI)驱动,然而到达精确的POI仍然是一个关键的“最后几米”挑战。现有的POI目标导航的视觉语言导航(VLN)基准通常由于生成的场景而存在粗粒度或显著的模拟到现实差距。为弥合这一差距,我们提出了POINav-Bench,这是第一个专为真实世界POI目标导航闭环评估设计的基准。它包含使用3D高斯泼溅(3DGS)从真实世界捕获重建的11个商业区域,总面积达126,398平方米,涵盖163个不同的POI。通过可通行性感知标注和参考轨迹,POINav-Bench能够在真实、POI丰富的现实环境中对导航智能体进行高保真评估。在此基础上,我们提出了POINav脑-动作框架,其中脑模块执行基于POI的推理以指导动作模块预测用于真实世界执行的连续航点。我们进一步整理了POINav-Dataset,包含70K个真实世界标志-入口对。实验表明,我们的框架为改进真实世界POI目标导航提供了一条可行路径。

英文摘要

Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

2605.28234 2026-05-28 cs.CV 版本更新

Bridging the Sampling Distribution Shift in Radio Map Estimation: A Trajectory-Aware Paradigm

桥接无线电地图估计中的采样分布偏移:一种轨迹感知范式

Feng Qiu, Zheng Fang, Shuhang Zhang, Kangjun Liu, Longkun Zou, Jing Liu, Ke Chen

发表机构 * School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) Pengcheng Laboratory(鹏城实验室) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系) Department of Electronics, Peking University(北京大学电子系) Guangzhou Institute of Technology, Xidian University(西安电子科技大学广州研究院)

AI总结 针对无人机轨迹采样与随机采样分布不匹配导致的性能下降,提出基于随机触发轨迹采样的轨迹感知训练范式,有效降低估计误差。

详情
AI中文摘要

基于学习的无线电地图估计(RME)在无人机辅助无线感知中扮演关键角色,支持覆盖预测和网络优化等任务。当前大多数方法假设基于随机采样的独立同分布(i.i.d.)训练和测试设置。然而,实际无人机测量是沿着可行轨迹顺序收集的,导致高度结构化和空间相关的模式。这种不匹配引入了采样分布偏移,增加了空间场恢复的内在难度,并损害了在i.i.d.假设下训练的模型的泛化能力。为缓解这一问题,我们提出了一种基于随机触发轨迹采样(ST-TBS)的轨迹感知训练范式,该范式在保持轨迹连续性的同时引入采样变异性。此外,从统计角度来看,我们表明与随机采样相比,基于轨迹的采样降低了空间多样性并增加了信息冗余。在RadioMapSeer和SpectrumNet数据集上的大量实验表明,在基于轨迹的观测下,使用随机采样训练的模型性能显著下降,在SpectrumNet上RMSE从0.0391增加到0.2632。相反,我们提出的ST-TBS方法有效将RMSE降低到0.0571。这些结果强调了对齐训练和部署采样分布对于可靠RME的必要性。

英文摘要

Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.

2605.28230 2026-05-28 cs.CV 版本更新

Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

Proprio: 用于物理合理视频生成的潜在自评分与推理时精炼

Mariam Hassan, Kaouther Messaoud, Wuyang Li, Alexandre Alahi

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Télécom Paris(巴黎电信学院)

AI总结 提出Proprio,一种无需训练框架,通过分析模型在潜在扰动下的流残差作为自评分信号,结合最佳N搜索和梯度自精炼,提升冻结视频生成器输出的物理合理性。

详情
AI中文摘要

现代视频生成模型在视觉上效果显著,但经常违反基本物理原理。我们提出Proprio,一种无需训练的框架,使冻结的视频生成器能够评估和改进自身输出的物理合理性。受本体感觉(生物对自身运动的感知)启发,Proprio将模型在受控潜在扰动下的流残差视为自评分信号。能被生成器学习到的动力学更好解释的样本会产生更小且更稳定的残差。我们跨时间步和扰动聚合该信号,通过动态时空掩码聚焦于运动相关区域,并将其用于最佳N搜索、基于梯度的自精炼或两者结合。在文本到视频和图像到视频基准测试中,Proprio持续提升物理合理性,在多种设置下优于基于VLM的评分和外部世界模型基线。使用TurboWan2.2,Proprio将Physics-IQ从32.2提升至37.5(+16.5%),VideoPhy2-hard物理常识从45.6提升至55.0(+20.6%)。人类评估进一步显示,在大约三分之二的比较中,评估者更偏好Proprio选择或精炼的视频的物理合理性。这些结果表明,冻结的视频生成器包含可操作的内部信号,用于评估和改进自身输出的物理合理性。

英文摘要

Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one's own movement, Proprio treats the model's flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

2605.28229 2026-05-28 cs.CV cs.AI 版本更新

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism: 用于图像到视频迁移的异构混合专家模型

Rui Lin, Chuanming Wang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 提出VidPrism,一种异构时间混合专家框架,通过功能专业化专家、内容感知多速率采样和动态双向融合机制,解决传统MoE中专家同质化问题,在视频识别基准上达到最先进性能。

Comments CVPR2026 camera ready

详情
AI中文摘要

随着预训练技术的快速发展,适应大规模视觉-语言模型(VLM)进行视频理解(即图像到视频迁移学习)已成为主导范式。为了获得卓越性能,近期进展中采用混合专家(MoE)来增强VLM的时间建模能力是一种有效策略。然而,传统的MoE设计存在专家同质化问题,即所有专家充当相同的通才,从无差异的视频流中低效地学习时空特征。为解决此问题,我们提出VidPrism,一种新颖的异构时间混合专家框架。VidPrism通过部署功能专业化的专家开创了分工机制,每个专家承担从空间理解到时间建模的不同角色。为了适当地为这些专家提供输入,我们引入了一个内容感知的多速率采样模块,动态生成从语义丰富到运动聚焦的表示流,为专家提供专业化输入。此外,一种动态双向融合机制实现了这些路径之间的协同信息交换,从而产生全面的视频表示。在各种视频识别基准上的大量实验表明,VidPrism达到了最先进的性能,并有效促进了专家专业化。我们的源代码可在https://github.com/Lrrrr549/VidPrism.git获取。

英文摘要

With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.

2605.28217 2026-05-28 cs.CV 版本更新

A Patient-Specific Pulmonary Arterial Tree Digital Twin to Extract Pulmonary Embolism Biomarkers

患者特异性肺动脉树数字孪生以提取肺栓塞生物标志物

Morgane des Ligneris, Nathan Painchaud, Allan Serva, Laurent Bertoletti, Pierre Croisille, Carole Frindel, Odyssée Merveille

发表机构 * Univ Lyon, INSA‐Lyon, Université Claude Bernard Lyon 1, UJM-Saint Etienne, CNRS, Inserm, CREATIS UMR 5220, U1294(里昂大学、里昂国立应用科学学院、 Claude Bernard 里昂大学、 UJM-圣艾蒂安、 CNRS、 Inserm、 CREATIS UMR 5220、 U1294) Department of Radiology, CHU Saint-Etienne, UJM Saint-Etienne, Saint-Etienne, France(放射科、圣艾蒂安大学医院、 UJM-圣艾蒂安、圣艾蒂安、法国) IUF, Institut Universitaire de France, Paris(IUF、法国国家科学院、巴黎)

AI总结 提出一种自动化流程,通过构建肺动脉树的有向图表示并提取基于图像的生物标志物(包括局部动脉特征和全局严重程度评分),生成患者数字孪生,用于肺栓塞的风险评估。

Comments 11 pages + 2 pages of supplementary materials. Submitted to special issue of JBHI

详情
AI中文摘要

肺栓塞是由血凝块阻塞肺动脉引起的,是急性心血管综合征的主要原因之一。在临床实践中,通过计算机断层扫描肺血管造影诊断后的治疗决策依赖于风险分层,该分层将30天死亡风险分为三类。这种分层取决于右心室与左心室直径比以及两种心脏酶的血液水平。然而,在急诊情况下,血液生物标志物并不总是可用,而手动计算既定的严重程度评分(如Qanadli和Mastora评分)耗时且很少在临床常规中进行。本研究引入了一种自动化流程,该流程对肺动脉树的有向图表示进行建模,标记其层次结构并表征肺栓塞。该流程推导出基于图像的生物标志物,包括局部动脉级特征(形态学信息、层次位置、血凝块体积和由此产生的阻塞)以及全局患者级生物标志物,如自动计算的严重程度评分(Qanadli和Mastora评分)以及按肺叶和层次划分的总栓塞体积分布。利用人工智能生成的动脉、栓塞、肺和肺叶的二元掩码,它创建了动脉结构的患者数字孪生。通过与现有流程、解剖学期望和手动严重程度评分计算的比较验证,证明了该流程能够自动生成解剖学上准确的数字孪生和具有高度一致性的严重程度评分。这支持了这些基于图像的生物标志物自动提供关于血栓负荷和空间血凝块分布的快速、精确信息的潜力。

英文摘要

Pulmonary embolism, the obstruction of a pulmonary artery by a blood clot, is one of the leading causes of acute cardiovascular syndrome. In clinical practice, therapeutic decisions after diagnosis via computed tomography pulmonary angiography rely on risk stratification, which categorizes 30-day mortality risk into three categories. This stratification depends on the right-to-left ventricular diameter ratio and blood levels of two cardiac enzymes. However, blood biomarkers are not always available in emergency settings, and manual calculation of established severity scores - such as Qanadli and Mastora - is time-consuming and rarely performed in clinical routine practice. This study introduces an automated pipeline that models a directed graph representation of the pulmonary arterial tree, labeling its hierarchical structure and characterizing pulmonary embolism. The pipeline derives image-based biomarkers, including local artery-level features (morphological information, hierarchical position, clot volume, and resulting obstruction) and global patient-level biomarkers such as automatically calculated severity scores (Qanadli and Mastora) and the total embolic volume distribution by lobes and hierarchical levels. Using artificial-intelligence-generated binary masks of arteries, emboli, lungs, and lobes, it creates a patient digital twin of the arterial structure. Validation of the pipeline through comparison to an existing pipeline, anatomical expectations, and manual severity score calculations demonstrates the pipeline's ability to automatically generate anatomically accurate digital twins and severity scores with strong agreement. This supports the potential of these image-derived biomarkers to automatically provide rapid, precise information on thrombotic burden and spatial clot distribution.

2605.28176 2026-05-28 cs.CV 版本更新

From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis Assessmen

从Kellgren-Lawrence到焦磷酸钙晶体沉积:一种用于膝骨关节炎评估的软标签框架

Francisco Bérchez-Moreno, Riccardo Rosati, Maria Chiara Fiorentino, Víctor M. Vargas, Edoardo Cipolletta, Emilio Filippucci, Luca Romeo, Pedro A. Gutiérrez, César Hervás-Martínez

发表机构 * organization= Department of Political Science, Communication International Relations, University of Macerata , city= Macerata , country= Italy organization= Department of Economics Law, University of Macerata , city= Macerata , country= Italy organization= Department of Innovative Technologies in Medicine \& Dentistry, Università degli Studi "G. D'Annunzio" Chieti - Pescara , city= Chieti , country= Italy organization= Department of Internal Medicine, Azienda Ospedaliero Universitaria delle Marche , city = Ancona , country= Italy organization= Academic Rheumatology, University of Nottingham , city = Nottingham , country= UK organization= Department of Rheumatology, Polytechnic University of Marche , city= Ancona , country= Italy

AI总结 提出基于软标签的序贯深度学习框架,通过单峰概率分布替代独热编码,同时处理KL和CPPD分级中的序数不确定性和不对称关系,在膝X光图像上显著提升分级性能。

详情
AI中文摘要

背景与目标。传统的膝骨关节炎(KOA)分级深度学习方法依赖于独热标签,未能捕捉Kellgren-Lawrence(KL)和焦磷酸钙沉积病(CPPD)严重程度评分的序数不确定性,以及临床实践中观察到的两个量表之间的不对称关系。方法。我们回顾性收集了2172张膝关节X光图像,包括968张同时标注了KL和CPPD严重程度的X光片。开发了一个基于软标签的序贯深度学习框架用于两项任务,用以标注等级为中心的单峰概率分布替代独热目标。研究了四种分布形式:二项分布、贝塔分布、三角分布和指数分布。结果。所有软标签策略均持续优于名义基线。对于CPPD分级,三角分布实现了最高的二次加权卡帕(QWK)和最低的平均绝对误差(MAE)(QWK = 0.796;MAE = 0.438),而贝塔分布在考虑各类别的平均MAE(AMAE)和最大MAE(MMAE)时产生了最平衡的类别性能(AMAE = 0.458;MMAE = 0.573)。对于KL分级,基于贝塔的方法提供了最佳整体性能,实现了最高的QWK以及最低的MAE和类别误差(QWK = 0.777;MAE = 0.529;AMAE = 0.523;MMAE = 0.775)。统计分析表明,与传统的独热监督相比有显著改进(p < 0.001)。

英文摘要

Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren--Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p < 0.001).

2605.28174 2026-05-28 cs.CV cs.AI 版本更新

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

FLORO:面向跨传感器与尺度的生态遥感多模态地理空间基础模型

Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara, Fida Mohammad Thoker, Kasper Johansen, Bernard Ghanem, Fernando T. Maestre, Matthew F. McCabe

发表机构 * Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹科技大学生物与环境科学与工程 division) Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹科技大学计算机、电气与数学科学与工程 division)

AI总结 提出FLORO多模态地理空间基础模型,通过掩码自编码在异构遥感数据上预训练,利用可用性感知输入统一异构传感器配置,在PANGAEA基准上实现强迁移性能。

Comments 29 pages, 9 figures

详情
AI中文摘要

基础模型为可迁移的遥感表示提供了有前景的途径,但许多当前方法依赖于非常大的预训练数据集和固定的传感器配置,限制了它们在生态和环境应用中的适用性,这些应用中的观测通常跨平台、空间和光谱分辨率以及可用模态而变化。我们提出了FLORO,一个多模态地理空间基础模型,旨在从一个小型但高度多样化的遥感语料库中学习可迁移表示。FLORO使用掩码自编码在Sentinel-1、Sentinel-2、SkySAT影像、高程和无人机数据的异构组合上进行预训练。为了适应传感器变异性,FLORO结合了可用性感知输入,指示每个样本中存在哪些光谱波段和辅助模态,从而在异构传感器配置上实现统一的输入空间。我们在PANGAEA基准上,在冻结编码器协议下,评估了FLORO的场景分类、分割和回归任务。尽管在比竞争基础模型更小的语料库上预训练,FLORO在跨光学、光学-SAR和光学-高程基准(涵盖中分辨率卫星、航空和超高分辨率无人机影像)上实现了强大且稳定的迁移。FLORO在六个PANGAEA基准上取得了第二好的平均分割性能,仅次于最近引入的预训练图像数量超过两个数量级的基础模型,在场景分类上保持竞争力,在回归任务中表现稳健,而定性结果显示在洪水、城市、生物量和冠层高度预测设置中空间结构的保存有所改善。在EuroSAT-MS上的单独对照实验中,相对于绝对位置编码,地理位置编码进一步提高了分类性能。

英文摘要

Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.

2605.28173 2026-05-28 cs.CV 版本更新

MangaFlow: An End-to-End Agentic Framework for Controllable Story to Manga Generation

MangaFlow: 一种用于可控故事到漫画生成的端到端代理框架

Muyao Wang, Zeke Xie, Yanhao Chen, Lixin Xiu, Hideki Nakayama

发表机构 * The University of Tokyo(东京大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出MangaFlow代理框架,通过将漫画创作分解为规划、定位、布局构建、参考条件渲染、合成和文字放置等步骤,实现可控的长篇漫画生成,支持布局和视觉参考作为显式中间变量,并引入故事段落记忆以保持跨面板一致性。

详情
AI中文摘要

端到端漫画生成是一项结构化的视觉叙事任务,需要故事分解、重复角色和场景定位、页面布局设计、面板渲染、页面合成和文字放置。然而,现有的生成模型通常直接进行页面合成,将这些因素纠缠在单个视觉输出中,限制了对布局几何、视觉参考和跨面板一致性的精确控制。为了解决这些限制,我们提出了MangaFlow,一个用于可控长篇漫画生成的代理框架,它将漫画创作分解为规划、定位、布局构建、参考条件渲染、合成和文字放置。通过将布局和视觉参考视为显式中间变量,MangaFlow既支持简单的文本到漫画生成,也支持更精确的用户控制漫画创作。这种设计将布局、视觉资产和文字放置暴露为可编辑的中间控制,用于细化面板几何、参考和文字位置。为了支持长篇一致性,MangaFlow引入了故事段落记忆,将段落描述与相应的角色、场景和对象参考链接起来,以便在面板间重用。我们进一步提出了一个元基准,用于评估布局可控性、视觉一致性和生成质量。实验表明,MangaFlow在布局遵循和跨面板一致性方面优于直接生成基线,同时支持灵活的人工控制。

英文摘要

End-to-end manga generation is a structured visual storytelling task that requires story decomposition, recurring character and scene grounding, page layout design, panel rendering, page composition, and lettering. However, existing generative models often perform direct page synthesis, entangling these factors in a single visual output and limiting precise control over layout geometry, visual references, and cross-panel consistency. To address these limitations, we propose MangaFlow, an agentic framework for controllable long-form manga generation that decomposes manga creation into planning, grounding, layout construction, reference-conditioned rendering, composition, and text placement. By treating layout and visual references as explicit intermediate variables, MangaFlow enables both simple text-to-manga generation and more precise user-controlled manga creation. This design exposes layout, visual assets, and lettering as editable intermediate controls for refining panel geometry, references, and text placement. To support long-form consistency, MangaFlow introduces a story section memory that links section descriptions with corresponding character, scene, and object references for reuse across panels. We further present a meta-benchmark for evaluating layout controllability, visual consistency, and generation quality. Experiments show that MangaFlow improves layout adherence and cross-panel consistency over direct generation baselines while supporting flexible human control.

2605.28167 2026-05-28 cs.CV 版本更新

DebFilter: Eradicating Biases Stashed in Value

DebFilter: 消除隐藏在值中的偏见

Seung Hyuk Lee, Songkuk Kim

发表机构 * School of Integrated Technology, BK21 Graduate Program in Intelligent Semiconductor Technology(整合技术学院,智能半导体技术BK21研究生项目)

AI总结 提出DebFilter,一种轻量级、无需训练的方法,通过调整交叉注意力中的值分量来纠正文本到图像扩散模型中的社会偏见,实现推理时偏差缓解。

Comments 8 pages, 7 figures, supplementary material included, CVPR 2026

详情
AI中文摘要

文本到图像扩散模型,理论上等价于基于分数的生成模型,通过由预训练视觉语言模型(如CLIP)提取的文本嵌入引导的多步去噪过程生成图像。然而,这些文本嵌入固有地编码了社会和语义偏见——例如与性别和年龄相关的偏见——这些偏见随后通过引导机制以及模型在相对于这些偏见概念不平衡的大规模数据集上的训练被传播和放大,常常导致文本到图像生成中的输出偏差。我们提出了DebFilter,一种轻量级且无需训练的框架,用于缓解文本到图像扩散模型中的此类偏见。观察到模型在每个去噪步骤中的误差预测主要受交叉注意力动态影响,我们引入了一种偏差校正策略,调整交叉注意力中的值分量。具体地,我们对引导嵌入的切片施加固定偏移,有效地将交叉注意力值的语义方向转向无偏表示。这种调整重新配置了分数景观以产生平衡的输出,同时保持与预期文本语义的对齐。与依赖微调或重新训练的先前方法不同,DebFilter完全在推理时运行,无需额外数据或模型更新。我们的结果表明,该方法有效缓解了生成图像中的社会偏见,为更公平和更包容的文本到图像生成提供了一条高效且可扩展的途径。

英文摘要

Text-to-image diffusion models, which are theoretically equivalent to score-based generative models, generate images through a multi-step denoising process guided by text embeddings extracted from pretrained vision-language models such as CLIP. However, these text embeddings inherently encode social and semantic biases -- such as those related to gender and age -- that are subsequently propagated and amplified through the guidance mechanism, along with the model's training on large-scale datasets that are imbalanced with respect to these bias-related concepts, often leading to skewed outputs in text-to-image generation. We propose DebFilter, a lightweight and training-free framework for mitigating such biases in text-to-image diffusion models. Observing that the model's error prediction at each denoising step is primarily influenced by cross-attention dynamics, we introduce a bias-correction strategy that adjusts the value components within cross-attention. Specifically, we apply a fixed offset to the slice of guidance embedding, effectively steering the semantic direction of cross-attention values toward unbiased representations. This adjustment reconfigures the score landscape to produce balanced outputs while maintaining alignment with the intended text semantics. Unlike prior approaches that rely on fine-tuning or retraining, DebFilter operates entirely at inference time, requiring no additional data or model updates. Our results demonstrate that this method effectively mitigates social biases in generated images, offering an efficient and scalable pathway toward fairer and more inclusive text-to-image generation.

2605.28161 2026-05-28 cs.CV 版本更新

MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment

MeniOmni:用于整体半月板损伤评估的结构化多模态基准

Shurui Xu, Siqi Yang, Weiping Ding, Hui Wang, Mengzhen Fan, Yuyu Sun, Shuyan Li

发表机构 * 1School of Electronics, Electrical Engineering Computer Science, Queen's University Belfast, Belfast, UK 2Radiology Department, Affiliated Nantong Clinical College of Nantong University, Nantong First People's Hospital, School of Clinical Medicine, Nantong University, Nantong, Jiangsu, China 3School of Artificial Intelligence Computer Science, Nantong University, Nantong, China 4Faculty of Data Science, City University of Macau, Macau, China 5Department of Chemistry, University of Oxford, Oxford, UK 6Orthopedics Department, Nantong First People's Hospital, Southeast University, Nantong, Jiangsu, China

AI总结 提出MeniOmni基准,包含多中心MRI、临床先验和专家标注文本,支持细粒度Stoller分级和诊断报告生成,并引入风险感知序数评估和语义一致性指标Meni-Score。

Comments Accepted by IEEE International Conference on Multimedia and Expo (ICME) 2026 (Oral Presentation)

详情
AI中文摘要

半月板损伤的临床诊断需要放射科医生将体积MRI证据与患者背景(如性别、年龄、BMI)相结合,并生成结构化诊断报告。现有的膝关节MRI基准通常是单模态的,依赖粗粒度标签,限制了评估整体临床推理的能力。我们提出了MeniOmni,一个用于半月板损伤评估的结构化多模态基准,包含746个多中心MRI研究,具有三平面体积输入、临床先验和专家标注的临床文本。MeniOmni支持两个任务:(1)细粒度Stoller严重程度分级和(2)诊断报告生成。我们进一步提出了风险感知序数评估和语义一致性指标(Meni-Score),以更好地反映临床相关性。基线实验表明,纳入临床先验可提高分级性能并减少严重错误,凸显了多模态上下文对更安全评估的价值。代码和数据可在https://github.com/ShuruiXu/MeniOmni获取。

英文摘要

Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at https://github.com/ShuruiXu/MeniOmni.

2605.28157 2026-05-28 cs.CV 版本更新

Intra-YOLO: A Small Object Detection Model for Caries and Molar-Incisor Hypomineralization in Intraoral Photography Based on Transfer Learning with Reinforcement Learning

Intra-YOLO:基于迁移学习与强化学习的口内摄影龋齿与磨牙-切牙矿化不良小目标检测模型

Po-Lun Chwang, Po-Yu Chang, Wen-Liang Lin, Tung-Sheng Wu, Min-Ching Wang, Yun-Chien Cheng

发表机构 * Department of Mechanical Engineering, College of Engineering, National Yang Ming Chiao Tung University(国家阳明交通大学机械工程系) Taipei Medical University Hospital(台北医学大学医院) Wan Fang Hospital, Taipei Medical University(台北医学大学万芳医院)

AI总结 提出Intra-YOLO模型,结合迁移学习与强化学习,解决口内照片中龋齿和MIH小目标检测难题。

详情
AI中文摘要

本研究开发了一种计算机辅助诊断(CAD)系统,用于检测口内照片中的龋齿和磨牙-切牙矿化不良(MIH)。这些病变外观相似,使得临床鉴别具有挑战性,尤其是考虑到它们尺寸小且成像条件多变。

英文摘要

This study developed a computer-aided diagnosis (CAD) system for detecting caries and molar-incisor hypomineralization (MIH) in intraoral photographs. These lesions share similar appearances, making clinical differentiation challenging, especially given their small size and variability in imaging conditions.

2605.28151 2026-05-28 cs.CV 版本更新

A novel ordinal multi-view aggregation scheme for oak defoliation

一种用于橡树落叶的新型有序多视图聚合方案

Francisco Bérchez-Moreno, Ricardo Enrique Hernández-Lambraño, David Guijo-Rubio, Víctor Manuel Vargas, Francisco José Ruiz-Gómez, Juan Carlos Fernández, Pablo González-Moreno

发表机构 * Department of Forest Engineering, Laboratory of Dendrochronology, Silviculture and Global Change – DendrodatLab, Universidad de Córdoba(森林工程系、树轮学实验室、林学与全球变化——DendrodatLab,科尔多瓦大学) ERSAF. Andalusian Institute for Earth System Research (IISTA), Universidad de Córdoba(安达卢西亚地球系统研究所(IISTA)、科尔多瓦大学)

AI总结 提出一种基于有序分类的多视图集成框架,通过聚合从不同视角(北、南、树冠)训练的CNN预测,实现更稳健准确的橡树落叶估计。

详情
AI中文摘要

由气候和生物胁迫驱动的森林衰退威胁着生态系统功能,使得准确监测树木健康至关重要。在这项工作中,我们将树木落叶估计视为一个有序分类问题,使用地面图像。我们提出了一种新颖的多视图集成框架,该框架聚合了从不同视角(北、南和树冠)训练的卷积神经网络(CNN)的预测。该方法通过同质集成设计利用互补的视觉信息,同时保持建模一致性。通过比较多种有序分类方法并分析每个视图及其组合的贡献,进行了全面评估。结果表明,对落叶水平的有序结构进行建模比名义方法提高了性能,而所提出的多视图集成始终优于单视图和成对配置。特别是,三视图集成在所有评估指标上实现了最稳健和准确的预测。这些发现凸显了结合深度学习(DL)、有序分类(OC)和多视图聚合在地中海牧场等复杂生态系统中进行可扩展、一致和客观的森林健康评估的潜力。

英文摘要

Forest decline driven by climate and biotic stressors threatens ecosystem functioning, making accurate monitoring of tree health essential. In this work, we address tree defoliation estimation as an ordinal classification problem using ground-level imagery. We propose a novel multi-view ensemble framework that aggregates predictions from Convolutional Neural Networks (CNNs) trained on different perspectives of individual trees (north, south, and crown). This approach leverages complementary visual information while preserving modelling consistency through a homogeneous ensemble design. A comprehensive evaluation is conducted by comparing multiple ordinal classification methods and analysing the contribution of each view and their combinations. Results show that modelling the ordinal structure of defoliation levels improves performance over nominal approaches, while the proposed multi-view ensemble consistently outperforms single-view and pairwise configurations. In particular, the three-view ensemble achieves the most robust and accurate predictions across all evaluation metrics. These findings highlight the potential of combining Deep Learning (DL), Ordinal Classification (OC), and multi-view aggregation for scalable, consistent, and objective forest health assessment in complex ecosystems such as Mediterranean dehesas.

2605.27351 2026-05-28 cs.CV 版本更新

Feedforward 3D Editing Learns from Semantic-Part Transformation

前馈3D编辑从语义部分变换中学习

Jiawei Weng, Saining Zhang, Zhenxin Diao, Peishuo Li, Henghaofan Zhang, Junhao Chen, Hao Zhao

发表机构 * Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学)

AI总结 提出Pxform数据集和PartFlow网络,通过语义部分变换实现高质量前馈3D编辑,在几何和外观编辑基准上达到最优性能。

Comments 31 pages, 22 figures. Project Page: https://dennis-jwweng.github.io/pxform/

详情
AI中文摘要

3D编辑是可扩展3D内容创作的基本能力。虽然图像编辑已迅速向大规模前馈生成范式发展,但3D AI生成仍以无需训练的编辑流程为主。前馈3D编辑的核心挑战在于缺乏高质量配对监督。可编辑的3D资产需要同时保持几何、多视图一致性、结构连贯性和局部编辑可控性。现有的3D编辑数据集通常依赖于独立生成的资产、图像介导的重建或狭窄的编辑分类,导致定位不准确、保持性弱、编辑边界模糊和语义一致性有限。在这项工作中,我们引入了一个新视角:可扩展的前馈3D编辑应从语义部分变换中学习。基于这一见解,我们提出了Pxform,一个高质量的3D编辑数据集,包含超过10万对七种编辑类型的一致前后编辑对。我们的流程不是将对象视为无结构形状,而是直接将编辑锚定在语义3D部分。基于Pxform,我们进一步提出了PartFlow,一个前馈3D编辑网络,它将源感知潜在控制注入预训练的3D生成先验中。PartFlow引入了掩码感知速度保持和渲染空间一致性监督,以共同提高编辑保真度和源保持,同时在推理时不需要3D编辑掩码。大量实验表明,高质量的语义部分监督显著改进了可扩展的3D编辑,使PartFlow在几何和外观编辑基准上均达到了最先进的性能。

英文摘要

3D editing is a fundamental capability for scalable 3D content creation. While image editing has rapidly evolved toward large-scale feedforward generative paradigms, 3D AI generation remains dominated by training-free editing pipelines. A central challenge of feedforward 3D editing lies in the lack of high-quality paired supervision. Editable 3D assets require simultaneous preservation of geometry, multi-view consistency, structural coherence, and localized edit controllability. Existing 3D editing datasets often rely on independently generated assets, image-mediated reconstruction or narrow edit taxonomies, leading to inaccurate localization, weak preservation, blurred edit boundaries, and limited semantic consistency. In this work, we introduce a new perspective: scalable feedforward 3D editing should be learned from semantic-part transformations. Based on this insight, we propose Pxform, a high-quality 3D editing dataset with over 100K consistent before/after editing pairs across seven edit types. Instead of treating objects as unstructured shapes, our pipeline grounds edits directly in semantic 3D parts. Built upon Pxform, we further propose PartFlow, a feedforward 3D editing network that injects source-aware latent control into pretrained 3D generative priors. PartFlow introduces mask-aware velocity preservation and render-space consistency supervision to jointly improve edit fidelity and source preservation, while requiring no 3D edit mask during inference. Extensive experiments demonstrate that high-quality semantic-part supervision substantially improves scalable 3D editing, enabling PartFlow to achieve state-of-the-art performance on both geometric and appearance editing benchmarks.

2605.27102 2026-05-28 cs.CV cs.LG 版本更新

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

JLT: 潜在扩散Transformer中的干净潜在预测

Funing Fu, Tenghui Wang, Guanyu Zhou, Junyong Cen, Qichao Zhu

发表机构 * Independent Researcher(独立研究者) Wuhan University of Technology(武汉理工大学) Hangzhou Jiyi Artificial Intelligence Co., Ltd.(杭州智益人工智能有限公司)

AI总结 本文提出JLT,一种在冻结的FLUX.2 VAE编码上训练的130M潜在扩散Transformer,通过干净潜在预测相比速度预测在ImageNet 256×256上获得更优的FID分数,表明潜在扩散中的预测目标是依赖于表示的几何选择。

详情
AI中文摘要

使用干净数据预测的流匹配表明,回归干净点比预测环境噪声量更能有效利用低维结构。我们询问在图像被映射到学习到的潜在空间后,这一原则是否仍然有用,因为压缩已经去除了原始像素的大部分变异性。我们引入了JLT,一个在冻结的FLUX.2 VAE编码上的130M潜在扩散Transformer,并在相同的表示、主干和训练设置下,将干净潜在预测与匹配的速度预测DiT进行比较。尽管三个变量x、epsilon和v在固定损坏时间下是线性可转换的,但局部高斯分析表明,速度回归继承了各向同性的目标协方差下限,并放大了低方差潜在方向,而干净预测则抑制了它们。在ImageNet 256×256上,JLT-B/1在无分类器引导下获得了FID-50K 2.50,与速度预测相比有较大的匹配目标差距。这些结果表明,潜在扩散中的预测目标是依赖于表示的几何选择,而不是可互换的代数参数化。

英文摘要

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.

2605.26368 2026-05-28 cs.CV cs.AI 版本更新

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

统一全景几何估计:基于多视角基础模型

Vukasin Bozic, Isidora Slavkovic, Dominik Narnhofer, Nando Metzger, Denis Rozumny, Konrad Schindler, Nikolai Kalischek

发表机构 * ETH Zürich(苏黎世联邦理工学院) Google(谷歌)

AI总结 提出PaGeR框架,利用预训练3D基础模型,从单张全景图像中统一预测尺度不变深度、度量深度、表面法线和天空掩码,实现360度场景重建。

详情
AI中文摘要

从透视图像进行几何估计已取得巨大进展,成熟到现成的基础模型不仅能够从多视角图像重建3D场景结构,甚至能从单视图进行重建。一个自然的扩展是从全景图像进行3D重建,其令人兴奋的前景是从单张全景图像恢复完整的360度场景。在这项工作中,我们引入了PaGeR(全景几何重建),这是一个将专为透视图像设计的强大3D基础模型提升到全景领域的框架。我们的策略是从一个预训练的3D重建Transformer开始,将其转变为一个统一的高性能模型,该模型在单次前向传播中从透视和全向图像预测尺度不变深度、度量深度、表面法线和天空掩码。通过将架构改动保持在最小,并在训练中混合透视和全景图像,PaGeR保留了底层基础模型的丰富3D先验,同时学会从单张全景图像估计几何一致的360度场景。我们在室内和室外环境中广泛测试了我们的方法,发现它在各种场景中提供了最先进的性能和出色的零样本性能。代码、数据和模型可在此处获取:https://github.com/prs-eth/PaGeR。

英文摘要

Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view. A natural extension is 3D reconstruction from panoramas, with the exciting prospect of recovering a full 360-degree scene from a single panoramic image. In this work, we introduce PaGeR (Panoramic Geometry Reconstruction), a framework to lift powerful 3D foundation models designed for perspective imagery to the panorama domain. Our strategy is to start from a pre-trained transformer for 3D reconstruction and turn it into a unified high-performance model that predicts scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images, in a single forward pass. By keeping architectural changes to a minimum and mixing perspective and panoramic images during training, PaGeR retains the rich 3D prior of the underlying foundation model while learning to also estimate geometrically consistent 360-degree scenes from single panoramas. We extensively test our method in both indoor and outdoor environments and find that it delivers state-of-the-art performance and excellent zero-shot performance across a wide range of scenes. Code, data and models are available $\href{https://github.com/prs-eth/PaGeR}{\text{here}}$.

2605.26277 2026-05-28 cs.CV cs.AI 版本更新

VesselSim: learning 3D blood vessel segmentation without expert annotations

VesselSim: 无需专家标注的3D血管分割学习

Erin Rainville, Melissa Ananian, Tristan Mirolla, Hassan Rivaz, Yiming Xiao

发表机构 * Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada(计算机科学与软件工程系,康科迪亚大学,蒙特利尔,加拿大) Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada(电气与计算机工程系,康科迪亚大学,蒙特利尔,加拿大)

AI总结 提出VesselSim两阶段框架,通过几何驱动的合成血管生成和自监督测试时适应,实现无需真实标注的3D血管分割,在多个临床数据集上达到与有监督方法竞争的性能。

Comments This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October

详情
AI中文摘要

血管分割是医学图像分析中用于血管疾病护理和手术规划的核心任务,然而提供专家血管标注的挑战对相关深度学习技术的进展构成了主要障碍。为解决这一问题,我们提出了VesselSim,一个用于通用3D血管分割的两阶段框架,在训练过程中无需真实标注数据。首先,我们引入了一个随机的、几何驱动的血管模拟框架,该框架模拟递归分支、曲率控制生长和碰撞感知拓扑,随后通过域随机化强度合成生成16,500个体解剖学上合理的3D血管造影体积。其次,仅在此合成数据上训练3D U-Net。为了在推理时弥合从合成图像到真实图像的域差距,我们通过自监督掩码重建解码器引入了一种测试时适应策略,无需先验域知识即可适应未见过的临床扫描。我们在多个真实世界数据集上以零样本设置评估VesselSim,这些数据集涵盖多个解剖区域(包括脑和肾脏)的MR和CT。尽管仅在合成数据上训练,VesselSim的性能与最先进的血管分割基础模型相竞争。这些发现表明,从合成管状结构中学习血管几何对于鲁棒的跨域泛化是有效的,大大减少了对获取的医学成像数据以及更重要的是专家标注的依赖。

英文摘要

Blood vessel segmentation is a core task in medical image analysis for the care of vascular diseases and surgical planning, yet the challenges of providing expert vascular annotations pose a major obstacle for the progress of related deep learning techniques. To address this, we propose VesselSim, a two-stage framework for universal 3D blood vessel segmentation that eliminates the need for real annotated data during training. First, we introduce a stochastic, geometry-driven vascular simulation framework that models recursive branching, curvature-controlled growth, and collision-aware topology, followed by domain-randomized intensity synthesis to generate 16,500 anatomically plausible 3D angiographic volumes. Second, a 3D U-Net is trained solely on this synthetic data. To bridge the domain gap from synthetic to real images at inference time, we introduce a test-time adaptation strategy via a self-supervised mask reconstruction decoder, enabling adaptation to unseen clinical scans without prior domain knowledge. We evaluate VesselSim in a zero-shot setting on multiple real-world datasets spanning MR and CT across several anatomical regions, including the brain and kidneys. Despite being trained exclusively on synthetic data, VesselSim achieves performance competitive with state-of-the-art vascular segmentation foundation models. These findings suggest that learning vessel geometry from synthetic tubular structures is effective for robust cross-domain generalization, substantially reducing the reliance on acquired medical imaging data and more importantly, expert annotations.

2605.25763 2026-05-28 cs.CV 版本更新

AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis

AI-T2I: 面向文本到图像合成的扩散模型的聚合与隔离交叉注意力

Shipeng Cao, Biao Qian, Haipeng Liu, Yang Wang, Meng Wang

发表机构 * Institute of Advanced Medicine and Frontier Technology(先进医学与前沿技术研究院) Hefei University of Technology(合肥工业大学) Key Laboratory of Knowledge Engineering With Big Data, Ministry of Education(大数据知识工程重点实验室) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对扩散模型在文本到图像合成中交叉注意力图内文本-图像对齐不精确的问题,提出一种聚合与隔离交叉注意力方法(AI-T2I),通过聚合损失整合分散的令牌内激活并引入隔离损失分离令牌间激活,实现精确对齐。

Comments Accepted by IEEE Transactions on Multimedia (2026). 13 pages, 15 figures

详情
AI中文摘要

文本到图像合成取得了显著进展,得益于扩散模型强大的生成能力。然而,这些模型在去噪过程中难以在交叉注意力图中实现精确的文本-图像对齐。现有工作主要关注不同主体之间的主体间令牌激活(即交叉注意力分数)重叠,而忽略了相同主体的主体内令牌激活分散问题。在本文中,我们提出了一种面向文本到图像合成的扩散模型的聚合与隔离交叉注意力方法,称为AI-T2I。技术上,为了解决分散问题,我们设计了一个聚合损失来识别并整合分散的令牌内激活,这隐式地有助于缓解潜在的重叠问题。在此基础上,进一步引入隔离损失以推开令牌间激活,从而实现精确的文本-图像对齐。在各种基准上的大量实验表明,AI-T2I在文本到图像合成方面优于最先进的工作。此外,我们的AI-T2I在其他任务上表现出优异的泛化能力,例如可控布局生成和个性化生成。我们的代码可在https://github.com/Hatter77/AI-T2I获取。

英文摘要

Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation. Our code is available at https://github.com/Hatter77/AI-T2I.

2605.24906 2026-05-28 cs.CV 版本更新

Where Detectors Fail: Probing Generative Space for Generalizable AI-Generated Image Detection

探测器失效之处:探索生成空间以实现可泛化的AI生成图像检测

Zijie Cao, Weijie Tu, Yao Xiao, Weijian Deng, Liang Lin, Pengxu Wei

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学计算机科学与工程学院) Australian National University, Canberra, Australia(澳大利亚国立大学) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生学院)

AI总结 提出PROBE框架,通过主动探索生成过程中的困难区域来改进AI生成图像检测器的泛化能力。

Comments Accepted by ICML2026

详情
AI中文摘要

检测AI生成图像(AIGI)仍然具有挑战性,因为检测器通常无法泛化到未见过的生成器。尽管现有方法在大数据集上训练,但当生成设置改变时,其性能仍然下降,这表明仅靠数据规模是不够的,训练期间生成变化的有限覆盖是一个关键因素。关于生成模型编辑的研究表明,内部表示的微小变化可以产生多样且有意义的图像变化,其中许多在标准采样下未被探索。利用这一见解,我们提出了PROBE(通过边界探索探测鲁棒性)框架,通过主动探索生成过程中的困难区域来改进检测器的泛化能力。PROBE不是将生成器视为固定的数据源,而是使用检测器作为批评者,通过流形级别的修改引导生成器,产生难以分类的真实样本。这些样本暴露了在标准数据采样策略下不常见的失败案例,并用于改进检测器。在多个基准上的实验结果表明,PROBE增强了对未见生成器的泛化能力,从而实现了更可泛化的AIGI检测性能。代码和模型可在 https://github.com/Amamiya-C/PROBE-AIGI-Detection 获取。

英文摘要

Detecting AI-generated images (AIGI) remains challenging because detectors often fail to generalize to unseen generators. Although existing methods are trained on large datasets, their performance still degrades when generation settings change, indicating that data scale alone is insufficient and that limited coverage of generative variations during training is a key factor. Studies on generative model editing show that small changes in internal representations can produce diverse and meaningful image variations, many of which are not explored under standard sampling. Leveraging this insight, we propose PROBE (Probing Robustness via Boundary Exploration), a framework that improves detector generalization by actively exploring challenging regions of the generative process. Instead of treating the generator as a fixed data source, PROBE uses the detector as a critic to steer the generator through manifold-level modifications, producing realistic samples that are difficult to classify. These samples expose failure cases that are uncommon under standard data sampling strategies and are used to refine the detector. Experimental results across multiple benchmarks indicate that PROBE enhances generalization to unseen generators, resulting in more generalizable AIGI detection performance. Code and models are available at https://github.com/Amamiya-C/PROBE-AIGI-Detection

2605.18137 2026-05-28 cs.CV 版本更新

Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving

小米自动驾驶世界模型:一个融合重建与生成的联合世界模型

Lijun Zhou, Hongcheng Luo, Zhenxin Zhu, Cheng Chi, Mingfei Tu, Kaixin Xiong, Lei Gong, Zhanqian Wu, Zehan Zhang, Fangzhen Li, Hao Li, Yingying Shen, Jiale He, Haohui Zhu, Shan Zhao, Kai Wang, Zhiwei Zhan, Yuechuan Pu, Kaiyuan Tan, Ruiling Yang, Xianqi Wang, Tianyi Yan, Jiawei Zhou, Lei Zhang, Jingyang Zhao, Xi Zhou, Chitian Sun, Chenming Wu, Jiong Deng, Hongwei Xie, Ming Lu, Kun Ma, Long Chen, Guang Chen, Hangjun Ye, Bing Wang, Haiyang Sun

发表机构 * Xiaomi(小米)

AI总结 提出一个统一技术系统,通过稀疏场景查询驱动的重建模块WorldRec和两阶段训练框架WorldGen,实现高保真3D场景表示与高质量因果视频生成,并联合优化以提升生成稳定性、跨帧一致性和视觉保真度。

详情
AI中文摘要

本报告提出了一个统一的技术系统,解决自动驾驶世界模型的两个核心能力:世界表示和世界生成。对于世界表示,我们提出了WorldRec,一种由稀疏场景查询驱动的前馈重建架构。WorldRec在3D空间中初始化结构化查询,利用它们聚合跨视图、跨时间特征,从而自然地强制帧间空间一致性,并产生紧凑且高保真的3D高斯场景表示。对于世界生成,我们提出了WorldGen,一个两阶段训练框架,包括双向预训练和随后通过三个渐进阶段(教师强制、ODE蒸馏和DMD)的因果微调,使得在仅4个去噪步骤中实现高质量的在线因果视频生成。基于这两个模块,我们进一步引入了JWM,它深度融合了WorldRec和WorldGen,在生成稳定性、跨帧一致性和视觉保真度方面实现协同增益,为自动驾驶中的闭环仿真、数据合成和端到端训练提供了坚实基础。

英文摘要

This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross-view, cross-temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross-frame consistency, and visual fidelity, providing a solid foundation for closed-loop simulation, data synthesis, and end-to-end training in autonomous driving.

2605.02417 2026-05-28 cs.CV 版本更新

DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

DirectEdit: 基于流的图像编辑的逐步骤精确反演

Desong Yang, Mang Ye

发表机构 * National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China(多媒体软件国家工程研究中心,计算机科学学院,武汉大学,中国武汉)

AI总结 提出DirectEdit方法,通过直接对齐前向路径消除反演过程中的累积漂移,实现精确重建和可靠特征共享,无需额外神经函数评估,在多种场景下优于现有方法。

Comments ICML 2026. Project page: https://desongyang.github.io/Directedit/

详情
AI中文摘要

随着大规模预训练文本到图像(T2I)模型的最新进展,免训练的图像编辑方法已展现出显著成功。通常,这些方法通过反演过程向干净图像添加噪声,随后在前向过程中分别对重建路径和编辑路径进行去噪步骤。然而,由于重建路径使用来自不匹配时间步的噪声潜变量进行近似,现有方法不可避免地遭受累积漂移,这从根本上限制了重建保真度。为了解决这一挑战,我们系统地分析了流变换器中的反演过程,并提出了DirectEdit,一种简单而有效的编辑方法,无需引入额外的神经函数评估(NFE)即可消除固有的重建误差。与大多数试图纠正反演路径的先前工作不同,DirectEdit专注于直接对齐前向路径,从而实现精确重建和可靠的特征共享。此外,我们引入了一种基于注意力特征注入和多分支掩码引导噪声混合的保留机制,有效平衡了保真度和可编辑性。跨多种场景的大量实验表明,DirectEdit实现了高效准确的图像编辑,其优越性能优于最先进的方法。代码和示例可在 https://desongyang.github.io/Directedit 获取。

英文摘要

With recent advancements in large-scale pre-trained text-to-image (T2I) models, training-free image editing methods have demonstrated remarkable success. Typically, these methods involve adding noise to a clean image via an inversion process, followed by separate denoising steps for the reconstruction and editing paths during the forward process. However, since the reconstruction path is approximated using noisy latents from mismatched timesteps, existing methods inevitably suffer from accumulated drift, which fundamentally limits reconstruction fidelity. To address this challenge, we systematically analyze the inversion process within the flow transformer and propose DirectEdit, a simple yet effective editing method that eliminates the inherent reconstruction error without introducing additional neural function evaluations (NFEs). Unlike most prior works that attempt to rectify the inversion path, DirectEdit focuses on directly aligning the forward paths, enabling precise reconstruction and reliable feature sharing. Furthermore, we introduce a preservation mechanism based on attention feature injection and multi-branch mask-guided noise blending, which effectively balances fidelity and editability. Extensive experiments across diverse scenarios demonstrate that DirectEdit achieves efficient and accurate image editing, delivering superior performance that outperforms state-of-the-art methods. Code and examples are available at https://desongyang.github.io/Directedit.

2604.21668 2026-05-28 cs.CV 版本更新

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

通过结构化运动描述实现无编码器的人体运动理解

Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao

发表机构 * Aalto University(阿alto大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出结构化运动描述(SMD),将关节位置序列转换为结构化自然语言描述,使大语言模型无需专用编码器即可直接进行运动推理,在运动问答和字幕生成任务上超越现有方法。

详情
AI中文摘要

基于文本的大语言模型(LLM)的世界知识和推理能力正在快速发展,但目前的人体运动理解方法(包括运动问答和字幕生成)尚未充分利用这些能力。现有的基于LLM的方法通常通过专用编码器学习运动-语言对齐,将运动特征投影到LLM的嵌入空间中,但仍受限于跨模态表示和对齐。受生物力学分析的启发(其中关节角度和身体部位运动学长期以来一直作为人体运动的精确描述语言),我们提出了 extbf{结构化运动描述(SMD)},一种基于规则的确定性方法,将关节位置序列转换为关节角度、身体部位运动和全局轨迹的结构化自然语言描述。通过将运动表示为文本,SMD使LLM能够直接将其关于身体部位、空间方向和运动语义的预训练知识应用于运动推理,无需学习编码器或对齐模块。我们表明,该方法在运动问答(BABEL-QA上66.7%,HuMMan-QA上90.1%)和运动字幕生成(HumanML3D上R@1为0.584,CIDEr为53.16)上均超越了所有先前方法,达到了最先进的结果。SMD还提供了实际优势:相同的文本输入可适用于不同的LLM,仅需轻量级的LoRA适配(在6个模型家族的8个LLM上验证),并且其人类可读的表示能够对运动描述进行可解释的注意力分析。代码、数据和预训练的LoRA适配器可在https://yaozhang182.github.io/motion-smd/获取。

英文摘要

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.

2605.28137 2026-05-28 cs.CV cs.LG 版本更新

No Safe Dose: How Training Data Drives Unsafe Image Generation

无安全剂量:训练数据如何驱动不安全图像生成

Felix Friedrich, Lukas Helff, Niharika Hegde, Patrick Schramowski, Kristian Kersting

发表机构 * Black Forest Labs(黑森林实验室) TU Darmstadt & hessian.AI(图腾达姆施塔特大学 & heessian.AI) DFKI(德意志联邦鹰嘴豆研究所) Lab1141(Lab1141实验室)

AI总结 通过控制训练数据中不安全图像的比例(0%至9.6%),发现输出不安全率随比例单调上升,且比例而非绝对数量是关键因素,同时文本编码器(如SafeCLIP)可降低基线风险,但剂量效应持续存在。

详情
AI中文摘要

基于大规模数据训练的文本到图像模型往往不可避免地包含不安全内容。虽然有人观察到输入输出放大效应,但训练数据组成是否以及如何直接驱动模型输出安全性,还是由其他因素决定,仍不清楚。我们通过隔离这一变量来阐明问题:在多个数据集规模(10万到800万)下,我们在仅在不安全图像比例(0%到9.6%)上不同的数据集上训练相同的文本到图像模型。然后使用生成的模型生成图像,并用四个独立的安全分类器进行评估。输出不安全率从0%污染时的16.6%单调上升到5%污染时的25.5%。析因设计揭示,不安全训练图像的 extit{比例}而非绝对数量是操作变量。零污染时16.6%的不可降低基线表明其他组件(如冻结的文本编码器)是残余安全风险——通过文本编码器消融实验证实,SafeCLIP将这一底线降至9.6%,而剂量效应在所有测试的三个编码器中持续存在。关键的是,在FID、CLIPscore和ImageReward方面,安全过滤并未伴随质量下降。这些结果表明,数据整理和文本编码器安全是互补且独立有效的干预措施。同时,剩余的不安全水平为未来关于新兴能力和组合性的研究提出了问题。

英文摘要

Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

2605.28136 2026-05-28 cs.CV cs.RO 版本更新

SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

SAM增强的道路数据集分割:自动驾驶中关键类别的平衡

Toomas Tahves, Mauro Bellone, Junyi Gu, Raivo Sell

发表机构 * Department of Mechanical and Industrial Engineering, Tallinn University of Technology(塔林技术大学机械与工业工程系) FinEst Centre for Smart Cities, Tallinn University of Technology(塔林技术大学智能城市研究中心) Department of Computer Science and Engineering, Universitas Mercatorum(默卡托姆大学计算机科学与工程系) Department of Computer Science and Engineering, Chalmers University of Technology(挑战者技术大学计算机科学与工程系) University of Gothenburg(哥德堡大学)

AI总结 提出基于SAM的标注流水线,将ZOD数据集的边界框转换为密集像素级语义掩码,并评估不同架构在类别不平衡下的性能,通过双向迁移学习实现跨传感器配置的有效迁移。

详情
AI中文摘要

密集语义分割对于自动驾驶至关重要,然而许多多模态数据集缺乏像素级标注。Zenseact开放数据集(ZOD)提供丰富的多传感器数据,但仅有边界框标签,限制了其在分割研究中的应用。我们的主要贡献是一个基于Segment Anything Model(SAM)的标注流水线,通过将边界框转换为语义掩码,为ZOD生成密集的像素级标注。在这项初步研究中,我们处理了超过10万帧,并手动筛选出一个2300帧的子集(接受率36%),以建立可靠的基线。利用这些标注,我们评估了基于Transformer的CLFT和基于CNN的DeepLabV3+架构在不同天气条件下的性能,其中CLFT-Hybrid达到了48.1%的mIoU。为了解决极端类别不平衡问题(行人、骑行者、标志牌像素占比不足1%),我们探索了针对稀有类别的专门模型。我们还在Iseauto自动驾驶平台上验证了该流水线,达到了77.5%的mIoU,并展示了通过双向迁移学习,SAM导出的表示能够有效地跨传感器配置迁移。所有代码和标注均已发布,以支持可重复研究。

英文摘要

Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

2605.28132 2026-05-28 cs.CV 版本更新

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

哪种预训练范式更有利于空间智能?视觉-语言模型与视频生成模型的实证比较

Haozhan Shen, Tiancheng Zhao, Kangjia Zhao, Jianwei Yin

发表机构 * Zhejiang University(浙江大学) Om AI Research(Om人工智能研究) Binjiang Institute of Zhejiang University(浙江大学滨江研究院)

AI总结 本文通过冻结特征探测研究,系统比较了视觉-语言模型(VLM)和视频生成模型(VGM)在语义标注、实例分组和3D几何预测三个空间智能维度上的表现,发现两者互补且简单融合可提升整体性能。

Comments Code is here: \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}

详情
AI中文摘要

空间智能需要能够捕捉物理世界中语义对象和几何结构的视觉表示。为此,两种主要的预训练方案被广泛用作基础骨干:视觉-语言模型(VLM),它使用语言监督将视觉观察与语义概念对齐;以及视频生成模型(VGM),它从时间演变的视觉世界中学习。然而,目前尚不清楚哪种预训练方案为空间智能提供了更好的表示基础。在本文中,我们首次对VLM和VGM在空间智能的三个代表性维度上进行了系统的冻结特征探测研究:语义标注、实例分组和3D几何预测。通过轻量级探测,我们的框架能够控制性地比较两个模型家族的冻结表示中已经编码的信息。实验结果显示明显的互补性:VLM在语义标注和实例分组方面更强,而VGM为密集几何和相机运动提供了更易获取的信号。此外,两者的简单融合已经产生了在几何和语义方面都表现出色的表示,这表明通过有效整合两个模型家族的特征来构建更强的空间智能骨干是一个有前景的方向。我们的代码可在\href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}获取。

英文摘要

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}.

2605.28125 2026-05-28 cs.CV cs.GR 版本更新

CLEAR-NeRF: Collinearity and Local-region Enhanced Accurate 3D Reconstruction in Unbounded Scenes

CLEAR-NeRF: 共线性和局部区域增强的无界场景精确三维重建

Vladislav Polianskii, Elijs Dima, Isabel Salmerón Marazuela, Gergő László Nagy, Sigurdur Sverrisson, Volodya Grancharov

发表机构 * Ericsson Research(爱立信研究)

AI总结 提出CLEAR-NeRF方法,通过自动局部区域定位、共线性射线采样、深度局部邻域点提取和几何相关颜色聚合,在无界复杂场景中实现高保真度和度量精度的三维重建。

详情
AI中文摘要

许多真实世界的三维重建应用要求在无界、复杂场景中实现照片级真实感和度量精度,这些场景具有挑战性的光照和不完美的捕获,而当前的神经辐射场(NeRF)流程仅部分满足这些需求。本研究将基于NeRF的三维重建适应于多兴趣区域的无界场景,以提高对光照和姿态变化的鲁棒性,同时确保适用于数字孪生应用的度量精度。我们的方法引入了(i)自动局部区域定位/检测和重建,以无缝优先考虑感兴趣区域而不增加子模块;(ii)共线性强制射线采样,以学习平滑的平面和曲面;(iii)深度局部邻域点提取,以抑制表面伪影;以及(iv)几何相关颜色聚合,以减轻光照和姿态引起的变化。结果表明,所提出的流程在基线NeRF模型以及成熟的结构从运动(SfM)-多视图立体(MVS)解决方案上均表现出优越的性能。

英文摘要

Many real-world 3D reconstruction applications demand photorealism and metric accuracy across unbounded, complex scenes with challenging lighting and imperfect captures that current Neural Radiance Field (NeRF) pipelines only partly satisfy. This study adapts NeRF-based 3D reconstruction to multi-region of interest unbounded scenes to improve robustness to lighting and pose variation while enforcing metric accuracy suitable for digital-twin applications. Our approach introduces (i) automated local region localization/detection and reconstruction to seamlessly prioritize areas of interest without proliferating submodules, (ii) collinearity-enforcing ray sampling to learn smooth planar and curved surfaces, (iii) depth-localized neighborhood point extraction to suppress surface artifacts, and (iv) geometry-relevant color aggregation to mitigate lighting- and pose-caused variations. Results indicate superior performance of the proposed pipeline over the baseline NeRF models and established Structure from Motion (SfM) - Multi-View Stereo (MVS) solutions.

2605.28100 2026-05-28 cs.CV cs.AI 版本更新

Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring

重新审视变化检测方法在冰塔崩塌延时监测中的应用

Arthur Dérédel, Carlos Crispim-Junior, Pierre Lemaire, Johan Berthet, Laure Tougne Rodet

发表机构 * Université Lumière Lyon 2, CNRS, Ecole Centrale de Lyon, INSA Lyon, Université Claude Bernard Lyon 1, LIRIS, UMR5205(里尔大学 Lyon 2,法国国家科学研究中心,中央理工学院,里昂国立应用科学学院,里尔大学 Lyon 1,LIRIS,UMR5205) Styx4D, 19 rue lac Saint André, Le Bourget-du-Lac, 73370, France(Styx4D,19 rue lac Saint André,Le Bourget-du-Lac,73370,法国)

AI总结 针对延时相机在监测冰塔崩塌时面临的形状和光照变化挑战,本文提出体积变化检测子任务,通过新数据集SeracFallDet评估现有方法,发现密集和半密集特征匹配表现稳健,而监督方法受限于数据稀缺。

Comments Preprint, 19 pages, 8 figures

详情
AI中文摘要

在气候变化加剧环境不确定性的时代,识别和检测事件前兆对于减轻灾难性自然灾害的影响变得至关重要。虽然干涉激光或地震仪等经典传感器可靠,但其广泛部署常受后勤和经济障碍阻碍,留下众多盲点。延时相机已为这类传感器提供经济高效的高分辨率视觉背景,是一种有前景的替代方案。然而,自动处理其输出面临重大挑战,尤其与极端形状和光照变化相关。克服这些问题对于将其大规模部署为监测工具至关重要。本文引入变化检测的一个新颖子任务,即体积变化检测,应用于延时相机和斜坡不稳定性。我们对现有最先进的变化检测方法及相关任务进行全面回顾,分析其核心组件,并评估其在此场景中的适用性。为此,我们引入新数据集SeracFallDet,其中包含冰塔崩塌注释,并已彻底注释以满足后者需求。通过泛化实验,我们证明密集和半密集特征匹配虽未专门针对此任务训练,但表现出稳健性能。相反,监督方法在数据稀缺和注释不平衡方面存在困难。这表明混合方法可能通过利用两种任务的优势提供前进路径。这些发现凸显了特征匹配技术的潜力,以及需要进一步创新以克服环境监测中实际部署的挑战。

英文摘要

In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming crucial to mitigate the impacts of disastrous natural hazards. While classical sensors such as interferometric lasers or seismometers are reliable, their widespread deployment is often hindered by logistical and economic barriers, leaving numerous blind spots. Time-lapse cameras, which already provide cost-effective, high-resolution visual context to such sensors, present a promising alternative. However, processing their output automatically faces significant challenges, notably linked to extreme shape and lighting variations. Overcoming those issues is essential to deploy them at large-scale as a monitoring tool. This paper introduces a novel sub-task of change detection, namely volumetric change detection, applied to time-lapse cameras and slope instabilities. We conduct a comprehensive review of state-of-the-art change detection methods and related tasks, analyze their core components and assess their applicability to this context. To that end, we introduce the new dataset SeracFallDet, which contains serac fall annotations and has been thoroughly annotated to meet the latter demand. Through generalization experiments, we demonstrate that dense and semi-dense feature matching, although not trained specifically for this task, exhibit robust performance. Alternatively, supervised approaches struggle with data scarcity and annotation imbalance. This suggests that hybrid methods may offer a path forward by leveraging the strengths of both tasks. These findings highlight the potential of feature matching techniques and the need for further innovation to overcome the challenges of real-world deployment in environmental monitoring.

2605.28091 2026-05-28 cs.CV 版本更新

Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation

Qwen-Image-Bench:从生成到创造——文本到图像评估

Niantong Li, Guangzheng Hu, Weixu Qiao, Ying Ba, Qichen Hong, Shijun Shen, Jinlin Wang, Fan Zhou, Jianye Kang, Xin Shang, Ziyi He, Wei Wang, Dalin Li, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yuxiang Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Zekai Zhang, Zhendong Wang, Zihao Liu, Zikai Zhou, Hongzhu Shi, Yi Wang, Bing Zhao, Hu Wei, Lin Qu, Chenfei Wu

发表机构 * Alibaba(阿里巴巴)

AI总结 针对现有文本到图像评估基准缺乏对真实世界保真度和创造性生成能力的考量,本文提出Qwen-Image-Bench,一个与专业艺术家共同设计的创作者中心基准,通过分层分类体系、1000个分层提示和基于Qwen3.6-27B的统一评判模型Q-Judger,实现细粒度、可归因的诊断,有效区分领先的T2I模型。

详情
AI中文摘要

文本到图像生成已从基础图像合成演变为专业创意工作流程中频繁使用的核心能力,简单的文本-图像对齐已无法满足用户对忠实真实世界重建和真正创意表达的迫切需求。然而,现有基准仍停留在这些基础标准上,未能捕捉真实艺术实践中重要的细微能力,使得可靠区分最先进的T2I模型变得困难。为弥补这一差距,我们引入了Qwen-Image-Bench,一个与专业艺术家共同设计、基于真实创作场景的创作者中心基准。Qwen-Image-Bench通过两个应用驱动维度丰富了传统评估:真实世界保真度和创意生成。借鉴专业艺术工作流程中固有的分阶段推理,我们将这五个支柱组织成一个自上而下的分层分类体系,进一步分解为23个二级子能力和56个三级可验证准则。为确保广泛覆盖,我们策划了1000个分层提示,每个提示联合锻炼多个支柱中的四个以上细粒度方面。我们训练了一个基于Qwen3.6-27B的统一评判模型Q-Judger,由来自全球艺术学院的80名专业标注员在盲标和三重审核协议下监督,对每张图像在所有56个可验证方面进行评分,产生细粒度、基于准则且完全可归因的诊断,而非单一不透明分数。实验表明,Qwen-Image-Bench可靠地区分领先的T2I模型,在现有基准几乎无法提供洞察的两个应用驱动维度(真实世界保真度和创意生成)上实现了最大分离,同时为生产级T2I开发提供了可信的优化信号。

英文摘要

Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users' pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.

2605.28083 2026-05-28 cs.CV 版本更新

VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking

VLA-Hijack: 通过视觉本体感觉劫持实现针对视觉-语言-动作模型的可迁移补丁攻击

Jiyuan Fu, Kaixun Jiang, Jingkai Jia, Zhaoyu Chen, Xueyao Chen, Lingyi Hong, Shuyong Gao, Chenzhi Tan, Dingkang Yang, Wenqiang Zhang

发表机构 * Fudan University(复旦大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出VLA-Hijack框架,通过注意力引导的本体感觉抑制和多模态本体感觉注入攻击视觉自定位过程,实现跨架构黑盒迁移攻击。

详情
AI中文摘要

虽然视觉-语言-动作(VLA)模型已成为强大的通用策略,但它们对对抗性补丁的严重脆弱性显著阻碍了其在安全关键领域的部署。此外,现有的补丁攻击主要关注白盒设置,严重过拟合目标模型的特定动作输出空间,导致跨架构迁移性差。为了克服这一限制,我们提出了VLA-Hijack,一个统一的对抗框架,通过利用本工作中发现的基本漏洞来突破迁移性瓶颈:在规划任何运动之前,VLA模型必须首先使用视觉信息在环境中定位自己的机械臂。针对这一共享的视觉自定位过程,我们的方法同时优化注意力引导的本体感觉抑制以抑制真实机械臂的特征,以及多模态本体感觉注入以将补丁建立为替代的“幻影实体”。通过在语义概念锚定和视觉原型投影之间交替,VLA-Hijack有效地切断了智能体真实实体与其控制策略之间的语义关系。跨多种架构(OpenVLA、UniVLA和CronusVLA)的大量实验表明,VLA-Hijack在白盒设置中实现了卓越的优化效率,并为跨架构和跨域黑盒迁移性设立了新的SOTA。

英文摘要

While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm's features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate "phantom embodiment". By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent's true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.

2605.28056 2026-05-28 cs.CV 版本更新

CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

CogPortrait: 通过分层智能体规划实现肖像动画中的细粒度眼部区域控制

He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) University of New South Wales(新南威尔士大学)

AI总结 提出CogPortrait两阶段框架,利用多模态大语言模型智能体从高层标签生成关键点,再通过DiT视频生成骨干合成动画,实现细粒度眼部控制,并引入EMH基准评估。

详情
AI中文摘要

肖像动画方法已实现显著的视觉质量和唇形同步,但眼部区域的细粒度操控仍面临输入粒度与运动精度之间的权衡。现有方法使用情感标签或粗略文本提示不足以描述细微的眼部动态,而基于动作单元或驱动视频的方法以更高的输入负担为代价提供更高的保真度。这些限制对于超越情感状态(例如思考)和困倦状态仍然具有局限性。鉴于此,我们提出CogPortrait,一个从高层标签生成肖像动画的两阶段框架。在第一阶段,三个思维链多模态大语言模型(MLLMs)智能体通过时间事件规划、原型检索和从真实行为库中组合以及语义-生理约束执行,将高层标签编译为面部关键点。在第二阶段,基于DiT的视频生成骨干以关键点、参考肖像、音频和文本提示为条件合成最终动画,并通过动态无分类器引导策略(具有眼部区域感知重新加权和基于KTO的边界情况细化)增强。我们进一步引入了EMH基准,涵盖多样化的情感和超越情感类别,并带有两个AU级指标用于评估细粒度眼部区域和头部运动控制。在HDTF和EMH基准上的大量实验表明,CogPortrait在保持优越视觉质量和身份一致性的同时,实现了比现有方法更精确的眼部区域控制。

英文摘要

Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

2605.28051 2026-05-28 cs.CV 版本更新

Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

超越代理梯度:面向视觉-语言模型的完全可微分令牌剪枝

Landi He, Mingde Yao, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology(深圳大学先进技术学院) CUHK MMLab(香港中文大学MMLab) CPII under InnoHK(创新工场CPII)

AI总结 提出DiffPrune方法,通过将剪枝重新表述为令牌信息的连续控制而非离散选择学习,利用信息节流阀调节令牌,实现完全可微分的令牌重要性学习,在保持96.5%全模型精度的同时将LLM预填充加速2.85倍。

详情
AI中文摘要

视觉令牌剪枝通过移除冗余视觉令牌来降低视觉-语言模型(VLM)的计算成本。现有方法通常依赖Gumbel-Softmax在训练期间近似离散选择。然而,优化由代理梯度驱动而非真实选择过程,导致令牌重要性的学习不可靠。本文提出DiffPrune,将剪枝重新表述为令牌信息的连续控制而非离散选择学习。具体而言,我们引入一个信息节流阀,利用基于重要性分数的方差保持噪声调节每个令牌,其中较高的分数在训练期间导致较少的信息抑制。该设计直接操作于令牌表示,自然地为学习令牌重要性提供了完全可微分的优化路径。在推理时,通过对学习到的分数进行硬阈值来移除令牌。在十个VLM基准测试中,DiffPrune保留了全模型精度的96.5%,同时将LLM预填充加速2.85倍,推理开销仅为0.69毫秒。

英文摘要

Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.

2605.28036 2026-05-28 cs.CV cs.LG 版本更新

Stay Fair! Ensuring Group Fairness in Diffusion Models Across Guidance Scales

保持公平!确保扩散模型在不同引导尺度下的群体公平性

Myeongsoo Kim, Eunji Kim, Minwoo Chae, Sangwoo Mo

发表机构 * POSTECH Amazon(亚马逊)

AI总结 提出StayFair方法,通过分解总偏差为模型偏差和引导偏差,并扩展强人口平价到引导过程,设计公平引导算法,使扩散模型在不同引导尺度下保持群体公平性。

Comments 28 pages, 18 figures

详情
AI中文摘要

扩散模型使用可调引导尺度来权衡提示对齐和多样性,从而引导条件生成。然而,现有的去偏技术针对单一尺度进行优化,当用户调整此参数时会降低公平性。我们通过将总偏差分解为两个组成部分:模型偏差和引导偏差,追溯了这种行为的先前被忽视的根源。虽然先前的工作主要针对前者,但我们表明引导偏差随引导尺度单调增长,最终在用户偏好的高引导区域占主导地位。为了解决这个问题,我们将强人口平价扩展到引导,并推导出一个条件,在该条件下目标分布在不同引导尺度下保持其群体比例。我们提出了StayFair,利用该条件在两种引导模式下设计公平引导算法。对于分类器引导,它均衡了分类器在不同群体间的输出分布;对于无分类器引导,它通过依赖于提示的偏移来移动空嵌入。由于StayFair仅修改引导步骤,它与模型去偏正交,可以叠加到现有的公平扩散模型上,以将其公平性扩展到不同引导尺度。在类条件和文本到图像生成中,StayFair在不牺牲图像质量的情况下将公平性与引导尺度解耦。

英文摘要

Diffusion models steer conditional generation with a tunable guidance scale to trade off prompt alignment and diversity. However, existing debiasing techniques are optimized for a single scale, degrading fairness when users adjust this parameter. We trace this behavior to a previously overlooked source by decomposing total bias into two components: a model bias and a guidance bias. While prior work primarily targets the former, we show that the guidance bias grows monotonically with the guidance scale, eventually dominating the high-guidance regimes users prefer. To address this, we extend Strong Demographic Parity to guidance and derive a condition under which the target distribution retains its group ratio across guidance scales. We propose StayFair, which leverages this condition to design fair guidance algorithms in both regimes. For classifier guidance, it equalizes the classifier's output distributions across groups; for classifier-free guidance, it shifts the null embedding by a prompt-dependent offset. Because StayFair modifies only the guidance step, it is orthogonal to model debiasing and can be layered onto existing fair diffusion models to extend their fairness across guidance scales. Across class-conditional and text-to-image generation, StayFair decouples fairness from the guidance scale without sacrificing image quality.

2605.28023 2026-05-28 cs.CV cs.AI cs.CL cs.MM 版本更新

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

VCap: 用于弱到强视觉字幕的超几何奖励

Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) Chinese Academy of Sciences(中国科学院) Kuaishou Technology(快手科技)

AI总结 提出VCap,一种证人-裁判奖励机制,通过超几何分布级别的精度验证视觉信号中参考字幕与策略生成字幕之间的事实一致性,实现弱到强泛化,在多个图像和视频字幕基准上超越SOTA模型。

Comments 28 pages, 8 figures

详情
AI中文摘要

视觉字幕要求模型忠实捕捉视觉内容,同时最小化遗漏和幻觉。作为字幕的主导范式,多模态大语言模型通过扩展和高质量数据取得了强大性能。最近,强化学习成为推动多模态大语言模型向更高精度和更广覆盖的关键途径,然而,现有字幕奖励设计未能提供细粒度且可靠的事实验证信号,限制了其有效性。为解决这一问题,我们提出VCap,一种证人-裁判奖励,将参考字幕(证人)与视觉信号(裁判)配对。通过明确验证基于视觉信号的参考字幕与策略生成字幕之间的事实一致性,VCap提供了具有超几何分布级别精度的奖励信号用于字幕质量验证。该设计即使在不完美的参考下也能实现有效学习,促进强化学习训练中的弱到强泛化。在我们的实验中,使用VCap训练的8B模型在多个图像和视频字幕基准上优于开源和闭源的最先进模型。人工评估进一步证实了其与事实正确性的强对齐。此外,VCap提升了多模态大语言模型的感知能力,跨任务泛化,并超越了最佳N蒸馏,挑战了先前关于强化学习与视觉推理的假设。

英文摘要

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

2605.28018 2026-05-28 cs.CV 版本更新

Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

双分支蒸馏Transformer用于高效非对称无人机跟踪

Hongtao Yang, Bineng Zhong, Qihua Liang, Yaozong Zheng, Xiantao Hu, Yuanliang Xue, Shuxiang Song

发表机构 * Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University(教育区块链与智能技术重点实验室,教育部,广西师范大学) Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University(广西多源信息挖掘与安全重点实验室,广西师范大学) Nanjing University of Science and Technology(南京理工大学) Xi’an Research Institute of High Technology(西安高新技术研究所)

AI总结 提出EATrack框架,通过教师引导的双分支蒸馏策略,在轻量学生模型中增强特征表达,实现无人机跟踪的精度与速度平衡。

Comments CVPR2026 Highlight

详情
AI中文摘要

鉴于无人机跟踪的实时性需求,许多方法简化骨干网络以减少计算量,但这往往削弱特征表示,导致复杂场景下性能下降。为解决此问题,我们提出EATrack,一种高效的非对称无人机跟踪框架,其核心是教师引导的双分支蒸馏策略,增强轻量学生模型的特征表达能力。具体而言,EATrack探索了知识迁移的两个互补视角:空间聚焦的特征级蒸馏,通过引导学生学习强目标表示来补偿弱化的表示;以及预测级蒸馏,通过学习教师精确目标定位的能力来增强空间定位。此外,为增强对外观变化的鲁棒性,我们引入细粒度目标感知蒸馏策略,选择性地将教师的目标建模能力迁移给学生。推理时集成时间适应模块以增强时间上的鲁棒性。在五个无人机基准上的实验表明,EATrack在精度和速度之间取得了良好的平衡。代码:https://github.com/GXNU-ZhongLab/EATrack

英文摘要

Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and prediction-level distillation that enhances spatial localization by learning the teacher's capability for accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher's target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed. Code: https://github.com/GXNU-ZhongLab/EATrack

2605.28016 2026-05-28 cs.CV physics.med-ph 版本更新

Enhancing Ultra-low-field MRI with Segmentation-guided Adversarial Learning

利用分割引导的对抗学习增强超低场MRI

James Grover, Andrew Phair, Michael Ferraro, David E. J. Waddington

发表机构 * Image X Institute, Sydney School of Health Sciences, Faculty of Medicine and Health(Image X研究院,悉尼健康科学学院,医学与健康学院)

AI总结 提出结合解剖条件分割先验和模型集成的方法,通过Swin UNETR生成组织分割先验,并利用CycleGAN和T-REX两个增强网络合成3T级MRI,有效提升64 mT超低场MRI的图像质量。

详情
AI中文摘要

超低场(ULF)MRI提供便携且低成本的成像,但图像质量较差。为解决此问题,我们提交了2025年ULF增强挑战赛(ULF-EnC)的方案,目标是从64 mT扫描合成类似高场MRI的图像。我们的流程通过解剖条件化和模型集成来增强ULF MRI。首先,使用仅在挑战提供数据上训练的Swin UNETR生成组织分割先验。这些先验条件化两个独立的增强网络——一个CycleGAN和一个基于Transformer的残差增强模型(T-REX)——每个网络都训练用于合成3T级MRI。两个模型的输出通过加权平均结合。我们的方法产生的增强MRI在定量和定性上都与高场扫描相当。

英文摘要

Ultra-low-field (ULF) MRI offers portable and low-cost imaging but suffers from poor image quality. To address this, we present our submission to the 2025 ULF Enhancement Challenge (ULF-EnC), where the goal is to synthesise high-field-like MRIs from 64 mT scans. Our pipeline enhances ULF MRI through a combination of anatomical conditioning and model ensembling. We first generate tissue segmentation priors using a Swin UNETR trained solely on challenge-provided data. These priors condition two independent enhancement networks - a CycleGAN and a transformer-based residual enhancement model (T-REX) - each trained to synthesise 3 T-like MRIs. Outputs from both models are combined using a weighted average. Our approach produces enhanced MRIs that were comparable to high-field scans both quantitatively and qualitatively.

2605.28011 2026-05-28 cs.CV 版本更新

Automated Estimation of Impact Time, Impact Location, and Shuttlecock Speed in Badminton Smashes Using Event Cameras

使用事件相机自动估计羽毛球扣杀中的撞击时间、撞击位置和球速

Yudai Washida, Yuto Kase, Kai Ishibe, Ryoma Yasuda, Sakiko Hashimoto

发表机构 * MIZUNO Corporation(MIZUNO公司) Suminoe-ku, Osaka-shi, Osaka(大阪府大阪市西淀川区)

AI总结 提出一种使用两台同步事件相机的方法,在同一试验中自动估计羽毛球扣杀的撞击时间、球拍面撞击位置和球速,并通过Bland-Altman分析验证其与高速相机参考方法的一致性。

Comments 24 pages, 5 figures

详情
AI中文摘要

量化羽毛球扣杀中的撞击现象对于评估运动表现和装备性能都很重要;然而,传统测量系统在时间分辨率、数据效率和准备工作之间存在权衡。本研究提出了一种使用两台同步事件相机的测量方法,在同一试验中自动估计撞击时间、球拍面上的撞击位置以及撞击后的球速。通过事件率统计检测挥拍区间,从侧视事件数据中的羽毛球轨迹拐点估计撞击时间,通过椭圆拟合后视事件图像中的球拍面确定撞击位置,并在矢状面计算球速。为了验证所提出的方法,使用来自五名运动员的125次扣杀试验,与基于高速相机的参考方法进行了Bland-Altman分析。在所有124次可分析试验中估计了撞击时间和球速,在93.5%(116/124)的试验中估计了撞击位置。撞击时间、内侧-外侧撞击位置、纵向撞击位置和球速的偏差(95%置信区间)分别为1.84毫秒(1.45至2.23)、3.45毫米(2.18至4.72)、-1.92毫米(-2.97至-0.88)和-1.00米/秒(-2.46至0.46)。所有指标均未观察到比例偏差。这些结果表明,所提出的方法可以作为在实际环境中综合评估羽毛球扣杀性能和装备的有用工具。

英文摘要

Quantifying impact phenomena in badminton smashes is important for evaluating both athletic performance and equipment; however, conventional measurement systems involve trade-offs between temporal resolution, data efficiency, and preparation effort. This study proposes a measurement method using two synchronized event cameras to automatically estimate impact time, impact location on the racket face, and post-impact shuttlecock speed in an integrated manner within the same trial. The swing interval was detected from event rate statistics, impact time was estimated from the shuttlecock trajectory inflection in the lateral-view event data, impact location was determined by ellipse fitting to the racket face in the rear-view event image, and shuttlecock speed was calculated in the sagittal plane. To validate the proposed method, Bland-Altman analysis was performed against a high-speed camera-based reference method using 125 smash trials from five players. Impact time and shuttlecock speed were estimated in all 124 analyzable trials, and impact location was estimated in 93.5% (116/124). The bias (95% CI) for impact time, medio-lateral impact location, longitudinal impact location, and shuttlecock speed were 1.84 ms (1.45 to 2.23), 3.45 mm (2.18 to 4.72), -1.92 mm (-2.97 to -0.88), and -1.00 m/s (-2.46 to 0.46), respectively. No proportional bias was observed for any metric. These results suggest that the proposed method can serve as a useful tool for integrated assessment of badminton smash performance and equipment in practical settings.

2605.27990 2026-05-28 cs.LG cs.AI cs.CV 版本更新

Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

几何校正扩散后验采样:基于去噪器回拉曲率引导与流形对齐阻尼

Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim

发表机构 * Department of Electrical and Computer Engineering, Inha University, Incheon, 22212, South Korea(电气与计算机工程系,Inha大学,Incheon,22212,韩国)

AI总结 提出一种基于去噪器回拉曲率引导和流形对齐阻尼的几何校正扩散后验采样方法,通过每噪声水平的阻尼高斯-牛顿校正替代标量引导,实现稳定高效的后验采样。

Comments Code: https://github.com/Seunghyeok0715/CLAMP

详情
Journal ref
International Conference on Machine Learning 2026
AI中文摘要

扩散后验采样将扩散先验条件于测量值,但数据一致性更新通常由手动调整的引导权重缩放,并且在刚性、算子依赖的曲率下可能破坏采样稳定性。我们使用在扩散状态坐标中计算的每噪声水平阻尼高斯-牛顿校正替代标量引导。该校正通过去噪器回拉似然梯度,使用避免前向去噪器雅可比矩阵的单侧曲率模型,并应用与去噪器残差对齐的扩散校准秩一阻尼。每个校正通过自动微分的无矩阵GMRES求解,采样通过具有闭式漂移/噪声分离的方差保持朗之万转移进行。在FFHQ和ImageNet上的逆问题中,该方法在PSNR/SSIM/LPIPS上达到竞争性能,同时运行速度显著快于大多数对比基线;在加速MRI重建中,它在对比基线中取得了最佳的PSNR/SSIM。

英文摘要

Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

2605.27978 2026-05-28 cs.CV 版本更新

ABot-OCR Technical Report

ABot-OCR 技术报告

Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li, Mu Xu

发表机构 * AMAP CV Lab(AMAP视觉实验室)

AI总结 提出端到端视觉语言模型ABot-OCR,通过单次前向传播将页面图像直接转录为干净Markdown,并采用解耦异构文档优化的强化学习方法提升文本准确性和标记格式正确性,在OmniDocBench基准上达到最先进水平。

Comments 21 pages, 11 figures, technical report

详情
AI中文摘要

我们介绍了ABot-OCR,一个端到端的视觉语言模型,它通过单次前向传播将页面图像直接转录为干净的Markdown。通过这样做,我们的方法完全消除了脆弱的模块化编排。为了最大化解析保真度,我们开发了一个专用数据引擎,以提供大规模、结构一致的监督。此外,我们提出了解耦异构文档优化,一种结构约束的强化学习方法,它在监督微调之外进一步提高了文本准确性并严格强制执行标记格式的正确性。大量评估证明了我们框架的优越性能。在OmniDocBench v1.5和v1.6基准测试中,ABot-OCR在所有端到端系统中达到了最先进的分数92.81和93.30,显著缩小了与强流水线基线之间的性能差距。最后,跨十种不同语言的全面多语言文本识别进一步证实了ABot-OCR的鲁棒泛化能力。

英文摘要

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

2605.27962 2026-05-28 cs.CV 版本更新

Bridging the Generalization Gap in Adverse Weather Segmentation: A Training Recipe Perspective

缩小恶劣天气分割中的泛化差距:训练方案视角

Cong Xu, Pu Luo, Yumei Li, Boyou Xue

发表机构 * Xidian University(西安电子科技大学)

AI总结 本文从训练方案角度出发,通过域自适应微调、多源数据混合、场景平衡采样和合成退化增强等方法,显著缩小了恶劣天气语义分割中的验证-测试泛化差距。

详情
AI中文摘要

本文描述了我们在第8届UG2+研讨会(CVPR 2026)Track 2中的方法,该赛道针对五种天气条件(模糊、黑暗、雪、雾和眩光)退化的户外场景进行语义分割。我们观察到一个核心挑战是严重的泛化差距——在验证集上表现良好的模型在测试集上往往崩溃。例如,SegFormer-B5从验证到测试下降了16.1 mIoU点,表明仅靠模型容量不足以实现鲁棒性。我们研究精心设计的训练方案(而非架构复杂性)是否可以解决这一差距。从预训练的SegMAN-S骨干开始,我们系统地研究了域自适应微调、多源数据混合、场景平衡采样和合成退化增强的效果。我们的最终系统在官方测试集上达到了59.9%的mIoU,同时验证-测试差距仅为6.5个点——不到更大模型的一半。我们分析了架构修改、损失函数变体和模型缩放的负面结果,为有限数据下天气鲁棒分割提供实用见解。

英文摘要

This paper describes our approach for the 8th UG2+ Workshop (CVPR 2026) Track~2, which targets semantic segmentation of outdoor scenes degraded by five weather conditions: blur, darkness, snow, haze, and glare. A central challenge we observe is a severe generalization gap -- models that perform well on the validation set often collapse on the test set. For instance, SegFormer-B5 drops 16.1 mIoU points from validation to test, suggesting that model capacity alone is insufficient for robustness. We investigate whether a carefully designed training recipe, rather than architectural complexity, can address this gap. Starting from a pre-trained SegMAN-S backbone, we systematically study the effects of domain-adaptive fine-tuning, multi-source data mixing, scene-balanced sampling, and synthetic degradation augmentation. Our final system achieves 59.9\% mIoU on the official test set while maintaining a validation-test gap of only 6.5 points -- less than half that of larger models. We analyze negative results from architectural modifications, loss function variants, and model scaling to provide practical insights for weather-robust segmentation under limited data.

2605.27960 2026-05-28 cs.CV 版本更新

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Mags-RL: 通过智能体强化学习为多模态大语言模型戴上放大镜以进行复杂场景推理

Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu, Xin Li, Zhipeng Wang, Shao Tang, Gen Li, Yujian Xiong, Hao Wang, Yanxi Chen, Prayag Tiwari, Yalin Wang

发表机构 * Arizona State University(亚利桑那州立大学) Clemson University(克莱姆森大学) Washington University in St. Louis(圣路易斯华盛顿大学) Halmstad University(哈姆斯塔德大学) Florida State University(佛罗里达州立大学) Rice University(里士满大学)

AI总结 提出Mags-RL框架,通过智能体强化学习让多模态大语言模型调用超分辨率代理进行高分辨率细粒度检查,实现两轮推理以提升复杂场景下的视觉推理能力。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)广受欢迎且成功,但它们在准确解释图像方面常常遇到困难,这限制了它们在复杂场景(如高物体密度和复杂背景杂乱)中的推理能力。先前的工作主要通过引入额外的显式视觉线索(如需要额外标注的边界框)来解决这一限制。此外,由此产生的低分辨率裁剪往往丢失了MLLMs进行准确推理所需的细粒度细节。因此,我们提出了Mags-RL,一个智能体强化学习(RL)框架,它为MLLMs配备了一个外部超分辨率“放大镜”代理,用于高分辨率细粒度检查。具体来说,该模型执行两轮推理:第一轮,它生成初始推理并自主识别感兴趣区域,无需依赖额外标注;第二轮,它调用超分辨率代理裁剪并放大这些区域,然后重新审视并验证其先前的推理以产生最终答案。我们还引入了一种新颖的课程学习策略,实现了数据高效的RL训练,仅需少至40个训练样本即可达到合理的性能。在VSR、TallyQA和GQA子集上的实验表明,与近期强竞争方法相比,它表现出优越的性能,展示了具有精确视觉基础的高质量推理。代码和权重将很快发布。

英文摘要

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

2605.27952 2026-05-28 cs.CV cs.RO 版本更新

Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse Odometry

Con-DSO:学习RGB-D直接稀疏里程计的短时一致性先验

Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Ziyan Gao, Xiongwen Jiang, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology(信息科学学系,日本科学技术先进研究院) College of Information Engineering, Shenyang University of Chemical Technology(信息工程学院,沈阳化学工业大学)

AI总结 提出Con-DSO框架,通过预测光度与深度几何一致性不确定性,实现质量感知的像素选择和加权,提升RGB-D直接稀疏里程计在动态、遮挡等挑战环境下的鲁棒性。

Comments Submitted

详情
AI中文摘要

视觉里程计(VO)是机器人和增强现实中的基础组件。RGB-D直接VO受益于度量深度测量,但在动态物体、遮挡、光照变化和不可靠深度违反直接对齐所使用的短时光度和深度几何一致性假设的挑战环境中,性能会下降。现有方法通过语义过滤、显式遮挡推理、光照适应或手工几何准则来缓解这些问题,但通常依赖外部模块或针对个别故障模式的固定假设,限制了其灵活性和以统一方式处理多样挑战的能力。本文提出Con-DSO,一种一致性感知的RGB-D直接稀疏里程计框架,从时间相邻的RGB-D帧对预测密集的光度和深度几何一致性不确定性。一致性网络通过流引导的光度误差和投影深度一致性误差进行训练,使得一致性违规可表示为像素级不确定性。这些成对不确定性预测被转换为关键帧跟踪的主机侧质量先验。该先验随后通过质量感知的支持像素选择和位姿估计中的解耦光度-几何加权应用于VO,使得不可靠观测持续衰减,而非硬拒绝或基于阈值的门控。在五个公开RGB-D基准上的实验表明,与直接RGB-D VO基线相比,在ICL-NUIM上绝对轨迹误差降低超过20%,在RGB-D Scenes V2、TUM/Bonn Dynamic和OpenLORIS序列上降低50%-80%。

英文摘要

Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20\% absolute trajectory error reduction on ICL-NUIM and 50\%--80\% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.

2605.27950 2026-05-28 cs.CV 版本更新

Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating Environment

从以自我为中心的饮食环境图像推断饮食行为改变接受度的可行性评估

Long Li, Yuning Huang, Heather A. Eicher-Miller, J. Graham Thomas, Fengqing Zhu, Edward Sazonov

发表机构 * The University of Alabama(阿拉巴马大学) Purdue University(普渡大学) Brown University(布朗大学)

AI总结 本研究利用可穿戴相机收集的以自我为中心的饮食图像,通过预训练CLIP视觉编码器和轻量级Transformer分类器,初步验证了被动感知推断饮食行为改变接受度的可行性。

详情
AI中文摘要

准确评估饮食行为改变接受度对于设计有效的即时自适应干预措施(JITAIs)以促进更健康的饮食习惯至关重要。然而,基于自我报告的行为改变接受度评估稀疏且延迟,限制了其在持续监测中的实际应用。为探索被动感知是否有助于解决这一挑战,本研究进行了一项初步调查,从可穿戴相机收集的以自我为中心的饮食图像中推断参与者自我报告的行为改变接受度。我们使用自动摄入监测器v2(AIM-2)在自由生活饮食事件中获取的初步数据。数据包括饮食期间捕获的以自我为中心的图像序列,并配以评估行为改变接受度特定维度(意识、互动能力和动机)的问题的回答。为了检查视觉信息是否与这些回答相关,我们评估了一个结合预训练对比语言-图像预训练(CLIP)视觉编码器和轻量级Transformer分类器的迁移学习辅助框架。该模型处理饮食事件图像序列,以提取与行为改变接受度相关的潜在语义和时间线索。初步实验结果显示,在行为改变接受度指标上,相比于简单基线模型有显著改进。这些早期发现表明,以自我为中心的饮食事件图像可能包含与饮食行为改变接受度相关的线索,并需要在更大、更全面的数据集上进行进一步研究。

英文摘要

Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants' self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.

2605.27938 2026-05-28 cs.CV 版本更新

SEMAGIC: Learning Semantically Consistent Deformable 3D Representations from In-the-Wild Images

SEMAGIC: 从野外图像中学习语义一致的可变形3D表示

Sky Cen, Wufei Ma, Guofeng Zhang, Alan Yuille, Adam Kortylewski

发表机构 * Johns Hopkins University(约翰霍普金斯大学) CISPA Helmholtz Center for Information Security(信息安全霍普金斯中心)

AI总结 针对现有可变形3D重建方法语义对应不稳定的问题,提出SEMAGIC框架,通过特征级一致性损失和顶点索引条件变形,在重建过程中强制语义一致性,从而提升类别级语义对应性能。

详情
AI中文摘要

从单视图野外图像中学习可变形3D物体模型已实现了无需监督的令人印象深刻的3D形状重建。然而,这些模型是否捕捉到下游任务所需的语义结构仍不清楚。我们发现,现有的可变形重建方法尽管生成了视觉上合理的几何形状,但在实例间产生了不稳定的对应关系,并在语义对应基准上表现不佳。我们引入了SEMAGIC,一个从单视图野外图像中学习语义一致的可变形3D表示的框架。SEMAGIC不将重建视为最终目标,而是将可变形建模作为发现类别级对应关系的机制。每个类别由一个规范模板网格和一个学习到的变形场表示,其功能类似于一个从图像特征重建实例几何的自编码器,使得顶点能够在实例间保持一致的语义含义。训练过程中通过(i)对齐规范网格和变形网格之间语义特征的特征级一致性损失,以及(ii)保持实例间语义对应的顶点索引条件变形,来强制语义一致性。通过将几何变形与语义对齐显式耦合,SEMAGIC生成了在类别内变化中保持稳定部件对应的表示。实验表明,SEMAGIC在SPair-71k上将可变形模型的语义对应提高了+14.7 PCK@0.1,确立了可变形模型作为有效语义3D表示的地位。

英文摘要

Learning deformable 3D object models from single-view in-the-wild images has enabled impressive 3D shape reconstruction without supervision. However, it remains unclear whether these models capture the semantic structure required for downstream tasks. We find that existing deformable reconstruction approaches, despite producing visually plausible geometry, yield unstable correspondences across instances and perform poorly on semantic correspondence benchmarks. We introduce SEMAGIC, a framework for learning semantically consistent deformable 3D representations from single-view in-the-wild images. Rather than treating reconstruction as the end goal, SEMAGIC uses deformable modeling as a mechanism to discover category-level correspondences. Each category is represented by a canonical template mesh and a learned deformation field, functioning similarly to an autoencoder that reconstructs instance geometry from image features, enabling vertices to maintain consistent semantic meaning across instances. Semantic consistency is enforced during training through (i) a feature-level consistency loss aligning semantic features between canonical and deformed meshes, and (ii) vertex-index-conditioned deformation that preserves semantic correspondence across instances. By explicitly coupling geometric deformation with semantic alignment, SEMAGIC produces representations that maintain stable part correspondences across intra-category variation. Experiments demonstrate that SEMAGIC improves semantic correspondence of deformable models by +14.7 PCK@0.1 on SPair-71k, establishing deformable models as effective semantic 3D representations.

2605.27932 2026-05-28 cs.CV cs.AI cs.CL cs.CR cs.LG 版本更新

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

当图文推理遇上安全:什么决定了多模态越狱鲁棒性?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

发表机构 * Independent Researcher(独立研究者) Stanford University(斯坦福大学) Harvard University(哈佛大学) Purdue University(普渡大学) Duke University(杜克大学)

AI总结 本文研究多模态大语言模型中不同图文推理范式对越狱鲁棒性的影响,发现显式图像工具交互能显著降低攻击成功率,并通过引入图像工具安全向量框架从表征层面解释其机制。

Comments 17 pages, 6 figures, 7 tables

详情
AI中文摘要

图文推理正成为大型视觉-语言模型的一种新推理范式,但其安全性影响尚不明确。现有系统已涵盖多种流程设计,包括直接响应生成、纯文本前轮、视觉状态操作以及显式外部图像工具调用。本文探究这些评估范式中哪一种能提升多模态越狱鲁棒性及其原因。在多个视觉-语言模型上,我们的实验表明显式图像工具交互的攻击成功率最低,平均相对降低约30%。这一发现起初令人惊讶:即使返回的图像工具输出被人为覆盖或本身不安全,攻击成功率仍保持较低,但在纯文本前轮控制下又恢复到接近直接回答的水平。这些结果表明,较低的攻击成功率并非由良性返回图像语义或仅文本图像工具轨迹解释。为解释这一模式,我们引入了一个图像工具安全向量框架,将图像工具调用建模为隐藏表示向安全相关方向的残差偏移。表征层面的分析和激活干预支持了这一解释。总体而言,我们的结果表明,显式图像工具交互是提升越狱鲁棒性的一种有前景的设计模式,同时也推动了针对特定流程的安全性评估。

英文摘要

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

2605.27927 2026-05-28 cs.CV cs.LG 版本更新

Structure-Guided Visual Perturbation Neutralization for LVLMs

结构引导的视觉扰动中和用于大型视觉语言模型

Yuanhe Zhang, Xueting Wang, YanBin Ren, Haoran Gao, Xinhan Zheng, Zhenhong Zhou, Fanyu Meng, Li Sun, Sen Su

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学) JIUTIAN Research(JIUTIAN研究所) Nanyang Technological University(南洋理工大学) Chongqing University of Posts and Telecommunications(重庆邮电大学)

AI总结 提出结构诱导引导中和(SIGN)框架,通过先验结构提取和动态引导中和实现轻量级、即插即用的对抗性防御,在仅0.5%像素修改和0.16秒每图下达到87%以上防御成功率。

详情
AI中文摘要

图像输入使大型视觉语言模型(LVLMs)能够感知细粒度的视觉信息,但也引入了一个像素级攻击面,通过该攻击面,对抗性扰动可以引发不安全的模型行为。然而,大多数现有防御是为传统计算机视觉场景设计的,因此常常忽略LVLMs所需的跨模态对齐,导致性能下降。同时,针对LVLMs的有限防御通常需要大量的图像修改并引入可观的计算开销,从而损害推理质量和效率。为解决这些限制,我们提出了结构诱导引导中和(SIGN),一个轻量级、即插即用的防御框架,通过先验结构提取提高LVLM兼容性,并通过动态引导中和实现高效的扰动抑制。大量实验表明,SIGN在仅0.5%像素修改和每张图像0.16秒的情况下实现了超过87%的防御成功率,同时几乎保留了原始视觉表示和良性任务性能。我们的工作为需要昂贵模型训练的防御提供了一种轻量级替代方案,并突显了利用视觉编码器进行高效对抗性保护的潜力。我们的代码已在 https://anonymous.4open.science/r/SIGN-BCB1 开源。

英文摘要

Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87\% defense success rate with only 0.5\% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on https://anonymous.4open.science/r/SIGN-BCB1.

2605.27924 2026-05-28 cs.CV 版本更新

SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization

SIGMA: 基于语义差异的指令引导掩码标注器用于文本驱动图像操作定位

Peiyu Zhuang, Jianquan Yang, Haodong Li, Zhuoying Cai, Ruitao Xie, Jishen Zeng, Baoying Chen, Jiwu Huang, Xiaochun Cao

发表机构 * Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) Guangdong Provincial Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security(广东省智能信息处理重点实验室和深圳媒体安全重点实验室) Shenzhen University of Advanced Technology and Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(深圳先进技术大学和深圳先进技术研究所,中国科学院) Alibaba Group(阿里巴巴集团) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 提出SIGMA方法,通过视觉基础模型中的语义特征差异和指令引导的空间先验,自动从公开编辑数据集中生成像素级掩码,用于训练图像操作定位模型,在五个基准上F1提升12.20%,并生成约110万训练集使六个检测器平均F1提升18.34%。

详情
AI中文摘要

文本驱动的图像编辑发展迅速,但可靠地定位这些操作需要在大规模像素标注数据集上训练的图像操作定位(IML)模型,目前尚无低成本获取此类训练数据的方法。我们观察到这些数据实际上已经以伪装形式存在:公开编辑数据集包含数百万个与IML训练样本结构相同的(原始、编辑)图像对,仅缺少像素级掩码。自动恢复这些掩码并非易事:像素差异被扩散引起的所有像素扰动淹没,而仅基于指令的定位只能定位提示描述的内容,遗漏了意外的编辑副作用。我们提出SIGMA(语义差异指令引导掩码标注器),它在视觉基础骨干网络中进行语义特征差异计算,并通过双向跨模态精炼将指令导出的空间先验注入视觉流,在编辑器忠实实现用户意图时放大预期编辑区域的差异信号。SIGMA通过两个互补阶段训练:第一阶段在修复掩码上进行监督;第二阶段通过VAE往返噪声校准、EMA自训练和编辑噪声解耦损失来弥合扩散域偏移。SIGMA在五个基准上优于现有自动掩码生成器(F1提升12.20%,IoU提升11.16%)。当应用于公开编辑语料库时,它生成了约110万IML训练集,使六个不同检测器在五个数据集上平均F1提升18.34%,将以前未使用的编辑数据转化为IML的模型无关监督资源。论文被接收后我们将立即发布完整代码库。

英文摘要

Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We'll release the full codebase as soon as the paper is accepted.

2605.27923 2026-05-28 cs.CV cs.AI cs.LG quant-ph 版本更新

Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

我们真的需要量子机器学习吗?:一项多维实证研究

Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

发表机构 * Department of Computer Science, University of Alabama, AL 35487(1 计算机科学系,阿拉巴马大学,AL 35487)

AI总结 通过在MNIST手写数字数据集上对经典和量子机器学习模型进行多维基准测试,发现量子模型在准确率、参数和内存效率上优于经典模型,但计算成本更高。

详情
AI中文摘要

计算机视觉的快速发展和日益复杂的图像识别任务暴露了经典机器学习模型的基本计算限制,推动了量子计算作为一种新兴范式的探索。本文对MNIST手写数字数据集上的经典和量子机器学习模型进行了全面的基准测试,评估了传统模型(经典支持向量机CSVM和量子支持向量机QSVM)以及深度神经网络模型(经典卷积神经网络CCNN和量子卷积神经网络QCNN)在四个性能维度上的表现:分类准确率、计算运行时间、参数数量和内存需求。实验作为特征维度和样本量的函数进行,并在CPU和GPU执行环境下进行,提供了受控的多维比较,以解决先前工作中的空白。对于基于SVM的模型,QSVM在准确率上始终优于CSVM,在1000个样本时达到约0.90对比约0.85,但计算成本更高。10个量子比特的特征数和200-500的样本量成为平衡准确率和运行时间的实际工作点。对于神经网络模型,CCNN和QCNN实现了可比的分类准确率,在64个特征和60000个样本时均超过0.96,但QCNN在参数和内存效率上显著更优,在较高特征数下比CCNN少约94%的参数和约75%的内存,但运行时间更长。在两个模型家族中,随着特征维度或样本量的增加,量子模型在准确率上始终以更大优势超越经典模型。

英文摘要

The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 -- 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94\% fewer parameters and $\sim$ 75\% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

2605.27920 2026-05-28 cs.CV 版本更新

Rethinking Video-Language Model from the Language Input Perspective

从语言输入角度重新思考视频-语言模型

Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) University College London(伦敦大学学院) Huazhong University of Science and Technology(华中科技大学) Wuhan University(武汉大学)

AI总结 本文从语言输入角度出发,提出一种即插即用的框架,通过生成正负文本、属性文本推理和自加权损失,提升视频-语言模型的性能。

Comments Published in AAAI 2026

详情
AI中文摘要

受大语言模型浪潮的驱动,视频-语言模型(VLM)已成为弥合视频与文本之间差距的重要且具有挑战性的技术。尽管先前的VLM工作取得了显著进展,但几乎所有工作都隐含地假设所有文本都是由特定模板预定义的。在实际应用中,这种严格的假设无法满足,因为1)预定义所有文本极其耗时费力;2)这些预定义的文本输入过于限制且不友好,限制了其应用。观察到,给定视频输入,语义相似但模板不同的文本会导致不同的性能。为此,本文提出了一种新颖的即插即用框架,用于各种基于VLM的方法,以充分弥合视频和文本。具体来说,我们首先从原始文本中生成正负文本,以针对特定的文本组件。然后,我们提出了一种基于属性的文本推理策略,以挖掘生成文本的细粒度语义。最后,我们利用视频作为指导,通过设计自加权损失来进行跨模态桥接。大量实验表明,所提出的方法可以作为即插即用模块,有效提升最先进VLM的性能。

英文摘要

Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

2605.27916 2026-05-28 cs.CV cs.CL 版本更新

OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

OphIn-500K:策划网络规模的视觉指令以扩展眼科多模态大语言模型

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Jingjing Wang, Xiaobing Yu, Haiyu Wu, Shao Tang, Zhipeng Wang, Langechuan Liu, Shan Lin, Oana Dumitrascu, Yalin Wang

发表机构 * Arizona State University(亚利桑那州立大学) Clemson University(克莱姆森大学) Washington University in St. Louis(圣路易斯华盛顿大学) University of Notre Dame(诺特丹大学) Florida State University(佛罗里达州立大学) Rice University(里德大学) NVIDIA(英伟达) Mayo Clinic(梅奥诊所)

AI总结 提出OphIn-Engine流水线从网络视频中构建高质量眼科指令数据,生成包含50万+指令实例的OphIn-500K数据集,并基于此开发眼科专用多模态大语言模型OphIn-VL,在多项任务上超越现有通用医学和专用模型。

详情
AI中文摘要

通用医学多模态大语言模型(MLLMs)的进步为构建支持临床诊断的对话助手展现了巨大潜力。然而,它们在高度专业化领域(如眼科)的适应性仍未得到充分探索,主要原因是缺乏大规模、领域特定的指令微调数据。现有的眼科对话数据集通常规模有限,且大多依赖于已建立的公共基准图像,限制了眼科MLLMs的可扩展性及其捕捉真实临床复杂性的能力。为解决这一问题,我们提出了$ extbf{OphIn-Engine}$,一个眼科特定的指令数据策划流水线,从开放获取的眼科网络规模视频中构建高质量指令数据。该流水线整合了多模态转录以提取图像-文本对、视觉线索分离与评分以识别临床相关的视觉描述,以及指令合成与质量控制以生成准确且多样的临床对话。利用该引擎,我们推出了$ extbf{OphIn-500K}$,一个大规模多模态眼科指令微调数据集,包含超过50万个指令实例和来自29,000多个视频片段的151,000多张独特图像,格式包括视觉问答(VQA)、多轮对话交互和思维链(CoT)推理。基于该数据集,我们进一步开发了$ extbf{OphIn-VL}$,一个具有高级视觉理解和对话能力的眼科专用MLLM。综合实验和案例研究表明,与最先进的通用医学和领域专用MLLMs相比,OphIn-VL实现了更优的性能。

英文摘要

The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

2605.27900 2026-05-28 cs.CV 版本更新

Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning

联邦学习中解耦训练与局部强化微调

Yuting Ma, Lechao Cheng, Xiaohua Xu

发表机构 * School of Computer Science and Technology, University of Science and Technology of China(中国科学技术大学计算机科学与技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院)

AI总结 提出FedDTL框架,通过解耦图像和文本编码器训练并引入两阶段局部微调(监督微调+强化学习),解决联邦学习中客户端间优化不一致和客户端内过专门化问题,平衡全局任务适应性与泛化能力。

Comments This work has been accepted by ICML 2026

详情
AI中文摘要

联邦学习(FL)与预训练视觉-语言模型(VLM)的结合已成为各种下游任务的有前景范式。通过利用其强大的表示,最近的研究在局部数据不足的情况下改进了任务适应性,同时保持了泛化能力。然而,这些方法强调完全局部优化和简单的参数聚合,这可能在异构和全数据FL设置下放大客户端间优化不一致和客户端内过专门化,使得平衡全局任务适应性和泛化变得困难。为了解决这些挑战,我们提出了FedDTL,一种新颖的联邦VLM框架,该框架在客户端和服务器之间解耦图像编码器和文本编码器。通过解耦编码器训练与服务器-客户端模态对齐,FedDTL促进了连贯的全局语义更新并减少了客户端间优化不一致,从而改善了全局任务适应性。为了进一步缓解客户端内过专门化,我们引入了两阶段局部微调,其中监督微调阶段实现了快速可靠的预热启动,随后是增强泛化的强化学习阶段。在多个基准测试上的大量实验,包括标签偏移和特征偏移,表明FedDTL在少样本和全数据设置下,在各种FL数据分布中实现了全局任务适应性和泛化之间的有效平衡。

英文摘要

Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation.To further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

2605.27894 2026-05-28 cs.CV 版本更新

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

面向不完整多模态输入的统一视觉-语言模型

Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, Wei Ji

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件工程学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) University College London(伦敦大学学院) Guangzhou University(广州大学) Wuhan University(武汉大学) Nanjing University(南京大学)

AI总结 针对视频-语言模型在传感器失效导致模态不完整数据下的训练-测试不一致问题,提出首个统一的不完整视频-语言模型作为即插即用模块,提升多模态任务性能。

Comments Published in AAAI 2026

详情
AI中文摘要

视频-语言模型(VLM)在多种计算机视觉应用中展示了令人印象深刻的多模态推理能力。然而,这些VLM是任务特定的,并假设视频和语言输入都是完整的。然而,现实世界的VLM应用可能因传感器停用(例如,由于数据隐私导致摄像头不可用)而面临挑战,产生模态不完整的数据,并导致训练和测试数据之间的不一致。虽然简单的不完整输入可以提升训练泛化能力并导致训练失败,但其对VLM在安全性和可信度方面的潜在风险在很大程度上被忽视了。为此,我们首次尝试提出一个统一的不完整视频-语言模型来处理不完整的多模态输入。大量实验结果表明,我们的方法可以作为先前工作的即插即用模块,提高它们在各种多模态任务中的性能。

英文摘要

Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.

2605.27893 2026-05-28 cs.CV 版本更新

SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

SIGMA:弥合视觉基础模型适应的结构与分布差距

Lingyu Xiong, Jinjin Shi, Xuran Xu, Cong Luo, Runyu Shi, Ying Huang

发表机构 * Xiaomi Corporation(小米公司)

AI总结 提出SIGMA方法,通过尺度自适应融合和语义调制模块,以1.72%可训练参数实现视觉基础模型在密集预测任务上的高效微调,性能优于现有PEFT方法。

详情
AI中文摘要

视觉基础模型(VFM)展示了令人印象深刻的表示能力。然而,通过全微调将它们适应到下游任务会带来高昂的计算和存储开销。参数高效微调(PEFT)作为一种有吸引力的替代方案应运而生,旨在以最小的训练成本实现与全微调相当的性能。尽管如此,将PEFT应用于VFM进行密集预测任务仍然具有挑战性,因为存在结构和分布差距。为了弥合这些差距,我们提出了尺度集成全局调制适配器(SIGMA),一种新颖的轻量级PEFT方法,它由两个模块组成:尺度自适应融合和语义调制。具体来说,尺度自适应融合模块用于通过增强多粒度视觉信息的提取来弥合结构差距。此外,SIGMA在融合特征上引入语义调制以执行全局特征对齐,进一步消除分布差距。这种设计促进了统一的空间和分布适应,相对于VFM骨干网络仅需1.72%的可训练参数。在各种下游密集任务和多个VFM骨干网络上的全面实验表明,SIGMA在性能上一致且优于最先进的PEFT方法。

英文摘要

Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72\% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

2605.27891 2026-05-28 cs.CV cs.AI 版本更新

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

SmartDirector: 基于关键帧的叙事节奏可控电影视频生成

Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han, Jun Liang, Jie Cao, Jing Li

发表机构 * Youku Moku-Lab(优酷莫酷实验室)

AI总结 提出SmartDirector框架,通过多关键帧条件控制视频生成中的叙事结构和时间节奏,采用两阶段方法(Director-Gen生成低分辨率视频,Director-SR利用高分辨率关键帧细化细节),显著优于现有方法。

详情
AI中文摘要

视频的叙事质量从根本上决定了其感知价值。尽管现有的视频生成方法可以生成视觉上吸引人的内容,但它们主要依赖于稀疏的条件信号,如文本提示或首尾帧,这限制了对叙事结构和时间节奏的精确控制。在本文中,我们提出了SmartDirector,一个通过多个关键帧增强视频生成模型叙事能力的框架。SmartDirector支持灵活的生成长场景,包括单镜头生成、多镜头叙事合成和视频扩展。该框架分两个阶段运行:Director-Gen根据提供的关键帧生成低分辨率视频,Director-SR通过利用高分辨率关键帧作为语义锚点来恢复细粒度细节,从而优化输出。为了实现鲁棒的多关键帧训练,我们构建了一个数据管道,从电影中策划单镜头和多镜头序列。大量实验表明,SmartDirector显著优于现有的最先进方法。我们将发布代码以促进进一步研究。

英文摘要

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

2605.27885 2026-05-28 cs.CV 版本更新

Reflective Dialogue between Teacher and Solver Agents for Video Question Answering

教师与求解器智能体之间的反思性对话用于视频问答

Takuya Murakawa, Toru Tamaki

发表机构 * Nagoya Institute of Technology(名古屋技术大学)

AI总结 提出一种仅通过推理时上下文注入的适应方法,利用教师与求解器智能体之间的反思性对话(RD)来提升视频问答性能,在EgoCross基准上超越零样本和标准上下文学习,获得CVPR 2026 EgoVis Workshop跨域挑战赛开源赛道第三名。

Comments Yhis paper serves as the technical report for the 1st Cross-Domain EgoCross Challenge @ EgoVis Workshop, CVPR 2026

详情
AI中文摘要

已经提出了各种方法来使视觉语言模型(VLM)适应视频问答的专门领域,包括微调和上下文学习。然而,在推理阶段仅从少量标记支持集获取任务特定知识而不进行微调仍然是一个挑战。在本文中,我们提出了一种仅通过推理时上下文注入来实现适应的方法。我们的方法首先构建一个反思性对话(RD)——两个智能体之间的多轮对话,其中教师提出每个支持问题并提供正确性反馈,求解器回答并提供视觉基础解释(或反思)以说明正确和错误的答案。然后,该对话历史在推理阶段用作上下文。在EgoCross基准上的实验表明,我们的方法优于基线零样本设置和直接传递支持集示例的标准上下文学习方法,在CVPR 2026 EgoVis Workshop的第一届跨域EgoCross挑战赛开源赛道中获得第三名,本文也作为该挑战赛的技术报告。

英文摘要

Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled support set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection. Our method first constructs a Reflective Dialogue (RD) -- a multi-turn conversation between two agents, in which Teacher poses each support question and delivers correctness feedback, and Solver answers and provides visual grounding explanations (or reflections) for both correct and incorrect answers. This dialogue history is then used as context at the inference phase. Experiments on the EgoCross benchmark demonstrate that our method outperforms both a baseline zero-shot setting and a standard in-context learning approach that passes support set examples directly, achieving 3rd place in the Open-source Track of the 1st Cross-Domain EgoCross Challenge at the CVPR 2026 EgoVis Workshop, for which this paper also serves as a technical report.

2605.27884 2026-05-28 cs.CV 版本更新

A Road-Conditioned Traffic Movie Prediction Network with Spatiotemporal and Structure-Consistent Learning

一种基于道路条件的交通电影预测网络,具有时空和结构一致性学习

Joshua Kofi Asamoah, Blessing Agyei Kyem, Armstrong Aboah

发表机构 * North Dakota State University(北达科他州立大学)

AI总结 提出RCSNet,一种基于道路条件的时空网络,通过拓扑引导的未来状态生成和结构一致性学习,提高跨城市交通预测的准确性和结构一致性。

Comments 22 pages (double column), 7 Tables, 11 Figures

详情
AI中文摘要

城市范围的交通预测对于拥堵管理、路线引导和智能交通系统至关重要,但当未来交通必须作为整个城市网络的空间地图生成时,准确预测仍然具有挑战性。现有的交通电影预测方法提高了帧级精度,但许多方法仍主要将预测视为图像重建。这可能会产生数值上接近真实值但受道路布局、连通性、行驶方向和拥堵传播约束较弱的交通地图,尤其是在交通行为和道路结构都发生变化的跨城市场景中。为了解决这一局限性,本研究提出了RCSNet,一种基于道路条件的时空网络,将交通电影预测重新表述为拓扑引导的未来状态生成。RCSNet从静态道路地图中提取道路感知表示,从历史观测中建模多时域交通动态,将方向性交通特征与局部道路结构对齐,并逐步生成未来交通地图以提高时间一致性。结构一致性学习目标进一步鼓励预测保持准确、与道路对齐且空间稳定。跨多个城市的实验表明,RCSNet提高了预测准确性和结构一致性。在柏林、安特卫普和莫斯科的同城预测中,与最接近的基线相比,RCSNet平均MAE、MSE和RMSE分别降低了11.5%、10.0%和5.1%。在未见过的芝加哥和曼谷的跨城市测试中,无需目标城市微调,RMSE分别降低了10.6%和10.5。额外的时域、道路结构、可解释性、统计和效率分析表明,RCSNet产生了更准确、可迁移、与道路对齐且计算高效的交通预测。

英文摘要

City-wide traffic forecasting is important for congestion management, route guidance, and intelligent transportation systems, but accurate prediction remains challenging when future traffic must be generated as spatial maps over an entire urban network. Existing traffic movie prediction methods have improved frame-level accuracy, yet many still treat forecasting mainly as image reconstruction. This can produce traffic maps that are numerically close to the ground truth but weakly constrained by road layout, connectivity, travel direction, and congestion propagation, especially in cross-city settings where both traffic behavior and road structure change. To address this limitation, this study proposes RCSNet, a road-conditioned spatiotemporal network that reformulates traffic movie prediction as topology-guided future-state generation. RCSNet extracts road-aware representations from static road maps, models multi-horizon traffic dynamics from historical observations, aligns directional traffic features with local road structure, and progressively generates future traffic maps for improved temporal consistency. A structure-consistent learning objective further encourages predictions to remain accurate, road-aligned, and spatially stable. Experiments across multiple cities show that RCSNet improves both forecasting accuracy and structural consistency. In same-city forecasting on Berlin, Antwerp, and Moscow, RCSNet reduces average MAE, MSE, and RMSE by 11.5%, 10.0%, and 5.1%, respectively, compared with the closest baseline. In cross-city testing on unseen Chicago and Bangkok, it reduces RMSE by 10.6% and 10.5% without target-city fine-tuning. Additional horizon-wise, road-structure, explainability, statistical, and efficiency analyses show that RCSNet produces more accurate, transferable, road-aligned, and computationally efficient traffic forecasts.

2605.27843 2026-05-28 cs.CV 版本更新

A self-supervised learning approach to deep filter banks for texture recognition

一种用于纹理识别的深度滤波器组的自监督学习方法

Joao B. Florindo, Lucas O. Lyra, Antonio E. Fabris

发表机构 * Institute of Mathematics and Statistics of the University of Sao Paulo(圣保罗大学数学与统计学研究所) Institute of Mathematics, Statistics and Scientific Computing of the University of Campinas(坎皮纳斯大学数学、统计与科学计算研究所)

AI总结 针对纹理识别中训练数据有限的问题,提出一种基于卷积自编码器的自监督预训练框架,结合深度滤波器和Fisher向量池化,在不显著增加计算负担的情况下提升识别性能。

详情
AI中文摘要

纹理识别中的一个重要挑战是实际应用中经常遇到的训练数据有限。在计算机视觉中,缓解这一问题的一个成功策略是使用预训练阶段,其中神经网络以自监督方式学习识别数据各部分之间的关系。在这方面,一个成熟的框架是掩码自编码器。然而,这些模型通常依赖于计算密集型的架构,如视觉变换器。在纹理图像的特定情况下,大多数相关信息被压缩在每个像素周围的有限区域内,这表明通过注意力机制捕获长距离依赖可能是不必要的。基于这一假设,本文提出了一种预训练模型为卷积自编码器的框架。为了利用纹理模式传递的丰富信息,我们采用了深度滤波器与Fisher向量池化相结合的方法。通过这种方式,我们在不增加显著计算负担的情况下提高了纹理识别的性能。我们的方法与多个纹理数据库中的几种最先进方法进行了比较,证实了其在分类精度和计算复杂度方面的潜力。

英文摘要

An important challenge in texture recognition is the limited amount of data for training frequently found in real-world applications. In computer vision in general, a successful strategy to mitigate this issue is the use of a pretraining stage where the neural network learns to identify relations between parts of the data in a self-supervised manner. A well-established framework in this direction is masked autoencoder. Nevertheless, these models usually rely on computationally intensive architectures, such as vision transformers. In the particular case of texture images, most of the relevant information is compacted within a delimited area around each pixel, which suggests that capturing long-range dependence via the attention mechanism may be unnecessary. Based on that assumption, here we propose a framework where the pretraining model is a convolutional autoencoder. To leverage the rich information conveyed by texture patterns, we employ deep filters coupled with Fisher vector pooling. In this way, we improve the performance of texture recognition without adding significant computational burden. Our approach is compared with several state-of-the-art methods in different texture databases, confirming its potential both in terms of classification accuracy and computational complexity.

2605.27823 2026-05-28 cs.CR cs.AI cs.CV 版本更新

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

解耦对抗性提示:基于语义图的鲁棒大语言模型安全防御

Xiang Fang, Wanlong Fang

发表机构 * Xiang Fang(1. 方翔) Wanlong Fang(2. 方万龙)

AI总结 提出对抗性提示解耦(APD)框架,通过互信息语义分解、图谱分析和轻量级分类器,在输入处理前识别并中和恶意组件,将有害输出减少85%以上。

Comments Published in AAAI 2026

详情
AI中文摘要

大语言模型(LLMs)越来越容易受到利用语义歧义绕过安全机制的对抗性提示的攻击,导致有害或不适当的输出。此类攻击,包括越狱和提示注入,对安全关键应用中LLMs的完整性和可用性构成重大风险。本文提出对抗性提示解耦(APD)框架,一种新颖的防御机制,在输入提示被LLM处理之前主动识别并中和其中的恶意组件。APD框架集成了三项关键创新:(1)基于互信息的语义分解方法,用于分离对抗性和良性提示组件,确保统计独立性;(2)基于图的意图分类方法,利用谱分析检测提示语义中的恶意模式;(3)轻量级基于Transformer的分类器,在真实世界的毒性和越狱提示数据集上训练,实现高效准确的对抗性意图检测。在包含对抗性提示的多样化数据集上评估,APD展现出卓越的鲁棒性,将有害输出生成减少超过85%,同时保持对模型性能的 negligible 影响。该框架的计算效率支持实时部署,使其成为保护LLMs的实用解决方案。我们的工作解决了机器学习安全中关于新型攻击和ML系统完整性方法的关键挑战,并提供了一种可扩展、符合伦理的防御手段来对抗基于提示的对抗性威胁。

英文摘要

Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanisms, resulting in harmful or inappropriate outputs. Such attacks, including jailbreaking and prompt injection, pose significant risks to the integrity and availability of LLMs in security-critical applications. This paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in input prompts before they are processed by the LLM. The APD framework integrates three key innovations: (1) a mutual information-based semantic decomposition method to isolate adversarial and benign prompt components, ensuring statistical independence; (2) a graph-based intent classification approach that leverages spectral analysis to detect malicious patterns in prompt semantics; and (3) a lightweight transformer-based classifier trained on real-world datasets of toxic and jailbreaking prompts, enabling efficient and accurate adversarial intent detection. Evaluated on diverse datasets containing adversarial prompts, APD demonstrates superior robustness, reducing harmful output generation by over 85\% while maintaining negligible impact on model performance. The framework's computational efficiency supports real-time deployment, making it a practical solution for securing LLMs. Our work addresses critical challenges in machine learning security on novel attacks and integrity methods for ML systems, and offers a scalable, ethically grounded defense against prompt-based adversarial threats.

2605.27817 2026-05-28 cs.RO cs.AI cs.CV cs.LG 版本更新

Turning Video Models into Generalist Robot Policies

将视频模型转化为通用机器人策略

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann

发表机构 * MIT(麻省理工学院) CMU(卡内基梅隆大学) Amazon FAR(亚马逊公司)

AI总结 提出一种解耦的视频到动作策略VERA,利用无动作视频世界模型和基于机器人雅可比矩阵的逆动力学模型,实现跨本体的零样本机器人控制。

Comments project page: https://vera.csail.mit.edu

详情
AI中文摘要

视频生成模型已成为一种有前景的机器人骨干网络,能够生成描绘跨本体和环境完成复杂任务的视频。最近的工作提出了机器人基础模型,通过使用带有动作标签的数据微调视频模型,联合预测未来观测和动作。在本文中,我们测试了一种替代方法的极限:保持视频规划器不变,同时训练一个特定本体的逆动力学模型(IDM)。这种解耦带来了几个自然的好处:视频规划器保持本体无关,不同的视频模型可以轻松互换而无需重新训练IDM,并且IDM可以独立地使用现成的自对弈数据进行训练。我们提出了一种闭环的视频到动作策略,该策略将无动作视频世界模型与基于机器人本体雅可比矩阵的精心设计的IDM相结合。我们证明了我们的IDM设计既数据高效又可扩展到高维动作空间。我们将该策略命名为视频到具身机器人动作模型(VERA),在模拟和真实世界基准测试中取得了强劲的性能,包括零样本的Panda机械臂操作和16自由度Allegro灵巧手立方体重新定向。通过将相同的视频规划器与不同的本体特定IDM配对,可以在多个本体上使用。我们的结果表明,解耦的视频规划加上忠实的视频到动作翻译是实现零样本、跨本体和可泛化机器人控制的可行替代途径。更多结果请访问我们的项目网站:https://vera.csail.mit.edu。

英文摘要

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

2605.27816 2026-05-28 cs.CV 版本更新

Pattern Recognition Tasks with Personalized Federated Learning

个性化联邦学习的模式识别任务

Md. Arifur Rahman, Isha Das, Mushfiqur Rahman Abir, B. M. Taslimul Haque, Abdullah Al Noman, Abir Ahmed, Md. Jakir Hossen

发表机构 * College of Graduate and Professional Studies, Trine University(特灵大学研究生与专业研究学院) Network Communication and IoT Lab, Chittagong University of Engineering and Technology(恰 TAGONG 工程技术大学网络通信与物联网实验室) Department of Computer Science and Engineering, American International University-Bangladesh(美国国际大学-孟加拉国计算机科学与工程系) Information Systems, Central Michigan University(中央密歇根大学信息系统系) Wilmington University(维尔明顿大学) Department of Information Technology, Washington University of Science & Technology(华盛顿科学与技术大学信息科技系) Center for Advanced Analytics (CAA), COE for Artificial Intelligence, Faculty of Engineering & Technology (FET), Multimedia University, Melaka(多媒体大学马六甲工程与技术学院(FET)、人工智能学院(COE)高级分析中心(CAA))

AI总结 本文通过比较七种个性化联邦学习算法在MNIST、SignMNIST和Digit5数据集上的性能,发现APPLE、FedGC和FedProto在准确率、精确率、召回率和F1分数上表现优异。

Comments Comprehensive comparative analysis of 7 Personalized Federated Learning algorithms across MNIST, SignMNIST, and Digit5 datasets. The paper presents detailed methodology, workflow architecture, experimental evaluation, and privacy-preserving AI analysis for distributed intelligent systems, secure collaborative learning, and critical infrastructure applications

详情
Journal ref
Emerging Science Journal 10(2):974-990 (2026)
AI中文摘要

个性化联邦学习(PFL)构成了一种新颖的范式,它为每个客户端定制机器学习(ML)模型,从而在维护严格数据隐私原则的同时提供个性化的模型更新。与传统的标准联邦学习(FL)方法不同,PFL使模型适应不同的客户端数据分布,从而在最小化通信开销的同时,实现更高水平的准确性、定制化和数据安全性。这种方法在依赖于异构数据源且以隐私问题为关键的模式识别任务背景下尤为突出。在本研究工作中,本文对七种不同的PFL算法进行了全面的比较分析,这些算法在三个不同的数据集(即MNIST、SignMNIST和Digit5)上部署。总体目标是通过基于准确率、精确率、召回率和F1分数等指标的严格评估,确定在模式识别任务框架内最优秀的PFL算法。同时,对这些PFL算法进行了深入审查,阐明了它们的工作流程、优点和局限性。通过实证研究,结果表明APPLE、FedGC和FedProto是强有力的竞争者,在评估的数据集范围内始终提供优越的性能,同时承认其他算法的上下文特异性以及通过迭代改进实现最优结果的潜力。

英文摘要

Personalized Federated Learning (PFL) constitutes a novel paradigm that tailors Machine Learning (ML) models to individual clients, thereby furnishing personalized model updates whilst upholding stringent data privacy principles. Diverging from conventional standard Federated Learning (FL) approaches, PFL adapts models to distinct client data distributions, engendering heightened levels of accuracy, customization, and data security, all while minimizing communication overhead. This methodology proves particularly salient in contexts marked by pattern recognition tasks reliant upon heterogeneous data sources and underpinned by paramount privacy apprehensions. In the present research endeavor, this article undertake a comprehensive comparative analysis of seven distinct PFL algorithms deployed across three diverse datasets, namely MNIST, SignMNIST, and Digit5. The overarching objective entails ascertaining the preeminent PFL algorithm, within the framework of pattern recognition tasks, through a rigorous evaluation anchored in metrics encompassing Accuracy, Precision, Recall, and F1 Score. Concurrently, an in-depth scrutiny of these PFL algorithms is conducted, elucidating their operative workflows, advantages, and limitations. Through empirical investigation, the findings evince that APPLE, FedGC, and FedProto emerge as stalwart contenders, consistently furnishing superior performance across the spectrum of assessed datasets, while acknowledging the contextual specificity of alternative algorithms and the potential for iterative refinement to realize optimal outcomes.

2605.27813 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

残差化时间稀疏自编码器用于解释扩散模型

Calvin Yeung, Prathyush Poduval, Ali Zakeri, Zhuowen Zou, Mohsen Imani

发表机构 * University of California, Irvine(加州大学 Irvine 分校)

AI总结 提出残差化时间稀疏自编码器,通过去噪时间步间的线性预测残差学习扩散激活轨迹中的可解释特征,并在Stable Diffusion 1.5上验证其有效性。

详情
AI中文摘要

文本到图像扩散模型通过迭代去噪过程生成图像,因此内部神经层产生激活轨迹而非单一静态表示。稀疏自编码器(SAE)最近被用于将扩散激活分解为可解释的特征方向,但大多数方法在单个时间步分析激活或基于时间条件,而非直接从完整激活轨迹中学习。在这项工作中,我们引入了用于扩散激活轨迹的残差化时间SAE。我们收集去噪时间上的激活,拟合相邻时间步之间的线性预测器,并使用初始激活以及这些线性动力学未解释的残差分量来表示每个轨迹。在这种残差化表示上训练SAE鼓励稀疏潜在变量捕捉超出线性可预测范围的结构。残差化解码器方向可以映射回激活空间,使得每个潜在变量可以作为去噪时间上的特征轨迹进行分析。通过在Stable Diffusion 1.5上的重建与消融研究、时空特征分析和定性引导实验,我们表明残差化时间SAE为研究时间结构化的扩散激活提供了一个有用的框架。

英文摘要

Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of activations rather than single static representations. Sparse autoencoders (SAEs) have recently been used to decompose diffusion activations into interpretable feature directions, but most approaches analyze activations at individual timesteps or condition on time rather than learning directly from full activation trajectories. In this work, we introduce residualized temporal SAEs for diffusion activation trajectories. We collect activations across denoising time, fit linear predictors between neighboring timesteps, and represent each trajectory using an initial activation together with residual components not explained by these linear dynamics. Training an SAE on this residualized representation encourages sparse latents to capture structure beyond what is linearly predictable. The residualized decoder directions can be mapped back into activation space, allowing each latent to be analyzed as a feature trajectory over denoising time. Through reconstruction and ablation studies, spatiotemporal feature analysis, and qualitative steering experiments on Stable Diffusion~1.5, we show that residualized temporal SAEs provide a useful framework for studying temporally structured diffusion activations.

2605.27800 2026-05-28 cs.CV 版本更新

CuriosAI Submission to the CASTLE Challenge at EgoVis 2026

CuriosAI 在 EgoVis 2026 CASTLE 挑战赛中的提交

Yuto Kanda, Hayato Tanoue, Takayuki Hori

发表机构 * SoftBank Corp(软银公司)

AI总结 针对600多小时多视角自我中心视频的185道选择题,提出SVA(搜索-验证-回答)三阶段流水线和TMKG(时间多模态知识图谱)两种方法,SVA达到0.50准确率并作为最终提交。

Comments The 4th place solution for the CASTLE Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

CASTLE 2026 在超过600小时的同步多视角自我中心视频中提出了185道多项选择题。我们在共享的多模态预处理层之上探索了两种方法,包括每人时间线、说话人解析的转录本和多VLM描述集成。方法A,SVA:搜索-验证-回答,是一个三阶段流水线,它分层缩小到主要窗口,在四个反事实规则下用VLM验证子窗口,并在证据优先级层次下用LLM法官融合证据。方法B,TMKG:时间多模态知识图谱,是相反的:它构建一个时间多模态知识图谱,通过图搜索定位主要单元,并用单个接地VLM产生最终答案。SVA在排行榜上达到0.50的准确率,是我们的最终挑战提交;TMKG达到0.35。

英文摘要

CASTLE 2026 asks 185 multiple-choice questions over 600+ hours of synchronized multi-view egocentric video. We explore two approaches on top of a shared multimodal preprocessing layer, including per-person timelines, speaker-resolved transcripts, and multi-VLM caption ensembles. Approach A, SVA: Search-Verify-Answer, is a three-stage pipeline that hierarchically narrows to a primary window, verifies sub-windows with a VLM under four anti-confabulation rules, and fuses evidence with an LLM judge under an evidence-priority hierarchy. Approach B, TMKG: Temporal-Multimodal-Knowledge-Graph, is the contrast: it builds a temporal multimodal knowledge graph, locates a primary cell via graph search, and produces the final answer with a single grounded VLM. SVA reaches a leaderboard accuracy of 0.50 and is our final challenge submission; TMKG reaches 0.35.

2605.27796 2026-05-28 eess.IV cs.CV cs.LG eess.SP stat.AP 版本更新

Benchmarking Ultrasound Foundation Models for Fetal Plane Classification

超声基础模型在胎儿平面分类中的基准测试

Leya Barrientos, Yuexi Du, Nicha C. Dvornek

发表机构 * 1 Radiology \& Biomedical Imaging, Yale School of Medicine, USA 2 Department of Biomedical Engineering, Yale University, USA

AI总结 本研究对四种超声基础模型(USFM、MOFO、UltraSAM、FetalCLIP)在胎儿平面分类任务上进行基准测试,发现FetalCLIP在线性探测设置中表现最佳,而USFM在全微调设置中表现最佳,且预训练目标显著影响迁移性能。

详情
AI中文摘要

超声因其安全性、可及性和实时成像能力被广泛应用于产科护理。然而,其解读仍依赖操作者,且易受噪声和伪影影响。深度学习模型在解决这些问题上表现出色,但通常需要大量标注数据集,这在临床超声中难以获得。基础模型(FMs)提供了一种替代方案,利用大量超声图像学习可迁移的表征,从而在有限标注数据下实现泛化。本文针对胎儿平面分类任务,对超声专用基础模型进行了全面基准测试。我们评估了四种超声基础模型(USFM、MOFO、UltraSAM、FetalCLIP),并与两个CNN基线(ResNet50、EfficientNet-V2)以及一个在自然图像上预训练的ViT(DINOv3)进行比较。我们在两种互补设置下训练所有模型:全微调和冻结编码器的线性探测。所有模型均使用西班牙胎儿超声数据集进行5折患者级交叉验证训练,并在域内数据和外部非洲队列上测试,以评估跨人群泛化能力。我们发现,FetalCLIP在线性探测设置中取得最佳结果(域内F1=0.9261,域外F1=0.9731),而USFM在全微调设置中表现最佳(域内F1=0.9476,域外F1=0.9515)。MOFO和UltraSAM在两种设置中性能下降最多,在某些情况下甚至不如自然图像预训练模型。这些发现强调了预训练模型的选择对胎儿平面分类性能的显著影响,因为不同的预训练目标导致不同的迁移能力。

英文摘要

Ultrasound is widely used in obstetric care due to its safety, accessibility, and real-time imaging. However, interpretation remains operator-dependent and susceptible to noise and artifacts. Deep learning models have shown strong performance to solve these problem, but they typically require large annotated datasets that are difficult to obtain in clinical ultrasound. Foundation models (FMs) offer an alternative, using a large number of ultrasound images to learn transferable representations that can generalize with limited labeled data. This work presents a comprehensive benchmark of ultrasound-specific FMs for fetal plane classification. We evaluated four ultrasound FMs (USFM, MOFO, UltraSAM, FetalCLIP) against two CNN baselines (ResNet50, EfficientNet-V2) and a ViT (DINOv3) pretrained on natural images. We trained all models under two complementary settings: full fine-tuning and linear probing with a frozen encoder. All models were trained using 5-fold patient-level cross-validation on a Spanish fetal ultrasound dataset and tested on both in-domain data and an external African cohort to assess cross-population generalization. We found that FetalCLIP achieved the best results in the linear probing setting (F1 = 0.9261 for in-domain, F1 = 0.9731 for out-of-domain), while USFM performed best in the full fine-tuning setting (F1 = 0.9476 for in-domain, F1 = 0.9515 for out-of-domain). MOFO and UltraSAM degraded most in both settings, underperforming natural image pretrained models in some cases. These findings highlight how the choice of pretrained model strongly affects fetal plane classification performance, since different pretraining objectives lead to different levels of transferability.

2605.27764 2026-05-28 cs.CV cs.AI 版本更新

Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

分割模型能理解世界吗?通过视觉思维链实现主动可供性推理

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

发表机构 * Northwestern University(西北大学) Northeastern University(东北大学) South China University of Technology(华南理工大学) Hong Kong Baptist University(香港 Baptist大学) Beijing Normal - Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 提出SegWorld框架,通过多级视觉思维链在意图级指令下进行主动场景观察和可供性推理,实现从目标到部件的高效分割。

详情
AI中文摘要

最近的分割模型将大语言模型(LLMs)与掩码解码器结合,将复杂的语言表达映射到掩码上,但其指令仍然是目标指涉的:它们描述、约束或暗示待分割的区域。然而,在现实世界的具身交互中,人类指令通常是意图级的,包括期望的结果而不指定实现该结果的区域。为弥合这一差距,我们引入SegWorld,其中模型在确定掩码之前通过多级视觉思维链(CoT)推理场景。在接收任何指令之前,它主动观察场景,描述可见对象并推断它们可能支持的可能事件。给定指令后,它继续思维链:从与意图相关的对象,到满足意图的动作,再到物理交互部位,即支持该动作的对象部分。我们将SegWorld形式化为概率推理,其中主动观察提供语言场景上下文,当指令以意图级别给出时,可改善掩码预测。我们构建了一个意图到部件的基准,用于评估从高层目标出发的可供性承载部件分割。实验表明,SegWorld在目标指涉指令上匹配指令驱动基线,并在意图级指令上显著提升。

英文摘要

Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet their instructions remain target-referential: they describe, constrain, or imply the region to be segmented. However, in real-world embodied interaction, human instructions are often at the intent-level, which includes the desired outcome without naming the region that enables it. To bridge this gap, we introduce SegWorld, where the model reasons about the scene through a multi-level visual chain-of-thought (CoT) before committing to a mask. Before receiving any instructions, it proactively observes the scene, describing visible objects and inferring plausible events they may support. Given an instruction, it continues the chain: from the object relevant to the intent, through the action that satisfies it, to the physical interaction site, the object part that affords the action. We formalize SegWorld as probabilistic inference, in which proactive observation supplies a linguistic scene context that improves mask prediction when instructions are given at the level of intent. We construct an intent-to-part benchmark for evaluating affordance-bearing part segmentation from high-level goals. Experiments show SegWorld matches instruction-driven baselines on target-referential instructions and improves substantially on intent-level ones.

2605.27761 2026-05-28 cs.CV cs.SE 版本更新

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

AndroidDaily: 面向真实世界闭源应用的可验证移动GUI智能体基准

Yifan Sui, Xin Huang, Hongbing Li, Fang Xu, Jiahe Lv, Haolong Yan, Yeqing Shen, Litao Liu, Zhimin Fan, Ziyang Meng, Jia Wang, Junbo Qi, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Osamu Yoshie

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) StepFun Waseda University(早稻田大学)

AI总结 针对闭源应用无法获取内部状态导致自动验证困难的问题,提出AndroidDaily基准(350个日常任务)和GRADE评估器(基于可观察外部指南的三层系统),实现无需内部状态的可验证评估,最强模型成功率为62.0%。

Comments 11 pages, 6 figures. Preprint

详情
AI中文摘要

GUI基础模型和移动GUI智能体的快速发展催生了众多评估基准,但大多数依赖于模拟环境或开源应用,真实世界的闭源应用在很大程度上未得到评估。核心困难在于闭源应用不暴露内部状态,使得传统的自动验证不适用。为弥合这一差距,我们引入了AndroidDaily,一个大规模基准,包含跨94个高频Android应用的350个现实日常任务,涵盖交通、购物、本地服务、娱乐、内容创作、社交媒体和日常实用工具。为了在这些不透明环境中实现自动且可验证的评估,我们提出了基于指南的自动诊断评估评审器(GRADE),这是一个基于三层可观察外部指南系统构建的过程感知评估器:操作义务、输出质量和负面约束。GRADE根据这些标准跟踪智能体的视觉轨迹,并产生步骤级诊断判断,将长期、开放式的移动交互转化为可验证的评估,而无需依赖隐藏的内部状态。实验表明,GRADE与人类评估者的一致性达到87.37%。最强模型在AndroidDaily上的成功率为62.0%,凸显了当前推理能力与现实移动工作流实际执行之间的巨大差距。

英文摘要

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

2605.27750 2026-05-28 cs.CL cs.AI cs.CV cs.DL 版本更新

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

阅读还是猜测?古希腊版本OCR中视觉语言模型的视觉定位失败

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

发表机构 * Inria(法国国家信息与自动化研究所)

AI总结 通过对比开放权重视觉语言模型与传统OCR基线在低资源古希腊批判版本上的表现,发现VLM即使错误也能生成流畅文本,表明其依赖语言先验,并引入扰动和标记级定位度量分析视觉证据。

详情
AI中文摘要

最近的研究表明,用于光学字符识别(OCR)的视觉语言模型(VLM)能够生成看似合理但缺乏视觉支持的文本,暗示其依赖语言先验。通过将开放权重VLM与传统OCR基线在低资源古希腊批判版本上进行对比,我们展示了VLM的错误即使在错误时也往往保持流畅,产生合理的希腊语替换,而传统引擎则产生局部识别噪声。为了分析解码过程中的视觉证据,我们引入了受控图像扰动和基于条件与无图像解码分布的标记级定位度量。在字符级扰动下,VLM与扰动的真实文本严重偏离,而传统OCR相对忠实;然而,标记级分析表明先验依赖是模型特定的:在OCR专业模型中,流畅的词汇错误几乎不依赖图像而产生,而通用VLM即使在错误时也仍然依赖于视觉输入。解码时干预未能可靠地恢复定位,而OCR后语言模型校正仅通过生成后修复文本改善了几个系统。我们的结果将先前关于OCR语言先验依赖的证据扩展到低资源历史文档和更广泛的模型集,表明流畅输出不一定具有视觉基础,并推动了超越总体准确性的可解释性驱动评估。

英文摘要

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

2605.27748 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection

马氏距离 PatchCore:协方差感知与流式兼容的工业异常检测

Niccolò Ferrari, Oligert Osmani, Evelina Lamma

发表机构 * Department of Engineering, University of Ferrara(费拉拉大学工程学院)

AI总结 提出马氏距离 PatchCore,通过协方差估计和流式处理改进 PatchCore,在保持性能的同时降低峰值内存并提升工业检测精度。

Comments 57 pages, 7 figures

详情
AI中文摘要

工业视觉异常检测通常是一类问题:正常图像丰富,而缺陷罕见、异质且常在系统设计时不可用。PatchCore 风格的检索适合此场景,因为它通过正常补丁特征的内存库对测试图像评分,但标准欧几里得几何忽略了特征相关性,且其离线构建在子采样前需实例化整个补丁池。我们引入马氏距离 PatchCore,一种协方差感知、流式兼容的 PatchCore 扩展。其人工智能贡献在于一种检索检测器,它在降维特征空间中估计正则化协方差模型并对嵌入进行白化,使得变换后的欧几里得最近邻搜索实现马氏距离检索。一个有界内存、可重复迭代的训练流程通过增量降维、在线协方差估计和流式聚合,无需一次性存储所有正常补丁即可构建内存库。工程应用是自动化工业检测,其中视觉异常检测必须在实际内存限制下保持准确。我们在一个公开的 15 类工业异常检测基准和三个工业数据集(涵盖吹灌封条带安瓿弯月面检测、琥珀色玻璃安瓿底部检测和冻干饼西林瓶检测)上评估该方法。马氏距离 PatchCore 在公开基准上保留了大部分离线 PatchCore 的图像级性能,同时将峰值内存从 5.41 GB 降至 2.78 GB,并将选定的工业平均图像接收者操作特征曲线下面积从 0.981 提升至 0.986。

英文摘要

Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unavailable during system design. PatchCore-style retrieval suits this setting because it scores test images from a memory bank of normal patch features, but the standard Euclidean geometry ignores feature correlations and its offline construction materialises the full patch pool before subsampling. We introduce Mahalanobis PatchCore, a covariance-aware, streaming-compatible extension of PatchCore. Its artificial intelligence contribution is a retrieval detector that estimates a regularised covariance model in reduced feature space and whitens embeddings, so Euclidean nearest-neighbour search after transformation implements Mahalanobis retrieval. A bounded-memory, re-iterable training pipeline builds the memory bank without storing all normal patches at once, using incremental dimensionality reduction, online covariance estimation, and streaming aggregation. The engineering application is automated industrial inspection, where visual anomaly detection must remain accurate under practical memory limits. We evaluate the method on a public 15-category industrial anomaly-detection benchmark and three industrial datasets covering blow-fill-seal strip-ampoule meniscus inspection, amber-glass-ampoule bottom inspection, and lyophilised-cake vial inspection. Mahalanobis PatchCore preserves most offline PatchCore image-level performance on the public benchmark while reducing peak memory from 5.41 to 2.78 GB, and improves the selected industrial mean image area under the receiver operating characteristic curve from 0.981 to 0.986.

2605.27737 2026-05-28 cs.CV 版本更新

Bounded-Compute Multimodal Regression for Product-Rating Prediction

有界计算多模态回归用于产品评分预测

William Leach, Ru He, Sizhuo Ma, Yizhen Jia, Min Cao, Jian Wang, Rick Cao

发表机构 * Snap Inc.

AI总结 针对严格延迟预算下的标量回归任务,提出一种有界计算适配方法,通过替换语言模型头为轻量MLP并固定输入,在LoViF 2026挑战赛中实现高效多模态回归。

Comments Accepted to the LoViF Workshop at CVPR 2026. 8 pages, 2 figures

详情
AI中文摘要

视觉语言模型在多模态质量评估中日益受欢迎,但其默认依赖自回归文本生成和动态视觉处理,在严格延迟预算下难以适配标量回归。我们提出一种有界计算适配方法,基于SmolVLM2-256M-Video-Instruct,用于LoViF 2026高效VLM挑战赛中的产品评分预测。受近期多模态参与度预测结果(显示基于特征的回归可优于基于token的分数生成)的启发,我们将语言建模头替换为轻量两层MLP,输入为池化后的解码器状态,并通过固定384x384图像和截断元数据强制执行确定性输入。在受控消融实验中,静态全局图像处理略优于动态平铺,且将训练样本从10万扩展到1600万显著提升了验证相关性。在官方留出评估中,我们的228M参数模型达到了0.39 PLCC和0.40 CES,为资源受限的多模态回归提供了强且可复现的基线。

英文摘要

Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0.39 PLCC and 0.40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.

2605.27736 2026-05-28 cs.LG cs.CV 版本更新

Explicit Critic Guidance for Aligning Diffusion Models

显式评论家引导的对齐扩散模型

Zhengyang Liang, Qihang Zhang, Ceyuan Yang

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出一种状态对齐的潜在演员-评论家框架,通过将扩散模型自身作为时间步条件价值函数,实现轨迹级PPO训练和推理时引导,在单/多奖励基准上优于先前方法。

详情
AI中文摘要

在线强化学习对于将扩散模型与不可微目标对齐变得越来越重要。然而,现有方法在沿去噪轨迹分配细粒度信用和实现稳定的基于价值的优化方面仍面临限制。我们提出了一种用于扩散后训练的状态对齐潜在演员-评论家框架,其中扩散模型自身作为时间步条件价值函数,并直接在噪声潜在状态上预测价值。这使得轨迹级PPO训练成为可能,通过简单的条件和价值预训练策略支持稳定的演员-评论家优化,并自然地允许学习到的评论家用于推理时引导。我们进一步将框架扩展到多奖励优化,其中与互补奖励的联合训练有助于减轻奖励破解。在基于UNet和DiT的骨干网络上,我们的方法在单奖励和多奖励基准上始终优于先前的组相对RL和演员-评论家基线,同时测试时引导在生成质量上提供了额外提升。

英文摘要

Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.

2605.27726 2026-05-28 cs.CV 版本更新

Asynchronous Remote Sensing Time-Series Fusion for Cloud Removal and Anytime Reconstruction

异步遥感时间序列融合用于云去除与任意时间重建

Forouzan Fallah, Chia Yu Hsu, Wenwen Li, Anna Liljedahl, Yezhou Yang

发表机构 * School of Computing and Augmented Intelligence, Arizona State University(计算与增强智能学院,亚利桑那州立大学) School of Geographical Sciences and Urban Planning, Arizona State University(地理科学与城市规划学院,亚利桑那州立大学) Woodwell Climate Research Center(伍德沃德气候研究中心)

AI总结 提出AGFlow模型,通过时间对齐生成流匹配融合异步S1/S2数据,实现云去除、缺失帧重建及任意时间查询。

Comments CVPR 2026 MORSE Workshop

详情
AI中文摘要

频繁的云层覆盖严重限制了Sentinel-2 (S2) 光学时间序列在地球表面监测中的可用性。Sentinel-1 (S1) SAR提供了全天候的互补观测,但由于采集不规则且异步,实际的S1/S2融合仍然困难。许多现有方法假设时间对齐的输入(或需要外部最近日期匹配),并且通常仅恢复观测时间戳,限制了长间隙下的重建并阻止了按需合成。我们提出AGFlow(时间对齐生成流匹配),一种用于S1/S2云去除和时间序列重建的时空流匹配模型,具有三种能力:(1) 时间戳条件内部对齐,融合异步S1和含云S2观测,无需基于预处理的配对;(2) 时空上下文感知去噪,联合建模空间结构与时间动态(而非独立的逐像素时间序列);(3) 任意时间查询,能够在监测窗口内的观测时间戳和用户指定时间戳生成无云S2帧。我们在RESTORE-DiT基准协议上进行了评估,包括定量指标、定性比较和组件消融。AGFlow显著改善了完全缺失帧的重建(MAE和RMSE相比RESTORE-DiT降低16-19%),并在持续间隙下提供可靠重建,同时具有竞争力的云去除性能和灵活的时间查询能力,适用于密集植被监测等下游任务。

英文摘要

Frequent cloud cover severely limits the usability of Sentinel-2 (S2) optical time series for Earth surface monitoring. Sentinel-1 (S1) SAR provides all-weather complementary observations, but practical S1/S2 fusion remains difficult because acquisitions are irregular and asynchronous. Many existing approaches assume temporally aligned inputs (or require external nearest-date matching) and typically restore only observed timestamps, limiting reconstruction under long gaps and preventing on-demand synthesis. We propose AGFlow (Time Aligned Generative Flow Matching), a spatiotemporal flow-matching model for S1/S2 cloud removal and time-series reconstruction with three capabilities: (1) timestamp-conditioned internal alignment that fuses asynchronous S1 and cloudy S2 observations without preprocessing-based pairing; (2) spatiotemporal, context-aware denoising that models spatial structure jointly with temporal dynamics (rather than independent per-pixel time series); and (3) anytime querying, enabling generation of cloud-free S2 frames at both observed and user-specified timestamps within the monitoring window. We evaluate on the RESTORE-DiT benchmark protocol with quantitative metrics, qualitative comparisons, and component ablations. AGFlow notably improves fully missing-frame reconstruction (MAE and RMSE reduce by 16-19% over RESTORE-DiT) and provides reliable reconstructions under persistent gaps, while also yielding competitive cloud removal performance and flexible temporal querying for downstream tasks such as dense vegetation monitoring.

2605.27686 2026-05-28 cs.CV cs.AI 版本更新

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

张量记忆:用于长程Transformer的固定大小循环状态

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院) IBM Research, Cambridge, MA, USA(IBM研究院) University of Toronto, Toronto, Canada(多伦多大学)

AI总结 提出张量记忆模块,通过固定大小的3D循环张量状态增强Transformer,以解耦状态容量与输入长度,并保持空间归纳偏置,适用于长程视频理解。

详情
AI中文摘要

Transformer通过将空间和时间展平为长令牌序列来处理图像和视频。虽然注意力和KV缓存保留了过去的特征,但其内存随序列长度增长,并且缺乏显式的、持久化的空间状态,这使得长程视频理解和遮挡敏感推理变得困难。我们提出张量记忆,一种轻量级模块,通过固定大小的循环3D记忆张量增强Transformer块:令牌通过可微的软写入将内容沉积为围绕预测连续3D位置的高斯加权体积到体素网格中,记忆通过高效的局部交互算子和门控循环动态更新,令牌通过连续采样和门控残差融合读取上下文。由于记忆张量大小固定,张量记忆将状态容量与输入长度解耦,同时保持空间归纳偏置。我们在标准语言、图像和视频基准测试以及一个旨在隔离持久状态何时有益的受控玩具诊断套件上评估该模块;它与标准Transformer训练流程集成,可以附加到现有块或从中移除,而无需其他架构更改。

英文摘要

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

2605.27679 2026-05-28 cond-mat.soft cs.CV cs.LG 版本更新

On the Equivariant Learning of the $Q$-tensor Order Parameter

$Q$ 张量序参数的等变学习

Julia Navarro, Mark Wilkinson

发表机构 * Nottingham Trent University, UK(诺丁汉特伦特大学,英国) Berea College, USA(贝雷学院,美国)

AI总结 本文构建并评估了群等变神经网络,用于从合成生成的微观纹理预测向列液晶的二维 $Q$ 张量序参数,发现等变模型相比非等变基准具有更低的误差和更强的泛化能力。

Comments 15 pages (excluding 7-page appendix); 6 figures

详情
AI中文摘要

我们构建并评估了群等变神经网络,用于从合成生成的微观纹理预测向列液晶的二维 $Q$ 张量序参数。使用权重共享约束、等变激活和正则化技术的组合,构建了七个等变于 $k=4,8,16,32,64,128,256$ 阶循环群 $C_k$ 的架构。为此,我们构造了旋转类置换矩阵群,其元素 $\varrho_{C_k}(g)$ 作用于按行向量化的图像,从而近似方形图像上圆形子域的 $\frac{2\pi}{k}$ 旋转。我们展示了所有七个等变模型在单精度浮点精度内满足 $Q$ 张量等变性约束。与近似参数匹配的非等变基准(有或没有数据增强)相比,我们发现等变模型始终实现更低的误差,并且对未见过的缺陷配置具有更强的泛化能力。性能随群阶增加而提高,表明纳入更精细的旋转对称性会导致更低的误差。

英文摘要

We construct and evaluate group-equivariant neural networks for the prediction of the two-dimensional $Q$-tensor order parameter of nematic liquid crystals from synthetically generated microscopic textures. Seven architectures, equivariant to cyclic groups $C_k$ of order $k$ for $k=4,\,8,\,16,\,32,\,64,\,128,\, 256$, are built using a combination of weight-sharing constraints, equivariant activations and regularization techniques. To do this, we construct rotation-like permutation matrix groups with elements $\varrho_{C_k}(g)$ that act on row-wise vectorized images, thereby approximating a $\frac{2π}{k}$ rotation of the circular subdomain on square images. We show that all seven equivariant models satisfy the $Q$-tensor equivariance constraint to within single-precision floating point accuracy. Comparing against approximate parameter-matched non-equivariant benchmarks, with and without data augmentation, we find that the equivariant models consistently achieve lower errors and generalize more robustly to unseen defect configurations. Performance increases with group order, suggesting that the incorporation of finer rotational symmetry leads to lower errors.

2605.27616 2026-05-28 cs.CV cs.AI 版本更新

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

并非所有 NVFP4 QAT 配方都相同:架构和规模如何影响异常分割的模型质量

Zijian Du, Oleg Rybakov

发表机构 * NVIDIA

AI总结 本研究通过统一协议评估多种架构、规模和 FP4 量化感知训练 (QAT) 配方在脑肿瘤异常分割任务中的交互作用,发现架构选择对量化鲁棒性影响最大,注意力机制架构对配方选择具有显著韧性,而 CNN 在大规模下受梯度量化配方影响性能下降。

详情
Journal ref
CVPR2026
AI中文摘要

实时异常分割要求高召回率和高效的低精度推理。我们研究了模型架构、模型规模和 FP4 量化感知训练 (QAT) 配方在召回关键的脑肿瘤分割任务中的三方交互,在统一协议下评估了多种架构、规模和 QAT 配方。我们发现架构选择对量化鲁棒性影响最大,基于注意力的架构对配方选择表现出显著的韧性,而 CNN 在大规模下在梯度量化配方下性能下降。在低容量下,FP4 可能离散化 softmax 注意力,但高级 QAT 配方可防止这种崩溃。在更大规模下,高级配方减轻了降低 CNN 质量的梯度量化噪声。五折患者级交叉验证证实这些发现对数据划分具有鲁棒性。我们的结果表明,Swin Transformer 在所有规模下对 QAT 配方选择都具有鲁棒性,使其成为 FP4 量化异常分割的推荐架构。

英文摘要

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

2605.27595 2026-05-28 cs.CV cs.AI 版本更新

Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

多模态大语言模型在农业图像解释与生成任务中的幻觉行为

Partho Ghose, Al Bashir, Prem Raj, Azlan Zahid

发表机构 * Texas A&M University System(德克萨斯大学系统)

AI总结 本研究系统评估了多模态大语言模型在农业图像解释(图像到文本)和生成(文本到图像)任务中的幻觉行为,发现模型存在生物不一致、上下文不准确和农学不合理等错误模式,并通过少样本提示等方法分析了幻觉的残留影响。

详情
AI中文摘要

大型语言模型(LLMs)正迅速被应用于农业成像领域,从作物解释到合成田间图像生成。然而,这些模型经常表现出看似自信但偏离生物或环境现实的幻觉输出,可能导致错误的农学见解。本研究从两个互补方向调查此类幻觉:图像到文本,即LLMs解释作物或田间图像以描述生物和非生物胁迫等条件;以及文本到图像,即模型基于描述性提示生成合成农业场景。我们检查涉及生物不一致、上下文不准确和农学不合理的错误,并在多个成像模态下根据领域知情标准评估输出。我们的分析识别了解释性和生成性任务中反复出现的幻觉模式。在图像解释中,LLMs(例如Gemma、LLAVA、Qwen和MiniCPM)实现了适度的零样本准确率(63%至75%),而少样本提示将性能提升至高达86.8%,但仍表现出虚假检测和漏检感染,表明存在残留幻觉效应。在文本到图像任务中,高级模型如GPT-5和Gemini 2.5 Flash在宽松提示约束下生成高达91%的生物不一致场景,揭示了当前LLMs的根本弱点。这种对视觉推理和生成的系统评估为增强基于LLM的农业成像平台的可靠性和可信度提供了关键见解。

英文摘要

Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights. This study investigates such hallucinations in two complementary directions: image-to-text, where LLMs interpret crop or field imagery to describe conditions such as biotic and abiotic stresses, and text-to-image, where models generate synthetic agricultural scenes based on descriptive prompts. We examine errors involving biological inconsistency, contextual inaccuracy, and agronomic implausibility, evaluating the outputs under domain-informed criteria across multiple imaging modalities. Our analysis identifies recurring hallucination patterns within both interpretive and generative tasks. In image interpretation, LLMs (e.g., Gemma, LLAVA, Qwen, and MiniCPM) achieved modest zero-shot accuracy (63 to 75 percent), whereas few-shot prompting improved performance up to 86.8 percent, exhibiting false detections and missed infections, indicating residual hallucination effects. In text-to-image tasks, advanced models such as GPT-5 and Gemini 2.5 Flash generate up to 91 percent biologically inconsistent scenes under relaxed prompt constraints, revealing fundamental weaknesses in current LLMs. This systematic assessment of visual reasoning and generation offers critical insights toward enhancing the reliability and trustworthiness of LLM-based agricultural imaging platforms.

2605.27589 2026-05-28 cs.CV 版本更新

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

What-If World: 具身场景中通用世界模型的因果基准

Kunlin Cai, Rui Song, Jinghuai Zhang, Kaiyuan Zhang, Pranav Bodapati, Alicia Yu, Fnu Suya, Mohammad Rostami, Jiaqi Ma, Yuan Tian

发表机构 * UCLA(加州大学洛杉矶分校) University of Tennessee(田纳西大学) Amazon(亚马逊)

AI总结 提出 What-If World 基准,通过成对提示测试视频生成模型在物理变化下的因果一致性,发现现有模型在因果干预上表现不佳。

Comments 38 pages, World Model Benchmark

详情
AI中文摘要

视频生成模型越来越多地被用作世界模拟器,用于驾驶和机器人操作等任务。在这些场景中,重要的不是单个视频看起来是否正确,而是模型的输出在输入变化时是否随之变化。我们通过给模型两个描述同一场景但一个物理细节不同的提示,并检查两个视频是否按照物理预测的方式产生差异来测试这一点。提示之间的措辞差异在设计上很小,因为只改变了一个变量,但正确的物理差异并不小。忽略这一点的模型仍然可以生成两个各自看起来合理的视频,而现有基准一次只评分一个视频,无法检测到这种失败。我们引入了 What-If World,包含 319 个这样的提示对,基于 nuScenes 和 DROID 的真实帧构建,并按驾驶和操作中共享的六个物理变量的分类法组织。每个对使用 APEO 评分,这是一个包含四个部分的评分标准,检查每个视频是否遵循其提示(遵循性)、物理上一致(物理性)、保持共享场景(环境性)以及最终产生正确的差异(结果性)。在九个最先进的模型中,没有系统在配对得分上超过 52%,开源模型集中在 28% 附近。每个测试的模型在大量因果干预上失败,表明这些模型在能够可靠支持动作条件模拟或基于模型的规划之前还有很大差距。在模型得分较高的地方,性能似乎与干预的视觉显著性相关,而不是其底层物理的可处理性。一些视觉上微妙的干预得分低至 14.2%,而视觉上显著的干预得分达到 40.4%。

英文摘要

Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model's output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

2605.27582 2026-05-28 cs.RO cs.CV 版本更新

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

Uni-LaViRA:面向统一具身导航的语言-视觉-机器人动作翻译

Hongyu Ding, Sizhuo Zhang, Ziming Xu, Jinwen Guo, Hongxiu Liu, Xingzhi Cheng, Zixuan Chen, Haifei Qi, Duo Wang, Hao Xu, Jieqi Shi, Yifan Zhang, Jing Huo, Jian Cheng, Yang Gao, Jiebo Luo

发表机构 * Nanjing University(南京大学) Beihang University(北京航空航天大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) BMW (Nanjing) Information Technology Co., Ltd.(宝马(南京)信息技术有限公司) University of Rochester(罗切斯特大学)

AI总结 提出Uni-LaViRA统一智能体架构,通过语言-视觉-机器人动作翻译结构,结合待办列表记忆和二次机会回溯机制,在零训练下实现四类导航任务和四种真实机器人的零样本泛化,性能匹配或超越近期训练式导航基础模型。

Comments Project page: https://xetroubadour.github.io/Uni-LaViRA/

详情
AI中文摘要

具身导航要求智能体将语言和视觉观测映射为一系列空间动作,驱动真实机器人在未见环境中移动。主流方法是在不断增大的机器人轨迹数据集上扩展视觉-语言-动作(VLA)基础模型。本文认为,对于导航而言,通用性可以通过结构获得,而不仅仅依赖数据规模。导航的底层决策结构可简化为单一的语言-视觉-机器人动作翻译。语言动作发出语义级方向指令,视觉动作发出像素级视觉目标。这两个输出都位于预训练多模态大语言模型(MLLM)的自然输出流形内,因此任务可以由智能体推理而非从机器人数据中学习。为此,我们提出Uni-LaViRA,一种统一的智能体架构,将相同的见解零样本地扩展到四个任务族(VLN-CE、ObjectNav、EQA和Aerial-VLN)和四种异构真实机器人(轮式、四足、人形机器人和自建无人机)。两种智能体循环机制使这种统一变得实用。待办列表记忆(TDM)在每一步重写待办子目标的结构化检查清单,将未完成项重新注入智能体的最近注意力窗口。二次机会回溯(SCB)将机器人回滚到错误前状态,并基于失败的子轨迹调整智能体的下一步计划,将单次导航转变为自我纠正过程。无需任何训练,Uni-LaViRA在VLN-CE R2R上达到60.7%的成功率(SR),在VLN-CE RxR上达到51.3%,在HM3D-v2上达到77.7%,在HM3D-OVON上达到60.0%,在MP3D-EQA上达到54.7%,在OpenUAV上达到40.0%,匹配甚至超越了近期消耗数百万样本和数千GPU小时的训练式导航基础模型。

英文摘要

Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action (VLA) foundation models on ever-larger collections of robot trajectories. This paper argues that, for navigation specifically, generality can be obtained structurally, not only through data scale. The underlying decision structure of navigation reduces to a single Language-Vision-Robot Actions Translation. The language action emits semantic-level directional command and the vision action emits a pixel-level visual target. Both outputs lie inside the natural output manifold of pretrained multimodal large language models (MLLMs), so the task can be reasoned about by an agent rather than learned from robot data. Therefore, we present Uni-LaViRA, a unified agentic architecture that extends the same insight to four task families (VLN-CE, ObjectNav, EQA, and Aerial-VLN) and to four heterogeneous real robots (Wheeled, Quadruped, Humanoid robot, and a self-built UAV) in a zero-shot manner. Two agent-loop mechanisms make this unification practical. TODO List Memory (TDM) rewrites a structured checklist of pending sub-goals at every step, reciting the unfinished items back into the agent's most recent attention window. Second Chance Backtrack (SCB) rolls the robot back to the pre-error state and conditions the agent's next plan on the failed sub-trajectory, turning single-pass navigation into a self-correcting process. With zero training effort, Uni-LaViRA reaches 60.7% SR on VLN-CE R2R, 51.3% on VLN-CE RxR, 77.7% on HM3D-v2, 60.0% on HM3D-OVON, 54.7% on MP3D-EQA, and 40.0% on OpenUAV, matching or even surpassing recent training navigation foundation models that consume millions of samples and thousands of GPU-hours.

2605.27561 2026-05-28 cs.CV cs.AI 版本更新

Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

Melanoscope AI移动皮肤镜临床决策支持系统的临床验证

Elena Sergeevna Kozachok, Sergey Sergeevich Seregin

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences(俄罗斯科学院伊万诺夫系统编程研究所) Orel Regional Oncology Dispensary(奥尔格地区肿瘤专科医院)

AI总结 本研究提出了一种级联深度学习模型的定量可解释性评估方法和三区患者分流算法,并在俄罗斯门诊实践中对Melanoscope AI CDSS进行了前瞻性单中心临床验证,结果显示无假阴性且特异性为88.3%。

Comments 24 pages, 6 figures, 5 tables, 21 references

详情
AI中文摘要

引言:恶性皮肤病变的早期检测对预后至关重要,但俄罗斯地区皮肤科医生短缺限制了筛查覆盖。移动皮肤镜临床决策支持系统(CDSS)提供了一种有前景的方法,但模型可解释性和标准化患者分流仍是采用的关键障碍。目的:开发一种级联深度学习模型的定量可解释性评估方法和三区患者分流算法,并在俄罗斯门诊实践中对Melanoscope AI CDSS进行初步的单中心前瞻性临床验证。材料与方法:皮肤镜图像的两阶段级联分类;注意力图可视化(ViT和Swin使用注意力展开;ConvNeXt和EfficientNetV2使用Grad-CAM);激活图与专家标注之间基于IoU的定量一致性评估;在四次“黑色素瘤日”活动(俄罗斯奥廖尔,2025年6月至2026年4月)中进行前瞻性单中心验证。结果:在176名患者中:与专家评估一致率为88.6%;5例恶性病变中无假阴性(95% CI: 47.8-100.0%);特异性为88.3%。组织学证实了3例黑色素瘤和2例基底细胞癌;6例发育不良痣被纳入随访。平均IoU(n=180):ViT - 0.69;Swin - 0.64;ConvNeXt - 0.53;EfficientNetV2 - 0.51。分流阈值:P<0.15 / 0.15-0.50 / >=0.50。结论:未观察到假阴性;特异性为88.3%,支持筛查应用。集成的级联分类、带IoU评估的注意力图可视化和三区分流提供了可重复、可解释的临床决策支持,可适应不同资源水平。

英文摘要

Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit screening coverage. Mobile dermoscopy clinical decision support systems (CDSS) offer a promising approach, with model interpretability and standardised patient routing remaining key barriers to adoption. Aim. To develop a quantitative interpretability assessment method for cascade deep learning models and a three-zone patient routing algorithm, and to conduct a preliminary single-centre prospective clinical validation of the Melanoscope AI CDSS in Russian outpatient practice. Material and methods. Two-stage cascade classification of dermoscopic images; attention map visualisation (attention rollout for ViT and Swin; Grad-CAM for ConvNeXt and EfficientNetV2); quantitative IoU-based agreement assessment between activation maps and expert annotations; prospective single-centre validation across four "Melanoma Day" sessions (Orel, Russia, June 2025 - April 2026). Results. On 176 patients: agreement with expert assessment 88.6%; no false negatives among 5 malignant lesions (95% CI: 47.8-100.0%); specificity 88.3%. Three melanomas and two basal cell carcinomas were histologically confirmed; six dysplastic naevi placed under follow-up. Mean IoU (n=180): ViT - 0.69; Swin - 0.64; ConvNeXt - 0.53; EfficientNetV2 - 0.51. Routing thresholds: P<0.15 / 0.15-0.50 / >=0.50. Conclusion. No false negatives were observed; specificity was 88.3%, supporting screening use. The integrated cascade classification, attention map visualisation with IoU assessment, and three-zone routing provide reproducible, interpretable clinical decision support adaptable to varying resource levels.

2605.27495 2026-05-28 cs.CV cs.LG 版本更新

Representation-Conditioned Diffusion Models for Guided Training Data Generation

表示条件扩散模型用于引导训练数据生成

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

发表机构 * Linköping University(利乌普斯大学)

AI总结 本文提出表示条件扩散模型,通过DINOv2、DINOv3和CLIP的表示条件生成合成图像,在ImageNet100上分类准确率比类条件生成高10.76个百分点,甚至超过真实数据训练的模型2.0个百分点。

详情
AI中文摘要

数据可用性仍然是许多深度学习应用中的关键瓶颈。大规模数据集通常收集、整理和标注成本高昂,这可能限制监督学习方法的可扩展性和适用性。在这项工作中,我们评估了在由生成式深度学习产生的合成图像数据集上训练的模型的分类性能。具体而言,我们使用基于DINOv2、DINOv3和CLIP学习表示的潜在扩散模型。我们的结果表明,这种表示条件公式通过提高样本质量和模式覆盖,显著优于类条件生成(在ImageNet100上top-1准确率提高10.76个百分点)。此外,通过扩大合成数据集的规模,我们能够超越在真实数据上训练的分类器(top-1准确率提高2.0个百分点)。我们还展示了生成的图像如何用于增强目的,优于经典增强方法,以及如何利用条件空间进行样本过滤以进一步提高训练价值。总的来说,这些发现表明,表示条件扩散模型为在大规模视觉学习任务中增强、补充或潜在替代真实世界数据集提供了一种有前景的方法。

英文摘要

Data availability remains a critical bottleneck in many deep learning applications. Large-scale datasets are often expensive to collect, curate and annotate, which can limit the scalability and applicability of supervised learning methods. In this work, we evaluate the classification performance of models trained on synthetic image datasets produced by generative deep learning. In particular, we use latent diffusion models conditioned on learned representations from DINOv2, DINOv3, and CLIP. Our results demonstrates that this representation-conditioned formulation significantly outperforms class-conditioned generation by a large margin (+10.76 p.p. top-1 accuracy on ImageNet100), by improving sample quality and mode coverage. Furthermore, by scaling the size of the synthetic dataset, we are able to outperform a classifier trained on the real data (+2.0 p.p top-1 accuracy). We also demonstrate how generated images can be used for augmentation purposes, outperforming classical augmentation methods, and how the conditioning space can be used for sample filtering to further improve training value. Collectively, these findings highlight that representation-conditioned diffusion models provide a promising approach for augmenting, complementing, or potentially replacing real-world datasets in large-scale visual learning tasks.

2605.27487 2026-05-28 cs.CV cs.AI 版本更新

Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

基于扩散的乌克兰手写文本生成与跨域风格迁移

Andrii Ahitoliev, Pavlo Berezin

发表机构 * Ukrainian Catholic University, Lviv, Ukraine(乌克兰天主教大学,利沃夫,乌克兰) National University of ``Kyiv-Mohyla Academy'', Kyiv, Ukraine(基輔-莫 Hil'a 学院国立大学,基輔,乌克兰)

AI总结 针对乌克兰语等非拉丁文字手写文本生成缺乏数据和模型泛化研究的问题,构建了乌克兰手写单词数据集并重新训练DiffusionPen模型,通过跨语言、零样本和少样本迁移实验验证了潜在扩散模型在跨域风格迁移中的有效性。

Comments 16 pages, 7 figures. Submitted to ICTERI 2026

详情
AI中文摘要

基于书写者风格的手写文本生成(HTG)在拉丁文字中已被广泛研究,但在低资源和非拉丁书写系统中仍探索不足,现有模型在拉丁域之外的泛化能力尚不明确。西里尔字母,尤其是乌克兰语,缺乏大规模书写者标注数据集和此类泛化的经验证据。为填补这一空白,我们使用连通分量分割、质量过滤和对代表性不足的乌克兰字符进行针对性过采样,构建了一个包含308位书写者、126,177张图像的乌克兰手写单词数据集。我们在不修改架构的情况下,在该数据集上重新训练了DiffusionPen——一种带有MobileNetV2三元组损失风格编码器和CANINE条件潜在扩散U-Net的模型,测试了从拉丁到西里尔字母的直接迁移。我们在三种设置下评估跨域风格迁移:从IAM英文样本的跨语言迁移、对20世纪早期乌克兰手稿的零样本迁移,以及对当代书写者的少样本模仿。该模型生成可读且风格一致的单词图像,表明少样本潜在扩散模型能够泛化到拉丁文字域之外。我们发布了数据集、训练模型和评估协议,作为书写者感知的西里尔HTG的可复现基准,为将风格化HTG扩展到其他代表性不足的书写系统奠定了基础。

英文摘要

Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-resource and non-Latin writing systems, leaving open how well existing models generalise beyond the Latin domain. Cyrillic, particularly Ukrainian, lacks both large-scale writer-labeled datasets and empirical evidence of such generalisation. To address this gap, we construct a Ukrainian handwritten word dataset of 126,177 images from 308 writers using connected-component segmentation, quality filtering, and targeted oversampling of underrepresented Ukrainian characters. We retrain DiffusionPen, a MobileNetV2 triplet-loss style encoder with a CANINE-conditioned latent diffusion U-Net, on this dataset without architectural modification, testing direct transfer from Latin to Cyrillic. We evaluate cross-domain style transfer in three settings: cross-lingual transfer from IAM English samples, zero-shot transfer to an early 20th-century Ukrainian manuscript, and few-shot imitation of contemporary writers. The model produces legible, style-consistent word images, indicating that few-shot latent diffusion models generalize beyond the Latin-script domain. We release the dataset, trained models, and evaluation protocol as a reproducible benchmark for writer-aware Cyrillic HTG, providing a foundation for extending stylized HTG to other underrepresented writing systems.

2605.27467 2026-05-28 cs.LG cs.AI cs.CV 版本更新

Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

液态神经网络与LSTM在序列模式识别中的比较分析:鲁棒性、效率与临床实用性

Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

发表机构 * National Electronics and Computer Technology Center (NECTEC)(国家电子与计算机技术中心) Language Understanding Lab.(语言理解实验室)

AI总结 本文通过对比液态神经网络(LNN)与LSTM在四种序列数据上的性能,发现LNN在参数效率和鲁棒性方面更优,尤其适用于数据稀疏的临床环境。

Comments 9 pages, 7 figures, 6 tables, The conference paper will appear in Proceedings of JCSSE 2026

详情
AI中文摘要

传统的循环神经网络(RNN)和长短期记忆网络(LSTM)在离散时间步上运行,往往无法捕捉现实世界物理过程的流体时间动态。液态神经网络(LNN),特别是闭式连续时间(CfC)网络,通过将隐藏状态演化建模为连续微分方程来解决这一问题。在本文中,我们在四种不同的序列模态上进行了全面的基准测试研究:神经形态事件数据(N-MNIST)、基于笔画的绘图(QuickDraw)、视觉手写(IAM)和生理时间序列(PhysioNet Sepsis-3)。此外,我们使用时间丢弃法进行了严格的压力测试,以评估模型对缺失数据的鲁棒性。我们的研究结果表明,LNN在原生时间域和数据稀疏普遍的临床环境中,始终提供优越的参数效率和显著更高的鲁棒性。本扩展预印本提供了关于相关数据集和LNN理论谱系的额外背景,并附有详细附录,记录了我们的完整实现和实验设置。

英文摘要

Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to capture the fluid temporal dynamics of real-world physical processes. Liquid Neural Networks (LNNs), specifically Closed-form Continuous-time (CfC) networks, address this by modeling the hidden state evolution as a continuous differential equation. In this paper, we conduct a comprehensive benchmarking study across four distinct sequential modalities: neuromorphic event-based data (N-MNIST), stroke-based drawing (QuickDraw), visual handwriting (IAM), and physiological time-series (PhysioNet Sepsis-3). Furthermore, we perform a rigorous stress test using temporal dropout to evaluate model robustness against missing data. Our findings reveal that LNNs consistently provide superior parameter efficiency and significantly higher robustness in natively temporal domains and clinical environments where data sparsity is prevalent. This extended preprint provides additional background on related datasets and the LNN theoretical lineage, supplemented with a detailed appendix documenting our full implementation and experimental settings.

2605.27465 2026-05-28 cs.CV cs.AI 版本更新

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

AdaMerge: 面向视觉Transformer无训练加速的显著性感知自适应令牌合并

Semi Lee, Hyejin Go, Hyesong Choi

发表机构 * Electronic Engineering(电子工程) Soongsil University(顺斯大学)

AI总结 提出AdaMerge框架,通过显著性加权相似度和自适应合并强度两个互补机制,在无训练条件下提升令牌合并的精度-计算量帕累托前沿。

Comments 11 pages, 3 figures, 5 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

视觉Transformer(ViT)中自注意力的二次计算成本构成了实际部署的基本瓶颈,激发了令牌缩减方面的活跃研究。在现有方法中,令牌合并(ToMe)已成为一种优雅的无训练解决方案;然而,其设计基于令牌平等的隐含前提,这与自注意力已充分证明的非均匀性相悖,并在激进压缩下导致高显著性令牌的信息丢失。我们通过AdaMerge解决了这一局限,该框架基于两个互补机制。首先,显著性加权相似度利用列式特征亲和度中心性作为令牌重要性代理,并将所得显著性分数纳入二分匹配分数,确保关键令牌对合并表示贡献更大。其次,自适应合并强度使用预先计算的逐层相似度统计量,根据输入特定的冗余性动态调整每层缩减数量。在ImageNet-1k上使用ViT-B/16,AdaMerge在所有FLOPs匹配条件下均持续优于ToMe、PiToMe和DSM。精度差距随压缩单调增大:在13.4G FLOPs操作点,AdaMerge的Top-1下降仅为-1.06%,而PiToMe为-1.45%,DSM为-4.62%。据我们所知,AdaMerge是首个将显著性加权相似度和自适应逐层缩减结合到单一无训练令牌合并框架中的方法,推动了ViT加速的精度-FLOPs帕累托前沿。

英文摘要

The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

2605.27464 2026-05-28 cs.CV cs.AI 版本更新

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

超越运动基元:基于头戴式IMU的行为活动识别

Chung-Ta Huang, Leopold Das, Jeffrey Zhou, Faizaan Siddique, Julia Seungjoo Baek, Serena Liu, Andrew Rusli, Todd Y. Zhou, Freddy Yu, Sinclair Hansen, Ziling Hu, Arnav Sharma, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab, Harvard University(哈佛人工智能与机器人实验室,哈佛大学)

AI总结 提出HiT-HAR层次模型,利用头戴式IMU数据实现行为级活动识别,超越传统运动基元,在五类动作和八类场景识别中优于现有模型。

详情
AI中文摘要

AR智能眼镜需要连续的行为上下文来提供主动辅助,但其最实用的常开传感器——头戴式惯性测量单元(IMU)仅能检测行走或站立等运动基元。我们突破运动基元,实现行为级识别,定义了五个类别以平衡AR应用需求与传感器可观测性。为此,我们构建了一个包含16万样本的Ego4D数据集,采用四层质量保证框架覆盖8个活动场景,并提出了HiT-HAR,一个70.3万参数的层次模型,在五类动作和八类场景识别中优于先前的头戴式IMU模型。我们通过每类可分离性分析进一步绘制了头戴式IMU的可观测性边界,识别出哪些行为类别可靠可观测(移动),哪些受益于时间上下文(物体传递、任务操作),以及哪些场景依赖的信号重叠仍构成挑战。我们的结果表明,利用时间上下文和场景结构的架构选择优于单纯扩大模型规模。代码和数据集公开于https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR。

英文摘要

AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR.

2605.27460 2026-05-28 cs.CV 版本更新

D$^2$Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation

D$^2$Turb: 深度感知仿真与解耦学习用于单帧大气湍流抑制

Zixiao Hu, Tianyu Li, Guoqing Wang, Wei Li, Guoguo Xin, Xun Liu, Peng Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Beijing Institute of Space Mechanics and Electricity(北京空间机械与电子研究所) School of Physics, Northwest University Xi'an(西安西北大学物理学院)

AI总结 提出D$^2$Turb框架,通过深度感知湍流合成协议和自适应结构先验注入机制,将物理仿真与解耦恢复结合,实现单帧大气湍流下的纹理去模糊与几何校正。

Comments 14 pages, 7 figures

详情
AI中文摘要

单帧大气湍流抑制由于空间变化模糊与非刚性几何畸变并存而本质上是病态的。现有的基于平面场仿真的端到端方法通常难以平衡纹理恢复与几何校正。为克服这一限制,我们提出D$^2$Turb,一个将物理仿真与显式解耦恢复相结合的统一框架。首先,我们引入深度感知湍流合成协议,将场景深度纳入相位到空间公式中,生成物理一致、深度相关的退化,并为解耦学习提供关键的中间倾斜监督信号。基于该仿真引擎,D$^2$Turb将恢复分解为两个交互阶段:纹理去模糊和几何校正。纹理去模糊阶段采用去模糊骨干网络恢复细节,同时保留几何畸变以供后续校正阶段使用。为缓解级联设计中常见的信息碎片化问题,我们进一步提出自适应结构先验注入(ASPI)机制,动态传递去模糊模块的深层结构表示以指导密集流预测进行空间去扭曲。大量实验表明,D$^2$Turb在合成和真实数据集上均达到最先进性能,在纹理恢复和几何保真度方面均有持续改进。我们的代码和预训练模型已在 https://github.com/HertzDot222/D2Turb 公开。

英文摘要

Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D$^2$Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration. First, we introduce a Depth-Aware Turbulence Synthesis protocol that incorporates scene depth into the phase-to-space formulation. This generates physically consistent, depth-dependent degradations and provides a crucial intermediate tilt supervision signal for disentangled learning. Building upon this simulation engine, D$^2$Turb decomposes restoration into two interactive stages: texture deblurring and geometric rectification. The texture deblurring stage employs a deblurring backbone to recover fine-grained details while preserving geometric distortion for the subsequent rectification stage. To mitigate the information fragmentation commonly observed in cascaded designs, we further propose an Adaptive Structural Prior Injection (ASPI) mechanism that dynamically transfers deep structural representations from the deblurring module to guide dense flow prediction for spatial unwarping. Extensive experiments demonstrate that D$^2$Turb achieves state-of-the-art performance on both synthetic and real-world datasets, with consistent improvements in both texture recovery and geometric fidelity. Our code and pre-trained models are publicly available at https://github.com/HertzDot222/D2Turb.

2605.27452 2026-05-28 cs.CV 版本更新

Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent

微调视觉语言模型以理解当前损伤并使用质量卫士代理进行优先级评分

Takato Yasuno

AI总结 本文通过微调LLaVA-1.5-7B视觉语言模型,结合规则引擎和质量卫士代理,实现了桥梁损伤自动理解与修复优先级评分,有效降低了评分变异并提升了效率。

Comments 23 pages, 11 figures, 13 tables

详情
AI中文摘要

日本的桥梁检查要求每五年进行一次强制性目视评估,然而不同工程师分配的定性损伤等级(a-e级)存在显著的评分者间变异性——这是实现一致基础设施管理的关键障碍。资深工程师的老龄化进一步威胁检查能力。本文提出了一种使用微调视觉语言模型(VLM)自动化桥梁损伤理解和修复优先级评分的方法。 我们使用QLoRA在多达4000对桥梁损伤图像和检查文本记录上微调LLaVA-1.5-7B,然后在固定的800张图像测试集上进行评估。模型输出识别结构构件和损伤模式的自然语言描述,基于此,一个基于规则的评分引擎计算五级修复优先级指数。一项渐进式训练研究(1k/2k/3k/4k样本)表明,2000个训练样本在仅2.9小时的训练中即可达到接近最优的验证损失;超过2000后,训练样本每翻倍,验证损失改善不超过0.2%,表现出明显的收益递减。此外,在保留测试集上的语义相似度在3000样本时达到峰值(0.6909),在4000样本时下降(0.6739),表明质量策划的中等规模数据优于更大但噪声更多的语料库。结合torch.compile()和批处理(batch_size=8)的推理优化实现了每张图像10.06秒——比未优化基线降低了70.2%。 我们的方法有助于桥梁检查中的数据治理,减少评分者间变异性,并提供AI辅助分诊以增强检查流程中的专家工程师。此外,我们引入了一个两阶段质量卫士,使用微调的Swallow-8B SLM在优先级评分前拒绝低质量的VLM输出,防止来自损坏或无法识别图像的虚假评分。

英文摘要

Bridge inspection in Japan requires mandatory visual assessments every five years, yet qualitative damage ratings (levels a-e) assigned by different engineers exhibit significant inter-rater variability -- a critical barrier to consistent infrastructure management. The aging of skilled engineers further threatens inspection capacity. This paper presents a methodology for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). We fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, then evaluate on a fixed test set of 800 images. The model outputs natural language descriptions identifying structural members and damage patterns, from which a rule-based scoring engine calculates a five-level repair priority index. A progressive training study (1k/2k/3k/4k samples) reveals that 2k training samples achieve near-optimal validation loss in only 2.9 hours of training; beyond 2k, validation loss improves by no more than 0.2% per doubling of training samples, exhibiting clear diminishing returns. Furthermore, semantic similarity on the held-out test set peaks at 3k (0.6909) and degrades at 4k (0.6739), indicating that quality-curated mid-scale data outperforms larger but noisier corpora. Inference optimization combining torch.compile() and batch processing (batch_size=8) achieves 10.06 seconds per image -- a 70.2% reduction over the unoptimized baseline. Our approach contributes to data governance in bridge inspection, reduces inter-rater variability, and provides AI-assisted triage to augment expert engineers in inspection workflows. Furthermore, we introduce a two-stage Quality Guard using a fine-tuned Swallow-8B SLM to reject low-quality VLM outputs before priority scoring, preventing spurious scores from damaged or unrecognised images.

2605.27451 2026-05-28 cs.CV 版本更新

From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop & Competition

从情感到复杂行为:第十届ABAW研讨会与竞赛推进多模态以人为中心的人工智能

Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, Stefanos Zafeiriou, Irene Kotsia, Eric Granger, Marco Pedersoli, Simon Bacon, Jens Madsen, Soufiane Belharbi, Muhammad Haseeb Aslam, Chunchang Shao, Guanyu Hu

发表机构 * Queen Mary University of London(伦敦皇后玛丽大学) Hume AI Google Deepmind(谷歌Deepmind) Imperial College London(伦敦帝国理工学院) Cogitat LIVIA ILLS ETS Montreal(蒙特利尔ETS) Concordia University(Concordia大学) Xi’an Jiaotong University(西安交通大学)

AI总结 本文介绍了第十届ABAW研讨会与竞赛,通过多模态挑战和论文,推动真实环境下人类情感与行为的建模、分析和理解。

Comments accepted at CVPR 2026

详情
AI中文摘要

第十届真实世界情感与行为分析(ABAW)研讨会与竞赛,与CVPR 2026同期举办,持续推动在真实、无约束环境中对人类情感与行为的建模、分析和理解研究。研讨会保持双重结构,包括竞赛和论文轨道。ABAW竞赛引入了一系列多样化的挑战,针对情感与行为理解的关键方面,包括连续情感(效价-唤醒度)估计、离散情感(表情和动作单元)识别,以及更复杂的行为分析任务,如情感模仿强度估计、矛盾/犹豫识别和细粒度暴力检测。这些挑战基于大规模真实世界数据集,为最先进方法提供了全面的基准。与此同时,论文轨道展示了广泛的贡献,涵盖姿态、运动与行为估计、情感建模与多模态学习、基准、数据集与评估协议、公平性、鲁棒性与部署。总体而言,第十届ABAW研讨会与竞赛继续作为基准测试、合作与创新的关键平台,塑造下一代多模态、以人为中心的人工智能系统的发展。

英文摘要

The 10th Affective & Behavior Analysis in-the-Wild (ABAW) Workshop and Competition, held at CVPR 2026, continues to advance research on modelling, analysis, understanding of human affect and behavior in real-world, unconstrained environments. The workshop maintains its dual structure, comprising both a competition and a paper track. The ABAW Competition introduces a diverse set of challenges targeting key aspects of affective and behavioral understanding, including continuous affect (valence-arousal) estimation, discrete affect (expression and action unit) recognition, as well as more complex behavior analysis tasks, such as emotional mimicry intensity estimation, ambivalence/hesitancy recognition and fine-grained violence detection. These challenges are built upon large-scale in-the-wild datasets, providing comprehensive benchmarks for state-of-the-art approaches. In parallel, the paper track presents a wide range of contributions spanning pose, motion & behavior estimation, affect modelling & multimodal learning, benchmarks, datasets & evaluation protocols, fairness, robustness & deployment. Overall, the 10th ABAW Workshop and Competition continues to serve as a key platform for benchmarking, collaboration and innovation, shaping the development of next-generation multimodal, human-centered AI systems.

2605.27436 2026-05-28 cs.IR cs.AI cs.CV 版本更新

RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?

RE-TRIANGLE:TRIANGLE 在检索中能否实现超越余弦相似度的多模态对齐?

Arijit Ghosh, Aritra Bandyopadhyay, Chiranjeev Bindra, Jingfen Qiao

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文复现 TRIANGLE 框架,验证其通过最小化超球面上模态三元组面积实现多模态对齐的几何目标在检索任务中的鲁棒性,发现其在零样本设置下优于成对基线,但存在优化不稳定和领域依赖问题。

详情
AI中文摘要

多模态对齐对于弥合信息检索中的语义鸿沟至关重要。然而,传统的成对策略存在几何盲点:虽然它们将锚定模态(如文本)与其他模态对齐,但缺乏强制外围模态(如视频和音频)之间相互一致性的约束。TRIANGLE 框架通过最小化超球面上模态三元组的面积来实现整体对齐。在这项可重复性研究中,我们验证了该几何目标在检索任务中的鲁棒性。我们确认 TRIANGLE 在零样本设置下优于成对基线,Recall@1 提升高达 +8.7 个百分点,但收益依赖于领域。然而,我们未能复现报告中的从头学习结果。使用合成玩具数据集的分析表明,这是由于联合优化几何对齐与数据-文本匹配(DTM)损失时的不稳定性。此外,我们发现余弦正则化主要稳定文本到视频检索,而使用领域监督进行微调会放大几何收益但降低跨数据集泛化能力。我们的发现支持了几何对齐的有效性,同时突出了关键的优化敏感性。代码可在 https://github.com/ARIJIT00171/RE-TRIANGLE 获取。

英文摘要

Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce a geometric blind spot: while they align anchor modalities (e.g., text) with others, they lack constraints to enforce mutual consistency between peripheral modalities (e.g., video and audio). The TRIANGLE framework addresses this by minimizing the area of modality triplets on a hypersphere to enforce holistic alignment. In this reproducibility study, we verify the robustness of this geometric objective for retrieval tasks. We confirm that TRIANGLE outperforms pairwise baselines in zero-shot settings, achieving Recall@1 gains of up to +8.7 points, though benefits are domain-dependent. However, we fail to reproduce the reported learning-from-scratch results. Analysis using a synthetic toy dataset attributes this to instability when jointly optimizing geometric alignment with Data-Text Matching (DTM) loss. Furthermore, we find that cosine regularization primarily stabilizes text-to-video retrieval, and fine-tuning with domain supervision amplifies geometric benefits but reduces cross-dataset generalization. Our findings support the efficacy of geometric alignment while highlighting critical optimization sensitivities. Code available at https://github.com/ARIJIT00171/RE-TRIANGLE.

2605.27378 2026-05-28 cs.CL cs.CV cs.MA 版本更新

OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

OralAgent: 融合推理、工具与知识的交互式牙科影像分析

Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu, Jiahao Bao, Yuxuan Fan, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Ming Hu, Liang Zhan, James Kit Hon Tsoi, Linlin Shen, Junjun He, Kuo Feng Hung

发表机构 * Faculty of Dentistry, the University of Hongkong, Hong Kong SAR, China(香港大学牙科学院,中国香港特别行政区) Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA(匹兹堡大学电气与计算机工程系,美国宾夕法尼亚州匹兹堡) Shenzhen University, China(深圳大学,中国) Department of Craniomaxillofacial Surgery, Shanghai Ninth People’s Hospital, China(上海第九人民医院口腔颌面外科部,中国) Nanyang technological University, Singapore(南洋理工大学,新加坡) School of Biomedical Engineering, Southern Medical University, China(南方医科大学生物医学工程学院,中国) Singapore University of Technology and Design, Singapore(新加坡科技设计大学,新加坡) University of Auckland, new zealand(奥克兰大学,新西兰) Shanghai Artificial Intelligence Laboratory , China(上海人工智能实验室,中国)

AI总结 提出首个牙科专用AI智能体OralAgent,通过集成22种视觉分析工具和368本经典牙科教科书,实现多模态推理、工具决策与知识检索的自动化框架,在多个基准上达到最优性能。

Comments 14 pages, 7 figures, 6 tables

详情
AI中文摘要

牙科影像分析在支持口腔医疗的准确诊断和治疗规划中起着关键作用。尽管近期进展产生了针对特定任务和单一成像模态的牙科AI模型,但其孤立的设计限制了在实际临床工作流程中的实用性。在本文中,我们提出了OralAgent,这是首个牙科专用AI智能体,它在端到端自动化框架内统一了多模态推理、基于工具的决策和基于知识的检索。它集成了22种视觉分析工具和368本广泛使用的经典牙科教科书,实现了自主推理、规划、工具使用、知识检索和多步骤工作流执行。此外,我们引入了OralCorpus,这是一个大规模、高质量的双语文本资源,包含1.348亿个标记,专为牙科检索增强生成(RAG)而构建。为了评估模型的多学科牙科知识,我们构建了OralQA-ZH,这是一个中文选择题基准,包含来自11个口腔亚专业的798个项目。大量实验表明,OralAgent在MMOral-Uni、MMOral-OPG和OralQA-ZH基准上达到了最先进的性能,突显了其在真实临床环境中的有效性、可解释性和适应性。代码和模型已在https://github.com/isjinghao/OralAgent公开。

英文摘要

Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.

2605.27365 2026-05-28 cs.CV cs.AI cs.LG cs.RO 版本更新

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything: 基于并行框解码的快速高质量视觉定位

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Princeton University(普林斯顿大学) Nanjing University(南京大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出并行框解码(PBD)方法,将边界框和点作为原子单元单步解码,结合大规模数据集LocateAnything-Data,实现高效统一的目标定位与检测,在保持高精度同时显著提升解码吞吐量。

Comments fix github link

详情
AI中文摘要

视觉语言模型(VLM)通常将视觉定位和检测表述为坐标令牌生成问题,将每个2D框序列化为多个1D令牌,这些令牌在很大程度上独立学习和解码。这种逐令牌解码与框几何的耦合结构不匹配,并且由于严格的顺序生成而造成了实际的推理瓶颈。我们引入了LocateAnything,一个基于并行框解码(PBD)的统一生成式定位和检测框架。通过将边界框和点等几何元素作为原子单元单步解码,LocateAnything保持了框内几何一致性并实现了显著的并行性。我们证明PBD提高了解码吞吐量和定位精度。我们进一步开发了一个可扩展的数据引擎,并策划了LocateAnything-Data,这是一个包含超过1.38亿个训练样本的大规模数据集,大大增加了高精度定位的数据多样性。大量评估表明,LocateAnything推进了速度-精度前沿,在多个基准测试中实现了显著更高的解码吞吐量,同时提高了高IoU定位质量。结果突显了并行框解码和大规模训练数据在实现高效精确的统一视觉定位和检测中的互补优势。

英文摘要

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

2605.27348 2026-05-28 cs.CV cs.AI 版本更新

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

当眼睛背叛AI:社交注视一致性作为AI生成图像检测的语义线索

Jihyeon Kim, Sohee Kim, Soosan Lee, Souhwan Jung, James Matthew Rehg, Hyesong Choi

发表机构 * School of Computer Engineering(计算机工程学院) Hoseo University(Hoseo大学) School of Electronic Engineering(电子工程学院) Soongsil University(Soongsil大学) School of Computer Science(计算机科学学院) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出社交注视一致性作为高层语义线索,通过构建诊断数据集、块组合描述监督和跨架构验证,证明该线索能有效检测AI生成图像,并解释其跨生成器迁移的机制。

Comments 23 pages, 2 figures, 17 tables

详情
AI中文摘要

最近的生成模型在很大程度上缩小了低级伪影(像素指纹、频率异常、上采样痕迹)的差距,特别是在以人为中心和局部编辑的设置中,其中被操纵的区域很小且被光度真实的内容包围。我们引入了社交注视一致性,这是一个高层语义线索,定义为互动个体之间注视方向、头眼对齐和瞳孔放置的相互一致性,并表明它构成了一个先前未被充分利用的检测轴,与现有的低级范式正交。我们通过三个耦合机制实例化这一见解:(i) 一个受控的诊断数据集,具有注视一致图像的特定区域扰动,其中严格的成对分组阻止了生成器指纹记忆作为优化时间捷径,而不是依赖增强;(ii) 块组合描述监督,它在1250个宏观组合描述中保持一个单一的5块推理骨架不变,将推理一致性与表面多样性解耦;(iii) 跨架构验证表明,相同的监督在COCOAI交互子集上将视觉语言骨干(FakeVLM)的平衡准确率提高了3.7个百分点(67.8 -> 71.5),在COCOAI人物子集上提高了1.3个百分点(83.0 -> 84.3),并且在仅视觉骨干(Effort)上也有持续提升,证明了骨干无关的线索。真实类和伪造类召回率同时上升,排除了“全预测为伪造”的伪影。一个四步机制解释——成对编辑捷径阻断、难到易难度转移、CLIP先验保留以及扩散族在眼周结构中的共享频谱弱点——解释了为什么在单个修复模型(FLUX.1-Fill)上训练能够迁移到多生成器套件。我们将在论文被接收后发布代码以促进可重复性。

英文摘要

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.

2605.27155 2026-05-28 cs.CV cs.AI 版本更新

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

通过修复进行语义鲁棒性探测:面向安全关键目标检测的交互工具

Nico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock

发表机构 * Federal Institute for Occupational Safety and Health (BAuA)(联邦职业安全与健康研究所) Fraunhofer Institute for Manufacturing Engineering and Automation IPA(弗劳恩霍夫研究所(制造工程与自动化IPA))

AI总结 提出SemProbe工具,通过扩散模型可控修复生成语义探针,支持用户自定义掩码和因素,自动评估并记录目标检测模型的鲁棒性变化。

详情
AI中文摘要

在安全关键领域测试目标检测器需要超越像素级损坏的语义上有意义的探针。我们提出了SemProbe,一个用于语义鲁棒性探测的工具:用户上传部署图像,手动或自动创建掩码,选择操作设计域衍生因素(或自定义提示),并运行基于扩散的可控修复。系统支持批量作业、并行种子/工作流变体以及可配置的生成参数。每次输出后,自动运行模型推理并显示带有性能差异的带注释的前后对比。所有探针都作为结构化工件记录,从而能够提供与安全评估工作流一致的可追溯鲁棒性证据。我们在尺寸锯的手部检测上演示了SemProbe,针对保险导向测试标准中的因素。

英文摘要

Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic robustness probing: users upload deployment images, create masks manually or automatically, select operational design domain-derived factors (or custom prompts), and run diffusion-based controlled inpainting. The system supports batch jobs, parallel seed/workflow variations, and configurable generation parameters. After each output, model inference runs automatically and displays annotated before/after comparisons with performance deltas. All probes are logged as structured artifacts, enabling traceable robustness evidence aligned with safety evaluation workflows. We demonstrate \textsc{SemProbe} on hand detection for dimension saws, targeting factors from insurance-oriented test criteria.

2605.26624 2026-05-28 cs.CV 版本更新

MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition

MSCGC-KAN: 用于脑电情感识别的多尺度因果图卷积与Kolmogorov-Arnold特征映射

Haoliang Gong, Qingshan She, Jiale Xu, Yunyan Gao, Xugang Xi

发表机构 * School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China(杭州电子科技大学自动化学院) Zhejiang Provincial Key Laboratory of Brain Computer Collaborative Intelligence Technology(浙江省脑机协同智能技术与应用重点实验室)

AI总结 本文提出MSCGC-KAN方法,通过多尺度因果图卷积和Kolmogorov-Arnold特征映射构建结构化任务头,在预训练CBraMod骨干上增强多尺度时间建模、可学习通道间连接建模和非线性判别映射,显著提升脑电情感识别性能。

详情
AI中文摘要

基于脑电图的情感识别是一项重要的情感计算任务,最近的脑电图基础模型为下游适应提供了有用的通用表示。然而,在微调设置下,三个局限性仍然突出:多尺度情感动态建模不足、通道间功能连接利用不充分以及简单线性分类头的表达能力有限。为了解决这些问题,本文提出了一种新的脑电情感识别方法,称为MSCGC-KAN,它引入了一个由多尺度因果图卷积和Kolmogorov-Arnold特征映射组成的结构化任务头。基于预训练的CBraMod骨干,MSCGC-KAN通过在紧凑的任务特定头中联合加强多尺度时间建模、可学习通道间连接建模和非线性判别映射来增强下游适应。这种设计保留了基础模型的表示优势,同时使分类器对情感相关的时空模式更加敏感。在公开的FACED和SEED-VII数据集上进行了大量实验。所提方法在FACED上实现了60.66%的平衡准确率、0.5525的Cohen's Kappa和60.40%的加权F1分数,在SEED-VII上分别获得了33.27%、0.2223和33.64%。与CBraMod+Linear基线相比,在两个数据集上平衡准确率分别提高了5.91和2.03个百分点。这些结果表明,在微调预训练脑电模型时,结构化任务头设计是改进脑电情感识别的有效方法。

英文摘要

Electroencephalogram (EEG)-based emotion recognition is an important affective computing task, and recent EEG foundation models provide useful generic representations for downstream adaptation. However, under the fine-tuning setting, three limitations remain prominent: insufficient modeling of multi-scale emotional dynamics, inadequate exploitation of inter-channel functional connectivity, and the limited expressive power of simple linear classification heads. To address these issues, this paper proposes a new EEG emotion recognition method, termed MSCGC-KAN, which introduces a structured task head composed of multi-scale causal graph convolution and Kolmogorov--Arnold feature mapping. Built on a pre-trained CBraMod backbone, MSCGC-KAN enhances downstream adaptation by jointly strengthening multi-scale temporal modeling, learnable inter-channel connectivity modeling, and nonlinear discriminative mapping within a compact task-specific head. This design preserves the representation advantage of the foundation model while making the classifier more sensitive to emotion-related spatiotemporal patterns. Extensive experiments are conducted on the public FACED and SEED-VII datasets. The proposed method achieves a balanced accuracy of 60.66\%, a Cohen's Kappa of 0.5525, and a weighted F1-score of 60.40\% on FACED, and obtains 33.27\%, 0.2223, and 33.64\%, respectively, on SEED-VII. Compared with the CBraMod+Linear baseline, the balanced accuracy is improved by 5.91 and 2.03 percentage points on the two datasets, respectively. These results indicate that structured task-head design is an effective way to improve EEG emotion recognition when fine-tuning pre-trained EEG models.

2605.26391 2026-05-28 cs.GR cs.CV 版本更新

Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing

Garment Particles: 一种用于生成和编辑的2D-3D对称服装表示

Kiyohiro Nakayama, I-Chao Shen, Ruofan Liu, Yiming Wang, Gordon Wetzstein, Takeo Igarashi

发表机构 * Stanford University USA(斯坦福大学) The University of Tokyo Japan(东京大学) Institute of Science Tokyo Japan(东京科学研究所) ETH Zurich Switzerland(苏黎世联邦理工学院)

AI总结 提出Garment Particles,一种5D点云表示,联合编码2D裁剪图和3D几何,通过Garment Particles Flow框架支持从高级输入生成和多种编辑操作,实现最先进的服装生成效果。

详情
AI中文摘要

实际服装设计跨越两种模式:从高级意图(如参考图像或文本描述)进行直观创建,以及在2D裁剪图和3D悬垂几何之间进行复杂的低级编辑,这需要专业培训才能驾驭其复杂的相互依赖性。然而,现有框架仅解决了这一挑战的一部分,提供了从随意输入生成服装或直接在裁剪图上编辑的功能。为了支持这两种需求,我们提出了Garment Particles,一种5D点云表示,联合编码2D裁剪图和3D几何。这种表示使得Garment Particles Flow(GPF)成为可能,这是一个整流流框架,支持从高级输入(文本、图像、草图)进行直观生成,并通过扩散后验采样对2D裁剪图和3D几何进行各种编辑操作。最后,我们引入了Particles-to-Pattern Flow,将生成的服装粒子转换为基于曲线的裁剪图以进行模拟。我们在多个数据集上验证了模型的生成能力,与竞争基线相比实现了最先进的服装生成结果。我们的模型还支持许多服装编辑场景,包括服装插值、裁剪图编辑、点云和轮廓条件服装生成。我们的项目网站位于 https://garment-particles.github.io。

英文摘要

Practical garment design spans two modes: intuitive creation from high-level intent, such as a reference image or text description, and complex low-level editing across 2D sewing patterns and 3D draped geometry, which requires professional training to navigate their complex interdependencies. Yet existing frameworks address only part of this challenge, offering either garment generation from casual inputs or direct editing on sewing patterns. To support both ends of the spectrum, we propose Garment Particles, a 5D point-cloud representation that jointly encodes 2D sewing patterns and 3D geometry. This representation enables Garment Particles Flow (GPF), a rectified flow framework that supports intuitive generation from high-level inputs (text, images, sketches) and various editing operations on 2D sewing patterns and 3D geometries via diffusion posterior sampling. Finally, we introduce Particles-to-Pattern Flow that converts generated garment particles into curved-based patterns for simulation. We validate our model's generation ability on multiple datasets, achieving state-of-the-art garment generation results against competitive baselines. Our model also enables many garment editing scenarios, including garment interpolation, sewing pattern editing, point-cloud- and silhouette-conditioned garment generation. Our project website is at https://garment-particles.github.io .

2605.25767 2026-05-28 cs.CV 版本更新

SAFE-Diff: Scale-Aware Attention and Feature-Dispersive Diffusion with Uncertainty Estimation for Contrast-Enhanced Breast MRI Synthesis

SAFE-Diff: 用于对比增强乳腺MRI合成的尺度感知注意力与特征分散扩散及不确定性估计

Tianyu Zhang, Xinglong Liang, Jarek van Dijk, Luyi Han, Chunyao Lu, Antonio Portaluri, Xinghe Xie, Yaofei Duan, Nika Rasoolzadeh, Xin Wang, Yuan Gao, Muzhen He, Yue Sun, Jonas Teuwen, Tao Tan, Ritse Mann

发表机构 * Department of Medical Imaging, Radboud University Medical Center(鲁文大学医学中心医学影像部) Department of Radiology, Netherlands Cancer Institute(荷兰癌症研究所放射科) Maastro Clinic(马斯垂克诊所) Faculty of Applied Science, Macao Polytechnic University(澳门理工大学应用科学学院) Department of Radiation Oncology, Netherlands Cancer Institute(荷兰癌症研究所放射肿瘤科)

AI总结 提出SAFE-Diff模型,通过尺度感知注意力、特征分散扩散和不确定性估计,解决对比增强乳腺MRI合成中复杂病灶纹理和异质性增强模式的挑战。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

合成高保真度的对比增强MRI对于更安全、更高效的乳腺癌筛查具有临床价值,但由于复杂的病灶纹理和异质性增强模式,仍然具有挑战性。

英文摘要

Synthesizing high fidelity contrast enhanced MRI is clinically valuable for safer and more efficient breast cancer screening, yet remains challenging due to complex lesion textures and heterogeneous enhancement patterns.

2605.25378 2026-05-28 cs.CV cs.AI 版本更新

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

CollectionLoRA: 通过多教师在线策略蒸馏将50种效果收集到1个LoRA中

Fangtai Wu, Hailong Guo, Shijie Huang, Jiayi Song, Yubo Huang, Mushui Liu, Zhao Wang, Yunlong Yu, Jiaming Liu, Ruihua Huang

发表机构 * Zhejiang University(浙江大学) Qwen Applications Business Group of Alibaba(阿里巴巴Qwen应用业务组) Xi'an Jiaotong University(西安交通大学)

AI总结 提出CollectionLoRA框架,通过多教师在线策略蒸馏将多达50种不同效果LoRA和少步生成能力整合到单个LoRA中,解决参数干扰并降低部署成本。

详情
AI中文摘要

定制图像编辑旨在使用有限的配对数据,通常通过低秩适配(LoRA)为预训练扩散模型配备特定的视觉效果。随着所需效果数量的增加,存储和动态加载这些效果LoRA会显著增加部署开销。此外,当前的流程通常将这些效果LoRA与加速模块级联以实现快速生成,这会引发严重的参数干扰,导致概念混淆和风格退化。我们提出了CollectionLoRA,一个多教师在线策略蒸馏框架,能够将多达50种不同效果LoRA的概念以及少步生成能力蒸馏到单个LoRA中。这从根本上解决了特征干扰问题,并显著降低了部署成本。具体来说,该方法引入了(i)概率双流路由机制,使模型在训练期间能够在数据源之间随机切换,有效增强其在未见场景中的泛化能力;(ii)非对称正交提示策略,在提示空间内实现概念隔离;(iii)从粗到细的蒸馏目标,以缓解教师模型与学生模型之间的分布差距。大量评估表明,CollectionLoRA将所有定制效果和少步生成蒸馏到单个LoRA中,降低了部署开销,同时实现了与独立训练的教师模型相当或更好的概念保真度。代码:https://github.com/Qwen-Applications/CollectionLoRA

英文摘要

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual-Stream Routing mechanism that enables the model to randomly switch between data sources during training, effectively enhancing its generalization in unseen scenarios; (ii) an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space; (iii) a Coarse-to-Fine Distillation Objective to mitigate the distribution gap between the teacher and student models. Extensive evaluations show that CollectionLoRA distills all customized effects and few-step generation into a single LoRA, reducing deployment overhead while achieving concept fidelity comparable to or better than independently trained teacher models. Code: https://github.com/Qwen-Applications/CollectionLoRA

2605.24302 2026-05-28 cs.CV 版本更新

Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies

基于Mamba的第一人称视频跨模态动作识别:通过CLS令牌融合策略整合RGB和手部骨架流

Juan Ignacio Bustos Gorostegui, Maria Elena Buemi

发表机构 * Univ. of Buenos Aires. Faculty of Exact and Natural Sciences. Dept. of Computer Science (DC)(布宜诺斯艾利斯大学。精确与自然科学学院。计算机科学系) CONICET-Univ. of Buenos Aires. Institute of Computer Sciences (ICC)(布宜诺斯艾利斯大学CONICET联合体。计算机科学研究所)

AI总结 提出一种基于Mamba的跨模态架构,通过四种CLS令牌融合策略(朴素、平均、加权和基于上下文)整合RGB视频和手部骨架数据,在H2O数据集上平均策略达到最佳性能,Top-1准确率在Tiny配置下提升超10%。

Comments 4 pages , 2 figures , Egovis2026 , CVPR2026

详情
AI中文摘要

第一人称动作识别由于相机运动不稳定、手部频繁遮挡以及随时间保持一致视觉表示的困难而具有挑战性。在这项工作中,我们提出了一种跨模态架构,将RGB视频和时间手部骨架数据结合在一个统一的基于Mamba的框架内,利用状态空间模型(SSMs)的线性时间复杂度。我们的架构由三个组件组成:用于视觉特征提取的VideoMamba模块、基于Mamba块堆叠的骨架编码器,以及将两种模态整合为单一表示的融合模块。本工作的一个核心贡献是设计和评估了四种用于多模态融合的类(CLS)令牌混合策略:朴素、平均、加权和基于上下文。这些策略在如何利用预训练的单模态CLS令牌(其作用是作为信息汇聚集所学表示)来初始化用于最终分类的混合CLS令牌方面有所不同。我们在H2O数据集上评估了所有策略。实验结果表明,平均策略实现了最佳性能,在Tiny配置下比VideoMamba基线提高了超过10%的Top-1准确率,在Small配置下提高了2%。

英文摘要

Egocentric action recognition is a challenging task due to erratic camera motion, frequent hand occlusion, and the difficulty of maintaining consistent visual representations over time. In this work, we propose a cross-modal architecture that combines RGB video and temporal hand skeleton data within a unified Mamba-based framework, exploiting the linear time complexity of State Space Models (SSMs). Our architecture consists of three components: a VideoMamba module for visual feature extraction, a skeleton encoder built on a stack of Mamba blocks, and a fusion module that integrates both modalities into a single representation. A central contribution of this work is the design and evaluation of four Class (CLS) token mixing strategies for multimodal fusion: Naive, Average, Weighted and Context-based. These strategies differ in how the pretrained unimodal CLS tokens, which role is to act as information sinks concentrating learned representations, are leveraged to initialize the mixed CLS token used for final classification. We evaluate all strategies on the H2O dataset. Experimental results show that the Average strategy achieves the best performance, yielding gains of over 10% Top-1 accuracy in the Tiny configuration and 2% in the Small configuration over the VideoMamba baseline.

2605.23908 2026-05-28 cs.AI cs.CL cs.CV cs.NE 版本更新

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

寻找开放性的要素:用大型视觉语言模型复现 Picbreeder

Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

发表机构 * New York University(纽约大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本研究通过用前沿视觉语言模型替代人类用户复现 Picbreeder,探索人工智能在无引导发现中的开放性能力,并分析系统输出与人类基线在系统发育复杂性、视觉和语义显著性及新颖性上的差异,同时研究探索性噪声、行为多样性和叙事动量等因素的影响。

Comments 26 pages, 21 figures, to be published at GECCO 2026

详情
AI中文摘要

我们正处于大规模工业和学术努力之中,旨在通过AI驱动的助手自动化科学、技术和创造性生产的过程。历史上,这些过程在人类形式中的一个基本属性是它们的开放性:即生成看似无穷无尽的新颖且有意义的新形式的能力。人工代理是否有能力进行这种富有成果的无引导发现?为了回答这个问题,我们转向Picbreeder,这是人类驱动的开放性搜索的典型范例,用户通过小型神经网络的交互式进化协作生成多样化的图像库。我们复现了Picbreeder,用前沿视觉语言模型(VLM)替代人类用户。我们观察到系统输出与历史人类基线之间存在明显的定性差异,并尝试使用系统发育复杂性、视觉和语义显著性及新颖性的指标来表征这些差异。为了识别导致这些差异的一些因果因素,我们研究了在代理的选择过程中添加探索性噪声、代理之间的行为多样性以及以过去行动记忆形式的叙事动量。我们的代码可在 https://github.com/smearle/picbreeder-vlm 获取。

英文摘要

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

2605.23192 2026-05-28 cs.CV 版本更新

Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing

遮挡感知的物理-语义关键帧选择用于鲁棒视频编辑

Lin Liu, Zhihan Xiao, Haohang Xu, Rong Cong, Zhibo Zhang, Xiaopeng Zhang, Qi Tian

发表机构 * Huawei(华为) Tsinghua University(清华大学) East China Normal University(华东师范大学)

AI总结 提出一种遮挡感知的物理-语义关键帧选择框架,通过从结构完整性、跟踪稳定性和属性可见性三个角度评估候选帧,自动选择最优锚定帧,并利用双向跟踪生成时空掩码,实现鲁棒且时序一致的视频编辑。

详情
AI中文摘要

近年来,基于扩散的生成模型在视频编辑领域取得了显著进展,能够根据自然语言指令实现多样化的对象级操作。然而,现有方法在遮挡、视角变化和快速物体运动场景下常常表现不佳,不可靠的视觉观测导致定位不准确、时间闪烁和编辑不一致。在本工作中,我们识别出缺乏可靠视觉锚点是遮挡鲁棒视频编辑的一个根本瓶颈。为解决此问题,我们提出了一种遮挡感知的物理-语义关键帧选择框架,该框架自动为下游编辑识别最优锚定帧。具体而言,我们的方法从三个互补角度评估候选帧:避免截断观测的结构完整性、衡量物理可靠性的循环一致跟踪稳定性、以及确保语义清晰性的基于视觉语言的属性可见性。选定的关键帧随后通过双向跟踪传播,生成密集的时空掩码,这些掩码作为扩散视频编辑骨干的辅助监督。通过将遮挡处理从显式重建转变为可靠锚点选择,我们的框架无需手动标注即可实现精确且时序一致的编辑。在具有挑战性的视频编辑基准上的大量实验证明了我们方法的有效性和高质量性能。

英文摘要

Video editing has recently achieved remarkable progress with diffusion-based generative models, enabling diverse object-level manipulations from natural language instructions. However, existing methods often struggle under occlusion, viewpoint changes, and fast object motion, where unreliable visual observations lead to inaccurate localization, temporal flickering, and inconsistent edits. In this work, we identify the absence of reliable visual anchors as a fundamental bottleneck in occlusion-robust video editing. To address this issue, we propose an occlusion-aware physics-semantic keyframe selection framework that automatically identifies an optimal anchor frame for downstream editing. Specifically, our method evaluates candidate frames from three complementary perspectives: structural completeness for avoiding truncated observations, cycle-consistent tracking stability for measuring physical reliability, and vision-language-based attribute visibility for ensuring semantic clarity. The selected keyframe is then propagated through bidirectional tracking to generate dense spatiotemporal masks, which are used as auxiliary supervision for a diffusion-based video editing backbone. By transforming occlusion handling from explicit reconstruction into reliable anchor selection, our framework enables precise and temporally consistent editing without requiring manual annotations. Extensive experiments on challenging video editing benchmarks demonstrate the effectiveness and high-quality performance of our method.

2605.23137 2026-05-28 eess.IV cs.CV 版本更新

STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding

STAMBRIDGE:用于脑电视觉解码的谱时幅度感知中间特征桥

Jiahe Meng, Weiming Zeng, Yueyang Li, Bo Chai, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(数字图像与智能计算实验室,上海 Maritime 大学) Department of Language Science and Technology, The Hong Kong Polytechnic University(语言科学与技术系,香港理工大学) Department of Neurology, Affiliated Lianyungang Hospital of Xuzhou Medical University(神经内科,徐州医学院附属连云港医院) Institute of Computing and Intelligence, Harbin Institute of Technology Shenzhen(计算与智能研究所,哈尔滨工业大学深圳研究院)

AI总结 提出STAMBRIDGE两阶段框架,通过谱时幅度感知调制(STAM)提取稳健脑电特征,并利用中间特征语义桥(MFSB)实现稳定的跨模态对齐,在THINGS-EEG基准上取得34.50% Top-1和65.95% Top-5的200路零样本检索准确率。

详情
AI中文摘要

脑电图(EEG)视觉解码由于低信噪比神经信号与高度结构化的视觉-语言空间之间的模态差距而仍然具有挑战性,使得直接的跨模态对齐不稳定。为了解决这个问题,我们提出了STAMBRIDGE,一个通用的两阶段框架,依次处理特征条件和跨模态对齐。首先,我们引入谱时幅度感知调制(STAM)来提取良好条件的EEG表示。通过用幅度导出的软通道权重和多尺度时间卷积替代硬频率掩蔽,STAM明确保留了频率感知的瞬态,同时降低了时域振铃伪影的风险。在这些稳健的神经特征基础上,我们进一步引入了一个模型无关的中间特征语义桥(MFSB),通过定向的跨模态交互构建一个正则化的中间空间,实现分阶段蒸馏和更稳定的语义对齐。在THINGS-EEG基准上的实验显示了具有竞争力的200路零样本检索性能,Top-1准确率为34.50%,Top-5准确率为65.95%。此外,STAMBRIDGE学习的嵌入使用扩散模型产生了语义连贯的图像重建,展示了稳健的EEG到视觉语义对齐。代码可在https://github.com/thabeatmjh/STAMBRIDGE获取。

英文摘要

Electroencephalography (EEG) visual decoding remains challenging due to the modality gap between low-SNR neural signals and highly structured vision--language spaces, making direct cross-modal alignment unstable. To address this, we propose STAMBRIDGE, a versatile two-stage framework that sequentially tackles feature conditioning and cross-modal alignment. First, we introduce a Spectral-Temporal Amplitude-aware Modulation (STAM) to extract well-conditioned EEG representations. By replacing hard frequency masking with amplitude-derived soft channel weighting and multi-scale temporal convolutions, STAM explicitly preserves frequency-aware transients while reducing the risk of time-domain ringing artifacts. Building upon these robust neural features, we further introduce a model-agnostic Mid-Feature Semantic Bridge (MFSB) that constructs a regularized intermediate space through directed cross-modal interactions, enabling staged distillation and more stable semantic alignment. Experiments on the THINGS-EEG benchmark show competitive 200-way zero-shot retrieval performance, with 34.50\% Top-1 and 65.95\% Top-5 accuracy. In addition, embeddings learned by STAMBRIDGE produce semantically coherent image reconstructions with a diffusion model, demonstrating robust EEG-to-vision semantic alignment. The code is available at: https://github.com/thabeatmjh/STAMBRIDGE.

2605.22547 2026-05-28 cs.CV cs.AI 版本更新

Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement

基于多模态知识图谱和可靠性引导精化的病例感知医学图像分类

Yiming Xu, Yixuan Liu, Yuhang Zhang, Ling Zheng, Yihan Wang, Qi Song

发表机构 * University of Science and Technology of China(科学技术大学)

AI总结 提出一种基于多模态知识图谱的病例感知推理框架,通过构建结构化诊断记忆、自适应检索相似病例、知识传播与注入机制以及置信度校准的决策精化方案,提升医学图像分类的性能和可解释性。

详情
AI中文摘要

深度学习为医学图像分类带来了显著进展,但现有方法大多依赖孤立的视觉证据,无法有效利用相似病例或外部知识。在临床实践中,诊断通常由相似历史病例及其相关症状支持。为了显式建模这一循证诊断过程,我们提出了一种由多模态知识图谱驱动的病例感知推理框架,用于医学图像分类。具体而言,我们构建了一个病例感知的多模态知识图谱作为结构化的诊断记忆,其中疾病、图像和症状按层次组织。给定输入图像,我们的方法自适应地从该记忆中检索相似病例,并提取相应的以病例为中心的子图。我们进一步引入了一种知识传播与注入机制,其中以图像为中心的图注意力网络将异质语义聚合为基于病例的特征,随后通过双向跨模态注意力机制将这些特征注入视觉表示以实现跨模态对齐。为了减轻噪声检索,我们设计了一种置信度校准的决策精化方案,通过联合考虑预测置信度和样本相似性来估计每个检索病例的可靠性,并重新加权其对最终预测的贡献,提供可解释的病例级证据。在多个医学影像数据集上的大量实验表明,我们的方法一致优于强基线,而消融和定性分析验证了其有效性和可解释性。代码可在 https://anonymous.4open.science/r/MKG-CARE-8B7B 获取。

英文摘要

Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by similar historical cases and their associated symptoms. To explicitly model this evidence-based diagnostic process, we propose a case-aware reasoning framework driven by multimodal knowledge graphs for medical image classification. Specifically, we construct a case-aware multimodal knowledge graph as a structured diagnostic memory, where diseases, images, and symptoms are hierarchically organized. Given an input image, our method adaptively retrieves similar cases from this memory and extracts their corresponding case-centered subgraphs. We further introduce a knowledge propagation and injection mechanism, in which an image-centric Graph Attention Network aggregates heterogeneous semantics into case-based features, followed by a bidirectional cross-modal attention mechanism that injects these features into visual representations for cross-modal alignment. To mitigate noisy retrieval, we design a confidence-calibrated decision refinement scheme that estimates the reliability of each retrieved case by jointly considering prediction confidence and sample similarity, and reweights its contribution to the final prediction, providing interpretable case-level evidence. Extensive experiments on multiple medical imaging datasets demonstrate that our approach consistently outperforms strong baselines, while ablation and qualitative analyses validate its effectiveness and interpretability. The code is available at https://anonymous.4open.science/r/MKG-CARE-8B7B.

2605.19729 2026-05-28 cs.CV cs.AI 版本更新

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

LIFT and PLACE: 一种简单、稳定且有效的轻量级扩散模型知识蒸馏框架

Hyunsoo Han, Sangyeop Yeo, Jaejun Yoo

发表机构 * Ulsan National Institute of Science and Technology (UNIST)(ulsan国家科学技术研究所)

AI总结 提出LIFT和PLACE框架,通过粗到细的蒸馏策略解决教师网络高复杂度带来的学生模仿困难,在极端压缩下仍能稳定训练并取得良好性能。

Comments Project page: https://hyun-s.github.io/LIFT_PLACE_site , 15 pages, 11 figure, 9 tables, To appear in CVPR 2026

详情
AI中文摘要

我们证明,在扩散模型的知识蒸馏中,教师网络由于其更大的容量而具有高度复杂的去噪过程,这给学生模型忠实模仿带来了重大挑战。为了解决这个问题,我们提出了一种基于线性拟合蒸馏(LIFT)和分段局部自适应系数估计(PLACE)的粗到细蒸馏框架。首先,LIFT将目标分解为“粗”对齐和“细”细化。学生先在粗对齐上训练,然后进行困难的细化。其次,PLACE通过将输出划分为基于误差的组来扩展LIFT以处理空间非均匀误差,提供局部自适应指导。我们的实验表明,LIFT和PLACE在扩散空间(图像/潜在)、骨干网络(U-Net/DiT)、任务(无条件/条件)、数据集上均有效,甚至扩展到基于流的模型如MMDiT(SD3)。此外,在极端压缩下(学生参数1.3M,仅为教师的1.6%),传统KD无法为稳定训练提供足够指导,FID分数常退化到50-200+,但我们的方法仍稳定收敛并达到15.73的FID。

英文摘要

We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

2605.20150 2026-05-28 cs.CV cs.PF 版本更新

TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

TideGS: 通过外存优化训练超过十亿个3D高斯溅射基元

Chonghao Zhong, Linfeng Shi, Hua Chen, Tiecheng Sun, Hao Zhao, Binhang Yuan, Chaojian Li

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 针对大规模3D高斯溅射训练的内存瓶颈,提出TideGS外存训练框架,通过SSD-CPU-GPU层次化管理和三种协同技术,在单GPU上实现超过十亿高斯基元的训练并达到最优重建质量。

Comments Accepted to ICML 2026 as Spotlight. Website: https://sponge-lab.github.io/TideGS

详情
AI中文摘要

训练十亿基元规模的3D高斯溅射(3DGS)本质上是内存受限的:每个高斯基元携带一个大的属性向量,总参数表迅速超出GPU容量,限制了先前系统在商用单GPU硬件上只能处理数千万高斯基元。我们观察到3DGS训练本质上是稀疏且轨迹条件的:每次迭代仅激活当前相机批次可见的高斯基元,因此GPU内存可以作为工作集缓存而非持久参数存储。基于这一洞察,我们引入了TideGS,一个外存训练框架,通过三种协同技术管理SSD-CPU-GPU层次结构中的参数:用于SSD对齐空间局部性的块虚拟化几何、用于将I/O与计算重叠的分层异步流水线,以及轨迹自适应差分流,该流在迭代之间仅传输增量工作集变化。实验表明,TideGS能够在单个24 GB GPU上训练超过十亿个高斯基元,同时在大规模场景中实现评估的单GPU基线中最佳的重建质量,超越了先前的外存基线(例如约1亿高斯基元)和标准内存训练(例如约1100万高斯基元)。

英文摘要

Training 3D Gaussian Splatting (3DGS) at billion-primitive scale is fundamentally memory-bound: each Gaussian primitive carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on commodity single-GPU hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD-CPU-GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single 24 GB GPU while achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes, scaling beyond prior out-of-core baselines (e.g., approximately 100M Gaussians) and standard in-memory training (e.g., approximately 11M Gaussians).

2511.14159 2026-05-28 cs.CV 版本更新

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

MVI-Bench:评估大型视觉语言模型对误导性视觉输入鲁棒性的综合基准

Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng

发表机构 * Department of Computer Science, University of Illinois Chicago, Chicago, USA School of Computer Science \& Engineering, Southeast University, Nanjing, China Guohao School, Tongji University, Shanghai, China

AI总结 针对现有鲁棒性基准忽视误导性视觉输入的问题,提出MVI-Bench基准,基于视觉基元的三级层次(视觉概念、视觉属性、视觉关系)构建6个类别1248个VQA实例,并引入MVI-Sensitivity指标进行细粒度评估,揭示18个LVLM的显著脆弱性。

Comments 18 pages, 9 figures

详情
AI中文摘要

评估大型视觉语言模型(LVLMs)的鲁棒性对于其持续发展和在现实世界应用中的负责任部署至关重要。然而,现有的鲁棒性基准通常关注幻觉或误导性文本输入,而在很大程度上忽视了评估视觉理解时由误导性视觉输入带来的同样关键的挑战。为填补这一重要空白,我们引入了MVI-Bench,这是首个专门设计用于评估误导性视觉输入如何削弱LVLMs鲁棒性的综合基准。基于基本视觉基元,MVI-Bench的设计围绕三个层次的误导性视觉输入:视觉概念、视觉属性和视觉关系。利用这一分类法,我们策划了六个代表性类别,并整理了1248个专家标注的VQA实例。为了促进细粒度的鲁棒性评估,我们进一步引入了MVI-Sensitivity,这是一种新颖的指标,可在细粒度上表征LVLM的鲁棒性。在18个最先进的LVLM上的实证结果揭示了它们对误导性视觉输入的显著脆弱性,我们在MVI-Bench上的深入分析提供了可操作的见解,可以指导开发更可靠和鲁棒的LVLM。基准和代码库可在https://github.com/chenyil6/MVI-Bench获取。

英文摘要

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

2605.19342 2026-05-28 cs.CV 版本更新

Semantic-Enriched Latent Visual Reasoning

语义增强的潜在视觉推理

Tianrun Xu, Yue Sun, Qixun Wang, Jingyi Lu, Yuan Wang, Tianren Zhang, Longteng Guo, Fengyun Rao, Jing Lyu, Feng Chen, Jing Liu

发表机构 * Department of Automation, Tsinghua University, Beijing, China(清华大学自动化系) Department of Electronic Engineering, Tsinghua University, Beijing, China(清华大学电子工程系) Zhongguancun Academy, Beijing, China(中关村学院) China Agricultural University, Beijing, China(中国农业大学) Peking University, Beijing, China(北京大学) Beijing Institute of Technology, Beijing, China(北京理工大学) Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院) WeChat Vision, Tencent Inc, Beijing, China(微信视觉,腾讯公司)

AI总结 提出两阶段学习框架SLVR,通过属性级语义监督和多查询组相对策略优化增强潜在表示的语义丰富性,提升潜在视觉推理的鲁棒性和语义一致性。

详情
AI中文摘要

多模态潜在空间推理旨在通过在紧凑的潜在空间中直接进行视觉推理,来替代使用图像的显式思考。然而,现有方法主要依赖视觉监督,产生的潜在表示缺乏足够的语义丰富性,限制了它们支持多样化区域级推理任务的能力。在这项工作中,我们引入了语义增强的潜在视觉推理(SLVR),这是一个两阶段学习框架,用属性级视觉语义丰富潜在表示,并将其与多样化的推理目标对齐。在第一阶段,SLVR在细粒度属性监督下学习语义增强的区域中心潜在表示。在第二阶段,我们设计了多查询组相对策略优化(M-GRPO),以对齐基于同一区域的多个查询的潜在表示。为了支持这一框架,我们构建了SLV-Set,包含约40万条区域级属性标注和80万个多查询问答样本,并引入了SV-QA,一个评估语义变化下潜在推理的基准。实验表明,与现有基线相比,SLVR提高了潜在视觉推理的鲁棒性和语义一致性。

英文摘要

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

2605.15864 2026-05-28 cs.CV cs.CL 版本更新

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

VLMs 是在看还是只是在说?揭示视觉重新检查的幻觉

Chufan Shi, Cheng Yang, Yaokang Wu, Linghao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma

发表机构 * University of Southern California(南加州大学) University of California San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过图像交换探测框架 VisualSwap 和 800 对图像基准 VS-Bench,发现视觉语言模型在推理时声称的“重新检查图像”多为文本模式,而非真正的视觉重新检查,且思考模型更易受影响,用户指令可恢复视觉基础但自我反思无效。

Comments ICML 2026 Oral

详情
AI中文摘要

视觉语言模型(VLM)在推理过程中经常产生自我反思的语句,如“让我再检查一下图片”。这样的语句是否触发了真正的视觉重新检查,还是仅仅是习得的文本模式?我们通过 VisualSwap(一种图像交换探测框架)对此进行研究:在模型对一张图像进行推理后,我们将其替换为视觉上相似但语义不同的图像,并测试模型是否注意到这一变化。我们引入了 VS-Bench,包含从 MathVista、MathVerse、MathVision 和 MMMU-Pro 中精选的 800 对图像。在 Qwen3-VL、Kimi-VL 和 ERNIE-VL 上的实验揭示了一个惊人的失败:模型绝大多数情况下忽略了图像交换,准确率下降高达 60%。与直觉相反,思考模型比其指令对应模型脆弱近 3 倍,且扩展规模无法缓解。多轮用户指令可以恢复视觉基础,但连续生成过程中自我生成的反思语句则不能。注意力分析解释了原因:用户指令显著提高了对视觉标记的注意力,而自我反思则没有。当前的 VLM 在声称执行视觉重新检查时倾向于“说”而非真正“看”。我们的代码和数据集可在项目页面获取:https://visualswap.github.io

英文摘要

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

2605.15523 2026-05-28 cs.CV 版本更新

Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

自提示扩散变压器用于开放词汇场景文本编辑的上下文学习

Hongxi Li, Tong Wang, Chengjing Wu, Tianbao Liu, Jiangtao Yao, Xiaochao Qu, Xinxiao Wu, Luoqi Liu, Ting Liu

发表机构 * MT Lab, Meitu Inc., Beijing, China School of Computer Science \& Technology, Beijing Institute of Technology, Beijing, China

AI总结 提出一种自提示场景文本编辑方法,通过构建风格和字形提示,利用多模态扩散变压器的上下文学习能力,实现开放词汇和风格一致的文本编辑。

Comments ICML 2026

详情
AI中文摘要

场景文本编辑旨在修改图像目标区域中的文本,同时保留周围的背景风格和纹理。现有方法仅依赖图像背景信息,而忽略了目标区域的视觉细节,这丢弃了原始文本中的风格特征,本质上将任务降级为文本渲染。此外,预训练的字形编码器施加的条件限制了可编辑文本的范围。为了解决这些问题,本文提出了一种自提示场景文本编辑方法,直接从原始图像构建风格和字形提示,无需引入额外的风格或字形编码器。我们采用两阶段训练策略:扩散变压器首先在大规模自监督数据上训练,然后使用少量配对图像进行微调。通过利用多模态扩散变压器(MM-DiT)的上下文学习能力,它实现了开放词汇和风格一致的文本编辑。在各种语言上的实验结果表明,我们的方法在文本准确性和风格一致性方面均达到了最先进的性能。我们的项目页面:hongxiii.github.io/mstedit。

英文摘要

Scene text editing aims to modify text in a target region of an image while preserving surrounding background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pre-trained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method that constructs style and glyph prompts directly from the original image, without introducing additional style or glyph encoders. We employ a two-stage training strategy: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency. Our project page: hongxiii.github.io/mstedit.

2605.13517 2026-05-28 cs.CV cs.AI cs.LG 版本更新

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

ArcVQ-VAE:一种带有反余弦加性边界的球面向量量化框架

Jaeyung Kim, YoungJoon Yoo

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea(韩国首尔 Chung-Ang 大学人工智能系) SNUAILAB, Seoul, Republic of Korea(韩国首尔 SNUAILAB 实验室)

AI总结 针对VQ-VAE有限码本容量限制表示能力的问题,提出ArcVQ-VAE框架,通过引入球面角边先验(包括球界范数正则化和反余弦加性边界损失)增强潜在表示的判别性和均匀分散性,提升码本利用率,在图像重建和生成任务上取得竞争性能。

Comments To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

向量量化变分自编码器(VQ-VAE)已成为图像建模中学习离散表示的基本框架。然而,VQ-VAE模型必须使用有限的码本向量集对整张图像进行分词,这种容量限制限制了其捕获丰富多样表示的能力。在本文中,我们提出反余弦加性边界VQ-VAE(ArcVQ-VAE),一种新颖的向量量化框架,该框架为传统VQ-VAE的码本引入了球面角边先验(SAMP)。所提出的SAMP由球界范数正则化(将所有码本向量约束在时间相关的欧几里得球内)和反余弦加性边界损失(鼓励潜在向量之间更大的角度可分性)组成。这种公式在受限空间内促进了更具判别性和均匀分散的潜在表示,从而提高了有效的潜在空间覆盖范围,并导致码本利用率提升。在标准图像重建和生成任务上的实验结果表明,ArcVQ-VAE在重建精度、表示多样性和样本质量方面与基线模型相比取得了竞争性能。代码可在 https://github.com/goals4292/ArcVQ-VAE 获取。

英文摘要

Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ-VAE

2506.22726 2026-05-28 cs.CV cs.LG 版本更新

XTransfer: Modality-Agnostic Few-Shot Model Transfer for Human Sensing at the Edge

XTransfer: 面向边缘人体感知的模态无关小样本模型迁移

Yu Zhang, Xi Zhang, Hualin Zhou, Xinyuan Chen, Shang Gao, Hong Jia, Jianfei Yang, Yuankai Qi, Tao Gu

发表机构 * Macquarie University, Sydney, NSW, Australia(麦考瑞大学,悉尼,新南威尔士州,澳大利亚) Nanyang Technological University, Singapore(南洋理工大学,新加坡) The University of Auckland, Auckland, New Zealand(奥克兰大学,奥克兰,新西兰)

AI总结 提出XTransfer方法,通过模型修复和层重组实现模态无关的小样本模型迁移,降低传感器数据收集、模型训练和边缘部署成本。

Comments Accepted at ICML2026

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, 6-11 July 2026
AI中文摘要

边缘系统上用于人体感知的深度学习具有巨大的智能应用潜力。然而,其训练和开发受到传感器数据有限和边缘系统资源约束的限制。虽然将预训练模型迁移到不同的感知应用很有前景,但现有方法通常需要大量的传感器数据和计算资源,导致成本高且可迁移性有限。在本文中,我们提出了XTransfer,这是一种首创的方法,实现了模态无关、小样本模型迁移,并具有资源高效的设计。XTransfer通过以下方式灵活地使用预训练模型并在不同模态间迁移知识:(i) 模型修复,通过仅使用少量传感器数据适配预训练层来安全地缓解模态偏移;(ii) 层重组,以逐层方式高效地搜索和重组源模型中的感兴趣层以重构模型。我们在跨不同模态的多种人体感知数据集上对各种基线进行了基准测试。结果表明,XTransfer实现了最先进的性能,同时显著降低了传感器数据收集、模型训练和边缘部署的成本。

英文摘要

Deep learning for human sensing on edge systems presents significant potential for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. While transferring pre-trained models to different sensing applications is promising, existing methods often require extensive sensor data and computational resources, resulting in high costs and limited transferability. In this paper, we propose XTransfer, a first-of-its-kind method enabling modality-agnostic, few-shot model transfer with resource-efficient design. XTransfer flexibly uses pre-trained models and transfers knowledge across different modalities by (i) model repairing that safely mitigates modality shift by adapting pre-trained layers with only few sensor data, and (ii) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to restructure models. We benchmark various baselines across diverse human sensing datasets spanning different modalities. The results show that XTransfer achieves state-of-the-art performance while significantly reducing the costs of sensor data collection, model training, and edge deployment.

2605.12929 2026-05-28 cs.CV cs.AI 版本更新

Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

Anatomy-Slot: 用于视网膜诊断中同源双侧推理的无监督解剖分解

Yingzhe Ma, Xiao Yang, Yuguo Yin, Zheyu Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Peking University(北京大学)

AI总结 提出Anatomy-Slot方法,通过无监督解剖瓶颈分解斑块令牌为结构一致的解剖区域槽,并利用双向交叉注意力对齐双眼槽,在ODIR-5K上相比ViT-L基线提升AUC 4.2点,验证了显式结构对应改善诊断的假设。

Comments 15 pages, 3 figures

详情
AI中文摘要

视网膜诊断本质上是双侧的:临床医生比较双眼的同源结构(例如,视盘不对称),然而大多数深度模型基于单眼表示。我们研究显式结构对应是否改善诊断,并提出Anatomy-Slot来操作化这一假设。Anatomy-Slot通过将斑块令牌分解为一组涌现的、结构一致的槽(对应于解剖区域)来引入无监督解剖瓶颈,然后通过双向交叉注意力对齐双眼的槽。在ODIR-5K上使用$n=10$个种子,该方法相比匹配的ViT-L基线在AUC上提升$4.2$个点(95%置信区间;Wilcoxon符号秩检验,$W=0$,$p=0.002$)。配对破坏和高斯噪声下的压力测试提供了对应依赖性和鲁棒性的受控测试。我们进一步在REFUGE上报告了定量视盘定位和交叉注意力定位分析。除了报告的性能提升外,这些结果表明,以对象为中心的解剖对应为与临床双侧比较一致的可解释诊断系统提供了一条原则性路径。

英文摘要

Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into a set of emergent, structurally-coherent slots that correspond to anatomical regions, then aligning these slots across eyes via bidirectional cross-attention. On ODIR-5K with $n=10$ seeds, the method improves AUC by $4.2$ points over a matched ViT-L baseline (95% CIs; Wilcoxon signed-rank test, $W=0$, $p=0.002$). Pairing disruption and stress testing under Gaussian noise provide controlled tests of correspondence dependence and robustness under corruption. We further report quantitative optic disc grounding on REFUGE and cross-attention localization analysis. Beyond the reported gains, these results indicate that object-centric anatomical correspondence offers a principled path toward interpretable diagnostic systems aligned with clinical bilateral comparison.

2605.11755 2026-05-28 cs.LG cs.CV stat.ML 版本更新

One-Step Generative Modeling via Wasserstein Gradient Flows

通过Wasserstein梯度流的一步生成建模

Jiaqi Han, Puheng Li, Qiushan Guo, Renyuan Xu, Stefano Ermon, Emmanuel J. Candès

发表机构 * Stanford University(斯坦福大学) ByteDance(字节跳动)

AI总结 提出W-Flow框架,通过Wasserstein梯度流将参考分布到目标分布的演化压缩为一步生成,结合Sinkhorn散度实现高效最优传输,在ImageNet 256×256上达到1.29 FID且采样速度提升约100倍。

Comments 40 pages, 14 figures

详情
AI中文摘要

扩散模型和基于流的方法展现了令人印象深刻的生成能力,尤其对于图像,但其采样成本高昂,因为需要多次迭代更新。我们引入了W-Flow,一个训练生成器的框架,该生成器在单步中将来自简单参考分布的样本转换为来自目标数据分布的样本。这通过两步实现:首先,通过最小化能量泛函的Wasserstein梯度流,定义从参考分布到目标分布的演化;其次,训练一个静态神经生成器将此演化压缩为一步生成。我们用Sinkhorn散度实例化能量泛函,该散度产生一种高效的基于最优传输的更新规则,捕获全局分布差异并改善目标分布的覆盖。我们进一步证明了在适当假设下,有限样本训练动力学收敛到连续时间分布动力学。实验上,W-Flow为一步ImageNet 256×256生成设立了新的最先进水平,实现了1.29 FID,并改善了模式覆盖和域迁移。与具有相似FID分数的多步扩散模型相比,我们的方法实现了约100倍的采样加速。这些结果表明,Wasserstein梯度流为快速且高保真的生成建模提供了原则性和有效的基础。

英文摘要

Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.

2605.10583 2026-05-28 cs.CV 版本更新

FrequencyCT: Frequency Domain Self-supervised Low-dose CT Denoising

FrequencyCT:频域自监督低剂量CT去噪

Guoquan Wei, Liu Shi, Chong Chen, Qiegen Liu

发表机构 * School of Information Engineering, Nanchang University(南昌大学信息工程学院) SKLMS, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院SKLMS、ICMSEC)

AI总结 提出FrequencyCT,一种在频域中利用噪声与真实信号分布差异生成伪样本的零样本自监督方法,用于低剂量CT去噪,并通过数据截断稳定优化,实验验证了其临床潜力。

详情
AI中文摘要

尽管对计算机断层扫描(CT)去噪进行了广泛研究,但很少有研究利用投影域数据特性来减轻噪声相关性。为填补这一空白,本文提出FrequencyCT,这是第一种在频域中为低剂量CT去噪生成伪样本的零样本自监督方法。具体而言,通过利用噪声和真实信号在频域分布上的差异,提出了一种区域低频锚定技术。对高频区域应用相位保持噪声和掩膜扰动,生成用于自监督的伪样本。基于含噪投影的噪声方差与底层真实信号之间的指数相关性,对生成的样本进行一致的数据截断,以稳定优化梯度。在多个公开和真实数据集上的评估结果证实了本研究的临床应用潜力,为去噪领域提供了创新视角。代码可在 https://github.com/yqx7150/FrequencyCT 获取。

英文摘要

Despite extensive research on computed tomography (CT) denoising, few studies exploit projection-domain data characteristics to mitigate noise correlation. To bridge this gap, this work proposes FrequencyCT, the first zero-shot self-supervised method for pseudo-sample generation in the frequency domain for low-dose CT denoising. Specifically, by exploiting the distinct frequency-domain distributions of noise and true signal, a regional low-frequency anchoring technique is proposed. Applying phase-preserving noise and mask perturbations to the high-frequency region generates pseudo-samples for self-supervision. Driven by the exponential correlation between noise variance of noisy projections and the underlying true signal, consistent data truncation is applied to the generated samples to stabilize optimization gradients. Evaluation results on multiple public and real datasets confirm the clinical application potential of this research, which provides an innovative perspective for the field of denoising. The code is available at: https://github.com/yqx7150/FrequencyCT.

2605.10581 2026-05-28 cs.CV 版本更新

Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention

Polygon-mamba: 使用多边形扫描Mamba和空间-频率协同注意力进行视网膜血管分割

Yuanyuan Peng, Wen Li

发表机构 * School of Electrical and Automation Engineering, East China Jiaotong University(东华理工大学电气与自动化工程学院)

AI总结 针对视网膜小血管分割难题,提出一种混合CNN-Mamba融合网络,通过多边形扫描Mamba和空间-频率协同注意力机制,有效保留像素连通性并增强关键特征,在三个公开数据集上取得优异性能。

详情
AI中文摘要

视网膜血管分割对于眼部疾病的诊断和评估至关重要。值得注意的是,小血管的分割一直被认为是一项具有挑战性和复杂性的任务。为了应对这一挑战,我们设计了一种混合CNN-Mamba融合网络,该网络集成了多边形扫描Mamba和空间-频率协同注意力机制,用于检测小血管。考虑到传统的水平-垂直扫描Mamba架构可能会破坏目标结构的拓扑完整性,并导致视网膜小血管的局部不连续性,我们提出了一种多边形扫描视觉状态空间模型(PS-VSS),通过多层反向扫描方式识别小血管结构特征。该方法有效保留了像素连通性,从而显著减轻了小血管信息的丢失。此外,众所周知,空间域优先考虑位置和结构信息,而频率域强调全局感知和局部细节成分,我们在跳跃连接中引入了空间-频率协同注意力机制(SFCAM),以从空间域和频率域提取高效特征。该策略使模型能够动态增强关键特征,同时有效抑制杂乱信息。为了评估模型的有效性,我们在三个公开数据集:DRIVE、STARE和CHASE_DB1上进行了测试。与手动标注相比,我们的模型在三个数据集上的F1分数分别为0.8283、0.8282和0.8251,曲线下面积(AUC)值分别为0.9806、0.9840和0.9866,灵敏度(SE)值分别为0.8268、0.8314和0.8484。通过视觉检查和定量分析验证了模型的有效性。

英文摘要

Retinal vessel segmentation is crucial for diagnosis and assessment of ocular diseases. Notably, segmentation of small retinal vessels has been consistently recognized as a challenging and complex task. To tackle this challenge, we design a hybrid CNN-Mamba fusion network that integrates polygon scanning mamba and space-frequency collaborative attention mechanism for the detection of small vessels. Considering that the traditional mamba architecture with horizontal-vertical scanning may compromise the topological integrity of target structures and result in local discontinuities in small retinal vessels, we present a polygon scanning visual state space model (PS-VSS) to identify small vessel structural features by multi-layer reverse scanning way. Which effectively preserves pixels connectivity, thereby substantially mitigating the loss of information pertaining to small vessels. Furthermore, as we all known that the spatial domain prioritizes positional and structural information, while the frequency domain emphasizes global perception and local detail components, a space-frequency collaborative attention mechanism (SFCAM) is introduced within the skip connection to extract efficient features from the spatial and frequency domains. This strategy empowers the model to dynamically enhance the key features while effectively suppressing clutters. To assess the efficacy of our model, it was tested on three publicly available datasets: DRIVE, STARE, and CHASE_DB1. Compared to manual annotations, our model demonstrated F1 scores of 0.8283, 0.8282, and 0.8251, Area Under Curve (AUC) values of 0.9806, 0.9840, and 0.9866, and Sensitivity (SE) values of of 0.8268, 0.8314, and 0.8484 across three datasets, respectively. The effectiveness of our model was validated through both visual inspection and quantitative analysis.

2508.11011 2026-05-28 cs.CV 版本更新

Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

大型预训练视觉语言模型能否成为有效的施工安全检查员?

Xuezheng Chen, Zhengbo Zou

发表机构 * Mechanical Engineering(机械工程)

AI总结 本文提出ConstructionSite 10k数据集,包含1万张施工图像及三项任务标注,评估大型预训练视觉语言模型在零样本和小样本下的泛化能力,为施工安全检查提供基准。

详情
AI中文摘要

施工安全检查通常涉及人类检查员在现场识别安全问题。随着强大的视觉语言模型(VLM)的兴起,研究人员正在探索将其用于从现场图像中检测安全违规等任务。然而,目前缺乏公开数据集来全面评估和进一步微调VLM在施工安全检查中的应用。当前VLM的应用使用小型监督数据集,限制了它们在未直接训练的任务中的适用性。在本文中,我们提出了ConstructionSite 10k数据集,包含10,000张施工场地图像,并为三个相互关联的任务提供标注,包括图像描述、安全违规视觉问答(VQA)和施工元素视觉定位。随后我们对当前最先进的大型预训练VLM的评估显示,它们在零样本和小样本设置下具有显著的泛化能力,但需要额外训练才能应用于实际施工场地。该数据集允许研究人员使用新的架构和技术训练和评估自己的VLM,为施工安全检查提供了有价值的基准。

英文摘要

Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.

2508.05417 2026-05-28 cs.CV 版本更新

Smoothing Slot Attention Iterations and Recurrences

平滑槽注意力迭代与循环

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

发表机构 * Department of Electrical Engineering and Automation, Aalto University, Espoo, Finland(电气工程与自动化系,阿alto大学,埃斯波,芬兰) Department of Computer Science, Aalto University, Espoo, Finland(计算机科学系,阿alto大学,埃斯波,芬兰) Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland(机器视觉与信号分析中心,奥卢大学,奥卢,芬兰)

AI总结 针对槽注意力在图像首帧冷启动查询缺乏样本特异性及视频帧间聚合变换同质化的问题,提出SmoothSA方法,通过预热冷启动查询和差异化迭代次数来平滑迭代与循环,提升目标发现、识别与推理性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

槽注意力(SA)是主流面向对象学习(OCL)的核心。图像特征可以通过SA迭代地细化冷启动查询槽来聚合成对象级表示。对于视频,这种聚合通过SA在帧间共享的循环进行,查询在第一帧冷启动,之后从上一帧的槽过渡。然而,冷启动查询缺乏样本特异性,从而阻碍了图像或视频第一帧的精确聚合;非第一帧的查询已经具有样本特异性,因此需要与第一帧不同的聚合变换。我们通过SmoothSA解决这些问题:(1)为了平滑图像或视频第一帧上的SA迭代,我们通过OCL内部自蒸馏的微型模块预热冷启动查询,使其具有丰富的输入特征信息;(2)为了平滑视频第一帧和非第一帧之间的SA循环,我们分别使用完整迭代和单次迭代来区分同质的聚合变换。在目标发现、识别和视觉推理上的综合实验验证了我们方法的有效性。进一步的视觉分析阐明了其潜在机制。我们的源代码、模型检查点和训练日志可在https://github.com/Genera1Z/SmoothSA获取。

英文摘要

Slot Attention (SA) lies at the heart of mainstream Object-Centric Learning (OCL). Image features can be aggregated into object-level representations by SA \textit{iteratively} refining cold-start query slots. For video, such aggregation proceeds by SA \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots thereafter. However, cold-start queries lack sample-specific cues thus hindering precise aggregation on image or video's first frame; Non-first frames' queries are already sample-specific thus requiring aggregation transforms different from the first frame. We address these issues with our \textit{SmoothSA}: (1) To smooth SA iterations on image or video's first frame, we \textit{preheat} cold-start queries with rich input-feature information, by a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across video's first and non-first frames, we \textit{differentiate} the homogeneous aggregation transforms by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and visual reasoning validate our method's effectiveness. Further visual analyses illuminate the underline mechanisms. Our \textit{source code}, \textit{model checkpoints} and \textit{training logs} are provided on https://github.com/Genera1Z/SmoothSA.

2604.25491 2026-05-28 cs.CV cs.AI 版本更新

The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing

水印移除的法医成本:从专用攻击到图像编辑

Gautier Evennou, Ewa Kijak

发表机构 * IMATAG(IMATAG机构) IRISA, Univ. Rennes, INRIA, CNRS(IRISA大学、INRIA和CNRS)

AI总结 本文提出水印移除检测(WRD)作为新评估维度,通过训练分类器检测移除痕迹,在10^{-3}假阳性率下实现最优检测,证明法医隐蔽性是水印移除的必要条件。

Comments v1:The Forensic Cost of Watermark Removal, accepted at IH&MMSEC 2026, Special Session "Watermarking Across the Lifecycle of Generative Models". v2: extended version, under review

详情
AI中文摘要

当前水印移除方法在两个轴上进行评估:攻击成功率和感知质量。我们证明这是不够的。虽然最先进的攻击成功地在没有可见失真的情况下降低了水印信号,但它们留下了明显的统计伪影,暴露了移除尝试。我们将这个被忽视的轴命名为水印移除检测(WRD),并证明基于这些伪影训练的现代分类器在10^{-3}假阳性率下,对每种测试的移除方法都达到了最先进的检测率。没有现有的攻击考虑到这种法医泄漏。我们在扩展的评估三元组(攻击成功率、感知质量和法医可检测性)下,对领先的水印方案与标准移除流水线进行了基准测试,发现当前没有方法能平衡所有三个。我们的结果确立了法医隐蔽性作为水印移除的必要要求。

英文摘要

Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at $10^{-3}$ FPR across every removal method tested. No existing attack accounts for this forensic leakage. We benchmark leading watermarking schemes against standard removal pipelines under the extended evaluation triple of attack success, perceptual quality, and forensic detectability, and find that no current method balances all three. Our results establish forensic stealthiness as a necessary requirement for watermark removal.

2604.23282 2026-05-28 cs.CV cs.MM 版本更新

Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

弥合姿态-语义鸿沟:基于文本的人物异常搜索的级联框架

Zequn Xie, Guijin Luo, Chuxin Wang, Sihang Cai, Tao Jin, Zhou Zhao, Yixuan Tang

发表机构 * Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 提出结构-语义解耦级联(SSDC)框架,通过两阶段检索(结构感知粗检索和多智能体语义验证)平衡效率与语义推理,在PAB基准上达到最优性能。

Comments Accepted to ACL 2026.10 pages, 5 figures

详情
AI中文摘要

基于文本的人物异常搜索利用自然语言查询从监控档案中检索特定行为事件。尽管最近的姿态感知方法能够很好地对齐几何结构,但它们面临一个根本性的姿态-语义鸿沟:语义不同的动作可能共享相似的骨骼几何结构。虽然多模态大语言模型(MLLMs)可以减少这种歧义,但将其用于大规模检索在计算上代价高昂。我们提出了结构-语义解耦级联(SSDC)框架,将检索解耦为两个阶段:(1)结构感知粗检索,其中轻量级模型通过骨骼相似性快速筛选候选;(2)侦探小组交互,一个多智能体语义验证模块。该小组包括一个用于快速二元过滤的侦探、一个用于证据提取的分析师和一个用于语义合成的写手。最后,通过将合成描述与结构先验融合,对候选进行重新排序。在PAB基准上的实验表明,SSDC通过平衡效率和语义推理实现了最先进的性能。

英文摘要

Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.

2601.11632 2026-05-28 cs.CV 版本更新

KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

KG-ViP:在多模态大语言模型中桥接知识基础与视觉感知以进行视觉问答

Zhiyang Li, Ao Ke, Yukun Cao, Xike Xie

发表机构 * University of Science and Technology of China(中国科学技术大学) Data Darkness Lab, MIRACLE Center, USTC(数据黑暗实验室,MIRACLE中心,中国科学技术大学) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院)

AI总结 提出KG-ViP框架,通过检索与融合场景图和常识图,统一外部知识与细粒度视觉细节,缓解多模态大语言模型在视觉问答中的知识幻觉和视觉感知不足问题。

详情
AI中文摘要

用于视觉问答(VQA)的多模态大语言模型(MLLMs)通常面临双重限制:知识幻觉和细粒度视觉感知不足。关键的是,我们发现常识图和场景图通过提供丰富的外部知识和捕捉细粒度视觉细节,恰好为这些缺陷提供了互补的解决方案。然而,先前的工作通常孤立地处理它们,忽视了它们的协同潜力。为了弥合这一差距,我们提出了KG-ViP,一个统一的框架,通过融合场景图和常识图来增强MLLMs。KG-ViP框架的核心是一个新颖的检索与融合流程,利用查询作为语义桥逐步整合两种图,合成统一的结构化上下文,促进可靠的多模态推理。在FVQA 2.0+和MVQA基准上的大量实验表明,KG-ViP显著优于现有的VQA方法。

英文摘要

Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.

2604.17110 2026-05-28 cs.CV 版本更新

From Clinical Intent to Clinical Model: Autonomous Coding-Agents for Clinician-driven AI Development

从临床意图到临床模型:面向临床医生驱动AI开发的自主编码代理

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Mathis Bode, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

AI总结 提出一种自主编码代理系统,允许临床医生用自然语言描述任务,系统自动生成并迭代优化模型,在五项临床任务中达到竞争性能,并显著减少胸部X光片模型对胸腔引流管的依赖。

Comments Code is available at https://github.com/zhaozh10/clinical-automata/

详情
AI中文摘要

开发在临床实践中有用的AI模型需要临床医生和AI开发者之间的高效协作。这带来了一个实际挑战:临床医生必须反复与AI开发者沟通并完善其需求,然后这些需求才能转化为可执行的模型开发。这种迭代过程耗时,即使经过反复讨论,由于双方未能完全共享专业知识,仍可能存在不一致。编码代理可能有助于弥合这一差距。它们可以自主编写和优化代码,并具备医学和AI的工作知识,以理解医学专家和开发者制定的命令。我们提出了一个原型,让临床医生直接驱动AI开发。临床医生用自然语言描述任务,系统将描述转化为可工作的流程,通过与临床医生一起反复实验进行优化,并返回满足既定临床目标的模型。在五项临床任务中,该系统可靠地生成了与临床医生请求匹配且达到竞争性能的模型。最值得注意的是,在胸部X光片上,该系统显著减少了模型对胸腔引流管的依赖(气胸分类的已知捷径),在一个数据集上从60%降至31%,在另一个数据集上从50%降至18%。我们的结果表明,编码代理可以将临床AI开发转向更以临床医生驱动的模式,使领域专家能够直接塑造模型,而不是通过专门的AI团队传递需求。

英文摘要

Developing AI models that are useful in clinical practice, requires efficient collaboration between clinicians and AI developers. This poses a practical challenge: clinicians must repeatedly communicate and refine their requirements with AI developers before those requirements can be translated into executable model development. This iterative process is time-consuming, and even after repeated discussion, misalignment may still exist because the two sides do not fully share each other's expertise. Coding agents may help close this gap. They can write and refine code on their own, and they carry working knowledge of both medicine and AI to understand commands formulated by both medical experts and developers. We present a prototype that lets clinicians drive AI development directly. A clinician describes the task in plain language, and the system turns the description into a working pipeline, refines it through repeated experiments together with the clinician, and returns a model that meets the stated clinical objective. Across five clinical tasks, the system reliably produces models that matched the clinician's request and reached competitive performance. Most notably, on chest radiographs the system sharply reduced the model's reliance on chest drains, a well-known shortcut for pneumothorax classification, from 60% to 31% on one dataset and from 50% to 18% on another. Our results suggest that coding agents can shift clinical AI development toward a more clinician-driven mode, allowing domain experts to shape models directly instead of relaying requirements through specialized AI teams.

2506.01247 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering

超越可解释性:稀疏自编码器何时、为何以及如何实现无标签视觉引导

Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

发表机构 * Department of Computer Science, Rutgers University(罗格斯大学计算机科学系) Department of Statistics, Rutgers University(罗格斯大学统计系)

AI总结 本文提出无标签视觉稀疏引导方法VS2,通过训练稀疏自编码器并利用其重构误差和稀疏特征放大来引导冻结的视觉语言模型,在九个图像分类数据集上提升零样本准确率。

详情
AI中文摘要

稀疏自编码器(SAE)越来越多地被用于解释基础模型,但它们作为可操作干预空间的作用仍不太被理解,尤其是在视觉领域。我们研究稀疏视觉特征是否不仅可用于事后分析,还可用于引导冻结的视觉语言模型。我们引入视觉稀疏引导(VS2),一种无标签方法,它在冻结的CLIP图像编码器的无标签激活上训练一个top-$k$ SAE,并在测试时通过放大输入的活跃稀疏特征并解码诱导的变化来构建一个可解释的引导向量。我们证明该过程可分解为质心偏差引导:每个输入沿着其与SAE学习到的质心的偏差移动。残差项由SAE的每样本重构误差(通过FVU测量)精确控制,从而产生基于FVU的残差界限,并促使在SAE重构不可靠时回退到零样本CLIP的可靠性门控。通过使用在无标签CLIP图像编码器激活上训练的目标域SAE,VS2在九个图像分类数据集上提高了零样本准确率,在推理计算量增加不到0.1%的情况下实现了高达+4.12%的提升。最后,一项受控的上界研究VS2++表明,选择性放大稀疏特征可带来高达+21.44%的提升,揭示了一个重构与任务显著性的差距:对重构显著的稀疏特征不一定与对下游预测有用的特征一致。

英文摘要

Sparse Autoencoders (SAEs) are increasingly used to interpret foundation models, but their role as an actionable intervention space remains less understood, especially in vision. We study whether sparse visual features can be used not only for post-hoc analysis, but also to steer frozen vision-language models. We introduce Visual Sparse Steering (VS2), a label-free method that trains a top-$k$ SAE on unlabeled activations from a frozen CLIP image encoder and, at test time, constructs an interpretable steering vector by amplifying the input's active sparse features and decoding the induced change. We show that this procedure admits a closed-form decomposition as centroid-deviation steering: each input is moved along its deviation from the SAE-learned centroid. The residual term is controlled exactly by the SAE's per-sample reconstruction error, measured by FVU, yielding an FVU-based residual bound and motivating a reliability gate that falls back to zero-shot CLIP when SAE reconstruction is unreliable. With target-domain SAEs trained on unlabeled CLIP image-encoder activations, VS2 improves zero-shot accuracy across nine image-classification datasets, achieving gains up to $+4.12\%$ with less than $0.1\%$ additional inference compute. Finally, a controlled upper-bound study, VS2++, shows that selective amplification of sparse features can yield gains up to $+21.44\%$, exposing a reconstruction-vs-task saliency gap: features salient for reconstruction need not align with features useful for downstream prediction.

2604.09367 2026-05-28 cs.CV 版本更新

EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

EpiAgent: 一种以智能体为中心的古铭文修复系统

Shipeng Zhu, Ang Chen, Na Nie, Pengfei Fang, Min-Ling Zhang, Hui Xue

发表机构 * School of Computer Science and Engineering, Southeast University, China(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其跨学科应用关键实验室(东南大学),教育部,中国) Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China(计算机网络与信息集成关键实验室(东南大学),教育部,中国) Nanjing University Museum, Nanjing University, China(南京大学博物馆,南京大学,中国) The China Centre for Linguistic and Strategic Studies, Nanjing University, China(中国语言战略研究中心,南京大学,中国)

AI总结 提出基于智能体的EpiAgent系统,通过分层规划与LLM协调多模态分析、历史经验和专用工具,实现灵活自适应的古铭文修复,在真实退化铭文上取得更优修复质量和泛化能力。

Comments Accepted by CVPR 2026

详情
AI中文摘要

古铭文作为文化记忆的载体,历经数世纪的环境和人为退化。恢复其交织的视觉和文本完整性是数字遗产保护中最具挑战性的任务之一。然而,现有基于AI的方法通常依赖刚性流水线,难以泛化到如此复杂和异质的真实退化场景。受人类金石学家技能协调工作流程的启发,我们提出EpiAgent,一个以智能体为中心的系统,将铭文修复形式化为分层规划问题。遵循观察-构思-执行-重新评估范式,基于LLM的中央规划器协调多模态分析、历史经验、专用修复工具和迭代自我精炼之间的协作。这种以智能体为中心的协调使得修复过程比传统的单次通过方法更加灵活和自适应。在真实退化的铭文上,EpiAgent相比现有方法实现了更优的修复质量和更强的泛化能力。我们的工作标志着向专家级智能体驱动的文化遗产修复迈出了重要一步。代码可在 https://github.com/blackprotoss/EpiAgent 获取。

英文摘要

Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.

2604.05378 2026-05-28 cs.CL cs.CV 版本更新

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

ICR-Drive:面向端到端语言驱动自动驾驶的指令反事实鲁棒性

Kaiser Hamid, Can Cui, Nade Liang

发表机构 * Texas Tech University(德克萨斯科技大学) Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI))

AI总结 提出ICR-Drive框架,通过生成四类扰动指令(改写、歧义、噪声、误导)并基于CARLA仿真评估,揭示语言条件驾驶模型对指令变化的脆弱性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 872-880
AI中文摘要

视觉-语言-动作(VLA)模型的最新进展使得语言条件驾驶代理能够在闭环仿真中执行自然语言导航命令,但标准评估大多假设指令精确且格式良好。在实际部署中,指令的措辞和具体性各不相同,可能省略关键限定词,偶尔还包含误导性的权威框架文本,导致指令级鲁棒性未被充分衡量。我们提出了ICR-Drive,一个用于端到端语言条件自动驾驶中指令反事实鲁棒性的诊断框架。ICR-Drive生成受控的指令变体,涵盖四类扰动:改写、歧义、噪声和误导,其中误导变体与导航目标冲突并试图覆盖意图。我们在匹配的仿真器配置和种子下重放相同的CARLA路线,以隔离由指令语言引起的性能变化。鲁棒性通过标准CARLA排行榜指标和相对于基线指令的每族性能下降来量化。在LMDrive和BEVDriver上的实验表明,微小的指令变化可能导致显著的性能下降和不同的故障模式,揭示了在安全关键驾驶中部署具身基础模型的可靠性差距。

英文摘要

Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

2604.03799 2026-05-28 cs.CV 版本更新

Next-Scale Autoregressive Models for Text-to-Motion Generation

Next-Scale 自回归模型用于文本到运动生成

Zhiwei Zheng, Shibo Jin, Lingjie Liu, Mingmin Zhao

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出 MoScale 框架,通过从粗到细的时间分辨率分层生成运动,结合跨尺度和尺度内细化,实现高效、可扩展的文本到运动生成。

Comments Accepted to CVPR 2026

详情
AI中文摘要

自回归(AR)模型提供稳定高效的训练,但标准的下一 token 预测与文本条件运动生成所需的时间结构不太一致。我们引入 MoScale,一个下一尺度 AR 框架,从粗到细的时间分辨率分层生成运动。通过在最粗尺度提供全局语义并逐步细化,MoScale 建立了一个更适合长程运动结构的因果层次。为了提高在有限文本-运动数据下的鲁棒性,我们进一步结合了跨尺度层次细化以改进每个尺度的初始预测,以及尺度内时间细化用于选择性双向重新预测。MoScale 在文本到运动任务上实现了最先进的性能,具有高训练效率,能有效随模型大小扩展,并零样本泛化到多种运动生成和编辑任务。

英文摘要

Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.

2604.00913 2026-05-28 cs.CV cs.CL 版本更新

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

跨描绘装配指令对齐的视觉-语言模型基准测试与机制分析

Zhuchenyang Liu, Yao Zhang, Yu Xiao

发表机构 * Aalto University(阿alto大学)

AI总结 构建IKEA-Bench基准,评估19个视觉-语言模型在装配图与视频帧对齐任务上的表现,发现视觉编码是提升跨描绘鲁棒性的关键瓶颈。

详情
AI中文摘要

二维装配图通常是抽象的且难以遵循,因此需要智能助手来监控进度、检测错误并提供逐步指导。在混合现实环境中,此类系统必须从摄像头画面中识别已完成和正在进行的步骤,并将其与图示指令对齐。视觉语言模型(VLM)在此任务上展现出潜力,但由于装配图和视频帧共享的视觉特征极少,面临描绘鸿沟。为系统评估这一鸿沟,我们构建了IKEA-Bench基准,包含29个宜家家具产品的6种任务类型共1623个问题,并在三种对齐策略下评估了19个VLM(2B-38B)。主要发现:(1)装配指令理解可通过文本恢复,但文本同时降低了图到视频的对齐性能;(2)架构族比参数数量更能预测对齐精度;(3)视频理解是难以通过策略影响的硬瓶颈。三级机制分析进一步揭示,图和视频占据不相交的ViT子空间,且添加文本会使模型从视觉驱动转向文本驱动的推理。这些结果表明,视觉编码是提升跨描绘鲁棒性的主要目标。项目页面:https://ryenhails.github.io/IKEA-Bench/

英文摘要

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

2604.00402 2026-05-28 cs.CV cs.AI 版本更新

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

COTTA: 面向自动驾驶轨迹预测的上下文感知迁移适应

Seohyoung Park, Jaeyeol Lim, Seoyoung Ju, Kyeonghun Kim, Nam-Joon Kim, Hyuk-Jae Lee

发表机构 * Ewha Womans University(成均馆大学) Seoul National University(首尔国立大学) Sangmyung University(Sangmyung 大学) NVIDIA

AI总结 本文研究将基于美国数据训练的轨迹预测模型QCNet迁移到韩国道路环境,通过对比四种训练策略,发现冻结编码器并微调解码器可在精度和效率间取得最佳平衡,预测误差降低66%以上。

Comments 4 pages, 2 figures. Accepted at ICEIC 2026

详情
AI中文摘要

开发鲁棒模型以准确预测周围代理的轨迹是自动驾驶安全的基础。然而,大多数公开数据集(如Waymo Open Motion Dataset和Argoverse)是在西方道路环境中收集的,并未反映其他地区(包括韩国)独特的交通模式、基础设施和驾驶行为。当在西方数据上训练的最先进模型部署到不同地理环境时,这种领域差异会导致性能下降。在本工作中,我们研究了查询中心轨迹预测(QCNet)从美国数据迁移到韩国道路环境时的适应性。使用韩国自动驾驶数据集,我们比较了四种训练策略:零样本迁移、从头训练、全微调和编码器冻结。实验结果表明,利用预训练知识显著提高了预测性能。具体而言,在冻结编码器的同时选择性微调解码器,在精度和训练效率之间取得了最佳平衡,与从头训练相比,预测误差降低了66%以上。本研究为在新地理领域部署轨迹预测模型提供了有效的迁移学习策略的实用见解。

英文摘要

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

2512.11524 2026-05-28 cs.CV cs.LG 版本更新

Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using Airborne LiDAR HD Reference Data across Metropolitan France

利用法国大都市机载LiDAR HD参考数据从Sentinel-2时间序列进行超分辨率冠层高度制图

Ekaterina Kalinicheva, Florian Helen, Stéphane Mermoz, Florian Mouret, Milena Planells

发表机构 * CESBIO GlobEO

AI总结 提出THREASURE-Net端到端框架,利用Sentinel-2时间序列和LiDAR HD数据生成2.5m、5m和10m分辨率的年度冠层高度图,无需预训练模型或高分辨率光学图像,在法国大都市区实现优于现有方法的精度。

详情
AI中文摘要

精细尺度的森林监测对于理解冠层结构及其动态至关重要,这些是碳储量、生物多样性和森林健康的关键指标。深度学习特别有效,因为它整合了共同反映冠层结构的光谱、时间和空间信号。为满足这一需求,我们提出了THREASURE-Net,一种新颖的端到端树高回归与超分辨率框架。该模型使用来自法国大都市区多个空间分辨率的LiDAR HD数据导出的参考高度指标,在Sentinel-2时间序列上训练,以生成年度高度图。我们评估了三种模型变体,分别产生2.5米、5米和10米分辨率的树高预测。THREASURE-Net不依赖任何预训练模型或参考甚高分辨率光学图像来训练其超分辨率模块;相反,它仅从LiDAR导出的高度信息中学习。我们的方法优于现有基于Sentinel数据的最先进方法,并与基于甚高分辨率图像的方法具有竞争力。它可以部署生成高精度年度冠层高度图,在2.5米、5米和10米分辨率下分别实现2.63米、2.70米和2.88米的平均绝对误差。这些结果凸显了THREASURE-Net仅使用免费卫星数据对温带森林进行可扩展且经济高效的结构监测的潜力。THREASURE-Net的源代码可在以下网址获取:https://github.com/Global-Earth-Observation/threasure-net。

英文摘要

Fine-scale forest monitoring is essential for understanding canopy structure and its dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning is particularly effective for this task, as it integrates spectral, temporal, and spatial signals that jointly reflect the canopy structure. To address this need, we introduce THREASURE-Net, a novel end-to-end framework for Tree Height Regression And Super-Resolution. The model is trained on Sentinel-2 time series using reference height metrics derived from LiDAR HD data at multiple spatial resolutions over Metropolitan France to produce annual height maps. We evaluate three model variants, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution. THREASURE-Net does not rely on any pretrained model nor on reference very high resolution optical imagery to train its super-resolution module; instead, it learns solely from LiDAR-derived height information. Our approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods based on very high resolution imagery. It can be deployed to generate high-precision annual canopy-height maps, achieving mean absolute errors of 2.63 m, 2.70 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution, respectively. These results highlight the potential of THREASURE-Net for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data. The source code for THREASURE-Net is available at: https://github.com/Global-Earth-Observation/threasure-net.

2601.17354 2026-05-28 cs.CV cs.GR 版本更新

PocketGS: On-Device Training of 3D Gaussian Splatting for High Perceptual Modeling

PocketGS: 用于高感知建模的3D高斯泼溅设备端训练

Wenzhi Guo, Guangchi Fang, Shu Yang, Bing Wang

发表机构 * Hong Kong Polytechnic University(香港理工大学) Nanjing University(南京大学)

AI总结 提出PocketGS,通过三个协同设计的算子(G、I、T)在移动设备上实现3D高斯泼溅的高效训练,在严格资源约束下保持高保真重建。

详情
AI中文摘要

虽然3D高斯泼溅(3DGS)能够实现实时渲染,但其训练需要工作站级别的计算和内存,使得在分钟级时间预算和有限峰值内存下移动部署不切实际。我们提出了PocketGS,一种移动场景建模范式,能够在这些紧密耦合的约束下实现设备端3DGS训练,同时保持高保真重建。PocketGS通过三个协同设计的算子解决了训练效率、内存紧凑性和建模质量之间的基本矛盾:$\mathcal{G}$构建几何保真的点云先验;$\mathcal{I}$注入局部表面统计以播种各向异性高斯,从而减少早期条件差距;$\mathcal{T}$使用缓存的中间结果和索引映射梯度散射展开alpha合成,以实现稳定的移动反向传播。大量实验表明,PocketGS在移动预算下优于强大的主流工作站3DGS基线,提供高质量重建,并实现了完全设备端的实用捕获到渲染工作流。

英文摘要

While 3D Gaussian Splatting (3DGS) enables real-time rendering, its training demands workstation-level compute and memory, making mobile deployment impractical under minute-scale time budgets and limited peak memory. We present PocketGS, a mobile scene modeling paradigm that enables on-device 3DGS training under these tightly coupled constraints while preserving high-fidelity reconstruction. PocketGS resolves the fundamental tension between training efficiency, memory compactness, and modeling quality through three co-designed operators: $\mathcal{G}$ builds geometry-faithful point-cloud priors; $\mathcal{I}$ injects local surface statistics to seed anisotropic Gaussians, thereby reducing early conditioning gaps; and $\mathcal{T}$ unrolls alpha compositing with cached intermediates and index-mapped gradient scattering for stable mobile backpropagation. Extensive experiments demonstrate that PocketGS outperforms the powerful mainstream workstation 3DGS baseline under mobile budgets, delivering high-quality reconstructions and enabling a fully on-device, practical capture-to-rendering workflow.

2512.12887 2026-05-28 cs.CV 版本更新

Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

重新审视用于可扩展3D医学图像分类的2D基础模型

Han Liu, Bogdan Georgescu, Yanbo Zhang, Youngjin Yoo, Michael Baumgartner, Riqiang Gao, Jianing Wang, Gengyan Zhao, Eli Gibson, Dorin Comaniciu, Sasa Grbic

发表机构 * Digital Technology and Innovation, Siemens Healthineers, Princeton NJ, USA(西门子医疗数字技术与创新,普林斯顿新泽西州,美国) Digital Technology and Innovation, Siemens Healthineers, Erlangen, Germany(西门子医疗数字技术与创新,埃尔兰根,德国)

AI总结 本文针对当前3D医学图像分类基础模型的数据偏差、适应不足和任务覆盖不全问题,提出AnyMC3D框架,通过冻结2D基础模型并添加轻量插件实现高效多任务扩展,并在12项任务上达到领先性能。

Comments 1st Place in VLM3D Challenge

详情
AI中文摘要

3D医学图像分类对于现代临床工作流程至关重要。医学基础模型(FMs)已成为扩展到新任务的有前途的方法,然而当前研究存在三个关键缺陷:数据体制偏差、次优适应和任务覆盖不足。在本文中,我们解决了这些缺陷,并引入了AnyMC3D,一种从2D FMs改编的可扩展3D分类器。我们的方法通过在单个冻结骨干网络上添加轻量级插件(每个任务约1M参数),高效地扩展到新任务。这个通用框架还支持多视图输入、辅助像素级监督和可解释的热力图生成。我们建立了一个涵盖12个任务的综合基准,包括不同的病理、解剖和模态,并系统分析了最先进的3D分类技术。我们的分析揭示了关键见解:(1)有效适应对于释放FM潜力至关重要,(2)通用FMs在适当适应后可以匹敌医学专用FMs,(3)基于2D的方法在3D分类上优于3D架构。我们首次证明了使用单一可扩展框架(包括在VLM3D挑战中获得第一名)在不同应用中实现最先进性能的可行性,消除了对单独任务特定模型的需求。

英文摘要

3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.

2603.21165 2026-05-28 cs.CL cs.CV 版本更新

Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

多种方言,多种语言,一种文化视角:评估多语言视觉语言模型对孟加拉文化的理解,涵盖历史关联语言和地区方言

Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda

发表机构 * United International University(国际联合大学) BRAC University(布拉塔克大学) University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出 BanglaVerse 基准,通过手工标注图像和扩展至多种语言及方言,评估多语言视觉语言模型在孟加拉文化理解中的表现,发现标准孟加拉语评估高估模型能力,方言变化导致性能下降,文化知识缺失是主要瓶颈。

Comments https://labib1610.github.io/BanglaVerse/

详情
AI中文摘要

孟加拉文化通过地区、方言、历史、食物、政治、媒体和日常视觉生活丰富地表达,但在多模态评估中仍然代表性不足。为了解决这一差距,我们引入了BanglaVerse,这是一个文化基础的基准,用于评估多语言视觉语言模型(VLM)对孟加拉文化的理解,涵盖历史关联语言和地区方言。该基准由9个领域的1152张手动策划图像构建,支持视觉问答和字幕生成,并扩展为四种语言和五种孟加拉方言,产生约32.2K个工件。我们的实验表明,仅评估标准孟加拉语会高估真实模型能力:在方言变化下性能下降,尤其是字幕生成,而历史关联语言如印地语和乌尔都语保留了一些文化意义,但在结构化推理方面仍然较弱。跨领域来看,主要瓶颈是缺失文化知识而非仅视觉基础,尤其是知识密集型类别。这些发现将BanglaVerse定位为在语言变化下衡量文化基础多模态理解的更现实测试平台。

英文摘要

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.2K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

2602.20497 2026-05-28 cs.CV cs.AI 版本更新

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

LESA: 可学习的阶段感知预测器用于扩散模型加速

Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对扩散模型计算开销大、现有缓存策略难以适应去噪过程阶段动态变化的问题,提出基于两阶段训练的可学习阶段感知预测器框架,利用KAN网络学习时序特征映射并采用多阶段多专家架构,在保持高质量生成的同时实现显著加速。

Comments Accepted to CVPR 2026

详情
AI中文摘要

扩散模型在图像和视频生成任务中取得了显著成功。然而,扩散Transformer(DiTs)的高计算需求对其实际部署构成了重大挑战。虽然特征缓存是一种有前景的加速策略,但现有基于简单重用或无训练预测的方法难以适应扩散过程中复杂的、阶段相关的动态变化,常常导致质量下降,并无法保持与标准去噪过程的一致性。为解决这一问题,我们提出了一种基于两阶段训练的可学习阶段感知(LESA)预测器框架。我们的方法利用Kolmogorov-Arnold网络(KAN)从数据中准确学习时序特征映射。我们进一步引入了一种多阶段、多专家架构,为不同噪声水平阶段分配专门的预测器,从而实现更精确和鲁棒的特征预测。大量实验表明,我们的方法在保持高保真生成的同时实现了显著加速。实验显示,在FLUX.1-dev上实现了5.00倍加速,质量下降极小(1.0%);在Qwen-Image上实现了6.25倍加速,质量比之前的最优方法(TaylorSeer)提升20.2%;在HunyuanVideo上实现了5.00倍加速,PSNR比TaylorSeer提升24.7%。在文本到图像和文本到视频合成任务上的最先进性能验证了我们基于训练框架在不同模型上的有效性和泛化能力。我们的代码可在https://github.com/caipeiliang2004/LESA获取。

英文摘要

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is available at https://github.com/caipeiliang2004/LESA.

2603.12824 2026-05-28 cs.IR cs.CV cs.LG 版本更新

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

NanoVDR:将20亿参数的视觉语言检索器蒸馏为7000万参数的纯文本编码器用于视觉文档检索

Zhuchenyang Liu, Yao Zhang, Yu Xiao

发表机构 * Aalto University(阿alto大学)

AI总结 利用查询-文档不对称性,通过蒸馏将20亿参数的视觉语言模型教师蒸馏为7000万参数的纯文本学生编码器,采用点态余弦对齐目标,实现视觉文档检索的高效推理。

详情
AI中文摘要

基于视觉语言模型(VLM)的检索器已将视觉文档检索(VDR)提升到令人印象深刻的水平。它们需要相同的数十亿参数编码器用于文档索引和查询编码,即使对于纯文本查询也会导致高延迟和GPU依赖。我们观察到这种设计是不必要对称的:文档在视觉上复杂且需要强大的视觉理解,而查询只是短文本字符串。NanoVDR利用这种查询-文档不对称性,解耦两个编码路径:冻结的20亿VLM教师离线索引文档,而蒸馏后的纯文本学生(小至6900万参数)在推理时编码查询。关键设计选择是蒸馏目标。通过对三个骨干网络和22个ViDoRe基准数据集的六个目标进行系统比较,我们发现查询文本上的点态余弦对齐始终优于基于排序和对比的替代方案,同时在训练期间仅需要预缓存的教师查询嵌入,无需处理文档。此外,我们识别出跨语言迁移是主要性能瓶颈,并通过使用机器翻译的查询增强训练数据廉价地解决它。最终的NanoVDR-S-Multi(DistilBERT,6900万)保留了教师质量的95.1%,在v2和v3上以32倍更少的参数和50倍更低的CPU查询延迟优于DSE-Qwen2(20亿),总训练成本低于13 GPU小时。

英文摘要

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

2603.08264 2026-05-28 cs.CV 版本更新

Event-based Motion & Appearance Fusion for 6D Object Pose Tracking

基于事件的运动与外观融合的6D物体姿态跟踪

Zhichao Li, Chiara Bartolozzi, Lorenzo Natale, Arren Glover

发表机构 * Event-driven Perception for Robotics, Istituto Italiano di Tecnologia, Italy(事件驱动感知机器人实验室,意大利理工学院) Humanoid Sensing and Perception, Istituto Italiano di Tecnologia, Italy(人形感知与感知,意大利理工学院) University of Genoa, Genoa, Italy(热那亚大学,意大利)

AI总结 提出一种结合事件相机高时间分辨率优势的无学习方法,通过事件光流传播姿态并利用模板匹配校正,在高速运动物体上达到或超越现有算法性能。

详情
AI中文摘要

物体姿态跟踪是机器人在家庭和工业环境中执行任务的基本且必要的任务。最常用的传感器是RGB-D相机,但在高动态环境中,由于运动模糊和帧率限制,它们可能达到极限。事件相机具有高时间分辨率和低延迟等显著特性,使其成为高速物体姿态跟踪的理想视觉传感器。尽管如此,目前仅有少数工作涉及事件相机的6D姿态跟踪。在这项工作中,我们利用高时间分辨率的优势,提出了一种结合传播步骤与姿态校正策略的方法。具体而言,我们使用从事件光流中获得的6D物体速度进行姿态传播,然后利用基于模板的局部姿态校正模块进行姿态校正。我们的无学习方法与最先进的算法性能相当,并且在某些情况下对快速移动物体的表现更优。结果表明,在深度网络方法受限于低更新速率的高动态场景中,事件相机具有应用潜力。

英文摘要

Object pose tracking is a fundamental and essential task for robotics to perform tasks in the home and industrial settings. The most commonly used sensors to do so are RGB-D cameras, which can hit limitations in highly dynamic environments due to motion blur and frame-rate constraints. Event cameras have remarkable features such as high temporal resolution and low latency, which make them a potentially ideal vision sensors for object pose tracking at high speed. Even so, there are still only few works on 6D pose tracking with event cameras. In this work, we take advantage of the high temporal resolution and propose a method that uses both a propagation step fused with a pose correction strategy. Specifically, we use 6D object velocity obtained from event-based optical flow for pose propagation, after which, a template-based local pose correction module is utilized for pose correction. Our learning-free method has comparable performance to the state-of-the-art algorithms, and in some cases out performs them for fast-moving objects. The results indicate the potential for using event cameras in highly-dynamic scenarios where the use of deep network approaches are limited by low update rates.

2603.05425 2026-05-28 cs.CV cs.AI 版本更新

RelaxFlow: Text-Driven Amodal 3D Generation

RelaxFlow: 文本驱动的非模态3D生成

Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

发表机构 * National University of Singapore(新加坡国立大学) Zhejiang University(浙江大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对遮挡下图像到3D生成的语义歧义问题,提出无训练的双分支框架RelaxFlow,通过多先验共识模块和松弛机制解耦控制粒度,实现文本提示引导下对未观察区域的补全,同时严格保留输入观测。

Comments Accepted as a spotlight presentation at ICML 2026. Code: https://github.com/viridityzhu/RelaxFlow

详情
AI中文摘要

图像到3D生成在遮挡下面临固有的语义歧义,仅凭部分观测通常不足以确定物体类别。在这项工作中,我们形式化了文本驱动的非模态3D生成,其中文本提示引导对未观察区域的补全,同时严格保留输入观测。关键的是,我们识别出这些目标需要不同的控制粒度:对观测的刚性控制与对提示的松弛结构控制。为此,我们提出RelaxFlow,一个无训练的双分支框架,通过多先验共识模块和松弛机制解耦控制粒度。理论上,我们证明我们的松弛等价于在生成向量场上应用低通滤波器,抑制高频实例细节以隔离适应观测的几何结构。为便于评估,我们引入了两个诊断基准:ExtremeOcc-3D和AmbiSem-3D。大量实验表明,RelaxFlow成功引导未观察区域的生成以匹配提示意图,同时不损害视觉保真度。

英文摘要

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

2602.23754 2026-05-28 cs.GR cs.CV 版本更新

Neural Image Space Tessellation efect

神经图像空间镶嵌效应

Youyang Du, Junqiu Zhu, Zheng Zeng, Lu Wang, Lingqi Yan

发表机构 * Shandong University(山东大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University of California, Santa Barbara(加州大学圣芭芭拉分校)

AI总结 提出一种轻量级屏幕空间后处理方法NIST,通过隐式变形图像空间轮廓并重新分配外观,减少低多边形渲染中的面状轮廓,实现接近基于镶嵌的平滑效果,且每帧成本几乎恒定。

详情
AI中文摘要

我们提出神经图像空间镶嵌效应(NIST),一种轻量级的屏幕空间后处理方法,用于减少低多边形渲染中的面状轮廓。NIST不进行图元镶嵌、创建新几何体或修改底层网格,而是利用低多边形渲染结果和简单的辅助G缓冲区属性,学习在图像空间中几何引导的对象轮廓平滑。其核心是,NIST首先隐式变形图像空间轮廓,然后学习在整个图像空间(包括变形区域)重新分配外观,保持纹理连续性并避免接缝伪影。实验表明,NIST减少了视觉上明显的几何面状化,并产生接近基于镶嵌的平滑参考的平滑、连贯轮廓,在我们测试的设置中每帧成本几乎恒定。据我们所知,NIST是第一个将低多边形轮廓面状化解决方案从渲染前几何阶段转移到渲染后屏幕空间阶段的工作。

英文摘要

We present Neural Image Space Tessellation effect (NIST), a lightweight screen-space post-processing approach for reducing the faceted silhouettes of low-poly renderings. Instead of tessellating primitives, creating new geometry, or modifying the underlying mesh, NIST uses the low-poly rendering result together with simple auxiliary G-buffer attributes to learn geometry-guided smoothing of object contours in image space. At its core, NIST first deforms image-space contours implicitly and then learns to reassign appearance in the whole image-space, including the deformed regions, preserving texture continuity and avoiding seam artifacts. Experiments show that NIST reduces visually apparent geometric faceting and produces smooth, coherent silhouettes close to tessellation-based smoothing references, with a nearly constant per-frame cost in our tested settings. To the best of our knowledge, NIST is the first work to move the solution of low-poly silhouette faceting from the pre-rendering geometry stage to a post-rendering screen-space stage.

2602.22096 2026-05-28 cs.CV 版本更新

WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

WeatherCity: 可控多天气变换的城市场景重建

Wenhua Wu, Huai Guan, Zhe Liu, Hesheng Wang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(自动化与智能感知学院,上海交通大学)

AI总结 提出WeatherCity框架,利用文本引导的图像编辑、天气高斯表示和物理驱动模型,实现高保真、时间一致的4D城市场景重建与多天气编辑。

详情
AI中文摘要

可编辑的高保真4D场景对于自动驾驶至关重要,因为它们可以应用于端到端训练和闭环仿真。然而,现有的重建方法主要局限于复制观察到的场景,缺乏多样化的天气模拟能力。而图像级别的天气编辑方法往往引入场景伪影,并且对天气效果的可控性较差。为了解决这些限制,我们提出了 extbf{WeatherCity},一个用于4D城市场景重建和天气编辑的新框架。具体来说,我们利用文本引导的图像编辑模型来实现图像天气背景的灵活编辑。为了应对多天气建模的挑战,我们引入了一种基于共享场景特征和专用天气解码器的新型天气高斯表示。这种表示进一步通过内容一致性优化得到增强,确保不同天气条件下的连贯建模。此外,我们设计了一个物理驱动模型,通过粒子和运动模式模拟动态天气效果。在多个数据集和各种场景上的大量实验表明,WeatherCity在4D重建和天气编辑中实现了灵活的可控性、高保真度和时间一致性。我们的框架不仅能够对天气条件(例如小雨和大雪)进行细粒度控制,还支持场景内的物体级操作。代码已发布在https://github.com/IRMVLab/WeatherCity。

英文摘要

Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose \textbf{WeatherCity}, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns. Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene. Codes are released at https://github.com/IRMVLab/WeatherCity.

2602.18647 2026-05-28 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

Noise Scheduling as Information-Guided Allocation in Diffusion Training

噪声调度作为扩散训练中的信息引导分配

Gabriel Raya, Bac Nguyen, Georgios Batzolis, Yuhta Takida, Dejan Stancevic, Naoki Murata, Chieh-Hsin Lai, Yuki Mitsufuji, Luca Ambrogioni

发表机构 * Tilburg University & JADS(蒂尔堡大学及JADS) Sony AI(索尼人工智能) University of Cambridge(剑桥大学) Radboud University(拉德堡德大学) Sony Group Corporation(索尼集团公司)

AI总结 提出InfoNoise,一种在线自适应噪声调度方法,通过估计条件熵率剖面动态调整训练噪声分布,以优化去噪任务中的信息增益,在图像、DNA和语言生成等任务中达到或超越基线,并节省高达3倍训练计算量。

详情
AI中文摘要

我们引入了InfoNoise,一种用于扩散训练的在线自适应噪声调度,它将优化努力重新分配到去噪最具信息量的噪声水平上。与损失加权一起,噪声调度在去噪问题之间诱导出有效的分配,而这种分配通常在知道信息性噪声水平之前就已固定。InfoNoise通过从训练期间的去噪损失中估计条件熵率剖面,使这种分配具有数据自适应性,无需辅助模型或离线搜索。通过I--MMSE,该剖面识别出噪声观测在何处能快速减少关于干净样本的不确定性,并指导训练噪声分布的适应。它只改变这个分布,保持目标、加权和参数化不变。在图像基准测试中,调度已被广泛调整,InfoNoise匹配或略微超过强基线,并且可以用更少的更新达到相同的质量。在表示、序列和模态转换(包括DNA和语言生成)上,InfoNoise优于固定和自适应基线,并且达到目标质量所需的训练计算量最多减少3倍。这些结果确立了条件熵率剖面作为噪声调度设计的数据依赖目标,并使在线自适应成为手动调度搜索的实用替代方案。

英文摘要

We introduce InfoNoise, an online adaptive noise schedule for diffusion training that reallocates optimization effort toward noise levels where denoising is most informative. Together with loss weighting, a noise schedule induces an effective allocation across denoising problems, often fixed before informative noise levels are known. InfoNoise makes this allocation data-adaptive by estimating a conditional-entropy-rate profile from denoising losses during training, without auxiliary models or offline search. Through I--MMSE, this profile identifies where noisy observations rapidly reduce uncertainty about the clean sample and guides adaptation of the training noise distribution. It changes only this distribution, keeping the objective, weighting, and parameterization fixed. On image benchmarks, where schedules have been extensively tuned, InfoNoise matches or slightly exceeds strong baselines and can reach the same quality with fewer updates. On representation, sequence, and modality shifts, including DNA and language generation, InfoNoise improves over fixed and adaptive baselines and reaches target quality with up to $3\times$ less training compute. These results establish the conditional-entropy-rate profile as the data-dependent target for noise schedule design and make online adaptation a practical alternative to manual schedule search.

2602.16872 2026-05-28 cs.CV 版本更新

DODO: Discrete OCR Diffusion Models

DODO: 离散OCR扩散模型

Sean Man, Gilad Deutch, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

发表机构 * Technion - Israel Institute of Technology, Haifa, Israel.(特拉维夫大学-以色列理工学院,海法,以色列。) Amazon Web Services(亚马逊网络服务)

AI总结 针对OCR任务中自回归解码速度慢的问题,提出首个利用块离散扩散的VLM模型DODO,在保持高精度的同时实现高达5倍的推理加速。

详情
AI中文摘要

光学字符识别(OCR)是数字化信息的基础任务,是视觉数据与文本理解之间的关键桥梁。虽然现代视觉语言模型(VLM)在该领域取得了高精度,但它们主要依赖自回归解码,这需要为每个生成的token进行顺序前向传播,因此在处理长文档时计算成本高且速度慢。我们发现了一个克服这一瓶颈的关键机会:与开放式生成不同,OCR是一个高度确定性的任务,视觉输入严格决定了唯一的输出序列,理论上可以通过扩散模型实现高效的并行解码。然而,我们表明现有的掩码扩散模型未能利用这一潜力;它们引入了结构不稳定性,这在灵活任务(如字幕生成)中无害,但对于OCR的刚性精确匹配要求则是灾难性的。为了弥合这一差距,我们引入了DODO,这是首个利用块离散扩散并释放其OCR加速潜力的VLM。通过将生成分解为块,DODO减轻了全局扩散的同步误差。实验上,我们的方法在实现接近最先进精度的同时,与自回归基线相比,推理速度提高了5倍。

英文摘要

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 5x faster inference compared to autoregressive baselines.

2602.13748 2026-05-28 cs.CL cs.CV 版本更新

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

RMPL:基于关系感知的多任务渐进学习与分阶段训练的多媒体事件抽取

Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出RMPL框架,通过分阶段训练结合单模态事件抽取和多模态关系抽取的异构监督,在低资源条件下实现多媒体事件抽取,并在M2E2基准上取得一致改进。

Comments Accepted by ACM ICMR 2026

详情
AI中文摘要

多媒体事件抽取(MEE)旨在从包含文本和图像的文档中识别事件及其论元。它需要跨不同模态对事件语义进行 grounding。MEE 的进展受到缺乏标注训练数据的限制。M2E2 是唯一已建立的基准,但它仅提供评估用的标注。这使得直接监督训练不切实际。现有方法主要依赖于跨模态对齐或使用视觉-语言模型(VLM)进行推理时提示。这些方法没有显式学习结构化的事件表示,并且通常在多模态设置中产生较弱的论元 grounding。为解决这些限制,我们提出了 RMPL,一种用于低资源条件下 MEE 的基于关系感知的多任务渐进学习框架。RMPL 通过分阶段训练整合了来自单模态事件抽取和多模态关系抽取的异构监督。模型首先使用统一模式进行训练,以学习跨模态的共享事件中心表示。然后,使用混合文本和视觉数据对模型进行微调,以进行事件提及识别和论元角色抽取。在 M2E2 基准上使用多个 VLM 进行的实验表明,在不同模态设置下均取得了一致的改进。

英文摘要

Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.

2602.12843 2026-05-28 cs.CV 版本更新

MMRad-22K: A Structured Multimodal Evidence Dataset for Chest X-ray Report Generation

MMRad-22K:用于胸部X光报告生成的结构化多模态证据数据集

Yichen Zhao, Zelin Peng, Fenghe Tang, Piao Yang, Yu Huang, Wei Shen

发表机构 * MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University(人工智能MOE实验室、人工智能研究院、计算机科学学院、上海交通大学) School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC)(生物医学工程学院、生命科学与医学系、中国科学技术大学) Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advanced Research, USTC(医学影像、机器人、分析计算与学习中心(MIRACLE)、苏州市先进研究院、中国科学技术大学) Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine(放射科、浙江大学医学院第一附属医院)

AI总结 针对胸部X光报告生成中现有资源监督信号碎片化的问题,提出结构化多模态证据数据集MMRad-22K,并基于统一LVLM骨干进行适配,证明结构化多模态证据优于纯文本或边界框证据,在语言和临床指标上表现更优。

详情
AI中文摘要

胸部X光(CXR)报告遵循基于区域的临床工作流程,放射科医生检查解剖区域并将局部发现整合到最终报告中。然而,现有的CXR报告生成资源以碎片化形式提供这些监督信号。我们引入MMRad-22K,一个将区域文本观察、解剖定位坐标、局部图像证据和报告目标组织成结构化多模态证据单元的数据集,用于CXR报告生成。为了推动这一构想,我们首先比较了不同证据格式对报告生成的影响,发现结构化多模态证据通常比纯文本或基于边界框的证据更有用。然后,我们使用MMRad-22K适配统一的LVLM骨干,并证明多模态证据适配在语言和临床导向指标上均优于文本证据适配和端到端适配。在相同的评估协议下,适配模型也达到了与几个开源LVLM参考相当的性能水平。这些结果共同支持MMRad-22K作为实用的结构化多模态资源,用于训练和评估与临床阅读工作流程一致的CXR报告生成。

英文摘要

Chest X-ray (CXR) reporting follows a region-based clinical workflow in which radiologists inspect anatomical regions and integrate localized findings into a final report. However, existing resources for CXR report generation provide these supervision signals in fragmented forms. We introduce MMRad-22K, a dataset that organizes regional textual observations, anatomical grounding coordinates, localized image evidence, and report targets into structured multimodal evidence units for CXR report generation. To motivate this formulation, we first compare different evidence formats for report generation and find that structured multimodal evidence is generally more useful than text-only or bounding box-based evidence. We then adapt a unified LVLM backbone using MMRad-22K and show that adaptation with multimodal evidence outperforms both textual-evidence adaptation and end-to-end adaptation on language and clinically oriented metrics. Under the same evaluation protocol, the adapted model also reaches a performance level comparable to several open-source LVLM references. Together, these results support MMRad-22K as a practical structured multimodal resource for training and evaluating CXR report generation aligned with clinical reading workflows.

2602.11564 2026-05-28 cs.CV 版本更新

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

LUVE:基于双频专家的潜在级联超高分辨率视频生成

Chen Zhao, Jiawei Chen, Hongyu Li, Zhuoliang Kang, Shilin Lu, Xiaoming Wei, Kai Zhang, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) Nanyang Technological University(南洋理工大学)

AI总结 提出LUVE框架,通过三阶段潜在级联架构(低分辨率运动生成、潜在空间上采样、高分辨率内容精炼)结合双频专家,解决超高分辨率视频生成中的运动建模、语义规划和细节合成难题。

Comments ICML 2026

详情
AI中文摘要

近期视频扩散模型在视觉质量上取得了显著进步,但超高分辨率(UHR)视频生成由于运动建模、语义规划和细节合成的复合困难,仍然是一个严峻挑战。为解决这些限制,我们提出了 extbf{LUVE},一个基于双频 extbf{专}家的 extbf{潜}在级联 extbf{UHR} extbf{V}ideo生成框架。LUVE采用三阶段架构,包括用于运动一致潜在合成的低分辨率运动生成、直接在潜在空间进行分辨率上采样以减少内存和计算开销的视频潜在上采样,以及集成低频和高频专家以共同增强语义连贯性和细粒度细节生成的高分辨率内容精炼。大量实验表明,我们的LUVE在UHR视频生成中实现了卓越的照片真实感和内容保真度,全面的消融研究进一步验证了每个组件的有效性。项目可在\href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}获取。

英文摘要

Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.

2511.18894 2026-05-28 cs.CV cs.AI 版本更新

Not All Pixels Are Equal: Pixel-wise Meta-Learning for Medical Segmentation with Noisy Labels

并非所有像素都平等:面向含噪标签医学分割的像素级元学习

Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng

发表机构 * Xidian University(西安电子科技大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出MetaDCSeg框架,通过动态学习像素级权重并引入动态中心距离机制建模边界不确定性,抑制噪声标签影响并提升边界分割性能。

详情
AI中文摘要

医学图像分割对于临床应用至关重要,但常常受到噪声标注和模糊解剖边界的干扰,限制了其在现实场景中的应用。现有方法通常直接适应为实例分类设计的噪声标签学习技术,忽视了医学分割中像素级异质性及其空间和解剖上的难度差异。因此,全局假设或简单的置信度指标无法解决这些局部变化,导致边界模糊问题未得到解决。为解决这一问题,我们提出MetaDCSeg,一个鲁棒的框架,动态学习最优像素级权重以抑制噪声标签的影响,同时保留可靠标注。通过动态中心距离(DCD)机制显式建模边界不确定性,我们的方法利用前景、背景和边界中心的加权特征距离,引导模型关注模糊边界附近的难分割像素。该策略能够更精确地处理结构边界(这些边界常被现有方法忽略),并显著提升分割性能。在四个不同噪声水平的基准数据集上的大量实验表明,MetaDCSeg优于现有最先进方法。

英文摘要

Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, limiting its application in real-world scenarios. Existing methods often directly adapt noisy label learning techniques designed for instance classification, overlooking the pixel-wise heterogeneity in medical segmentation with its spatially and anatomically varying difficulties. Consequently, global assumptions or simple confidence metrics fail to address these local variations, leaving boundary ambiguities unresolved. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg outperforms existing state-of-the-art methods.

2602.07574 2026-05-28 cs.CV cs.CL 版本更新

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

ViCA:仅视觉交叉注意力的高效多模态大语言模型

Wenjie Liu, Hao Wu, Xin Qiu, Xudong Wang, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院、东部技术研究院) Munich Center for Machine Learning, LMU Munich(慕尼黑机器学习中心、慕尼黑大学)

AI总结 提出ViCA架构,通过仅视觉交叉注意力减少视觉令牌计算,在保持98%准确率的同时将视觉计算降至4%,实现显著加速。

详情
AI中文摘要

现代多模态大语言模型(MLLMs)采用统一的自我注意设计,在每个Transformer层处理视觉和文本令牌,导致大量计算开销。在这项工作中,我们重新审视了这种密集视觉处理的必要性,并表明投影的视觉嵌入已经与语言空间良好对齐,而有效的视觉-语言交互仅发生在少数层中。基于这些见解,我们提出了ViCA(仅视觉交叉注意力),一种最小的MLLM架构,其中视觉令牌绕过所有自我注意和前馈层,仅通过稀疏的交叉注意力在选定层与文本交互。在三个MLLM骨干、九个多模态基准和26个基于剪枝的基线上的广泛评估表明,ViCA在将视觉侧计算减少到4%的同时保持了98%的基线准确率,始终实现了优越的性能-效率权衡。此外,ViCA提供了一个规则的、硬件友好的推理流水线,在单批推理中实现了超过3.5倍的加速,在多批推理中实现了超过10倍的加速,与仅文本的LLM相比,将视觉定位减少到接近零的开销。它还与令牌剪枝方法正交,可以无缝结合以进一步提高效率。我们的代码可在https://github.com/EIT-NLP/ViCA获取。

英文摘要

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

2412.01004 2026-05-28 cs.CV 版本更新

Take Only What You Need: Rank Minimization as an Implicit Forgetting Regularizer in Continual Learning

只取所需:秩最小化作为持续学习中的隐式遗忘正则化器

Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong

发表机构 * University of New South Wales(新南威尔士大学) CSIRO(澳大利亚联邦科学工业研究组织)

AI总结 本文提出CoDyRA方法,通过秩最小化作为隐式遗忘正则化器,在持续学习中平衡可塑性与稳定性,在多个基准上优于现有方法。

Comments Preprint

详情
AI中文摘要

持续学习中的核心张力是可塑性(获取新知识)与稳定性(保留先前知识)之间的权衡。我们研究如何通过容量控制(即调节每次参数更新的有效秩,这是LoRA更新中可直接控制的逐步骤量)来持续更新预训练骨干网络,使其吸收新知识的同时保留现有能力。对模块和任务间LoRA秩和放置的受控探测揭示了一致的权衡,存在一个随放置和任务变化的中等秩最佳点,没有普遍最优的固定秩;一个形式化界限表明遗忘随秩增长。基于这些发现,我们提出了持续动态秩选择LoRA(CoDyRA),该方法通过在每个组件重要性权重上施加稀疏性促进正则化,联合训练每个LoRA更新与秩最小化。监督目标驱动可塑性;秩最小化正则化遗忘。我们证明秩最小化在持续学习机制中充当隐式遗忘正则化器,通过控制相对于当前模型状态的遗忘,同时保护通用能力和先前任务知识。在MTIL、X-TAIL和TRACE(CLIP、LLaMA、Gemma)上,CoDyRA在新知识学习和遗忘方面优于先前的持续学习方法,实现了强大的可塑性-稳定性平衡。代码可在https://github.com/jeff024/codyra获取。

英文摘要

The central tension in continual learning (CL) is the trade-off between plasticity (acquiring new knowledge) and stability (retaining prior knowledge). We study how a pre-trained backbone can be continually updated to absorb new knowledge while preserving existing capabilities, via capacity control: regulating the effective rank of each parameter update, a per-step quantity directly controllable inside a LoRA update. A controlled probe of LoRA rank and placement across modules and tasks reveals a consistent trade-off, with a moderate-rank sweet spot that varies by placement and task, leaving no universally optimal fixed rank; a formal bound shows forgetting grows with rank. Building on these findings, we propose Continual Dynamic Rank-Selective LoRA (CoDyRA), which jointly trains each LoRA update with rank minimization via sparsity-promoting regularization on per-component importance weights. The supervised objective drives plasticity; rank minimization regularizes forgetting. We show that rank minimization serves as an implicit forgetting regularizer in the CL regime, protecting general capability and prior-task knowledge simultaneously by controlling forgetting against the current model state. Across MTIL, X-TAIL, and TRACE (CLIP, LLaMA, Gemma), CoDyRA outperforms prior CL methods on new knowledge learning and forgetting, achieving a strong plasticity-stability balance. Code is available at https://github.com/jeff024/codyra.

2602.03668 2026-05-28 cs.RO cs.CV 版本更新

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

MVP-LAM:通过跨视角重建学习以动作为中心的潜在动作

Jung Min Lee, Dohyeok Lee, Seokhun Ju, Taehyun Cho, Jin Woo Koo, Li Zhao, Sangwoo Hong, Jungwoo Lee

发表机构 * Seoul National University, Seoul, South Korea(首尔国立大学,首尔,韩国) Konkuk University, Seoul, South Korea(韩国konkuk大学,首尔,韩国) Microsoft Research Asia, Beijing, China(微软亚洲研究院,北京,中国) HodooAI Labs, Seoul, South Korea(HodooAI实验室,首尔,韩国)

AI总结 提出MVP-LAM模型,利用多视角视频通过跨视角重建目标学习与真实动作高度相关的潜在动作,提升动作预测和下游操作性能。

详情
AI中文摘要

从多样化人类视频中学习的潜在动作作为视觉-语言-动作(VLA)预训练的伪标签,但只有当它们对底层真实动作保持信息量时才能提供有效监督。为了有效监督,潜在动作应包含关于底层动作的信息,尽管这些信息不可直接获取。我们提出多视角潜在动作模型(MVP-LAM),该模型从多视角视频中学习与真实动作高度相关的潜在动作。MVP-LAM通过跨视角重建目标训练潜在动作,使得一个视角的潜在动作必须解释另一个视角的未来,从而减少对视角特定线索的依赖。在Bridge V2上,MVP-LAM生成更以动作为中心的潜在动作,与真实动作的互信息更高,动作预测性能提升,包括在分布外评估下。最后,使用MVP-LAM潜在动作预训练VLA模型提高了各种基准上的下游操作性能。代码和训练好的检查点可在https://jmsnu.github.io获取。

英文摘要

Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Moel (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jmsnu.github.io.

2602.03491 2026-05-28 cs.CV cs.CL 版本更新

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

解耦骨架与血肉:基于解缠对齐和结构感知引导的高效多模态表格推理

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室)

AI总结 提出DiSCo解缠结构-内容对齐框架和Table-GLS全局到局部结构引导推理框架,高效增强LVLM的表格理解与推理能力,无需昂贵监督或外部工具。

Comments Accepted as a Spotlight Paper at ICML 2026

详情
AI中文摘要

由于复杂的布局和紧密耦合的结构-内容信息,对表格图像进行推理对于大型视觉语言模型(LVLM)仍然具有挑战性。现有解决方案通常依赖于昂贵的监督训练、强化学习或外部工具,限制了效率和可扩展性。这项工作解决了一个关键问题:如何以最少的标注且无需外部工具来使LVLM适应表格推理?具体来说,我们首先引入了DiSCo,一种解缠结构-内容对齐框架,在多模态对齐期间明确分离结构抽象和语义基础,高效地将LVLM适应于表格结构。在DiSCo的基础上,我们进一步提出了Table-GLS,一种全局到局部结构引导推理框架,通过结构化探索和基于证据的推理来执行表格推理。跨多个基准的大量实验表明,我们的框架高效地增强了LVLM的表格理解和推理能力,特别是泛化到未见过的表格结构。我们的数据和代码可在https://github.com/AAAndy-Zhu/TableVLM获取。

英文摘要

Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures. Our data and code are available at https://github.com/AAAndy-Zhu/TableVLM.

2602.02259 2026-05-28 cs.LG cs.CV 版本更新

Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

聚焦分割:在干扰物存在下引导潜在动作模型

Marcus Fechner, Hamza Adnan, Constantin C. Lüth, Matthew T. Jackson, Alexey Zakharov, J. Marius Zöllner

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) University of Oxford(牛津大学)

AI总结 针对动作相关视觉干扰导致潜在动作模型失效的问题,提出MaskLAM方法,利用分割基础模型(如SAM)零样本获取智能体掩码,限制重建目标于智能体像素,迫使潜在动作编码内源动态,显著提升下游策略性能。

详情
AI中文摘要

潜在动作模型(LAMs)为在大规模无动作视频上预训练具身智能体提供了一条有前景的路径。它们推断连续观测之间的潜在动作,之后可以使用少量标签解码为真实动作。然而,近期工作表明,在真实世界视频中常见的动作相关视觉干扰物(如动态背景、相机抖动或其他移动物体)存在时,这一方法会失败。在这些场景中,标准重建目标会驱使潜在动作编码外源运动而非智能体控制的动态,导致微调后的策略性能不佳。然而,我们观察到内源和外源因素通常在像素空间中是空间分离的:控制相关的变化集中在智能体上,而干扰物运动发生在别处。我们利用这一观察,将重建目标限制在智能体像素上,迫使潜在动作解释智能体控制的动态而非外源动态。我们将该方法称为MaskLAM;它从现成的分割基础模型(如SAM)中零样本获取智能体掩码,并且在预训练期间不需要架构更改、辅助损失或动作标签。在两个连续控制基准(Distracting Control Suite、Distracting Meta-World)上,MaskLAM将归一化线性探针MSE降低了最多$3.51 imes$,并将归一化回报提高了最多$4.97 imes$,相比LAPO,同时缩小了与依赖真实动作监督的LAOM-Labels之间的差距。

英文摘要

Latent action models (LAMs) offer a promising path to pre-training embodied agents on large amounts of action-free video. They infer latent actions between consecutive observations that can later be decoded to ground-truth actions using a small number of labels. However, recent work has shown that this recipe fails in the presence of action-correlated visual distractors common in real-world video, such as dynamic backgrounds, camera shake, or other moving objects. In these scenarios, the standard reconstruction objective drives latent actions to encode exogenous motion instead of agent-controlled dynamics, resulting in policies that underperform when fine-tuned. We observe, however, that endogenous and exogenous factors are typically spatially separated in pixel space: control-relevant change is concentrated on the agent, while distractor motion occurs elsewhere. We exploit this observation by restricting the reconstruction objective to agent pixels, forcing latent actions to explain agent-controlled dynamics rather than exogenous ones. We call this method MaskLAM; it obtains the agent mask zero-shot from off-the-shelf segmentation foundation models (e.g., SAM) and requires no architectural changes, auxiliary losses, or action labels during pre-training. Across two continuous-control benchmarks (Distracting Control Suite, Distracting Meta-World), MaskLAM reduces normalized linear-probe MSE by up to $3.51\times$ and improves normalized return by up to $4.97\times$ over LAPO, while narrowing the gap to LAOM-Labels, which relies on ground-truth action supervision.

2601.21666 2026-05-28 cs.AI cs.CV 版本更新

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

SONIC-O1:用于评估多模态大语言模型在音视频理解上的真实世界基准

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

发表机构 * Vector Institute for Artificial Intelligence(向量人工智能研究所) University of Groningen(Groningen大学) York University(约克大学)

AI总结 提出SONIC-O1基准,包含60小时人工验证的音视频数据,评估多模态大语言模型在开放摘要、多项选择问答和时序定位上的能力,发现模型在时序定位上存在显著性能差距和人口统计偏差。

详情
AI中文摘要

多模态大语言模型(MLLMs)是近期AI研究的主要焦点。然而,大多数先前工作集中于静态图像理解,而它们处理序列音视频数据的能力仍未充分探索。这一差距凸显了需要一个高质量基准来系统评估MLLM在真实世界场景中的性能。我们介绍了SONIC-O1,一个全面的、完全人工验证的基准,包含60小时(231个片段)跨越13个真实世界对话领域的数据,带有4,958个注释和人口统计元数据。SONIC-O1评估三种能力:开放摘要、多项选择题(MCQ)回答以及带有支持理由(推理)的时序定位。在闭源和开源模型中,我们发现MCQ准确率显示模型家族之间的差距最小,但最好的闭源模型在时序定位上比最好的开源模型高出22.6%。我们进一步观察到不同人口统计组在时序定位上的准确率差距高达21.4%,表明模型行为存在持续差异。SONIC-O1为基于时序和人口统计鲁棒的多模态理解提供了一个开放评估套件。SONIC-O1公开可用于研究:项目页面(https://vectorinstitute.github.io/sonic-o1/)、数据集(https://huggingface.co/datasets/vector-institute/sonic-o1)、GitHub(https://github.com/vectorinstitute/sonic-o1)、排行榜(https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard)。

英文摘要

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark of 60 hours (231 clips) spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates three capabilities: open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Across closed- and open-source models, we find that the MCQ accuracy shows the smallest gap between model families, but the best closed-source model outperforms the best open-source model by 22.6% on temporal localization. We further observe accuracy gaps of up to 21.4% on temporal localization across demographic groups, indicating persistent disparities in model behaviour. SONIC-O1 provides an open evaluation suite for temporally grounded and demographically robust multimodal understanding. SONIC-O1 is publicly available for research: Project page (https://vectorinstitute.github.io/sonic-o1/), Dataset (https://huggingface.co/datasets/vector-institute/sonic-o1), GitHub (https://github.com/vectorinstitute/sonic-o1), Leaderboard (https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard).

2601.17737 2026-05-28 cs.CV cs.AI 版本更新

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

脚本即一切:一个用于长程对话到电影视频生成的智能体框架

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus

发表机构 * Tencent(腾讯)

AI总结 提出一个端到端智能体框架,通过训练ScripterAgent将对话转化为精细脚本,并利用DirectorAgent跨场景连续生成策略,实现长程对话到电影视频的连贯生成,显著提升脚本忠实度和时间保真度。

详情
AI中文摘要

近期视频生成的进展产生了能够从简单文本提示合成惊艳视觉内容的模型。然而,这些模型难以从对话等高层概念生成连贯的长篇叙事,揭示了创意想法与其电影执行之间的“语义鸿沟”。为弥合这一鸿沟,我们引入了一个新颖的、端到端的智能体框架,用于对话到电影视频的生成。我们框架的核心是ScripterAgent,一个经过训练将粗略对话转化为精细、可执行的电影脚本的模型。为此,我们构建了ScriptBench,一个具有丰富多模态上下文的新大规模基准,通过专家引导的流程进行标注。生成的脚本随后指导DirectorAgent,它使用跨场景连续生成策略协调最先进的视频模型,以确保长程连贯性。我们的全面评估,包括一个AI驱动的CriticAgent和一个新的视觉-脚本对齐(VSA)指标,表明我们的框架在所有测试的视频模型上显著提高了脚本忠实度和时间保真度。此外,我们的分析揭示了当前SOTA模型在视觉奇观与严格脚本遵循之间的关键权衡,为自动化电影制作的未来提供了宝贵见解。

英文摘要

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

2601.10714 2026-05-28 cs.CV cs.GR 版本更新

Alterbute: Editing Intrinsic Attributes of Objects in Images

Alterbute: 编辑图像中物体的内在属性

Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen

发表机构 * Google(谷歌) The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Reichman University(雷赫曼大学)

AI总结 提出Alterbute方法,通过扩散模型结合松弛训练目标和视觉命名实体,在保持物体身份和场景上下文的同时编辑颜色、纹理、材质和形状等内在属性。

Comments ICML 2026. Project page is available at https://talreiss.github.io/alterbute/

详情
AI中文摘要

我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中物体的内在属性。我们允许改变物体的颜色、纹理、材质甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖无监督先验,往往无法保持身份,要么使用过度严格的监督,阻止有意义的内部变化。我们的方法依赖于:(i) 一个松弛的训练目标,允许模型在身份参考图像、描述目标内在属性的文本提示以及定义外在上下文的背景图像和物体掩码的条件下,改变内在和外在属性。在推理时,我们通过重用原始背景和物体掩码来限制外在变化,从而确保只改变所需的内在属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如“保时捷911 Carrera”),这些类别将共享身份定义特征的物体分组,同时允许内在属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和内在属性描述,从而实现可扩展的、保持身份的监督。Alterbute在保持身份的物体内在属性编辑方面优于现有方法。

英文摘要

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

2601.10334 2026-05-28 cs.CV cs.LG 版本更新

An analytic theory of convolutional neural network inverse problems solvers

卷积神经网络逆问题求解器的解析理论

Minh Hai Nguyen, Quoc Bao Do, Edouard Pauwels, Pierre Weiss

发表机构 * IRIT \& CBI, CNRS \& Université Toulouse, France Toulouse School of Economics, Université Toulouse Capitole, France

AI总结 通过最小均方误差估计器引入平移等变性和有限感受野的归纳偏置,推导出局部等变MMSE的解析公式,并在多种逆问题、数据集和架构上验证其与神经网络输出高度一致。

详情
Journal ref
Forty-Third International Conference on Machine Learning, 2026
AI中文摘要

监督卷积神经网络(CNN)被广泛用于解决成像逆问题,在众多应用中取得了最先进的性能。然而,尽管取得了经验上的成功,这些方法从理论角度仍缺乏理解,常被视为黑箱。为弥合这一差距,我们通过最小均方误差(MMSE)估计器的视角分析训练后的神经网络,并引入捕获CNN两个基本归纳偏置(平移等变性和通过有限感受野的局部性)的功能约束。在经验训练分布下,我们推导出这种约束变体(称为局部等变MMSE,LE-MMSE)的解析、可解释且易于计算的公式。通过在不同逆问题(去噪、修复、去卷积)、数据集(FFHQ、CIFAR-10、FashionMNIST)和架构(U-Net、ResNet、PatchMLP)上的大量数值实验,我们证明了我们的理论与神经网络输出相匹配(PSNR $\gtrsim25$dB)。此外,我们提供了对物理感知和物理无关估计器之间差异、训练(补丁)分布中高密度区域的影响以及其他因素(数据集大小、补丁大小等)影响的见解。

英文摘要

Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).

2601.08617 2026-05-28 cs.CV 版本更新

SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

SoC: 测试时提示调优的语义正交校准

Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz

发表机构 * MICS, CentraleSupélec, Université Paris-Saclay(MICS,CentraleSupélec,巴黎萨克雷大学) LIVIA, ILLS, ÉTS Montréal(LIVIA,ILLs,蒙特利尔ÉTS)

AI总结 针对视觉语言模型测试时提示调优中校准被忽视的问题,提出基于Huber的正则化方法SoC,在保持语义邻近性的同时实现平滑的原型分离,从而改善校准性能并保持判别能力。

详情
Journal ref
CVPR 2026
AI中文摘要

随着视觉语言模型(VLM)在医疗或自动驾驶等关键决策系统中的日益普及,对其不确定性估计的校准变得至关重要。然而,这一维度在VLM测试时提示调优(TPT)文献中尚未得到充分探索,该领域主要侧重于提升其判别性能。最近的最先进方法主张对文本提示嵌入对实施完全正交约束以增强可分离性,从而改善校准。然而,正如我们在理论上所示,完全正交约束的固有梯度会强烈地将语义相关的类别推开,最终使模型过度自信。基于我们的发现,我们提出了语义正交校准(SoC),一种基于Huber的正则化器,它在保持语义邻近性的同时实现平滑的原型分离,从而相比于先前的基于正交性的方法改善了校准。通过全面的实证验证,我们证明SoC在保持竞争性判别能力的同时,持续改善了校准性能。

英文摘要

With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.

2601.03549 2026-05-28 cs.CV cs.CL 版本更新

FEA-SLT: A Gloss-Free End-to-End Framework for Facial-Expression-Aware Sign Language Translation

FEA-SLT:一种面向面部表情感知的手语翻译的无词汇端到端框架

Guobin Tu, Di Weng

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院)

AI总结 提出FEA-SLT框架,通过面部表情感知融合模块利用面部动态作为语义锚点,解决无词汇手语翻译中手势歧义问题,在PHOENIX14T和CSL-Daily数据集上达到最优BLEU性能。

详情
AI中文摘要

手语翻译(SLT)是一项具有挑战性的跨模态任务,需要对手部动作和非手动信号进行联合建模。现有的无词汇SLT方法有效捕捉手势动态,但常常未充分利用面部表情,而面部表情在语法和消除歧义中起着关键作用。当不同概念共享相似手部配置时,这一限制可能导致语义退化。为解决此问题,我们提出FEA-SLT(面部表情感知手语翻译),一种无词汇端到端框架,利用面部动态作为语义锚点来消除手部歧义。FEA-SLT采用领域迁移的面部编码器提取表情敏感表示,并通过语言约束的面部表情感知融合(FEAF)模块将其与手部特征集成。FEAF通过双向调制捕捉手部和面部通道之间的相互依赖关系,增强句法保真度。在PHOENIX14T和CSL-Daily上的实验表明,FEA-SLT在无词汇方法中实现了最先进的BLEU性能,而针对性分析证实了其对面部敏感语句翻译的改进。代码可在[https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT)获取。

英文摘要

Sign Language Translation (SLT) is a challenging cross-modal task requiring joint modeling of manual articulations and non-manual signals. Existing gloss-free SLT methods effectively capture gestural dynamics but often underutilize facial expressions, which play crucial grammatical and disambiguating roles. This limitation can cause semantic degradation when distinct concepts share similar manual configurations. To address this issue, we propose FEA-SLT (**F**acial-**E**xpression-**A**ware **S**ign **L**anguage **T**ranslation), a gloss-free end-to-end framework that uses facial dynamics as semantic anchors for resolving manual ambiguity. FEA-SLT employs a domain-transferred facial encoder to extract expression-sensitive representations and integrates them with manual features through a linguistically constrained *Facial-Expression-Aware Fusion* (FEAF) module. FEAF captures reciprocal dependencies between manual and facial channels via bidirectional modulation, enhancing syntactic fidelity. Experiments on PHOENIX14T and CSL-Daily show that FEA-SLT achieves state-of-the-art BLEU performance among gloss-free methods, while targeted analyses confirm improved translation of facial-sensitive utterances. Code is available at [https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT).

2601.03048 2026-05-28 cs.CV cs.AI cs.CC 版本更新

On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

关于Transformer图像嵌入在非可解空间推理中的内在限制

Siyi Lyu, Quan Liu, Feng Yan

发表机构 * School of Electronic Science and Engineering, Nanjing University, Nanjing, China(电子科学与工程学院,南京大学,南京,中国)

AI总结 本文通过将空间理解形式化为群同态问题,证明恒定深度Transformer由于TC⁰复杂度限制,无法在单次前向传播中捕获非可解群(如SO(3))的空间结构。

详情
AI中文摘要

视觉Transformer(ViT)在语义识别方面表现出色,但在心理旋转等空间推理任务中却出现系统性失败。虽然这通常归因于数据规模,但本文认为该限制源于架构的内在电路复杂度。通过将空间理解形式化为学习一个群同态问题——其中潜在嵌入保留作用于图像的物理变换的代数结构——我们识别出一个基本的计算瓶颈。具体来说,对于非可解群(例如$\mathrm{SO}(3)$),维持这种保结构嵌入的下界由单词问题决定,该问题是$\mathsf{NC^1}$-完全的。相比之下,具有多项式精度的恒定深度ViT严格受限于复杂度类$\mathsf{TC^0}$。在标准猜想$\mathsf{TC^0} \subsetneq \mathsf{NC^1}$下,出现了一个复杂度边界:恒定深度架构缺乏在单次前向传播中捕获非可解空间结构所需的逻辑深度。为了实证验证这一理论差距,我们提出了潜在空间代数(LSA)基准,该基准揭示了随着非可解任务组合深度的增加,ViT表示出现显著退化。

英文摘要

Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as learning a Group Homomorphism Problem -- where latent embeddings preserve the algebraic structure of physical transformations acting on images -- we identify a fundamental computational bottleneck. Specifically, for non-solvable groups (e.g., $\mathrm{SO}(3)$), maintaining such structure-preserving embeddings is lowerbounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class $\mathsf{TC^0}$. Under the standard conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, a complexity boundary emerges: constant-depth architectures lack the logical depth required to capture non-solvable spatial structures in a single forward pass. To empirically validate this theoretical gap, we propose the Latent Space Algebra (LSA) benchmark, which reveals a significant degradation in ViT representations as the compositional depth of non-solvable tasks increases.

2504.10079 2026-05-28 cs.CV 版本更新

Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

层次化关系增强表示泛化用于少样本动作识别

Hongyu Qu, Ling Xing, Jiachao Zhang, Rui Yan, Yazhou Yao, Xiangbo Shu

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院) Artificial Intelligence Industrial Technology Research Institute, Nanjing Institute of Technology(南京理工大学人工智能工业技术研究院)

AI总结 提出HR2G-shot框架,通过统一帧间、视频间和任务间三种关系建模,从整体视角学习任务特定的时间模式,以提升少样本动作识别的性能。

详情
AI中文摘要

少样本动作识别(FSAR)旨在通过少量样本识别新动作类别。现有方法通常通过设计帧间时间建模策略或粗粒度视频级交互来学习每个视频的帧级表示。然而,它们孤立地处理每个情节任务,忽略了视频间的细粒度时间关系建模,因此无法捕获跨视频共享的细粒度时间模式,也无法重用历史任务的时间知识。鉴于此,我们提出了HR2G-shot,一种用于FSAR的层次化关系增强表示泛化框架,它统一了三种关系建模(帧间、视频间和任务间),从整体视角学习任务特定的时间模式。除了进行帧间时间交互外,我们进一步设计了两个组件分别探索视频间和任务间关系:i) 视频间语义相关性(ISC)以细粒度方式执行跨视频帧级交互,从而捕获任务特定的查询特征,并增强类内一致性和类间可分离性;ii) 任务间知识迁移(IKT)从存储历史情节任务中多样时间模式的库中检索和聚合相关时间知识。在五个基准上的大量实验表明,HR2G-shot优于当前领先的FSAR方法。

英文摘要

Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

2601.00501 2026-05-28 cs.CV 版本更新

CPPO: Contrastive Perception Policy Optimization for VLM Agents

CPPO: 面向VLM智能体的对比感知策略优化

Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Yong Zhang, Mohammad Akbari

发表机构 * Huawei Technologies Canada Co. Ltd.(华为技术加拿大有限公司) Huawei Cloud(华为云)

AI总结 提出一种自监督的对比感知策略优化方法CPPO,通过对比感知损失增强视觉语言模型的视觉基础能力,无需额外模型或标注,在感知关键任务中优于现有方法。

详情
AI中文摘要

我们引入了CPPO,一种用于微调视觉语言模型(VLM)的对比感知策略优化方法。可靠的感知是基于VLM的智能体在开放环境中推理和行动的核心要求:错误的视觉基础直接导致错误的行为、幻觉工具调用和不安全的决策。虽然强化学习(RL)显著提升了语言模型的推理能力,但将这些进展扩展到多模态智能体需要同时改进感知和推理。先前的工作主要通过显式感知奖励来解决这一挑战,这通常需要额外的LLM评判器、真实标注或强制将感知与推理分离。CPPO通过扩展RL目标,引入对比感知损失(CPL),以自监督方式解决了这一限制,为视觉基础提供了直接的学习信号。对比目标鼓励模型对输入的视觉信息更加敏感。为了有效应用这一信号,CPPO利用在扰动图像下模型输出分布中的熵移机制识别感知令牌,并在训练期间选择性地对这些令牌应用对比损失。实验表明,CPPO在避免额外模型的同时超越了先前方法,使训练更加高效和可扩展,并产生了更适合感知关键智能体任务的策略。

英文摘要

We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision--language models (VLMs). Reliable perception is a core requirement for VLM-based agents that must reason and act in open-ended environments: faulty visual grounding cascades directly into faulty actions, hallucinated tool calls, and unsafe decisions. While reinforcement learning (RL) has significantly improved reasoning in language models, extending these advances to multimodal agents requires improving both perception and reasoning. Prior works address this challenge mainly through explicit perception rewards, which often require extra LLM judges, ground-truth annotations, or forced separation of perception from reasoning. CPPO addresses this limitation in a self-supervised manner by extending the RL objective with a Contrastive Perception Loss (CPL) that provides a direct learning signal for visual grounding. The contrastive objective encourages the model to become more sensitive to input visual information. To apply this signal effectively, CPPO identifies perception tokens using an entropy-shift mechanism in the model's output distributions under perturbed images and applies the contrastive loss selectively to those tokens during training. Experiments show that CPPO surpasses prior methods while avoiding extra models, making training more efficient and scalable, and yielding policies that are better suited to perception-critical agentic tasks.

2512.16483 2026-05-28 cs.CV 版本更新

FasterVAR: Plug-and-Play Acceleration for Visual Autoregressive Models

FasterVAR:视觉自回归模型的即插即用加速

Senmao Li, Kai Wang, Salman Khan, Fahad Shahbaz Khan, Jian Yang, Yaxing Wang

发表机构 * PCA Lab, VCIP, College of Computer Science, Nankai University(南开大学计算机学院、VCIP、PCA实验室) Program of Computer Science, City University of Hong Kong (Dongguan), China(香港城市大学(东莞)计算机系,中国) City University of Hong Kong, HK SAR, China(香港城市大学,香港特别行政区,中国) Mohamed bin Zayed University of Artificial Intelligence, UAE(阿联酋Mohamed bin Zayed人工智能大学) Linkoping University, Sweden(林地平大学,瑞典) PCA Lab, School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院、PCA实验室) College of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 针对VAR模型在大尺度步骤计算复杂度高的问题,提出一种基于阶段感知的即插即用加速框架FasterVAR,通过保留早期关键步骤并剪枝或近似后期细节步骤,实现最高3.4倍加速且几乎无性能损失。

Comments Accepted at ICML2026

详情
AI中文摘要

视觉自回归(VAR)建模通过下一尺度预测偏离了传统自回归(AR)模型的下一个标记预测范式,实现了高质量的图像生成。然而,VAR范式在大尺度步骤上面临计算复杂度和运行时间急剧增加的问题。尽管现有的加速方法减少了大尺度步骤的运行时间,但依赖于手动步骤选择,并忽略了生成过程中不同阶段的不同重要性。为了解决这一挑战,我们提出了FasterVAR,一个对VAR模型的系统研究和即插即用加速框架。我们的分析表明,早期步骤对于保持语义和结构一致性至关重要,应保持不变,而后期步骤主要细化细节,可以被剪枝或近似以加速。基于这些见解,FasterVAR引入了一种即插即用加速策略,利用后期计算中的语义无关性和低秩属性,无需额外训练。我们提出的FasterVAR实现了最高3.4倍的加速,且几乎没有性能损失,持续优于现有的加速基线。这些结果凸显了阶段感知设计作为高效视觉自回归图像生成的一个强大原则。

英文摘要

Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present FasterVAR, a systematic study and plug-and-play acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact,while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, FasterVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed FasterVAR achieves up to 3.4x speedup with almost no performance loss. consistently outperforming existing acceleration baselines.These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.

2512.00814 2026-05-28 cs.CV 版本更新

IRPO: Boosting Image Restoration via Post-training GRPO

IRPO:通过后训练GRPO提升图像恢复

Haoxuan Xu, Yi Liu, Tianfu Li, Ruolin Shen, Boyuan Jiang, Jinlong Peng, Donghao Luo, Xiaobin Hu, Shuicheng Yan, Haoang Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学) Technical University of Munich(慕尼黑技术大学) Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) National University of Singapore(新加坡国立大学)

AI总结 提出IRPO框架,利用GRPO后训练优化确定性恢复模型,通过数据筛选和复合奖励建模,在域内和域外任务上显著提升性能。

详情
AI中文摘要

后训练在高层次生成任务中已变得有效,但在低层次视觉中的作用仍未被充分探索。现有的图像恢复方法通常依赖于对真实图像的固定逐像素拟合,这可能导致过度平滑和泛化能力弱。我们提出了IRPO,一个基于GRPO的后训练框架,用于确定性恢复模型。IRPO围绕两个轴构建:数据公式化和奖励建模。对于数据公式化,我们从预训练阶段选择表现最差的30%样本,这提高了准确性和训练效率。对于奖励建模,我们将面向保真度和面向质量的反馈与三个组件结合:用于结构保真度的通用奖励、使用视觉-语言模型作为粗粒度视觉质量评判的专家奖励,以及用于任务特定低级线索的恢复奖励。在六个域内和五个域外基准上的实验表明,IRPO在域内任务上将AdaIR基线提高了0.93 dB,在域外设置上提高了3.43 dB。我们的代码可在https://github.com/HaoxuanXU1024/IRPO查看。

英文摘要

Post-training has become effective for high-level generation, but its role in low-level vision remains underexplored. Existing image restoration methods often rely on fixed pixel-wise fitting to ground-truth images, which can lead to over-smoothing and weak generalization. We propose IRPO, a GRPO-based post-training framework for deterministic restoration models. IRPO is built around two axes: data formulation and reward modeling. For data formulation, we select the 30% underperforming samples from the pre-training stage, which improves both accuracy and training efficiency. For reward modeling, we combine fidelity-oriented and quality-aware feedback with three components: a General Reward for structural fidelity, an Expert Reward that uses a Vision-Language Model as a coarse visual-quality judge, and a Restoration Reward for task-specific low-level cues. Experiments on six in-domain and five out-of-domain (OOD) benchmarks show that IRPO improves the AdaIR baseline by 0.93 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.

2508.13544 2026-05-28 cs.CV cs.AI 版本更新

FLAIR: Frequency- and Locality-Aware Implicit Neural Representations

FLAIR: 频率与位置感知的隐式神经表示

Sukhun Ko, Seokhyun Youn, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh

AI总结 针对隐式神经表示缺乏频率选择性和空间定位导致频谱偏差的问题,提出带限局部激活和小波能量引导编码,提升2D图像表示、3D形状重建和新视角合成性能。

Comments CVPR Findings 2026 (camera ready ver.). Please visit our project page at https://cmlab-korea.github.io/FLAIR/

详情
AI中文摘要

隐式神经表示利用神经网络将坐标映射到对应信号,实现连续且紧凑的表示。该范式推动了各种视觉任务的重大进展。然而,现有的隐式神经表示缺乏频率选择性和空间定位,导致过度依赖冗余信号分量。因此,它们表现出频谱偏差,倾向于早期学习低频分量,而难以捕捉精细的高频细节。为了解决这些问题,我们提出了FLAIR(频率与位置感知的隐式神经表示),它包含两个关键创新。第一个是带限局部激活(BLA),这是一种新颖的激活函数,设计用于在时频不确定性原理(TFUP)约束下进行联合频率选择和空间定位。通过结构化的频率控制和空间局部响应,BLA有效减轻了频谱偏差并增强了训练稳定性。第二个是小波能量引导编码(WEGE),它利用离散小波变换计算能量分数,并显式地将频率信息引导到网络,实现精确的频率选择和自适应频带控制。我们的方法在2D图像表示、3D形状重建和新视角合成方面始终优于现有的隐式神经表示。

英文摘要

Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity and spatial localization, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is Band-Localized Activation (BLA), a novel activation designed for joint frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). Through structured frequency control and spatially localized responses, BLA effectively mitigates spectral bias and enhances training stability. The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform to compute energy scores and explicitly guide frequency information to the network, enabling precise frequency selection and adaptive band control. Our method consistently outperforms existing INRs in 2D image representation, as well as 3D shape reconstruction and novel view synthesis.

2512.01988 2026-05-28 cs.CV 版本更新

Artemis: Structured Visual Reasoning for Perception Policy Learning

Artemis: 用于感知策略学习的结构化视觉推理

Wei Tang, Yanpeng Sun, Shan Zhang, Weihao Bo, Xiaofan Li, Piotr Koniusz, Wei Li, Na Zhao, Zechao Li

发表机构 * NJUST IMAG(南京理工大学图像所) YZU(宜春大学) SUTD IMPL(新加坡科技设计大学智能感知实验室) Adelaide AIML(阿德莱德人工智能实验室) Data61 CSIRO(澳大利亚联邦科学与工业研究组织) ZJU(浙江大学) UNSW Sydney(新南威尔士大学悉尼分校) SenseTime Research(秒速科技研究院)

AI总结 提出Artemis方法,通过结构化视觉推理(中间步骤表示为(标签,边界框)对)替代语言推理,提升视觉感知策略的性能,并统一处理多种感知任务。

详情
AI中文摘要

最近的视觉感知策略强化学习框架通常结合用自然语言表达的中间推理链。经验观察表明,这种纯语言中间推理通常会降低感知任务的性能。我们认为核心问题不在于推理本身,而在于推理的形式:虽然这些链在非结构化的语言空间中进行语义推理,但视觉感知需要在空间和以对象为中心的空间中进行推理。为此,我们引入了Artemis,一种感知策略学习方法,它执行结构化的视觉推理,其中每个中间步骤都表示为一个(标签,边界框)对,捕获可验证的视觉状态。这种设计能够显式跟踪中间状态,直接监督提议质量,并避免基于语言的推理引入的歧义。基于可验证和空间定位的推理链,Artemis为各种感知任务提供了统一的架构,无需依赖先前感知策略模型所依赖的任务特定设计。使用自然图像域中的定位和检测样本进行训练,Artemis泛化到计数和几何感知任务。其核心是空间定位的、以对象为中心的链式规则,为可扩展和通用的感知策略提供了原则性基础。

英文摘要

Recent reinforcement-learning frameworks for visual perception policy usually incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, \textbf{visual perception requires reasoning in a spatial and object-centric space}. In response, we introduce \textbf{Artemis}, a perception-policy learning method that performs structured visual reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Building upon verifiable and spatially grounded reasoning chains, Artemis provides a unified architecture for diverse perceptual tasks, without requiring the task-specific designs relied upon by prior perceptual policy models. Trained using grounding and detection sampeles in natural image domains, Artemis generalizes to counting and geometric perception tasks. At its core, a spatially grounded, object-centric chain rule provides a principled foundation for scalable and general perceptual policies.

2511.20934 2026-05-28 cs.AI cs.CV cs.LG 版本更新

Guaranteed Optimal Compositional Explanations for Neurons

神经元的保证最优组合解释

Biagio La Rosa, Leilani H. Gilpin

发表机构 * Computer Science and Engineering Department, University of California, Santa Cruz, US(加州大学圣克鲁兹分校计算机科学与工程系)

AI总结 提出首个框架,通过分解、启发式和算法,在完整状态空间上计算保证最优的组合解释,并证明10-40%的波束搜索解释在概念重叠时非最优。

Comments Accepted at ICML 2026 (Oral), 43 pages, 10 figures

详情
AI中文摘要

组合解释是一类方法,旨在通过逻辑规则描述神经元感受野激活与概念之间的空间对齐,通常通过搜索所有可能的概念组合来计算。由于在整个状态空间上计算空间对齐在计算上不可行,文献中通常采用与组合结构相关的假设和波束搜索来限制状态空间。然而,波束搜索无法提供任何最优性的理论保证,且当前解释与真正最优解的接近程度仍不清楚。在这篇理论性论文中,我们通过引入首个框架来解决这一差距,该框架在采用假设所涵盖的整个状态空间上计算保证最优的组合解释。具体而言,我们提出:(i) 一种识别影响空间对齐因素的分解方法,(ii) 一种在搜索任何阶段估计对齐的启发式方法,以及(iii) 第一个能够在与穷举波束搜索相当的时间内计算最优组合解释的算法。使用该框架,我们证明当涉及重叠概念时,先前通过波束搜索获得的10-40%的解释是次优的。最后,我们评估了一种由我们提出的分解和启发式方法引导的波束搜索变体,表明它在超参数和计算资源方面提供更大灵活性的同时,匹配或改进了先前方法的运行时间。

英文摘要

Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations and concepts through logical rules, typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts assumptions related to the structure of the combinations and beam search to restrict the state space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations over the entire state space spanned by the adopted assumptions. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations in a time comparable to exhaustive beam search. Using this framework, we demonstrate that 10-40% of explanations previously obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.

2511.20439 2026-05-28 cs.CV cs.AI 版本更新

Object-Centric Vision Token Pruning for Vision Language Models

面向视觉语言模型的以对象为中心的视觉令牌剪枝

Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen

发表机构 * Aalto University(阿alto大学) University of Electronic Science and Technology of China(电子科学与技术大学) Delft University of Technology(代尔夫特理工大学)

AI总结 提出OC-VTP方法,通过轻量预训练以对象为中心的视觉令牌剪枝器,直接选择最具代表性的视觉令牌,在保持高精度的同时提升VLM推理效率。

详情
AI中文摘要

在视觉语言模型(VLM)中,与语言令牌相比,视觉令牌数量庞大但信息分散,因此消耗了大量不必要的计算。为了提升VLM推理效率,剪枝冗余视觉令牌的研究一直在进行,但现有方法都采用间接且无保证的方式。我们提出了OC-VTP,一种直接且有保证的方法,用于选择最具代表性的视觉令牌,以实现高效且保持精度的VLM推理。我们的OC-VTP仅需对一个小型的以对象为中心的视觉令牌剪枝器进行轻量预训练,然后即可将其插入现有VLM中,无需在任何数据集上微调任何模型。通过最小化从所选令牌重建原始未剪枝令牌的误差,保证保留最具代表性的视觉令牌。在任何视觉剪枝比例(即推理效率)下,我们的OC-VTP都能一致地帮助主流VLM保持最高的推理精度。我们的剪枝还展示了有趣的可解释性。我们的代码可在 https://github.com/GarryLarry010131/OC-VTP 获取。

英文摘要

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

2511.02558 2026-05-28 cs.CV cs.LG q-bio.NC 版本更新

Forecasting Future Anatomies: Longitudinal Brain Mri-to-Mri Prediction

预测未来解剖结构:纵向脑MRI到MRI的预测

Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka

发表机构 * A.I. Virtanen Institute for Molecular Sciences, University of Eastern Finland, Kuopio, Finland(A.I. Virtanen分子科学研究所,东芬兰大学,库奥普io,芬兰)

AI总结 本文研究从基线MRI预测未来脑部MRI,采用五种深度学习架构(UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet)在ADNI和AIBL数据集上实现高保真体素级预测,并验证了跨队列泛化能力。

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), Apr. 2026
AI中文摘要

从基线磁共振图像(MRI)预测未来脑状态是神经影像学的一个核心挑战,对研究阿尔茨海默病(AD)等神经退行性疾病具有重要意义。大多数现有方法预测未来认知评分或临床结果,例如从轻度认知障碍向痴呆的转化。相反,本文研究纵向MRI图像到图像的预测,该预测可以预测参与者未来数年的整个脑部MRI,内在建模复杂的、空间分布的神经退行模式。我们在两个纵向队列(ADNI和AIBL)上实施并评估了五种深度学习架构(UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet)。使用捕捉全局相似性和局部差异的指标,将预测的随访MRI与实际随访扫描直接进行比较。表现最佳的模型实现了高保真预测,并且所有模型都能很好地泛化到独立的外部数据集,展示了稳健的跨队列性能。我们的结果表明,深度学习可以在体素水平上可靠地预测参与者特定的脑部MRI,为个体化预后提供了新的机会。

英文摘要

Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.

2511.15390 2026-05-28 cs.CV 版本更新

Automatic Pruning Discovery for Large Language Models

大型语言模型的自动剪枝发现

Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang

发表机构 * Northeastern University, Shenyang, China(东北大学(沈阳)) Hebei Key Laboratory of Marine Perception Network and Data Processing, Northeastern University at Qinhuangdao 066004, Hebei Province, China(河北省海洋感知网络与数据处理重点实验室,秦皇岛东北大学066004,河北省) Sun Yat-sen University, Shenzhen, China(深圳大学) Hong Kong Baptist University, Hongkong, China(香港 Baptist 大学) Xidian University, Xian, China(西安电子科技大学)

AI总结 提出AutoPrune方法,利用LLMs自动设计剪枝算法,并通过图驱动思维链优化提示,结合偏态感知动态稀疏分配解决高剪枝率下的异常值问题,在主流基准上超越现有方法。

Comments 15 pages, 10 figures

详情
AI中文摘要

大型语言模型(LLMs)在广泛任务上取得了显著性能,但由于其庞大的规模,阻碍了实际部署。现有的针对LLMs的剪枝方法(例如Wanda)严重依赖手动设计的剪枝算法,从而导致巨大的人力成本并需要专家知识。此外,我们首次识别出在高剪枝率下由均匀稀疏性导致的严重异常值问题,这引发了关于如何为LLMs设计自适应剪枝稀疏度的额外担忧。LLMs能否自行剪枝?在这项工作中,我们通过提出一种名为AutoPrune的新型剪枝方法给出了肯定答案,该方法首次通过利用LLMs自动为其自身设计最优剪枝算法,无需任何专家知识,从而克服了专家知识的限制。具体来说,为了缓解LLMs的黑箱性质,我们提出了一种图驱动思维链(GCoT)来优化提示,显著增强了学习剪枝算法中的推理过程,并使我们能够生成具有卓越性能和可解释性的下一代剪枝算法。最后,基于对异常值问题的洞察,我们引入了偏态感知动态稀疏分配(SDSA)来克服异常值问题,减轻高剪枝率下的性能下降。我们在主流LLMs基准上进行了广泛实验,证明了AutoPrune的优越性,它始终优于最先进的竞争对手。

英文摘要

Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to huge labor costs and requires expert knowledge. Furthermore, we are the first to identify the serious outlier value issue behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called AutoPrune, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithm for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors.

2511.14558 2026-05-28 cs.CV 版本更新

Explaining Digital Pathology Models via Clustering Activations

通过激活聚类解释数字病理学模型

Adam Bajger, Jan Obdržálek, Vojtěch Kůr, Rudolf Nenutil, Petr Holub, Vít Musil, Tomáš Brázdil

发表机构 * Faculty of Informatics, Masaryk University, Brno, Czech Republic(马萨里克大学信息学院,布拉格,捷克共和国) Institute of Computer Science, Masaryk University, Brno, Czech Republic(马萨里克大学计算机科学研究所,布拉格,捷克共和国) Masaryk Memorial Cancer Institute, Brno, Czech Republic(马萨里克纪念癌症研究所,布拉格,捷克共和国)

AI总结 提出一种基于卷积神经网络激活聚类的可解释性方法,通过展示模型全局行为并提供细粒度信息,增强对数字病理学模型的理解和信任。

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)
AI中文摘要

我们提出了一种基于聚类的可解释性技术,用于基于卷积神经网络的数字病理学模型。与常用的基于显著性图的方法(如遮挡、GradCAM或相关性传播)不同,这些方法突出显示对单个切片预测贡献最大的区域,而我们的方法展示了所考虑模型的全局行为,同时提供了更细粒度的信息。结果聚类不仅可以可视化以理解模型,还可以增加对其操作的信心,从而在临床实践中更快地采用。我们还评估了我们的技术在现有用于检测前列腺癌的模型上的性能,证明了其实用性。

英文摘要

We present a clustering-based explainability technique for digital pathology models based on convolutional neural networks. Unlike commonly used methods based on saliency maps, such as occlusion, GradCAM, or relevance propagation, which highlight regions that contribute the most to the prediction for a single slide, our method shows the global behaviour of the model under consideration, while also providing more fine-grained information. The result clusters can be visualised not only to understand the model, but also to increase confidence in its operation, leading to faster adoption in clinical practice. We also evaluate the performance of our technique on an existing model for detecting prostate cancer, demonstrating its usefulness.

2510.27266 2026-05-28 cs.CV 版本更新

Enhancing Trustworthy GUI Grounding via Self-Critiqued Reinforcement Learning

通过自我批评强化学习增强可信的GUI定位

Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, Jian Luan

发表机构 * Xiaomi Inc(小米公司)

AI总结 提出HyperClick框架,通过自我批评强化学习联合优化定位准确性和置信度可靠性,实现可信的GUI定位。

详情
AI中文摘要

自主图形用户界面(GUI)代理依赖于准确的GUI定位,将语言指令映射到屏幕坐标,以执行用户命令。然而,当前的模型,无论是通过监督微调(SFT)还是强化学习(RL)训练的,通常提供的置信度信号与实际定位正确性对齐不良,导致过度自信且不可靠的预测。为了解决这个问题,我们提出了HyperClick,一种通过自我批评强化学习(SCRL)增强可信GUI定位的新框架。HyperClick结合了正确性奖励和置信度对齐奖励,训练策略模型同时输出点击预测和明确的置信度估计。这种方法通过基于置信度的自我评估,联合优化了定位准确性和置信度可靠性。在具有挑战性的基准测试上的大量实验表明,HyperClick在保持强大定位性能的同时,提供了更好对齐的置信度估计。通过在GUI动作旁边暴露不确定性,HyperClick支持GUI自动化中基于置信度的弃权。代码将在此处发布。

英文摘要

Autonomous graphical user interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement learning (RL), often provide confidence signals that are poorly aligned with actual grounding correctness, leading to overconfident and unreliable predictions. To address this, we propose HyperClick, a novel framework that enhances trustworthy GUI grounding through self-critiqued reinforcement learning (SCRL). HyperClick combines a correctness reward and a confidence alignment reward, training the policy model to output both a click prediction and an explicit confidence estimate. This approach jointly optimizes grounding accuracy and confidence reliability through confidence-based self-assessment. Extensive experiments on challenging benchmarks show that HyperClick maintains strong grounding performance while providing better-aligned confidence estimates. By exposing uncertainty alongside GUI actions, HyperClick supports confidence-based abstention in GUI automation. Code will be released here.

2510.18668 2026-05-28 cs.LG cs.CV 版本更新

Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches

面向心血管传感器贴片的端到端多模态微型CNN原型设计

Mustafa Fuad Rifet Ibrahim, Tunc Alkanat, Felix Manthey, Maurice Meijer, Alexander Schlaefer, Peer Stelldinger

发表机构 * CTO System Innovation, NXP Semiconductors Germany GmbH(NXP半导体德国系统创新部) Advanced Chip Engineering, NXP Semiconductors(NXP半导体先进芯片工程部) Business Line Secure Connected Edge, NXP Semiconductors(NXP半导体安全连接边缘业务线) Institute of Medical Technology and Intelligent Systems, Hamburg University of Technology(汉堡技术大学医学技术与智能系统研究所) Department of Informatics, Hamburg University of Applied Sciences(汉堡应用科学大学信息学院)

AI总结 针对资源受限的医疗边缘设备,提出一种早期融合心电图和心音图数据的卷积神经网络,实现二分类,相比现有技术将内存和计算成本降低约三个数量级,并验证了在微控制器上的能效优势。

Comments 11 pages, 2 figures. Extended version of our 2024 IEEE PerCom paper, with direct on-device energy measurements, a BLE communication benchmark, architecture comparisons, and an extended evaluation. Submitted to Biomedical Signal Processing and Control; Fixed typos

详情
AI中文摘要

绝大多数心血管疾病如果早期发现风险因素和迹象是可以预防的。使用体戴式传感器贴片等设备进行心血管监测,可以在保持患者自由和舒适的同时检测这些迹象。然而,传感器数据的分析必须稳健、可靠、高效且高度准确。深度学习方法可以自动化数据解读,减轻临床医生的工作负担。在这项工作中,我们分析了在资源受限的医疗边缘设备上应用深度学习模型对同步心电图(ECG)和心音图(PCG)记录进行分类的可行性。我们提出了一种具有早期数据融合的卷积神经网络来解决二分类问题。该模型在Physionet Challenge 2016数据集的同步ECG和PCG记录上进行训练和验证。与现有技术相比,我们的方法将内存占用和计算成本降低了约三个数量级,同时保持了有竞争力的准确性。我们进一步通过测量配备神经处理单元(NPU)的微控制器上的能耗,并在代表性BLE评估套件上对一系列有效载荷大小的蓝牙低功耗(BLE)通信能耗进行基准测试,证明了所提模型在医疗边缘设备上的适用性。比较结果证实,设备端推理比连续数据流传输更节能。

英文摘要

The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. The model is trained and validated on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by approximately three orders of magnitude compared with the state-of-the-art while maintaining competitive accuracy. We further demonstrate the applicability of the proposed model on medical edge devices by measuring its energy consumption on a microcontroller equipped with a neural processing unit (NPU) and benchmarking the energy of Bluetooth Low Energy (BLE) communication on a representative BLE evaluation kit across a range of payload sizes. The comparison confirms that on-device inference can be more energy efficient than continuous data streaming.

2510.15541 2026-05-28 cs.LG cs.CV eess.IV 版本更新

An Empirical Study on Variance-based MC Dropout Uncertainty-Error Correlation in 2D Brain Tumor Segmentation

基于方差的MC Dropout不确定性-误差相关性在二维脑肿瘤分割中的实证研究

Saumya B

发表机构 * Project Associate DESE, Indian Institute of Science(DESE项目助理,印度科学研究院)

AI总结 通过U-Net在四种增强设置下的实验,发现基于方差的MC Dropout不确定性在全局和边界上与分割误差的相关性较弱,表明其局限性。

Comments v2: Updated title and framing to clarify that findings are specific to variance-based uncertainty estimation via MC Dropout, not MC Dropout broadly. Minor textual improvements throughout. Code and results available at https://github.com/Saumya4321/mcd-error-correlation

详情
AI中文摘要

从MRI中准确分割脑肿瘤对诊断和治疗规划至关重要。尽管蒙特卡洛(MC) Dropout被广泛用于估计模型不确定性,但基于方差的不确定性(通过随机前向传递的逐像素方差计算)在识别分割误差(尤其是肿瘤边界附近)方面的有效性尚未得到充分研究。本研究使用在四种增强设置(无增强、水平翻转、旋转和缩放)下训练的U-Net,实证检验了基于方差的MC Dropout不确定性与二维脑肿瘤MRI分割误差之间的关系。不确定性估计为50次随机前向传递的逐像素方差,并使用Pearson和Spearman系数与逐像素误差进行相关性分析。结果显示全局相关性较弱(r ~ 0.30-0.38),边界相关性可忽略(|r| < 0.05)。尽管不同增强设置之间的差异具有统计显著性(p < 0.001),但缺乏实际意义。这些发现表明,基于方差的MC Dropout不确定性为全局和边界误差定位提供的线索有限,且不确定性表示的选择对MC Dropout在医学图像分割中的效用有重要影响。替代表示如预测熵或互信息可能更好地捕捉分割误差,尤其是在边界处。

英文摘要

Accurate brain tumor segmentation from MRI is vital for diagnosis and treatment planning. Although Monte Carlo (MC) Dropout is widely used to estimate model uncertainty, the effectiveness of variance-based uncertainty - computed as pixel-wise variance across stochastic forward passes - in identifying segmentation errors, particularly near tumor boundaries, remains insufficiently studied. This study empirically examines the relationship between variance-based MC Dropout uncertainty and segmentation error in 2D brain tumor MRI segmentation using a U-Net trained under four augmentation settings: none, horizontal flip, rotation, and scaling. Uncertainty was estimated as the pixel-wise variance across 50 stochastic forward passes and correlated with pixel-wise errors using Pearson and Spearman coefficients. Results show weak global correlations (r ~ 0.30-0.38) and negligible boundary correlations (|r| < 0.05). Although differences across augmentations were statistically significant (p < 0.001), they lacked practical relevance. These findings suggest that variance-based MC Dropout uncertainty provides limited cues for global and boundary error localization, and that the choice of uncertainty representation critically affects the utility of MC Dropout in medical image segmentation. Alternative representations such as predictive entropy or mutual information may better capture segmentation errors, particularly at boundaries.

2510.08555 2026-05-28 cs.CV 版本更新

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

VideoCanvas: 通过上下文条件化从任意时空补丁进行统一视频补全

Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue

发表机构 * MMLab, The Chinese University of Hong Kong(香港中文大学MMLab) Kling Team, Kuaishou Technology(快手技术有限公司Kling团队)

AI总结 提出VideoCanvas框架,通过混合条件化策略(空间上使用零填充全帧画布编码,时间上使用Temporal RoPE插值)实现任意时空视频补全的统一任务,无需修改或重新训练VAE。

Comments Project page: https://onevfall.github.io/project_page/videocanvas

详情
AI中文摘要

现有的可控视频生成方法通常针对刚性、任务特定的设置设计,例如首帧图像到视频、修复或插值,将时空控制视为一组孤立的问题。我们形式化了一个统一的任务——任意时空视频补全,其中模型从用户指定的、放置在任何空间位置和时间戳的补丁生成连贯视频。然而,在现代潜在视频扩散模型中实现这样的统一框架并非易事:因果视频VAE将多个帧压缩到单个潜在槽中,使得帧级条件化从根本上不适定,并且直接将稀疏填充、零填充的视频输入馈入VAE会导致严重的分布外伪影。为了解决这些挑战,我们提出了VideoCanvas,一个简单而有效的框架,它将上下文条件化范式适应于任意时空补全,无需修改或重新训练VAE。我们的关键思想是一种混合条件化策略,将空间和时间控制解耦:在空间上,我们以图像模式编码零填充的全帧画布,使VAE输入保持分布内;在时间上,我们使用Temporal RoPE插值为每个条件分配潜在序列中的连续分数索引,以实现精确的帧级对齐。为了评估这种能力,我们开发了VideoCanvasBench,这是第一个用于任意时空视频补全的基准测试,涵盖场景内保真度和场景间创造力。大量实验表明,VideoCanvas在单一统一框架下,在各种视频生成任务中实现了最先进的性能。

英文摘要

Existing controllable video generation methods are typically designed for rigid, task-specific settings, such as first-frame image-to-video, inpainting, or interpolation, treating spatio-temporal control as a set of isolated problems. We formalize a unified task, arbitrary spatio-temporal video completion, where a model generates a coherent video from user-specified patches placed at any spatial location and timestamp. However, realizing such a unified framework within modern latent video diffusion models is non-trivial: causal video VAEs compress multiple frames into a single latent slot, making frame-level conditioning fundamentally ill-posed, and directly feeding sparsely populated, zero-padded video inputs into the VAE leads to severe out-of-distribution artifacts. To address these challenges, we propose VideoCanvas, a simple yet effective framework that adapts the In-Context Conditioning paradigm to arbitrary spatio-temporal completion without modifying or retraining the VAE. Our key idea is a hybrid conditioning strategy that decouples spatial and temporal control: spatially, we encode zero-padded full-frame canvases in image mode to keep VAE inputs in-distribution, and temporally we use Temporal RoPE Interpolation to assign each condition a continuous fractional index in the latent sequence for precise frame-level alignment. To evaluate this capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Extensive experiments demonstrate that VideoCanvas achieves state-of-the-art performance across a diverse range of video generation tasks under a single, unified framework.

2510.06928 2026-05-28 cs.CV 版本更新

IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

IAR2:通过语义-细节关联令牌预测改进自回归视觉生成

Ran Yi, Teng Hu, Zihan Su, Jiangning Zhang, Lizhuang Ma

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 提出IAR2框架,通过语义-细节关联双码本和分层预测机制,实现从粗到细的图像生成,在ImageNet上取得FID 1.50的领先性能。

详情
AI中文摘要

自回归模型已成为视觉内容创建的有力范式,但常常忽略视觉数据的内在结构特性。我们之前的工作IAR通过基于嵌入相似性重新组织视觉码本,开启了解决这一问题的方向,从而提高了生成的鲁棒性。然而,它受到预训练码本的刚性和硬均匀聚类的不准确性的限制。为了克服这些限制,我们提出了IAR2,一种先进的自回归框架,实现了层次化的语义-细节合成过程。IAR2的核心是一种新颖的语义-细节关联双码本,它将图像表示解耦为用于全局语义信息的语义码本和用于细粒度细节的细节码本。它将量化能力从线性扩展到多项式规模,显著增强了表达能力。为了适应这种双重表示,我们提出了一种语义-细节自回归预测方案,结合局部上下文增强自回归头,执行分层预测——先预测语义令牌,再预测细节令牌——同时利用局部上下文窗口增强空间连贯性。此外,对于条件生成,我们引入了一种渐进式注意力引导的自适应CFG机制,该机制根据每个令牌与条件的相关性及其在生成序列中的时间位置动态调节引导尺度,在不牺牲真实性的情况下改善条件对齐。大量实验表明,IAR2在自回归图像生成上树立了新的最先进水平,在ImageNet上实现了1.50的FID。我们的模型不仅在性能上超越了先前的方法,而且展示了卓越的计算效率,突显了我们结构化、从粗到细生成策略的有效性。

英文摘要

Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.

2502.17832 2026-05-28 cs.LG cs.AI cs.CR cs.CV 版本更新

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

MM-PoisonRAG:通过局部和全局投毒攻击破坏多模态RAG

Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California Los Angeles(加州大学洛杉矶分校)

AI总结 提出MM-PoisonRAG框架,通过局部投毒攻击(LPA)和全局投毒攻击(GPA)两种策略,系统研究多模态检索增强生成(RAG)在知识投毒下的脆弱性,实验表明攻击成功率高达56%且能绕过现有防御。

Comments Code is available at https://github.com/HyeonjeongHa/MM-PoisonRAG

详情
AI中文摘要

检索增强生成(RAG)已成为多模态大语言模型(MLLM)中增强事实基础并减少幻觉的常见做法。然而,其对检索的依赖使MLLM面临知识投毒攻击,攻击者故意将恶意多模态内容注入外部知识库,以引导模型生成不正确甚至有害的响应。我们提出MM-PoisonRAG框架,系统研究多模态RAG在知识投毒下的脆弱性。具体地,我们设计了两种新颖的攻击策略:局部投毒攻击(LPA),植入针对特定查询的多模态错误信息以操纵输出至攻击者控制的响应;以及全局投毒攻击(GPA),使用单一、非定向的对抗性注入广泛破坏推理并降低所有查询的生成质量。在多样化任务、多模态RAG组件和攻击者访问级别上的大量实验揭示了严重的脆弱性:LPA即使在受限访问下也能达到高达56%的攻击成功率,并且无需重新优化对抗样本即可在四种不同的检索器之间有效迁移。GPA仅需一个投毒内容即可完全破坏模型生成,使准确率降至0%。此外,LPA和GPA均能绕过现有防御,突显了多模态RAG的脆弱性,并将MM-PoisonRAG确立为未来保护RAG框架免受多模态知识投毒研究的基础。

英文摘要

Retrieval-augmented generation (RAG) has become a common practice in multimodal large language models (MLLM) to enhance factual grounding and reduce hallucination. Yet, its reliance on retrieval exposes MLLMs to knowledge poisoning attacks, in which adversaries deliberately inject malicious multimodal content into external knowledge bases to steer models toward generating incorrect or even harmful responses. We present MM-PoisonRAG, a framework to systematically study the vulnerability of multimodal RAG under knowledge poisoning. Specifically, we design two novel attack strategies: Localized Poisoning Attack (LPA), which implants targeted, query-specific multimodal misinformation to manipulate outputs toward attacker-controlled responses, and Globalized Poisoning Attack (GPA), which uses a single, untargeted adversarial injection to broadly corrupt reasoning and collapse generation quality across all queries. Extensive experiments on diverse tasks, multimodal RAG components, and attacker access levels reveal severe vulnerabilities: LPA achieves up to 56% attack success rate even under restricted access, and transfers effectively across four different retrievers without re-optimizing the adversaries. GPA completely disrupts model generation to 0% accuracy with just one poisoned content. Moreover, both LPA and GPA bypass existing defenses, underscoring the fragility of multimodal RAG and establishing MM-PoisonRAG as a foundation for future research on securing RAG frameworks against multimodal knowledge poisoning.

2508.21046 2026-05-28 cs.CV cs.RO 版本更新

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

CogVLA: 通过指令驱动路由与稀疏化实现认知对齐的视觉-语言-动作模型

Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区计算机科学与技术学院)

AI总结 提出CogVLA框架,通过指令驱动路由和稀疏化机制,在LIBERO基准和真实机器人任务上以2.5倍训练成本降低和2.8倍推理延迟降低实现97.4%和70.0%的成功率。

Comments Accepted to NeurIPS 2025, Project Page: https://jiutian-vl.github.io/CogVLA-page

详情
AI中文摘要

最近基于预训练视觉-语言模型(VLM)构建的视觉-语言-动作(VLA)模型需要大量后训练,导致计算开销高,限制了可扩展性和部署。我们提出CogVLA,一个认知对齐的视觉-语言-动作框架,利用指令驱动路由和稀疏化来提高效率和性能。CogVLA受人类多模态协调启发,引入了一个3阶段渐进式架构。1)基于编码器-FiLM的聚合路由(EFA-Routing)将指令信息注入视觉编码器,以选择性聚合和压缩双流视觉标记,形成指令感知的潜在表示。2)基于这种紧凑的视觉编码,基于LLM-FiLM的剪枝路由(LFP-Routing)通过剪枝与指令无关的视觉接地标记将动作意图引入语言模型,从而实现标记级稀疏性。3)为确保压缩的感知输入仍能支持准确且连贯的动作生成,我们引入了V-L-A耦合注意力(CAtten),它将因果视觉-语言注意力与双向动作并行解码相结合。在LIBERO基准和真实机器人任务上的大量实验表明,CogVLA实现了最先进的性能,成功率分别为97.4%和70.0%,同时与OpenVLA相比,训练成本降低了2.5倍,推理延迟降低了2.8倍。CogVLA已开源,可在https://github.com/JiuTian-VL/CogVLA获取。

英文摘要

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

2503.11906 2026-05-28 cs.CV cs.AI 版本更新

A Survey on SAR ship classification using Deep Learning

基于深度学习的SAR船舶分类综述

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Emanuele Salerno

发表机构 * PhD School in Computer Science, University of Pisa(计算机科学博士学院,比萨大学) Institute of Information Science and Technologies, National Research Council of Italy(意大利国家研究委员会信息科学与技术研究所) National Biodiversity Future Center - NBFC(国家生物多样性未来中心 - NBFC)

AI总结 本文综述了深度学习在SAR船舶分类中的应用,建立了基于模型、手工特征、SAR属性利用和微调影响的分类法,并讨论了未来研究方向。

Comments in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

详情
AI中文摘要

深度学习(DL)已成为合成孔径雷达(SAR)船舶分类的强大工具。本综述全面分析了该领域使用的各种DL技术。我们识别了关键趋势和挑战,强调了整合手工特征、利用公共数据集、数据增强、微调、可解释性技术以及促进跨学科合作以提高DL模型性能的重要性。本综述建立了首个基于DL模型、手工特征使用、SAR属性利用和微调影响的分类法,用于对相关研究进行分类。我们讨论了SAR船舶分类任务中使用的方法论以及不同技术的影响。最后,本综述探讨了未来研究的潜在方向,包括解决数据稀缺问题、探索新型DL架构、融入可解释性技术以及建立标准化性能指标。通过应对这些挑战并利用DL的进步,研究人员可以为开发更准确和高效的船舶分类系统做出贡献,最终增强海上监视及相关应用。

英文摘要

Deep learning (DL) has emerged as a powerful tool for Synthetic Aperture Radar (SAR) ship classification. This survey comprehensively analyzes the diverse DL techniques employed in this domain. We identify critical trends and challenges, highlighting the importance of integrating handcrafted features, utilizing public datasets, data augmentation, fine-tuning, explainability techniques, and fostering interdisciplinary collaborations to improve DL model performance. This survey establishes a first-of-its-kind taxonomy for categorizing relevant research based on DL models, handcrafted feature use, SAR attribute utilization, and the impact of fine-tuning. We discuss the methodologies used in SAR ship classification tasks and the impact of different techniques. Finally, the survey explores potential avenues for future research, including addressing data scarcity, exploring novel DL architectures, incorporating interpretability techniques, and establishing standardized performance metrics. By addressing these challenges and leveraging advancements in DL, researchers can contribute to developing more accurate and efficient ship classification systems, ultimately enhancing maritime surveillance and related applications.

2503.09675 2026-05-28 cs.CV 版本更新

Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

利用局部转移一致性加速扩散采样

Shangwen Zhu, Han Zhang, Zhantao Yang, Qianyu Peng, Zhao Pu, Huangji Wang, Fan Cheng

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出一种无需训练的加速方法LTC-Accel,通过利用相邻步骤间转移算子的统计关系来估计当前转移算子,从而加速文本到图像和视频的扩散采样,兼容多种网络结构和现有加速技术。

详情
AI中文摘要

基于文本的扩散模型在从文本描述生成高质量图像和视频方面取得了重大突破。然而,去噪过程漫长的采样时间仍然是实际应用中的一个显著瓶颈。以往的方法要么忽略相邻步骤之间的统计关系,要么依赖于它们之间的注意力或特征相似性,这通常只适用于特定的网络结构。为了解决这个问题,我们在相邻步骤之间的转移算子中发现了一种新的统计关系,重点关注网络输出之间的关系。这种关系对网络结构没有任何要求。基于这一观察,我们提出了一种新颖的无训练加速方法,称为LTC-Accel,它利用识别出的关系基于相邻步骤估计当前转移算子。由于对网络结构没有特定假设,LTC-Accel几乎适用于所有基于扩散的方法,并且与几乎所有现有的加速技术正交,因此易于与它们结合。实验结果表明,LTC-Accel在文本到图像和文本到视频合成中显著加速了采样,同时保持了具有竞争力的样本质量。具体来说,LTC-Accel在Stable Diffusion v2中实现了1.67倍的加速,在视频生成模型中实现了1.55倍的加速。当与蒸馏模型结合时,LTC-Accel在视频生成中实现了惊人的10倍加速,允许实时生成超过16FPS。

英文摘要

Text-based diffusion models have made significant breakthroughs in generating high-quality images and videos from textual descriptions. However, the lengthy sampling time of the denoising process remains a significant bottleneck in practical applications. Previous methods either ignore the statistical relationships between adjacent steps or rely on attention or feature similarity between them, which often only works with specific network structures. To address this issue, we discover a new statistical relationship in the transition operator between adjacent steps, focusing on the relationship of the outputs from the network. This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel training-free acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. Due to no specific assumptions regarding the network structure, LTC-Accel is applicable to almost all diffusion-based methods and orthogonal to almost all existing acceleration techniques, making it easy to combine with them. Experimental results demonstrate that LTC-Accel significantly speeds up sampling in text-to-image and text-to-video synthesis while maintaining competitive sample quality. Specifically, LTC-Accel achieves a speedup of 1.67-fold in Stable Diffusion v2 and a speedup of 1.55-fold in video generation models. When combined with distillation models, LTC-Accel achieves a remarkable 10-fold speedup in video generation, allowing real-time generation of more than 16FPS.

2508.18271 2026-05-28 cs.CV 版本更新

ObjFiller3D: Scaling 3D Object Inpainting to Dense Multi-View Consistency

ObjFiller3D:将3D物体修复扩展到密集多视图一致性

Haitang Feng, Xinkai Chen, Jie Liu, Jie Tang, Gangshan Wu, Beiqi Chen, Jianhuang Lai, Guangcong Wang

发表机构 * Nanjing University(南京大学) Great Bay University(大湾大学) Harbin Institute of Technology(哈尔滨工业大学) Sun Yat-sen University(中山大学)

AI总结 提出ObjFiller-3D方法,通过联合优化密集采样视图的时序生成、语义感知补全和循环一致性3D编码,实现高质量且一致的多视图3D物体修复与编辑。

Comments Project page: https://objfiller3d.github.io/ Code: https://github.com/objfiller3d/ObjFiller-3D

详情
AI中文摘要

3D物体修复通常通过多视图2D图像补全实现,但独立修复的视图常出现跨视图不一致,导致重建的3D物体纹理模糊、几何不连续和视觉伪影。为克服这些限制,我们提出ObjFiller-3D,一种用于高质量且一致的3D物体补全与编辑的新方法。该方法不依赖稀疏视图编辑或逐视图2D修复,而是沿$360^\circ$轨迹联合优化密集采样视图序列,实现跨视角的全局一致性。我们设计了一个包含三个互补组件的新框架:用于建模密集视图依赖的时间驱动生成编码器、用于物体级修复的语义感知补全编码器,以及通过闭环公式强制全局一致性的循环一致性3D编码器。该框架还支持参考引导的3D修复,允许对外观进行细粒度控制。在多个数据集上的大量实验表明,ObjFiller-3D显著优于先前方法,实现了更高的重建保真度(PSNR 26.6 vs. NeRFiller的15.9)和感知质量(LPIPS 0.19 vs. Instant3dit的0.25),同时将重建时间从超过40分钟缩短至不到10分钟。这些结果突显了我们的方法在真实世界3D编辑应用中的有效性和实际潜力。项目页面:https://objfiller3d.github.io/ 代码:https://github.com/objfiller3d/ObjFiller-3D

英文摘要

3D object inpainting is commonly achieved via multi-view 2D image completion, yet independently inpainted views often suffer from cross-view inconsistencies, leading to blurred textures, geometric discontinuities, and visual artifacts in the reconstructed 3D objects. To overcome these limitations, we propose ObjFiller-3D, a novel method designed for the completion and editing of high-quality and consistent 3D objects. Instead of relying on sparse-view editing or per-view 2D inpainting, our method jointly optimizes a sequence of densely sampled views along a $360^\circ$ trajectory, enabling global coherence across viewpoints. We design a new framework with three complementary components: a Temporal-Driven Generative Encoder for modeling dense-view dependencies, a Semantic-Aware Completion Encoder for object-level inpainting, and a Cycle-Consistent 3D Encoder that enforces global coherence through a closed-loop formulation. Our framework also supports reference-guided 3D inpainting, allowing fine-grained control over appearance. Extensive experiments on diverse datasets demonstrate that ObjFiller-3D significantly outperforms prior methods, achieving higher reconstruction fidelity (PSNR 26.6 vs.\ 15.9 of NeRFiller) and perceptual quality (LPIPS 0.19 vs.\ 0.25 of Instant3dit), while reducing reconstruction time from over 40 minutes to under 10 minutes. These results highlight the effectiveness and practical potential of our approach for real-world 3D editing applications. Project page: https://objfiller3d.github.io/ Code: https://github.com/objfiller3d/ObjFiller-3D .

2508.09449 2026-05-28 cs.CV 版本更新

RASR: Retrieval-Augmented Super Resolution for Practical Reference-based Image Restoration

RASR: 面向实际参考图像复原的检索增强超分辨率

Jiaqi Yan, Shuning Xu, Xiangyu Chen, Dell Zhang, Jiantao Zhou, Jie Tang, Gangshan Wu, Jie Liu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) State Key Laboratory of Internet of Things for Smart City(物联网智慧城市国家重点实验室) Department of Computer Science and Information Science, University of Macau(澳门大学计算机科学与信息科学系)

AI总结 提出检索增强超分辨率(RASR)范式,通过自动检索参考图像实现实际场景下的参考超分辨率,并构建基准数据集RASR-Flickr30及基线模型RASRNet。

Comments Accepted at ISCAS 2026

详情
AI中文摘要

基于参考的超分辨率(RefSR)通过利用高质量参考图像增强纹理保真度和视觉真实性,改进了单图像超分辨率(SISR)。然而,现有RefSR方法的一个关键限制是它们依赖于手动筛选的目标-参考图像对,这严重限制了其在现实场景中的实用性。为了克服这一点,我们引入了检索增强超分辨率(RASR),这是一种新的、实用的RefSR范式,它仅根据低质量输入自动从参考数据库中检索语义相关的高分辨率图像。这使得在现实用例中实现可扩展且灵活的RefSR成为可能,例如增强在动物园或博物馆等环境中拍摄的手机照片,其中可以轻松收集或预先整理特定类别的参考数据(例如动物、艺术品)。为了促进这一方向的研究,我们构建了RASR-Flickr30,这是第一个专为RASR设计的基准数据集。与之前具有固定目标-参考对的数据集不同,RASR-Flickr30提供每类参考数据库以支持开放世界检索。我们进一步提出了RASRNet,这是一个强大的基线模型,它结合了语义参考检索器和基于扩散的RefSR生成器。它基于语义相似性检索相关参考,并采用增强语义条件的扩散生成器。在RASR-Flickr30上的实验表明,RASRNet持续优于SISR基线,实现了+0.38 dB PSNR和-0.0131 LPIPS的提升,同时生成更逼真的纹理。这些发现表明检索增强是弥合学术RefSR研究与现实世界适用性之间差距的一个有前景的方向。

英文摘要

Reference-based Super Resolution (RefSR) improves upon Single Image Super Resolution (SISR) by leveraging high-quality reference images to enhance texture fidelity and visual realism. However, a critical limitation of existing RefSR approaches is their reliance on manually curated target-reference image pairs, which severely constrains their practicality in real-world scenarios. To overcome this, we introduce Retrieval-Augmented Super Resolution (RASR), a new and practical RefSR paradigm that automatically retrieves semantically relevant high-resolution images from a reference database given only a low-quality input. This enables scalable and flexible RefSR in realistic use cases, such as enhancing mobile photos taken in environments like zoos or museums, where category-specific reference data (e.g., animals, artworks) can be readily collected or pre-curated. To facilitate research in this direction, we construct RASR-Flickr30, the first benchmark dataset designed for RASR. Unlike prior datasets with fixed target-reference pairs, RASR-Flickr30 provides per-category reference databases to support open-world retrieval. We further propose RASRNet, a strong baseline that combines a semantic reference retriever with a diffusion-based RefSR generator. It retrieves relevant references based on semantic similarity and employs a diffusion-based generator enhanced with semantic conditioning. Experiments on RASR-Flickr30 demonstrate that RASRNet consistently improves over SISR baselines, achieving +0.38 dB PSNR and -0.0131 LPIPS, while generating more realistic textures. These findings highlight retrieval augmentation as a promising direction to bridge the gap between academic RefSR research and real-world applicability.

2508.06420 2026-05-28 cs.CV 版本更新

Feature-Space Oversampling for Addressing Class Imbalance in SAR Ship Classification

特征空间过采样解决SAR舰船分类中的类别不平衡问题

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

发表机构 * ISTI-CNR \& University of Pisa Pisa, Italy Cardiff University Cardiff, UK.

AI总结 针对SAR舰船分类中长尾数据集导致的类别不平衡问题,提出两种基于Major-to-minor (M2m)方法的特征空间过采样算法M2m$_f$和M2m$_u$,在OpenSARShip和FuSARShip数据集上使用ViT、VGG16和ResNet50作为特征提取器,平均F1分数分别提升8.82%和4.44%。

Comments Accepted and presented at IGARSS

详情
Journal ref
IGARSS 2025 - 2025 IEEE International Geoscience and Remote Sensing Symposium, Brisbane, Australia, 2025, pp. 2010-2014,
AI中文摘要

SAR舰船分类面临长尾数据集的挑战,这使得对代表性不足的类别的分类变得复杂。过采样方法已被证明在解决光学数据中的类别不平衡问题方面有效。在本文中,我们评估了在特征空间中进行过采样对SAR舰船分类的影响。我们提出了两种受Major-to-minor (M2m)方法启发的新算法M2m$_f$和M2m$_u$。这些算法在两个公开数据集OpenSARShip(6类)和FuSARShip(9类)上进行了测试,使用三种最先进的模型作为特征提取器:ViT、VGG16和ResNet50。此外,我们还分析了过采样方法对不同类别大小的影响。结果表明,我们的新方法优于原始的M2m和基线方法,在FuSARShip上平均F1分数提高了8.82%,在OpenSARShip上提高了4.44%。

英文摘要

SAR ship classification faces the challenge of long-tailed datasets, which complicates the classification of underrepresented classes. Oversampling methods have proven effective in addressing class imbalance in optical data. In this paper, we evaluated the effect of oversampling in the feature space for SAR ship classification. We propose two novel algorithms inspired by the Major-to-minor (M2m) method M2m$_f$, M2m$_u$. The algorithms are tested on two public datasets, OpenSARShip (6 classes) and FuSARShip (9 classes), using three state-of-the-art models as feature extractors: ViT, VGG16, and ResNet50. Additionally, we also analyzed the impact of oversampling methods on different class sizes. The results demonstrated the effectiveness of our novel methods over the original M2m and baselines, with an average F1-score increase of 8.82% for FuSARShip and 4.44% for OpenSARShip.

2507.06999 2026-05-28 cs.CV cs.CL cs.LG 版本更新

Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

有意学习,直觉行动:解锁多模态大语言模型的测试时推理能力

Yahan Yu, Yuyang Dong, Masafumi Oyamada

发表机构 * Kyoto University(京都大学) Initial S NEC Corporation, Japan(日本NEC公司)

AI总结 提出D2I框架,通过训练时使用基于规则的格式奖励进行有意推理以增强模态对齐,推理时移除显式策略转为直觉推理,从而提升多模态大语言模型的推理能力,无需额外标注或复杂奖励。

Comments 22 pages, 24 figures

详情
AI中文摘要

推理对于大型语言模型(LLMs)至关重要,尤其是在数学问题求解等复杂任务中。然而,多模态推理在模态对齐和训练可扩展性方面仍面临挑战,因为许多现有方法依赖于额外的标注或复杂的基于规则的奖励。为了解决这些问题,我们提出了“有意到直觉”推理框架(D2I),该框架无需额外标注或复杂奖励即可提升多模态大语言模型(MLLMs)的理解和推理能力。在训练过程中,D2I使用仅由基于规则的格式奖励监督的有意推理策略来增强模态对齐。在推理过程中,它通过移除这些显式策略转向直觉推理,使模型能够在其响应中隐式应用所获得的能力。D2I在域内和域外基准测试中均优于基线,突显了格式奖励在培养可迁移多模态推理技能方面的有效性,并表明将训练时的推理深度与测试时的响应灵活性解耦是有益的。

英文摘要

Reasoning is essential for large language models (LLMs), especially in complex tasks such as mathematical problem solving. However, multimodal reasoning still faces challenges in modality alignment and training scalability, as many existing methods rely on additional annotations or complex rule-based rewards. To address these issues, we propose the Deliberate-to-Intuitive reasoning framework (D2I), which improves the understanding and reasoning abilities of multimodal LLMs (MLLMs) without extra annotations or complex rewards. During training, D2I uses deliberate reasoning strategies supervised only by rule-based format rewards to enhance modality alignment. During inference, it shifts to intuitive reasoning by removing these explicit strategies, allowing the model to implicitly apply the acquired abilities in its responses. D2I outperforms baselines on both in-domain and out-of-domain benchmarks, highlighting the effectiveness of format rewards in fostering transferable multimodal reasoning skills and suggesting the benefit of decoupling training-time reasoning depth from test-time response flexibility.

2505.21771 2026-05-28 cs.CV cs.AI 版本更新

MMTABREAL: Real-World Benchmark for Multimodal Table Understanding

MMTABREAL:多模态表格理解的真实世界基准

Prasham Titiya, Jainil Trivedi, Chitta Baral, Vivek Gupta

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 针对多模态表格理解,构建了包含500个真实表格和4021个问答对的人工筛选基准MMTABREAL,评估发现现有模型在视觉定位、空间对齐和多步推理上存在20-40%的性能差距。

详情
AI中文摘要

多模态表格,即与图表、地图、图标和颜色编码交织的表格布局,在实际应用中无处不在,但对多模态大语言模型(MLLMs)来说仍然困难。尽管在文本和图像理解方面取得了进展,但对以表格为中心的多模态推理的系统评估仍然有限。我们引入了MMTABREAL,一个多模态表格基准,包含人工筛选的500个真实世界表格及其对应的4021个问答对。MMTABREAL涵盖四种问题类型、五种推理类别和八种结构原型。对最先进模型的评估揭示了显著差距,特别是在视觉定位、空间对齐和多步推理方面,相对于现有基准性能下降了20-40%。这些结果强调了需要更紧密融合视觉与表格结构并支持显式数值/逻辑运算的架构。MMTABREAL仅用于评估,提供了一个严谨、可复现的测试平台,反映了真实世界多模态表格的语言、结构和推理复杂性。

英文摘要

Multimodal tables i.e. tabular layouts interleaved with charts, maps, icons, and color encodings are ubiquitous in real applications yet remain difficult for Multimodal Large Language Models (MLLMs). Despite advances in text and image understanding, systematic evaluation of table-centric multimodal reasoning is limited. We introduce MMTABREAL, a MultiModal Table Benchmark, human-curated suite of 500 real-world tables paired with 4,021 question-answer pairs. MMTABREAL spans four question types, five reasoning categories, and eight structural archetypes. Evaluations of state-of-the-art models reveal substantial gaps, especially in visual grounding, spatial alignment, and multi-step inference, with 20-40% performance drops relative to existing benchmarks. These results highlight the need for architectures that more tightly fuse vision with tabular structure and support explicit numeric/logical operations. MMTABREAL is released for evaluation only, providing a rigorous, reproducible testbed that reflects the linguistic, structural, and reasoning complexity of real-world multimodal tables.

2502.05242 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

超越外部监控:增强大型语言模型的透明度以便于监控

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) ICISEE, Shanghai Jiao Tong University(上海交通大学ICISEE) School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学数学科学学院) King Abdullah University of Science and Technology(卡塔尔国王 Abdullah 科学与技术大学)

AI总结 提出TELLME方法,通过改进大型语言模型的内部表征透明度,帮助监控者识别不当和敏感行为,并在去毒化任务中验证其有效性。

Comments 28 pages,8 figures,15 tables

详情
AI中文摘要

大型语言模型(LLMs)的能力日益增强,但其思维和决策过程的机制仍不清楚。思维链(CoTs)常被用来外化LLMs的思维,但这一策略未能准确反映LLMs的思维过程。基于LLMs隐藏表征的技术提供了内部视角,以改善对其潜在思维的可监控性。然而,以往的方法仅尝试开发外部模块,而非使LLMs本身更易于监控。本文提出了一种新方法TELLME,提高了LLMs的透明度,并帮助监控者识别不合适和敏感的行为。此外,我们在去毒化任务上展示了TELLME的有效性,LLMs在多模态测试集、不同架构和不同参数规模上均取得了一致的改进。我们进一步从最优传输理论和实证角度分析了TELLME对LLMs泛化能力的提升。

英文摘要

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

2503.02857 2026-05-28 cs.CV cs.AI cs.CY 版本更新

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Deepfake-Eval-2024:2024年传播的深度伪造多模态野外基准

Nuria Alina Chandra, Hannah Lee, Ryan Murtfeldt, Lin Qiu, Arnab Karmakar, Emmanuel Tanumihardja, Kevin Farhat, Ben Caffee, Changyeon Lee, Jongwook Choi, Sejin Paik, Aerin Kim, Oren Etzioni

发表机构 * University of Washington(华盛顿大学) Allen Institute for Artificial Intelligence(人工智能研究院) University of Maryland(马里兰大学) Chung-Ang University(Chung-Ang 大学) Georgetown University(乔治城大学) Miraflow AI

AI总结 针对现有学术基准过时且不反映真实深度伪造的问题,提出包含2024年社交媒体和用户提交的多模态深度伪造基准Deepfake-Eval-2024,评估发现开源模型性能大幅下降,而商业模型和微调模型表现更优但未达到专家水平。

详情
AI中文摘要

在生成式人工智能日益逼真的时代,稳健的深度伪造检测对于减少欺诈和虚假信息至关重要。尽管许多深度伪造检测器在学术数据集上报告了高准确率,但我们表明这些学术基准已经过时,不能代表现实世界的深度伪造。我们引入了Deepfake-Eval-2024,这是一个新的深度伪造检测基准,由2024年从社交媒体和深度伪造检测平台用户收集的野外深度伪造组成。Deepfake-Eval-2024包含45小时的视频、56.5小时的音频和1,975张图像,涵盖了最新的操纵技术。该基准包含来自52种不同语言、88个不同网站的多样化媒体内容。我们发现,在Deepfake-Eval-2024上评估时,开源最先进的深度伪造检测模型的性能急剧下降,与之前的基准相比,视频模型的AUC下降了50%,音频模型下降了48%,图像模型下降了45%。我们还评估了商业深度伪造检测模型和在Deepfake-Eval-2024上微调的模型,发现它们比现成的开源模型性能更优,但尚未达到深度伪造取证分析师的准确率。数据集可在https://github.com/nuriachandra/Deepfake-Eval-2024获取。

英文摘要

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

2504.04924 2026-05-28 cs.CV eess.IV 版本更新

Inter-event Interval Microscopy for Event Cameras

事件相机的帧间间隔显微术

Changqing Su, Yanqin Chen, Zihan Lin, Zhen Cheng, You Zhou, Bo Xiong, Zhaofei Yu, Tiejun Huang

发表机构 * National Key Laboratory for Multimedia Information Processing(国家多媒体信息处理重点实验室) Westlake Laboratory of Life Sciences and Biomedicine(西湖生命科学与生物医学实验室) School of Automation(自动化学院) Department of Automation(自动化系) Nanjing University Medical School(南京大学医学院)

AI总结 提出基于事件相机的帧间间隔显微术(IEIM),通过量化连续事件的时间间隔实现静态和动态场景的强度重建,在荧光显微镜中实现高时空分辨率和动态范围。

详情
AI中文摘要

事件相机是一种创新的仿生传感器,与传统相机不同,它通过感知强度变化而非直接感知强度,并将这些变化记录为连续的“事件”流。从这些稀疏事件中重建强度一直是一个具有挑战性的问题。以往的方法主要集中在将运动诱发的事件转换为视频,或通过在事件相机采集端集成调制器件来实现静态场景的强度成像。在本文中,我们首次利用静态事件相机在荧光显微镜中实现了静态和动态场景的事件到强度转换。与主要依赖事件积分的传统方法不同,所提出的帧间间隔显微术(IEIM)量化了每个像素处连续事件之间的时间间隔。在事件相机中,由于阈值固定,时间间隔可以精确表示强度。在硬件层面,所提出的IEIM在配备事件相机的显微镜中集成了脉冲光调制器件,称为基于脉冲调制的事件驱动荧光显微镜。此外,我们收集了包含高动态范围和高速度场景的IEIMat数据集。在IEIMat数据集上的实验结果表明,所提出的IEIM在空间和时间分辨率、动态范围方面优于其他方法,且带宽更低。代码和IEIMat数据集将公开提供。

英文摘要

Event cameras, an innovative bio-inspired sensor, differ from traditional cameras by sensing changes in intensity rather than directly perceiving intensity and recording these variations as a continuous stream of "events". The intensity reconstruction from these sparse events has long been a challenging problem. Previous approaches mainly focused on transforming motion-induced events into videos or achieving intensity imaging for static scenes by integrating modulation devices at the event camera acquisition end. In this paper, for the first time, we achieve event-to-intensity conversion using a static event camera for both static and dynamic scenes in fluorescence microscopy. Unlike conventional methods that primarily rely on event integration, the proposed Inter-event Interval Microscopy (IEIM) quantifies the time interval between consecutive events at each pixel. With a fixed threshold in the event camera, the time interval can precisely represent the intensity. At the hardware level, the proposed IEIM integrates a pulse light modulation device within a microscope equipped with an event camera, termed Pulse Modulation-based Event-driven Fluorescence Microscopy. Additionally, we have collected IEIMat dataset under various scenes including high dynamic range and high-speed scenarios. Experimental results on the IEIMat dataset demonstrate that the proposed IEIM achieves superior spatial and temporal resolution, as well as a higher dynamic range, with lower bandwidth compared to other methods. The code and the IEIMat dataset will be made publicly available.

2504.20736 2026-05-28 cs.RO cs.CV 版本更新

A Survey on Event-based Optical Marker Systems

基于事件的光学标记系统综述

Nafiseh Jabbari Tofighi, Maxime Robic, Fabio Morbidi, Pascal Vasseur

发表机构 * MIS laboratory, University of Picardie Jules Verne(皮卡第大学朱勒斯·弗尔大学MIS实验室) DART Lab, Politecnico di Milano(米兰理工学院DART实验室)

AI总结 本文综述了基于事件的光学标记系统(EBOMS),分析其异步操作原理和鲁棒性,并介绍了在目标检测、姿态估计和光通信等领域的应用。

Comments 11 pages, 6 figures, 2 table

详情
AI中文摘要

事件相机的出现,以其低延迟、高动态范围和低功耗,标志着机器感知和机器人视觉的转折点。特别是,这些神经形态传感器与广泛可用的被动或主动光学标记(例如AprilTags、闪烁LED阵列)的结合,最近开辟了一个新的机遇领域。本综述论文对基于事件的光学标记系统(EBOMS)进行了全面回顾。我们分析了这些系统所基于的基本原理和技术,特别关注其异步操作和对挑战性光照条件的鲁棒性。我们还描述了EBOMS最相关的应用,包括目标检测与跟踪、姿态估计和光通信。文章最后讨论了这一快速发展的多学科领域可能的未来研究方向。

英文摘要

The advent of event-based cameras, with their low latency, high dynamic range, and reduced power consumption, marked a turning point in machine perception and robotic vision. In~particular, the combination of these neuromorphic sensors with widely-available passive or active optical markers (e.g. AprilTags, arrays of blinking LEDs), has recently opened up a new field of opportunities. This survey paper provides a comprehensive review of Event-Based Optical Marker Systems (EBOMS). We~analyze the underlying principles and technologies on which these systems are based, with a special focus on their asynchronous operation and robustness against challenging lighting conditions. We also describe the most relevant applications of EBOMS, including object detection and tracking, pose estimation, and optical communication. The article concludes with a discussion of possible future research directions in this rapidly-emerging and multidisciplinary area.

2504.04540 2026-05-28 cs.CV cs.AI 版本更新

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

点、视觉与文本:点云能否提升大语言模型的空间推理能力?一项偏差控制研究

Weichen Zhang, Ruiying Peng, Xin Zeng, Jianjie Fang, Ziyou Wang, Kaiyuan Li, Heng Dong, Wei Li, Chen Gao, Xin Wang, Xinlei Chen, Yong Li

发表机构 * Tsinghua University(清华大学) ByteDance Seed(字节跳动种子)

AI总结 本文通过引入包含文本、视觉和点云模态的3D空间推理基准ScanReQA,评估不同模态下大语言模型的空间推理能力,发现点云和视觉模态的模型表现优于纯文本模型,并揭示了3D大语言模型中的注意力下沉现象。

详情
AI中文摘要

利用点云中空间信息进行3D空间推理的3D大语言模型(LLMs)引起了广泛关注。尽管取得了一些有希望的结果,但点云相对于其他模态的优势仍不明确。此外,现有的3D基准不足以公平评估多模态大语言模型理解空间概念的能力。为了解决这些挑战,我们引入了ScanReQA,一个涵盖文本、视觉和点云模态的3D空间推理基准。然后,我们评估了文本、2D和3D大语言模型在该基准上的性能,以比较不同模态在理解空间概念方面的有效性。此外,我们分析了使用点云的3D大语言模型背后的推理机制。我们的发现表明:1)二元空间推理对当前的3D大语言模型仍然具有挑战性;2)基于点云和视觉模态的多模态大语言模型展现出比大语言模型更强的空间推理能力;3)3D大语言模型表现出类似于2D大语言模型中的注意力下沉现象,这损害了空间推理。我们认为这些结论有助于3D大语言模型的下一步发展,并为其他模态的基础模型提供见解。我们在项目页面发布了数据集和代码:https://github.com/EmbodiedCity/ScanReQA.code。

英文摘要

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the advantages of point clouds over other modalities remain unclear. Moreover, existing 3D benchmarks are insufficient for fairly evaluating the ability of multimodal LLMs to comprehend spatial concepts. To address these challenges, we introduce ScanReQA, a 3D spatial reasoning benchmark encompassing text, vision, and point cloud modalities. We then evaluate the performance of text, 2D, and 3D LLMs on the benchmark to compare the effectiveness of different modalities in understanding spatial concepts. Furthermore, we analyze the reasoning mechanisms behind 3D LLMs using point clouds. Our findings reveal that: 1) binary spatial reasoning remains challenging for current 3D LLMs, 2) MLLMs based on point cloud and visual modalities demonstrate stronger spatial reasoning capabilities than LLMs, and 3) 3D LLMs exhibit the attention sink phenomenon similar to that in 2D LLMs, impairing spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and codes in the project page: https://github.com/EmbodiedCity/ScanReQA.code.

2501.04144 2026-05-28 cs.CV cs.GR 版本更新

Chirpy3D: Part-Aware Multi-View Diffusion for Creative Fine-Grained Object Generation

Chirpy3D: 面向创意细粒度物体生成的部件感知多视角扩散

Kam Woh Ng, Jing Yang, Jia Wei Sii, Chee Seng Chan, Jiankang Deng, Yi-Zhe Song, Tao Xiang, Xiatian Zhu

发表机构 * University of Surrey(萨里大学) University of Cambridge(剑桥大学) Universiti Malaya(马来亚大学) Imperial College London(伦敦帝国学院)

AI总结 提出Chirpy3D,一种部件感知多视角扩散框架,从无姿态2D图像中学习层次化部件潜在空间,实现部件级交换、插值和零样本组合,无需3D数据或手动标注。

Comments 20 pages. Code at https://github.com/kamwoh/chirpy3d

详情
AI中文摘要

理解并生成物体的细粒度结构——例如具有物种特异性喙、翅膀和尾巴的鸟类——是计算机视觉中长期存在的挑战。我们提出Chirpy3D,一种部件感知多视角扩散框架,它从无姿态的2D图像中学习层次化部件潜在空间,仅使用现成的2D部件分割掩码作为空间指导——无需任何3D数据、相机姿态或手动部件标注。该潜在空间支持直观的部件级交换、插值和零样本组合。自监督特征一致性损失进一步促进跨视角的结构对齐,即使在混合或未见过的部件组合下也能实现连贯生成。我们的核心贡献是可控制的部件感知潜在空间和多视角扩散模型。通过任何可微分渲染器(如NeRF)支持下游3D生成,但这与主框架正交,使Chirpy3D成为在缺乏结构化3D数据时进行创意物体生成的灵活基础。代码已发布在https://github.com/kamwoh/chirpy3d。

英文摘要

Understanding and generating the fine-grained structure of objects -- such as birds with species-specific beaks, wings, and tails -- is a long-standing challenge in computer vision. We propose Chirpy3D, a part-aware multi-view diffusion framework that learns a hierarchical part latent space from unposed 2D images, using only off-the-shelf 2D part segmentation masks as spatial guidance -- without requiring any 3D data, camera poses, or manual part annotations. This latent space enables intuitive part-level swapping, interpolation, and zero-shot composition. A self-supervised feature consistency loss further encourages structural alignment across views, allowing coherent generation even with hybrid or unseen part combinations. Our core contribution is the controllable part-aware latent space and multi-view diffusion model. Downstream 3D generation is supported via any differentiable renderer such as NeRF but is orthogonal to the main framework, making Chirpy3D a flexible foundation for creative object generation in the absence of structured 3D data. Code is released at https://github.com/kamwoh/chirpy3d.

2503.22655 2026-05-28 cs.AI cs.CV cs.MM 版本更新

Text-Only Data Synthesis for Vision Language Model Training

仅文本数据合成用于视觉语言模型训练

Xiaomin Yu, Wenjie Zhang, Ziyue Qiao, Chengwei Qin, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Great Bay University(大湾大学)

AI总结 提出一个跨集成的三阶段多模态数据合成框架,仅从文本生成高质量多模态训练数据,用于视觉语言模型的预训练和指令微调。

详情
AI中文摘要

训练视觉语言模型(VLM)通常需要大规模、高质量的图像-文本对,但收集或合成此类数据成本高昂。相比之下,文本数据丰富且廉价,这引发了一个问题:能否仅从文本中合成高质量的多模态训练数据?为解决这一问题,我们提出了一个跨集成的三阶段多模态数据合成框架,生成了两个数据集:Unicorn-1.2M和Unicorn-471K-Instruction。在第一阶段:多样化字幕数据合成,我们通过使用大语言模型(LLM)扩展稀疏字幕种子,构建了120万语义多样的高质量字幕。在第二阶段:指令微调数据生成,我们进一步将47.1万个字幕处理为多轮指令微调任务,以支持复杂推理。最后,在第三阶段:模态表示迁移,这些文本字幕表示被转换为视觉表示,从而产生多样化的合成图像表示。这一三阶段过程使我们能够构建用于预训练的Unicorn-1.2M和用于指令微调的Unicorn-471K-Instruction,而无需依赖真实图像。通过消除对真实图像的依赖,同时保持数据质量和多样性,我们的框架为VLM训练提供了一种成本效益高且可扩展的解决方案。

英文摘要

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training.

2503.04863 2026-05-28 cs.CV cs.AI 版本更新

Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

Manboformer: 通过时空注意力机制学习高斯表示

Ziyue Zhao, Qining Qi, Jianfa Ma

AI总结 针对自动驾驶3D语义占用预测中高斯表示性能不足的问题,提出利用时空自注意力机制优化GaussianFormer,以提升模型性能。

Comments After careful self-check, we found several unnoticed deficiencies and incomplete discussions in this manuscript. To ensure the rigor and accuracy of academic results, we decide to withdraw this preprint. A refined, complete, and rigorous version will be submitted soon

详情
AI中文摘要

与基于体素的网格预测相比,在自动驾驶的3D语义占用预测领域,GaussianFormer提出使用3D高斯来描述场景,基于对象的稀疏3D语义高斯是另一种内存需求更低的方案。每个3D高斯函数表示一个灵活的兴趣区域及其语义特征,通过注意力机制迭代优化。实验中发现,该方法所需的高斯函数数量大于原始密集网格网络的查询分辨率,导致性能受损。因此,我们考虑通过利用未使用的时序信息来优化GaussianFormer。我们从先前的网格占用网络中学习时空自注意力机制,并将其改进应用于GaussianFormer。实验使用NuScenes数据集进行,目前正在进行中。

英文摘要

Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on objects is another scheme with lower memory requirements. Each 3D Gaussian function represents a flexible region of interest and its semantic features, which are iteratively refined by the attention mechanism. In the experiment, it is found that the Gaussian function required by this method is larger than the query resolution of the original dense grid network, resulting in impaired performance. Therefore, we consider optimizing GaussianFormer by using unused temporal information. We learn the Spatial-Temporal Self-attention Mechanism from the previous grid-given occupation network and improve it to GaussianFormer. The experiment was conducted with the NuScenes dataset, and the experiment is currently underway.

2405.09586 2026-05-28 eess.IV cs.AI cs.CV 版本更新

Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation

事实序列化增强:胸部X光报告生成的关键创新

Kang Liu, Zhuoqi Ma, Mengmeng Liu, Zhicheng Jiao, Xiaolu Kang, Qiguang Miao, Kun Xie

发表机构 * School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Xi’an Key Laboratory of Big Data and Intelligent Vision(西安大数据与智能视觉重点实验室) Key Laboratory of Collaborative Intelligence Systems, Ministry of Education(教育部协同智能系统重点实验室) School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) Department of Diagnostic Imaging, Brown University(布朗大学诊断影像科)

AI总结 提出FSE两阶段事实序列化增强方法,通过事实引导对比学习和证据驱动报告生成,提升胸部X光报告生成的临床准确性和自然语言质量。

Comments code is available at FSE" target="_blank" rel="noopener">https://github.com/mk-runner/FSE

详情
AI中文摘要

放射学报告包含呈现式词汇(确保清晰和组织)和事实性词汇(基于可观察发现提供准确客观描述)。手动编写这些报告耗时费力,而自动报告生成提供了一种有前景的替代方案。该过程中的关键步骤是将X光片与其对应报告对齐。然而,现有方法通常依赖完整报告进行对齐,忽略了呈现式词汇的影响。为解决此问题,我们提出FSE,一种两阶段事实序列化增强方法。在第一阶段,我们引入事实引导的对比学习用于视觉表示,通过最大化X光片与对应事实描述之间的语义对应关系。在第二阶段,我们提出证据驱动的报告生成,通过整合来自类似历史病例的结构化事实序列化见解,增强诊断准确性。在MIMIC-CXR和IU X-ray数据集上的实验(涵盖特定和一般场景)表明,FSE在自然语言生成和临床效能指标上均优于最先进方法。消融研究进一步强调了第一阶段和第二阶段中事实序列化的积极作用。代码可在https://github.com/mk-runner/FSE获取。

英文摘要

A radiology report comprises presentation-style vocabulary, which ensures clarity and organization, and factual vocabulary, which provides accurate and objective descriptions based on observable findings. While manually writing these reports is time-consuming and labor-intensive, automatic report generation offers a promising alternative. A critical step in this process is to align radiographs with their corresponding reports. However, existing methods often rely on complete reports for alignment, overlooking the impact of presentation-style vocabulary. To address this issue, we propose FSE, a two-stage Factual Serialization Enhancement method. In Stage 1, we introduce factuality-guided contrastive learning for visual representation by maximizing the semantic correspondence between radiographs and corresponding factual descriptions. In Stage 2, we present evidence-driven report generation that enhances diagnostic accuracy by integrating insights from similar historical cases structured as factual serialization. Experiments on MIMIC-CXR and IU X-ray datasets across specific and general scenarios demonstrate that FSE outperforms state-of-the-art approaches in both natural language generation and clinical efficacy metrics. Ablation studies further emphasize the positive effects of factual serialization in Stage 1 and Stage 2. The code is available at https://github.com/mk-runner/FSE.