arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.21484 2026-05-21 cs.CV 版本更新

One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

通过固定点迭代实现离散扩散图像生成器的一步蒸馏

Chaoyang Wang, Yunhai Tong

发表机构 * Peking University(北京大学)

AI总结 本文提出了一种名为Fixed-Point Distillation (FPD)的端到端框架,通过部分破坏学生模型的一步草稿并用单个教师步骤进行细化,构建局部修正目标。该方法将离散标记提升为连续特征,并应用多带宽漂移损失,迭代累积这些修正。通过直通估计器将连续梯度回传到学生日志it,同时可选地引入无条件对抗目标以增强感知现实。在类别和文本条件生成上的评估验证了该框架的有效性,FPD在单步推理中实现了竞争性的视觉保真度和结构对齐,缩小了与多步教师之间的差距,同时优于现有离散蒸馏基线。

详情
AI中文摘要

离散扩散模型在视觉合成方面表现出色,但依赖于缓慢的迭代解码。现有的单步蒸馏方法试图绕过这一瓶颈,要么通过训练辅助分数网络,从而有效地将计算量翻倍,要么通过引入专门的参数化和多阶段管道来碎片化优化。在本文中,我们引入了Fixed-Point Distillation (FPD),一种端到端的框架,通过部分破坏学生模型的一步草稿并用单个教师步骤进行细化,构建局部修正目标。为了在语义有意义的空间中计算训练目标,我们将离散标记提升为连续特征,并应用多带宽漂移损失,该损失迭代地累积这些修正。为了通过离散瓶颈进行反向传播,我们采用直通估计器,在前向传递过程中将精确的硬采样标记喂给教师和解码器,确保训练和推理在同一个代码本流形上进行,同时将连续梯度回传到学生日志it。这种完全可微的路径还允许可选地引入无条件对抗目标以增强感知现实。在类别和文本条件生成上的评估验证了该框架的有效性。FPD在单步推理中实现了竞争性的视觉保真度和结构对齐,缩小了与多步教师之间的差距,同时优于现有离散蒸馏基线。

英文摘要

Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student's one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.

2605.21479 2026-05-21 cs.CV cs.AI 版本更新

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

WikiVQABench: 一个基于维基百科和维基数据的知识引导视觉问答基准

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

发表机构 * IBM Research San Jose(IBM桑 Jose研究实验室)

AI总结 本文提出WikiVQABench,一个结合维基百科图片、文章描述和维基数据结构化知识的知识引导视觉问答基准,通过大规模语言模型生成候选多选题,并由人工审核确保事实正确性和视觉-文本一致性,评估多种视觉-语言模型在知识密集型推理中的性能。

详情
AI中文摘要

视觉问答(VQA)基准大多强调基于感知的任务,这些任务可以通过单独的视觉内容解决。相比之下,许多现实场景需要外部知识来正确回答,而这些知识无法直接从图像中观察到。我们介绍了WikiVQABench,一个由系统结合维基百科图片、其相关文章描述和来自维基数据的结构化知识构建的人工整理的知识引导VQA基准。我们的流程使用大规模语言模型(LLMs)生成候选多选图像-问题-答案集。所有生成的实例随后由人工标注者审核,以确保事实正确性、视觉-文本一致性以及每个问题需要外部知识,除了视觉证据外,才能正确解决。WikiVQABench包含大量维基百科图片和经过整理的多选问题,旨在基准测试知识意识的视觉-语言模型(VLMs)。对十五种VLMs(256M-90B参数)的评估显示了广泛的性能范围(24.7%-75.6%准确率),表明该基准能够有效区分模型在知识密集型推理中的能力。数据集和基准测试代码已公开。

英文摘要

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

2605.21478 2026-05-21 cs.CV cs.GR 版本更新

Latent Dynamics for Full Body Avatar Animation

基于潜在动态的全身动画 avatar

Shichong Peng, Chengxiang Yin, Fei Jiang, Zhongshi Jiang, Lingchen Yang, Qingyang Tan, Amin Jourabloo, Jason Saragih, Ke Li, Christian Häne

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Codec Avatars Lab, Meta(Meta编码化身实验室)

AI总结 本文提出了一种基于潜在动态的全身动画 avatar 方法,通过引入 transformer 解码器和动态残差潜在变量,实现了更精确的动态模拟,提高了动画质量。

Comments Supplementary video: https://youtu.be/xjnr3YM0yIE

详情
AI中文摘要

基于姿态驱动的神经渲染全身 avatar 能够生成高质量的新型视角。然而,松散的衣物和其他动态元素的变形方式超出了姿态本身所能解释的范围:相同姿态可以对应多种不同状态,因为它们的运动依赖于历史、惯性和接触。显式模拟和分层衣物方法可以建模此类动态,但需要专门的衣物模板,而原始多视角捕获并不自然提供此类模板,或者需要测试时的物理模拟器,其运行时间成本较高。另一条研究线学习了数据驱动的衣物 avatar,这些方法在推理时固定辅助潜在变量,从姿态回归或从训练数据检索,而不显式建模潜在变量如何随自身动态演变。此外,即使在日常运动中,现有架构在捕捉细粒度细节时也常常遇到困难,产生模糊的渲染和时间伪影。本文在姿态条件的 3D 高斯 avatar 上加入了 transformer 基于解码器和动态残差潜在变量,以捕捉超出驱动信号的时空外观和几何变化。在推理时,学习的潜在动态模型从短姿态历史和前一潜在状态演化残差潜在变量。模型将每次更新分解为驱动、恢复和耗散力,产生时间一致、依赖历史的滚动,且附加成本极低。不同的初始条件产生多样但合理的运动轨迹,力的分解暴露了如刚性等控制。在九个具有不同松散衣物的日常运动捕获序列中,定量指标和感知用户研究显示,与最近的数据驱动基线相比,动画质量有所提高。

英文摘要

Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines.

2605.21466 2026-05-21 cs.CV 版本更新

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

StreamGVE: 无需训练的视频编辑通过少步流式视频生成

Guanlong Jiao, Chenyangguang Zhang, Jia Jun Cheng Xian, Zewei Zhang, Renjie Liao

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) ETH Zürich(苏黎世联邦理工学院) McMaster University(麦马斯特大学) Vector Institute(向量研究所) Canada CIFAR AI Chair(加拿大 CIFAR 人工智能主席)

AI总结 本文提出StreamGVE,一种基于噪声到数据视角的视频编辑方法,通过引入双分支快速采样和自注意力桥接以及交叉注意力接地/增强,实现了高效的视频编辑,能够在少步设置中优于现有方法。

Comments Project Page: https://dsl-lab.github.io/StreamGVE/

详情
AI中文摘要

尽管现有的视频编辑方法通常可行,但它们往往需要许多昂贵的迭代,并且仍然难以交付高质量且令人满意的编辑结果。我们归因于普遍的数据到数据范式,这种范式不如噪声到数据生成与现代生成模型兼容。为了解决这一差距,我们重新审视视频编辑从噪声到数据的视角,并提出基于流式生成的视频编辑(StreamGVE),在保留少量步骤采样的同时无缝地注入源视频条件。基于预训练的流式生成模型,StreamGVE引入双分支快速采样,结合自注意力桥接和交叉注意力接地/增强,以满足采样和条件要求。我们进一步提出源导向的指导以提高目标生成质量,并提出视觉提示策略以增强编辑的灵活性和实用性。该方法在不同模型上均有效、稳健且具有通用性。在多样化的视频编辑任务上的广泛实验表明,StreamGVE在少步设置中也优于现有方法,即使时间成本极低。

英文摘要

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

2605.21454 2026-05-21 cs.CV q-bio.QM q-bio.TO 版本更新

ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction

ProtoPathway: 为多模态癌症生存预测设计的生物结构化原型-路径融合

Amaya Gallagher-Syed, Costantino Pitzalis, Myles J. Lewis, Michael R. Barnes, Gregory Slabaugh

发表机构 * Queen Mary University of London(伦敦女王学院) Imperial College London(帝国理工学院伦敦分校)

AI总结 本文提出ProtoPathway框架,通过统一全切片成像和转录组学,利用编码器生成生物基础的表示,以提升癌症生存预测的生物可解释性和计算效率。

Comments Currently under peer review

详情
AI中文摘要

我们介绍了ProtoPathway,一种为癌症生存预测设计的可解释多模态框架,通过编码器在两个融合侧生成生物基础的表示。在组织病理学侧,$K$个可学习的形态原型通过端到端训练与生存目标相结合,作为切片本身的表示:片段通过软分配流入原型标记,将可变长度的片段集压缩成固定任务适应的标记。在基因组侧,双分图神经网络在Reactome通路层级编码基因表达,生成反映构成基因及其更广泛生物背景的通路嵌入,通过双向消息传递在共享的基因-通路图上进行。跨模态注意机制则在紧凑的原型$ imes$通路矩阵上操作,其中原型查询通路,建模分子程序如何导致组织形态的生物方向。由于两个轴都携带稳定的任务学习身份,注意矩阵本身是可解释性输出,从而在完整的生物层级上实现原生的推理时间归因,从基因通过通路和原型到空间组织图。我们在五个TCGA癌症队列上进行评估,展示了与现有方法相比具有竞争力或更优的生存预测能力,同时具有显著改进的生物可解释性和减少的计算成本,通过折叠分层的基于排名的群体水平分析验证了可解释性声明。我们的源代码、模型权重和Reactome通路,以及一个重新实现所有多模态生存基准的统一代码库,在相同预处理和评估条件下可用:https://github.com/AmayaGS/ProtoPathway.

英文摘要

We introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics through encoders producing biologically grounded representations on both sides of the fusion. On the histopathology side, $K$ learnable morphological prototypes, trained end-to-end with the survival objective, serve as the slide representation itself: patches flow into prototype tokens via soft assignment, compressing variable-length patch sets into fixed task-adaptive tokens. On the genomic side, a bipartite graph neural network encodes gene expression within the Reactome pathway hierarchy, producing pathway embeddings that reflect both constituent genes and their broader biological context through bidirectional message passing over a shared gene--pathway graph. Cross-modal attention then operates over a compact prototype $\times$ pathway matrix in which prototypes query pathways, modeling the biological direction in which molecular programs give rise to tissue morphology. Because both axes carry stable task-learned identity, the attention matrix is itself an interpretability output, yielding native inference-time attribution across the full biological hierarchy, from genes through pathways and prototypes to spatial tissue maps. We evaluate on five TCGA cancer cohorts, demonstrating competitive or superior survival prediction with substantially improved biological interpretability and reduced computational cost, with interpretability claims validated through fold-stratified rank-based population-level analysis. Our source code, model weights, and Reactome pathways, together with a unified codebase reimplementing all multimodal survival baselines under identical preprocessing and evaluation, are available at: https://github.com/AmayaGS/ProtoPathway.

2605.21443 2026-05-21 cs.CV cs.AI 版本更新

TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

TempGlitch: 评估视觉-语言模型在游戏视频中检测时间故障的能力

Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

发表机构 * University of Alberta(阿尔伯塔大学) Sony Interactive Entertainment(索尼互动娱乐)

AI总结 本文提出TempGlitch基准测试,用于评估视觉-语言模型在游戏视频中检测时间故障的能力,发现现有模型在处理时间故障时表现不佳,且更密集的帧采样和更大的模型尺寸并不能有效解决这些问题。

详情
AI中文摘要

视觉-语言模型(VLMs)正被越来越多地探索用于视频游戏质量保证,特别是游戏故障检测。然而,大多数现有评估将故障视为静态视觉异常,要求模型从单个帧中检测故障。我们主张这种框架忽略了关键区别:一些故障是空间性的,在孤立帧中可见,而另一些是时间性的,只有通过连续帧的变化才能显现。初步研究证实了这一差距,显示时间故障对VLMs的检测比空间故障要困难得多。为系统评估这一未被充分探索的设置,我们引入了TempGlitch,一个受控的游戏视频基准测试,用于时间故障检测。TempGlitch涵盖五种时间故障类型,每类样本平衡,同时配有配对的无故障视频,以实现可靠的二元评估。我们评估了12个专有和开源的VLMs,在多个帧采样设置下。我们的结果表明,当前VLMs在TempGlitch上仍接近随机猜测,通常会陷入过于保守的行为,错过大多数故障,或过于敏感的行为,将干净的视频标记为有故障。此外,更密集的帧采样和更大的模型尺寸并不能可靠地解决这些失败。TempGlitch为时间推理、稳健的游戏理解以及自动化故障检测提供了专注的测试平台。代码和数据可在项目网站上获得。

英文摘要

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

2605.21440 2026-05-21 cs.CV 版本更新

ReMATF: Recurrent Motion-Adaptive Multi-scale Turbulence Mitigation for Dynamic Scenes

ReMATF: 基于循环的运动自适应多尺度湍流抑制用于动态场景

Zhiming Liu, Zhicheng Zou, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, School of Computer Science, University of Bristol(布里斯托尔大学计算机科学学院视觉信息实验室)

AI总结 本文提出ReMATF,一种轻量级循环框架,通过仅使用两帧恢复视频,同时保持空间细节和时间稳定性,有效抑制湍流并提升视频质量。

详情
AI中文摘要

大气湍流严重降质视频质量,通过引入几何扭曲、模糊和时间闪烁等失真,对视觉清晰度和时间一致性构成重大挑战。当前最先进的方法基于transformer、3D架构和多帧输入,但其大计算成本和内存使用限制了实时部署,特别是在资源受限的场景中。在本工作中,我们提出ReMATF,一种轻量级循环框架,通过仅使用两帧恢复视频,同时保持空间细节和时间稳定性。ReMATF结合多尺度编码器-解码器、时间扭曲和运动自适应时间融合模块,通过将扭曲的前一输出与当前预测进行逐像素融合,增强一致性而不扩大时间窗口。该设计减少了闪烁,提升了细节清晰度,并保持了效率。在合成和真实湍流数据集上的实验显示,ReMATF在PSNR/SSIM和感知质量(LPIPS)上表现出一致的改进,同时比多帧transformer基线有显著更快的推理速度,使其适合资源受限场景中的湍流抑制。

英文摘要

Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer, 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose ReMATF, a lightweight recurrent framework that restores videos using only two frames at a time while preserving spatial detail and temporal stability. ReMATF combines a multi-scale encoder-decoder with temporal warping and a motion-adaptive temporal fusion module that performs per-pixel fusion between the warped previous output and the current prediction to enhance coherence without enlarging the temporal window. This design reduces flicker, sharpens details, and remains efficient. Experiments on synthetic and real turbulence datasets show consistent improvements in PSNR/SSIM and perceptual quality (LPIPS), along with substantially faster inference than multi-frame transformer baselines, making ReMATF suitable turbulence mitigation in resource-constrained scenarios.

2605.21418 2026-05-21 cs.LG cs.AI cs.CV cs.NI 版本更新

FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G

FedCritic: 一种基于联邦批评学习的多小区OFDMA资源分配方法用于6G

Amin Farajzadeh, Melike Erol-Kantarci

发表机构 * School of Electrical Engineering and Computer Science, University of Ottawa(奥克塔维亚大学电气工程与计算机科学学院)

AI总结 本文研究了6G超密集网络中因频率重用加剧的小区间干扰问题,提出FedCritic框架,通过轻量级基于干扰图的参数平均实现去中心化执行,从而在不依赖中央协调器的情况下稳定估计价值函数,提升信号干扰噪声比(SINR)和小区边缘速率,提高网络总和速率和公平性。

Comments Submitted to IEEE for possible publication

详情
AI中文摘要

在第六代(6G)超密集网络中,激进的频率重用加剧了小区间干扰(IC),使得多小区正交频分多址(OFDMA)调度和功率控制在相邻小区之间高度耦合。我们研究了在干扰耦合和长期用户服务质量(QoS)最小速率约束下,分布式下行资源管理——联合子载波调度和功率分配。通过使用虚拟队列缺陷权重来强制长期QoS,我们开发了FedCritic,一种无服务器的联邦多智能体actor-critic框架,具有去中心化执行。与需要集中式批评学习和联合轨迹聚合的集中式训练与去中心化执行(CTDE)方法不同,FedCritic通过轻量级基于干扰图的参数平均联邦化批评,从而在不依赖中央协调器的情况下保持策略本地化,实现稳定的值估计。在干扰丰富的重用-1设置中的仿真显示,FedCritic在均值信号干扰噪声比(SINR)和小区边缘速率、网络总和速率和公平性方面优于非协调和CTDE基线,并实现了更低的协调开销和更稳定的训练。

英文摘要

In sixth-generation (6G) ultra-dense networks, aggressive frequency reuse amplifies inter-cell interference (ICI), making multi-cell orthogonal frequency-division multiple access (OFDMA) scheduling and power control strongly coupled across neighboring cells. We study distributed downlink resource management -- joint subcarrier scheduling and power allocation -- under interference coupling and long-term per-user quality-of-service (QoS) minimum-rate constraints. By using virtual-queue deficit weights to enforce long-term QoS, we develop FedCritic, a serverless federated multi-agent actor-critic framework with decentralized execution. Unlike centralized training with decentralized execution (CTDE) approaches that require centralized critic learning and joint trajectory aggregation, FedCritic federates the critic through lightweight gossip-based parameter averaging over the interference graph, enabling stable value estimation without a central coordinator while keeping policies local. Simulations in an interference-rich reuse-1 setting show that FedCritic improves mean signal-to-interference-plus-noise ratio (SINR) and cell-edge rate, increases network-wide average sum-rate and fairness relative to non-coordinated and CTDE baselines, and achieves more stable training with lower coordination overhead.

2605.21414 2026-05-21 cs.RO cs.CV 版本更新

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

PointACT: 多尺度点-动作交互的视觉-语言-动作模型

Shizhe Chen, Paul Pacaud, Cordelia Schmid

发表机构 * Inria(法国国家信息与自动化研究所) École normale supérieure(法国高等科学研究院) CNRS(法国国家科学研究中心) PSL Research University(巴黎综合理工研究院)

AI总结 本文提出PointACT,一种集成层次化3D点云表示的3D感知视觉-语言-动作政策,通过多尺度点-动作交互机制提升机器人在3D环境中的精细几何推理和空间定位能力。

Comments Accepted to RSS 2026; project webpage: https://cshizhe.github.io/projects/pointact.html

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用大规模预训练的视觉-语言骨干网络,在通用机器人操作中展现出强大潜力。然而,大多数现有VLA模型主要依赖2D视觉表示,限制了其对细粒度几何和空间定位的推理能力,这些能力对于在3D环境中实现精确且稳健的操作至关重要。在本文中,我们提出了PointACT,一种双系统3D感知VLA策略,直接将层次化的3D点云表示整合到动作解码过程中。PointACT采用多尺度点-动作交互机制,结合高效的瓶颈窗口自注意力机制,使演化动作令牌能够密集地关注局部几何细节和全局场景结构。我们评估了PointACT在LIBERO和RLBench基准上的表现,并系统地将其与单系统和双系统VLA基线进行比较,包括加入点云输入的变体。PointACT在两个基准上均实现了持续改进,在具有挑战性的RLBench-10Tasks套件上,其成功率比最先进的预训练VLA提高了10%,当冻结视觉-语言骨干并从头训练动作专家时,提升幅度更大。广泛的消融研究证明,紧密耦合层次化的3D几何与预训练的2D语义表示对于鲁棒且空间感知的机器人控制至关重要。我们的结果还突显了预训练3D表示在3D感知VLA策略中的潜力。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

2605.21411 2026-05-21 cs.CV 版本更新

RoadTones: Tone Controllable Text Generation from Road Event Videos

RoadTones: 从道路事件视频生成可调节语气的文本

Chirag Parikh, Siddhi Pravin Lipare, Ravi Kiran Sarvadevabhatla

发表机构 * CVIT & iHub-Data, IIIT Hyderabad, India(CVIT与iHub-Data,IIIT海得拉巴,印度)

AI总结 本文提出RoadTones-51K数据集和RoadTones-VL-CoT模型,通过生成语气条件的推理草稿提升可解释性,并引入RoadTones-Eval评估体系,共同为上下文敏感的可调节视频描述奠定基础。

Comments Accepted at CVPR Findings 2026. Project page: https://roadtones.github.io/

详情
AI中文摘要

现有的视频-语言模型能够生成道路事件的事实性描述,但缺乏对事件表达方式的控制:语气、紧迫性或风格。这限制了在通信关键性场景中的应用,因为信息的有效性取决于内容和表现,而不仅仅是事实准确性。为缓解这一问题,我们引入了一个全面的数据集-模型-评估体系,用于可调节语气的道路视频描述生成。我们的经人类验证的数据生成流程扩展了道路视频语料库,添加了多样化的语气标注和多语气描述,生成RoadTones-51K数据集。我们提出了RoadTones-VL-CoT,一个可调节的视频到文本模型,还生成语气条件的推理草稿以提高可解释性。我们还引入了RoadTones-Eval,一个新的评估体系,联合测量事实一致性与语气符合度。此外,我们还进行了用户研究,其结果验证了描述质量、语气控制和事实一致性。这些贡献共同为上下文敏感的可调节视频描述奠定了基础。

英文摘要

Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

2605.21381 2026-05-21 cs.CV cs.LG 版本更新

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

解耦生成与回归在可控图像恢复中的随机插值

Yi Liu, Jia Ma, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang

发表机构 * Tongji University(同济大学) Fudan University(复旦大学)

AI总结 本文提出DiSI框架,通过解耦随机插值过程中的生成与回归组件,实现从纯回归到全生成的连续可控过渡,提升图像恢复任务的效率和精度。

Comments 44 pages, 16 figures, 16 tables

详情
AI中文摘要

近年来,图像恢复(IR)的进步主要由生成方法如扩散模型和流匹配驱动,这些方法在合成逼真纹理方面表现出色,但存在推理慢和像素保真度差的问题。相比之下,传统基于回归的IR方法在这些方面表现更佳,提供单步高效性和高像素级重建保真度。为弥合这一差距,我们提出DiSI,一个统一框架,将随机插值过程解耦为独立的生成和回归组件。这种解耦使DiSI具有显著的通用性,能够连续且可控地从纯回归过程过渡到全生成过程。技术上,我们通过两种特定的采样轨迹实例化该框架,并辅以统一的采样器,实现高质量的少步推理。此外,我们设计了双分支U-Net风格变压器网络,在像素空间中使用专用分支增强条件引导,同时确保高吞吐量。大量实验表明,DiSI在各种IR任务中实现了高效且具有竞争力的结果,同时在单个模型中提供推理时的灵活性,以控制失真感知的权衡。

英文摘要

Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.

2605.21372 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

闭环动态驾驶数据混合用于真实-合成协同训练

Hongzhi Ruan, Pei Liu, Weiliang Ma, Zhengning Li, Xueyang Zhang, Jun Ma, Dan Xu, Kun Zhan

发表机构 * Li Auto(力汽车) HKUST(香港科技大学) HKUST (GZ)(香港科技大学(广州))

AI总结 本文提出了一种闭环动态数据混合方法,通过动态优化过程调整训练数据混合比例,以提升模型性能,解决了在有限预算下优化数据混合的关键问题。

详情
AI中文摘要

数据扩展是现代深度学习的基础,随着自动驾驶转向端到端学习,其重要性日益增加。现实世界驾驶数据标注成本高且场景偏向性明显,使利用几乎无限的合成数据进行真实-合成协同训练成为有前景的方向。然而,简单地整合所有可用的合成数据效率低下且导致分布偏移,优化实际训练预算下的数据混合仍是一个关键但尚未充分研究的问题。因此,我们主张在场景类型和数量上为训练数据混合提供明确指导。特别是在本文中,我们将数据混合近似概念化为一个动态优化过程,通过闭环评估反馈迭代调整训练数据混合以最大化模型性能,并提出AutoScale,一种完全自动化的闭环数据引擎,统一了场景表示、数据混合优化与检索以及模型训练与评估。具体而言,我们提出了图正则化的自编码器(Graph-RAE)用于驾驶场景表示,引入了簇感知梯度上升(Cluster-GA)用于簇级重要性估计和重新加权,并执行簇引导的向量检索以选择高价值样本。在NavSim上的实验表明,AutoScale在有限预算下优于传统协同训练和跨域基线,实现了更好的性能。

英文摘要

Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

2605.21371 2026-05-21 cs.CV 版本更新

A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM+ SLC-off Imagery in Antarctica

一种用于南极 Landsat 7 ETM+ SLC-off 图像恢复的非参考扩散框架

Leyue Tang, Jonathan Louis Bamber, Gang Qiao, Yuanhang Kong

发表机构 * College of Surveying and Geo-Informatics, Tongji University, Shanghai 200092, China(同济大学测绘与地理信息学院,上海200092,中国)

AI总结 本文提出 DiffGF 框架,通过非参考扩散方法恢复 Landsat 7 SLC-off 图像,无需外部参考数据,利用南极专用数据集 SLCANT 进行训练和评估,验证了其在恢复南极 SLC-off 图像方面的高保真度,并通过下游裂缝分割应用展示了其实际价值。

Comments Submitted to IEEE JSTARS

详情
AI中文摘要

在南极获取可用光学图像本质上具有挑战性,由于极夜长和频繁的云覆盖。Landsat 提供了最长且最连续的光学观测,是南极研究最重要的遥感数据源之一。然而,2003 年扫描线校正器(SLC)故障导致 Landsat 7 ETM+ SLC-off 图像约有 22% 的像素缺失,严重限制了其可用性。与许多非极地环境不同,南极表面经历快速且显著的变化,这使得获取可靠的参考图像变得困难,减少了传统参考基填充方法的适用性。为了解决这一挑战,我们提出了 DiffGF,一种非参考扩散框架,用于在不需任何外部参考数据的情况下恢复 Landsat 7 SLC-off 图像。DiffGF 采用由潜在空间扩散过程和像素空间细化组成的两阶段设计。构建了一个专门的南极数据集 SLCANT 用于训练和评估。定量和定性结果表明,DiffGF 能够高保真地恢复南极 SLC-off 图像。其实际价值通过下游裂缝分割应用进一步检验。结果表明,DiffGF 为利用南极 Landsat 7 SLC-off 归档提供了有用的方法,使从历史记录中提取有价值信息成为可能,并支持相关的南极研究。

英文摘要

Acquiring usable optical imagery in Antarctica is inherently challenging due to prolonged polar nights and frequent cloud cover. Landsat provides the longest and most continuous optical observations and constitutes one of the most important remote sensing data sources for Antarctic studies. However, the scan-line corrector (SLC) failure in 2003 resulted in approximately 22% missing pixels in Landsat 7 ETM+ SLC-off imagery, severely limiting its usability. Unlike many non-polar environments, Antarctic surfaces undergo rapid and substantial changes, which makes it difficult to obtain reliable reference imagery and reduces the applicability of conventional reference-based gap-filling methods. To address this challenge, we propose DiffGF, a non-reference diffusion-based framework for restoring Landsat 7 SLC-off imagery without requiring any external reference data. DiffGF adopts a two-stage design consisting of a latent-space diffusion process and a pixel-space refinement. A dedicated Antarctic dataset, SLCANT, is constructed for training and evaluation. Quantitative and qualitative results demonstrate that DiffGF restores Antarctic SLC-off imagery with high fidelity. Its practical value is further examined through a downstream crevasse segmentation application. The results suggest that DiffGF provides a useful approach for exploiting Landsat 7 SLC-off archives in Antarctica, enabling the extraction of valuable information from historical records and supporting related Antarctic studies.

2605.21343 2026-05-21 cs.CV 版本更新

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

OcclusionFormer: 布局导向图像生成中的Z轴顺序安排

Ziye Li, Henghui Ding

发表机构 * Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University, China(大数据研究院,计算机科学与人工智能学院,复旦大学,中国)

AI总结 本文提出OcclusionFormer,一种基于Z轴顺序的扩散变换框架,通过解耦实例并利用体积渲染进行合成,以解决布局到图像模型中物体间遮挡问题,并通过查询对齐损失提升空间精度和语义一致性。

Comments ICML 2026, Project Page: https://henghuiding.com/OcclusionFormer/

详情
AI中文摘要

最近的布局到图像模型在空间可控性方面取得了显著进展。然而,它们仍然在物体间遮挡方面存在困难。当边界框重叠时,大多数现有方法缺乏显式的遮挡信息,这使得交集区域的生成本质上具有歧义性,并阻碍了复杂遮挡关系的确定。为此,我们首先构建了SA-Z,一个包含显式遮挡顺序和像素级注释的大型数据集。基于我们提出的数据集,我们引入了OcclusionFormer,一种新的遮挡感知扩散变换框架,通过解耦实例并利用体积渲染进行合成,显式地建模Z轴优先级。此外,为了确保细粒度的空间精度,我们引入了查询对齐损失,显式监督单个实例并增强语义一致性。所提出的方法有效减少了重叠区域的歧义性,强制正确遮挡依赖关系,并保持了结构完整性,从而在多样化的场景中实现了显著的准确性提升。

英文摘要

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

2605.21309 2026-05-21 cs.CV cs.RO 版本更新

Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation

Hyper-V2X: 基于超网络的协作鸟瞰图语义分割中epistemic和aleatoric不确定性的估计

Abhishek Dinkar Jagtap, Sanath Tiptur Sadashivaiah, Andreas Festag

发表机构 * CARISSMA Institute for Electric, COnnected, and Secure Mobility (C-ECOS), Technische Hochschule Ingolstadt(CARISSMA电动、连接与安全移动研究所(C-ECOS)、因戈尔施塔特技术大学) University of Applied Sciences Aschaffenburg(阿施发堡应用科学大学)

AI总结 本文提出Hyper-V2X框架,通过超网络估计协作V2X感知中的epistemic和aleatoric不确定性,采用部分权重生成方案和V2X上下文嵌入模块,条件化贝叶斯超网络生成随机鸟瞰图分割的权重分布,提升感知可靠性。

Comments Accepted for IEEE Intelligent Vehicle Symposium (IV) 2026

详情
AI中文摘要

通过Vehicle-to-Everything (V2X)通信实现的协作感知通过共享传感器数据创建统一的环境表示,从而提高自动驾驶安全性。尽管近期工作已推进多智能体融合以改善感知,但此类协作框架中的不确定性量化仍鲜有研究。本文介绍Hyper-V2X,一种基于超网络的框架,用于估计V2X感知中的epistemic和aleatoric不确定性。具体而言,我们提出了一种部分权重生成方案和V2X上下文嵌入模块,将贝叶斯超网络条件化于融合的多智能体特征,以生成随机Bird's-Eye-View (BEV)分割的权重分布。与现有确定性BEV模型不同,Hyper-V2X在计算开销小的情况下实现了高效的不确定性估计。我们的方法架构无关,可无缝集成到现代协作骨干结构中,如CoBEVT。在OPV2V基准测试中,Hyper-V2X提供了准确且校准良好的不确定性估计,并提高了整体感知可靠性。我们的代码和基准已公开发布,许可证为开源:https://github.com/abhishekjagtap1/Hyper-V2X

英文摘要

Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird's-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: https://github.com/abhishekjagtap1/Hyper-V2X

2605.21308 2026-05-21 cs.CV cs.AI 版本更新

Deformba: Vision State Space Model with Adaptive State Fusion

Deformba:具有自适应状态融合的视觉状态空间模型

Hongyu Ke, Jack Morris, Yongkang Liu, Satoshi Kitai, Kentaro Oguchi, Yi Ding, Haoxin Wang

发表机构 * Department of Computer Science, Georgia State University(佐治亚州立大学计算机科学系) University of Tennessee Knoxville(田纳西大学肯纳邦克分校)

AI总结 本文提出Deformba,一种能够动态增强空间结构信息并保持状态空间模型线性复杂度的自适应方法,通过多模态融合(如交叉注意力)提升视觉任务的性能,展示了在2D和3D视觉任务中的广泛适用性。

详情
Journal ref
Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

状态空间模型(SSMs)已作为一种强大的、高效的替代方案出现于Transformer之上,展现出线性时间复杂度和卓越的序列建模能力。然而,将其应用于视觉任务仍具有挑战性。首先,现有的视觉SSMs大多依赖于手动设计的固定扫描方法将图像块扁平化为序列,这会引入预定义的几何结构并增加复杂性。其次,在需要不同信息流之间进行查询式交互的领域中,SSMs的更广泛采用受到阻碍。这是由于SSMs为1D序列建模任务设计时固有的因果性和自指性所致。这种融合机制对于多视角3D融合等关键感知任务至关重要。为了解决这些限制,我们提出Deformba,一种上下文自适应的方法,能够在保持SSMs线性复杂度的同时动态增强空间结构信息。Deformba还允许多模态融合,如交叉注意力。为了证明Deformba的有效性和广泛适用性,我们在通用的2D视觉任务(如图像分类、目标检测和分割)以及3D视觉任务(如BEV感知)上测试其性能。大量实验表明,Deformba在各种视觉感知基准上均取得了强劲的性能。

英文摘要

State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

2605.21301 2026-05-21 cs.LG cs.CV 版本更新

Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls

通过与健康对照组对比自动发现疾病亚组

Robin Louiset, Edouard Duchesnay, Benoit Dufumier, Antoine Grigis, Pietro Gori

发表机构 * NeuroSpin(神经旋) Université Paris-Saclay(巴黎-萨克勒大学) CEA(法国原子能委员会) LTCI Institut Polytechnique de Paris(巴黎高等理工学院)

AI总结 本文提出了一种通过对比患者与健康对照组来发现可解释且同质的疾病亚组的方法,该方法在医学影像数据集上展示了改进的亚组估计质量。

Comments Accepted to Data Mining and Knowledge Discovery, ECML-PKDD 2026 Journal Track

详情
AI中文摘要

在生物医学亚组发现中,研究者致力于在患者群体中发现可解释且同质的亚组。在本文中,我们假设健康个体(即对照组)与患者共享一些无关的变异性因素,从而提出了一种称为Deep UCSL的对比亚组发现方法。通过对比患者与对照组,Deep UCSL识别出仅由病理因素驱动的亚组,忽略与健康个体共享的共同变异性。我们的框架采用深度特征提取器来学习判别性表示空间。数学上,我们基于潜在聚类和患者/对照组标签的条件联合似然推导出一种新的损失函数,并通过期望最大化策略交替优化亚组推断和特征编码器更新。一个正则化项进一步鼓励表示捕捉疾病特异性变异性,同时忽略与对照组共享的变异性。与先前相关工作相比,我们的方法在MNIST示例和四个不同的医学影像数据集上展示了改进的亚组估计质量。代码和数据集可在:https://github.com/rlouiset/deep_ucsl获取。

英文摘要

In biomedical Subgroup Discovery, practitioners are interested in discovering interpretable and homogeneous subgroups within a group of patients. In this paper, assuming that healthy subjects (i.e., controls) share common but irrelevant factors of variation with the patients, we motivate and develop a Contrastive Subgroup Discovery method, entitled Deep UCSL. By contrasting patients with controls, Deep UCSL identifies subgroups driven solely by pathological factors, ignoring common variability shared with healthy subjects. Our framework employs a deep feature extractor to learn a discriminative representation space. Mathematically, we derive a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels, optimized via an Expectation-Maximization strategy alternating between subgroup inference and feature encoder updates. A regularization term further encourages representations to capture disease-specific variability while ignoring variability shared with controls. Compared to previous related works, our approach quantitatively improves the quality of the estimated subgroups, as demonstrated on a MNIST example and four distinct real medical imaging datasets. Code and datasets are available at: https://github.com/rlouiset/deep_ucsl.

2605.21300 2026-05-21 cs.CV 版本更新

Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

通过强调图像负样本token减少LVLMs中的物体幻觉

Meng Shen, Minghao Wu, Deepu Rajan

发表机构 * Nanyang Technological University(南洋理工大学) Monash University(墨尔本大学)

AI总结 本文通过强调图像负样本token来减少LVLMs中的物体幻觉问题,提出调整不同token的训练权重和数据过滤策略以控制幻觉。

Comments 20 pages, 10 figures, 10 tables

详情
AI中文摘要

物体幻觉是阻碍大型视觉-语言模型(LVLMs)在实践中应用的重要挑战。我们假设幻觉的一个可能来源是模型倾向于优先生成文本而非与图像进行有意义的交互。为此,我们研究了生成过程并将文本token分为三类:图像正样本、不变样本和负样本,基于它们对输入图像token的视觉依赖性。我们的分析发现,大多数生成的token对图像信息影响很小。这表明在模型训练阶段,更强调学习如何遵循文本指令,而非从图像中提取信息。基于此发现,我们提出根据token的视觉依赖性调整训练权重以控制幻觉。此外,我们移除一部分可能包含更多幻觉的训练数据作为数据过滤策略。这两种方法在不牺牲响应长度或引入额外计算成本的情况下减少了幻觉。我们验证了我们的方法在三个LVLM变体上的有效性,展示了其有效性和通用性。

英文摘要

Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.

2605.21280 2026-05-21 cs.CV 版本更新

Let EEG Models Learn EEG

让EEG模型学习EEG

Yifan Wang, Yijia Ma, Wen Li, Chenyu You

发表机构 * Stony Brook University(石溪大学) University of Texas Health Center at Houston(德克萨斯大学健康中心(休斯顿))

AI总结 本文提出了一种基于条件流匹配的生成框架JET,通过直接建模神经信号的连续演化来生成高质量的EEG信号,解决了传统离散去噪方法在捕捉长期时间依赖性和保持频谱结构方面的不足,实现了在多个基准测试中优于现有方法的性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

高保真度的EEG生成对于缓解大规模神经建模中的数据稀缺和隐私约束至关重要。尽管近年来取得了进展,但大多数现有方法通过离散去噪目标来生成EEG,这无法充分反映神经活动本质上连续的时间动态和频谱结构。因此,这些方法往往难以保持长期时间依赖性,并且生成信号在频谱和时间结构上存在不匹配。在本文中,我们主张有效的EEG生成需要能够直接操作神经信号连续演化的模型。我们引入了Just EEG Transformer (JET),一种基于条件流匹配的生成框架,将EEG建模为沿着连续轨迹演变的原始序列。通过学习一个平滑的向量场,将噪声传输到EEG数据分布,JET在不依赖离散去噪方案或领域特定表示的情况下捕捉时间连续性和瞬态动态。为了确保学习到的动力学与EEG信号的关键属性保持一致,我们引入了保留频谱结构、时间平稳性和信号级统计的原理性约束。在三个大规模基准测试中,JET一致地实现了最先进的性能,相比强大的基线,将TS-FID降低了超过40%。广泛的分析显示,JET捕捉了神经动态的关键结构特性,提供了一种可扩展且原理性的EEG生成方法。项目页面:https://y-research-sbu.github.io/JET/

英文摘要

High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation. Project page: https://y-research-sbu.github.io/JET/ .

2605.21272 2026-05-21 cs.CV cs.AI 版本更新

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

MONET:一个大规模、开放、非冗余且增强的文本到图像数据集

Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

发表机构 * Jasper Research(Jasper研究)

AI总结 本文提出MONET数据集,通过多阶段过滤和增强,提供高质量的文本到图像数据,以降低大规模可重复研究的门槛。

详情
AI中文摘要

训练大型文本到图像模型需要高质量、经过精心编纂的数据集,具有多样内容和详细的描述。然而,收集、过滤、去重和重新描述此类语料库的高昂成本和复杂性阻碍了该领域的开放和可重复研究。我们介绍了MONET,一个开放的Apache 2.0数据集,包含约104.9亿个图像-文本对,这些数据来自29亿个原始对,通过多阶段的安全过滤、领域过滤、精确和近似去重以及使用多种视觉-语言模型重新描述,覆盖短到长形式的描述,并进一步通过合成生成样本增强。每个图像都配有预计算的嵌入和注释,以加速下游使用。为了验证MONET的有效性,我们仅使用它训练了一个400亿参数的潜在扩散模型,并在GenEval和DPG评分中达到了具有竞争力的结果,证明我们的数据集降低了大规模、可重复文本到图像研究的门槛。

英文摘要

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

2605.07816 2026-05-21 cs.CV 版本更新

ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles

ICDAR 2026竞赛:从手绘圆圈中识别作家和笔类

Thomas Gorges, Janne van der Loop, Lukas Hüttner, Linda-Sophie Schneider, Fei Wu, Mathias Seuret, Vincent Christlein

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(模式识别实验室,弗赖堡-亚历山大-埃朗根-纽伦堡大学) Buchwissenschaft, Johannes Gutenberg-Universität Mainz(书籍学,约翰内斯·古滕贝格大学美因茨)

AI总结 本文提出CircleID竞赛,旨在研究在极小的静态痕迹中,生物特征作家特性和物理笔特征如何自然交织,通过两个任务:开放集作家识别和跨作家笔类分类,评估模型在识别已知作家并拒绝未知作家以及跨已知和未知作家进行笔类分类的能力。

详情
AI中文摘要

本文提出了CircleID,即ICDAR 2026竞赛中关于从扫描的手绘圆圈中进行作家识别和笔类分类的大型竞赛。主要目标是研究生物特征作家特性和物理笔特征在极小的静态痕迹中如何自然交织。CircleID包含两个任务:(1)开放集作家识别,要求模型识别已知作家并明确拒绝未知者;(2)跨作家笔类分类,评估在已见和未见作家之间的表现。参赛者获得了一个新的受控数据集,包含46,155张紧密裁剪的圆圈图像,以400 DPI数字化,并标注了作家身份和笔类型。数据集包含44名已知作家和22名未知作家使用八种不同笔具的样本。该竞赛在Kaggle上作为两个独立赛道进行,设有公开和私人排行榜。竞赛为参赛者提供了ResNet基准线。总计389支队伍(436名参赛者)提交了3,185次笔类分类任务的提交,113支队伍(141名参赛者)提交了1,737次作家识别任务的提交。在私人排行榜上表现最好的提交在作家识别任务中达到了64.801%的Top-1准确率,在笔类分类任务中达到了92.726%的准确率。本文详细介绍了数据集,评估了获胜方法,并分析了非分布作家对模型泛化和特征解耦的影响。在此次大规模竞赛中,CircleID为极小痕迹分析建立了新的基准。

英文摘要

This paper presents CircleID, a large-scale ICDAR 2026 competition on writer identification and pen classification from scanned hand-drawn circles. The primary objective is to investigate how biometric writer characteristics and physical pen features naturally entangle within minimal, static traces. CircleID comprises two distinct tasks: (1) open-set writer identification, requiring models to recognize known writers while explicitly rejecting unknown ones, and (2) cross-writer pen classification, evaluated across both seen and unseen writers. Participants were provided with a new, controlled dataset of 46,155 tightly cropped circle images, digitized at 400 DPI and annotated for writer identity and pen type. The dataset comprises samples from 44 known and 22 unknown writers using eight different pens. Hosted on Kaggle as two separate tracks with public and private leaderboards, the competition provided participants with a ResNet baseline. In total, 389 teams (436 participants) made 3,185 submissions for the pen classification task, and 113 teams (141 participants) made 1,737 submissions for the writer identification track. The best-performing private leaderboard submissions achieved a Top-1 accuracy of 64.801% for writer identification and 92.726% for pen classification. This paper details the dataset, evaluates the winning methodologies, and analyzes the impact of out-of-distribution writers on model generalization and feature disentanglement. In this large-scale competition, CircleID establishes a new baseline for minimal-trace analysis.

2603.24139 2026-05-21 cs.CV cs.LG 版本更新

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

tutor-student 强化学习:一种动态课程以实现鲁棒的深度伪造检测

Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Integrated Circuits, Peking University(北京大学集成电路学院) School of Information, Huazhong Agricultural University(华中农业大学信息学院) Cyberspace Institute of Advanced Technology, Guangzhou University(广州大学先进技术网络研究院)

AI总结 本文提出了一种 tutor-student 强化学习框架,通过动态优化训练课程来提高深度伪造检测的鲁棒性和泛化能力。

Comments Accepted to CVPR 2026

详情
Journal ref
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)
AI中文摘要

标准的监督训练将所有样本视为同等重要,这在学习鲁棒且可泛化的特征方面可能是次优的。在本工作中,我们提出了一种新颖的 tutor-student 强化学习 (TSRL) 框架,以动态优化训练课程。我们的方法将训练过程建模为马尔可夫决策过程,其中一个 ``tutor'' agent 学习引导一个 ``student'' (深度伪造检测器)。tutor 实现为一个近端策略优化 (PPO) agent,观察每个训练样本的丰富状态表示,包括不仅其视觉特征,还包括其历史学习动态,如 EMA 损失和遗忘计数。基于此状态,tutor 通过分配连续权重 (0-1) 到样本的损失,从而动态重新加权训练批次。tutor 的奖励基于 student 的即时性能变化,具体奖励从错误预测转为正确预测的过渡。这种策略促使 tutor 学习一个优先考虑高价值样本的课程,如困难但可学习的例子,从而实现更高效和有效的训练过程。我们证明,这种自适应课程相比传统训练方法提高了 student 对未见操纵技术的泛化能力。代码可在 https://github.com/wannac1/TSRL 上获得。

英文摘要

Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

2603.17784 2026-05-21 cs.CV cs.LG 版本更新

ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis

基于类重加权和解剖引导时间解码的ResNet-50在消化系统视频分析中的应用

Romil Imtiaz, Dimitris K. Iakovidis

发表机构 * Department of Computer Science and Biomedical Informatics, University of Thessaly(塞萨洛尼基大学计算机科学与生物医学信息学系)

AI总结 本文提出了一种多标签消化系统视频分析管道,结合ResNet-50帧分类器和解剖引导的时间事件解码,通过类重加权和解剖引导的解码方法提高稀有病理类别的识别性能,最终在挑战测试集上将时间mAP从0.3801提升到0.4303。

Comments ICPR 2026 RARE-VISION Competition

详情
AI中文摘要

我们开发了一种基于ResNet-50帧分类器的多标签消化系统视频分析管道,随后进行解剖引导的时间事件解码。系统从336x336大小的帧中预测17个标签,包括5个解剖类别和12个病理类别。主要挑战是严重的类别不平衡,尤其是罕见病理标签。为了解决这个问题,我们在训练损失中使用了截断的类别级正权重,这在提高罕见类别学习的同时保持了稳定的优化。在时间阶段,我们发现直接帧到事件的转换与官方地面真实值存在碎片化的不匹配。最终提交因此结合了GT风格的帧级事件组成、解剖投票平滑和基于解剖的病理门控,以及保守的滞回解码器。这种设计在挑战测试集上将最终的时间mAP从0.3801提升到0.4303。

英文摘要

We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.

2601.15133 2026-05-21 cs.CV cs.LG 版本更新

Building Deep Graph Predictors with Graph Imitation Learning

通过图模仿学习构建深度图预测器

André Eberhard, Gerhard Neumann, Pascal Friederich

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 本文提出GRAIL框架,通过图模仿学习解决图生成中的表示问题,实验证明其在多个基准测试中表现优异。

详情
AI中文摘要

近年来,神经生成文本、图像和音频方面取得了显著进展,得益于成熟的训练流程和大规模优化。然而,对于图而言,这种进展更为有限。我们归因于图特定的优化和表示挑战,这些挑战削弱了通过反向传播和梯度下降训练神经网络的有效性。我们主张在最近提出的监督图预测模型中,将图表示为固定大小的欧几里得网格可能不是最优选择。为了支持我们的观点,我们分析了神经图生成方法,并识别出导致训练神经网络生成图时出现陷阱的理论挑战。受此分析启发,我们引入GRAIL(Graph Imitation Learning),一种用于监督设置的框架,其中监督信号是一个图。GRAIL通过马尔可夫决策过程在部分图的嵌入上依次生成图,从而避免了固定大小网格图表示相关的表示问题。我们实验证明,GRAIL在18个全面的基准测试中实现了具有竞争力的结果,在多个设置中匹配或超过了最先进的方法。

英文摘要

Recent years have seen substantial progress in neural generation of text, images, and audio, supported by mature training pipelines and large-scale optimization. For graphs, however, comparable progress has been more limited. We attribute this gap to graph-specific optimization and representation challenges that undermine the effectiveness of training neural networks with backpropagation and gradient descent. We argue that representing graphs on a fixed-size Euclidean grid, as is common in recently proposed models for supervised graph prediction, may not be the optimal choice in these settings. To support our view, we provide an analysis of neural graph generation methods and identify theoretical challenges that lead to pitfalls when training neural networks to produce graphs as their output. Motivated by this analysis, we introduce \textbf{GRA}ph~\textbf{I}mitation~\textbf{L}earning~(GRAIL), a framework for training neural networks in supervised settings in which the supervision signal is a graph. GRAIL generates graphs sequentially through a Markov decision process over embeddings of partial graphs, thereby avoiding the representation issues associated with fixed-size grid graph representations. We empirically show that GRAIL achieves competitive results on supervised graph prediction across a comprehensive suite of 18 benchmarks, matching or surpassing state-of-the-art methods in several settings.

2511.04520 2026-05-21 cs.CV 版本更新

THEval. Evaluation Framework for Talking Head Video Generation

THEval:谈话头视频生成的评估框架

Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva

发表机构 * Inria Centre at Université Côte d’Azur(Inria 雅克-路易·科蒂尔大学中心) Shanghai AI Laboratory(上海人工智能实验室) da/sec - Biometrics and Security Research Group, Hochschule Darmstadt(da/sec 生物识别与安全研究组,达姆斯塔特大学)

AI总结 本文提出了一种新的评估框架,用于评估生成谈话头视频的质量、自然性和同步性,通过8个指标来衡量,并通过大量实验验证了现有方法在生成表情和无瑕疵细节方面的不足。

Comments CVPR 2026 Findings, Project Page: https://newbyl.github.io/theval_project_page/

详情
AI中文摘要

视频生成已经取得了显著进展,生成的视频越来越接近真实视频。然而,生成技术的快速发展已经超过了评估指标的完善速度。目前,对谈话头生成的评估主要依赖于有限的指标,评估一般视频质量、唇同步以及进行用户研究。受此启发,我们提出了一种新的评估框架,包含8个指标,涵盖三个维度(i)质量,(ii)自然性,(iii)同步性。在选择指标时,我们强调效率以及与人类偏好的一致性。基于这些考虑,我们分析了头部、嘴巴和眉毛的细粒度动态以及面部质量。我们在85,000个由17种最先进的模型生成的视频上进行了广泛的实验,发现尽管许多算法在唇同步方面表现优异,但在生成表情和无瑕疵细节方面面临挑战。这些视频是基于一个新实测数据集生成的,我们为此数据集进行了精心挑选,以减少训练数据的偏见。我们提出的基准框架旨在评估生成方法的改进。原始代码、数据集和排行榜将被公开发布并定期更新,以反映该领域的发展进展。

英文摘要

Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.

2510.14737 2026-05-21 cs.CV 版本更新

Free-Grained Hierarchical Visual Recognition

自由粒度层次视觉识别

Seulki Park, Zilin Wang, Stella X. Yu

发表机构 * University of Michigan(密歇根大学) UC Berkeley(伯克利大学)

AI总结 本文研究了在现实世界中标签不完整且粒度混合的情况下,如何进行层次视觉识别。通过引入自由粒度训练方法,结合文本监督和半监督学习,改进了传统层次方法在不完整监督下的性能,并提出了自由粒度推理机制以适应不同预测深度的需求。

Comments Accepted to CVPR 2026. 31 pages

详情
AI中文摘要

层次图像识别旨在沿着语义分类学预测类别标签,从广义类别到具体类别。通常假设每张训练图像在其分类路径上完全标注。现实更复杂:远处的鸟可能仅被标记为鸟,而清晰的特写可能证明是 bald eagle。我们引入了自由粒度训练,其中标签可能出现在分类学的任何层次,模型必须从不完整、混合粒度的监督中学习一致的层次预测。我们构建了具有不同标签粒度的基准数据集,并展示了现有层次方法在该设置下性能急剧下降。为弥补缺失的监督,我们提出了两种简单解决方案:一种是添加基于文本的广泛监督以捕捉视觉属性,另一种是将特定分类学层次中缺失的标签视为半监督学习问题。我们还研究了自由粒度推理,其中模型选择预测深度,当细粒度预测不确定时返回可靠的粗粒度标签。整体而言,我们的任务、数据集和方法使层次识别更接近现实世界中标签的产生方式。

英文摘要

Hierarchical image recognition seeks to predict class labels along a semantic taxonomy, from broad categories to specific ones, typically under the tidy assumption that every training image is fully annotated along its taxonomy path. Reality is messier: A distant bird may be labeled only bird, while a clear close-up may justify bald eagle. We introduce free-grain training, where labels may appear at any level of the taxonomy and models must learn consistent hierarchical predictions from incomplete, mixed-granularity supervision. We build benchmark datasets with varying label granularity and show that existing hierarchical methods deteriorate sharply in this setting. To make up for missing supervision, we propose two simple solutions: One adds broad text-based supervision that captures visual attributes, and the other treats missing labels at specific taxonomy levels as a semi-supervised learning problem. We also study free-grained inference, where the model chooses how deep to predict, returning a reliable coarse label when a fine-grained one is uncertain. Together, our task, datasets, and methods move hierarchical recognition closer to the way labels arise in the real world.

2509.07120 2026-05-21 cs.CV 版本更新

Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers

块稀疏全局注意力:用于高效多视图几何变换器

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe

发表机构 * Computer Vision Group(计算机视觉组) RWTH Aachen University(亚琛工业大学)

AI总结 本文提出了一种块稀疏替代密集全局注意力的方法,通过优化内核实现高效多视图重建,显著提升处理大规模图像集的可扩展性。

Comments Project page at https://vision.rwth-aachen.de/sparse-vggt

详情
AI中文摘要

高效且准确的前馈多视图重建长期以来一直是计算机视觉中的重要任务。最近的基于变换器的模型,如VGGT、π³和MapAnything,通过相对简单的架构展示了显著的性能。然而,它们的可扩展性从根本上受到全局注意力二次复杂度的限制,这在处理大规模图像集时会带来显著的运行时间瓶颈。在本工作中,我们通过实证分析这些模型的全局注意力矩阵,并观察到概率质量集中在一小部分补丁-补丁交互上,这些交互对应于跨视图几何对应关系。基于这一见解并受近期大语言模型进展的启发,我们提出了一种无需训练的块稀疏替代密集全局注意力方法,通过高度优化的内核实现。我们的方法在保持可比任务性能的同时,将推理速度提高了超过3倍。在全面的多视图基准测试中,我们的方法无缝集成到现有的基于全局注意力的架构中,如VGGT、π³和MapAnything,同时显著提高了处理大规模图像集的可扩展性。

英文摘要

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $π^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than $3\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, $π^3$ , and MapAnything, while substantially improving scalability to large image collections.

2411.09593 2026-05-21 eess.IV cs.AI cs.CV 版本更新

SMILE-UHURA Challenge -- Small Vessel Segmentation at Mesoscopic Scale from Ultra-High Resolution 7T Magnetic Resonance Angiograms

SMILE-UHURA挑战 -- 从超高分辨率7T磁共振血管造影中进行微血管分割

Soumick Chatterjee, Hendrik Mattern, Marc Dörner, Alessandro Sciarra, Florian Dubost, Hannes Schnurre, Rupali Khatun, Chun-Chih Yu, Tsung-Lin Hsieh, Yi-Shan Tsai, Yi-Zeng Fang, Yung-Ching Yang, Juinn-Dar Huang, Marshall Xu, Siyu Liu, Fernanda L. Ribeiro, Saskia Bollmann, Karthikesh Varma Chintalapati, Chethan Mysuru Radhakrishna, Sri Chandana Hudukula Ram Kumara, Raviteja Sutrave, Abdul Qayyum, Moona Mazher, Imran Razzak, Cristobal Rodero, Steven Niederren, Fengming Lin, Yan Xia, Jiacheng Wang, Riyu Qiu, Liansheng Wang, Arya Yazdan Panah, Rosana El Jurdi, Guanghui Fu, Janan Arslan, Ghislain Vaillant, Romain Valabregue, Didier Dormont, Bruno Stankoff, Olivier Colliot, Luisa Vargas, Isai Daniel Chacón, Ioannis Pitsiorlas, Pablo Arbeláez, Maria A. Zuluaga, Stefanie Schreiber, Oliver Speck, Andreas Nürnberger

发表机构 * Faculty of Computer Science, Otto von Guericke University Magdeburg(奥托·冯·格里克大学马格德堡分校计算机科学学院) Data and Knowledge Engineering Group, Otto von Guericke University Magdeburg(奥托·冯·格里克大学马格德堡分校数据与知识工程小组) Human Technopole(人类技术极地) Biomedical Magnetic Resonance, Otto von Guericke University Magdeburg(生物医学磁共振,奥托·冯·格里克大学马格德堡分校) Department of Neurology, Medical Faculty, University Hospital of Magdeburg(马格德堡大学医院医学系神经科) German Centre for Neurodegenerative Diseases(德国神经退行性疾病研究中心) Centre for Behavioural Brain Sciences, Magdeburg(行为脑科学中心,马格德堡) Department of Neurology, University Hospital Zurich(苏黎世大学医院神经科) Department of Consultation-Liaison-Psychiatry and Psychosomatic Medicine, University Hospital Zurich(苏黎世大学医院咨询-联络精神病学与心身医学科) Stanford University(斯坦福大学) Translational Radiobiology, Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg(转化放射生物学,放射肿瘤学部,埃尔兰根大学医院,埃尔兰根-纽伦堡弗里德里希-亚历山大大学) National Yang Ming Chiao Tung University(阳明交通大学) School of Electrical Engineering and Computer Science, University of Queensland(昆士兰大学电气工程与计算机科学学院) Australian eHealth Research Centre, CSIRO(澳大利亚eHealth研究中心,CSIRO) National Heart and Lung Institute, Faculty of Medicine, Imperial College London(英国伦敦帝国理工学院医学系国家心脏和肺研究所) Hawkes Institute, Department of Computer Science, University College London(霍克斯研究所,伦敦大学学院计算机科学系) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates(阿布扎克穆罕默德·本·扎耶德人工智能大学) The Alan Turing Institute, London, UK(艾伦·图灵研究所,伦敦,英国) School of Computing, University of Leeds(利兹大学计算学院) Department of Computer Science at School of Informatics, Xiamen University(厦门大学信息学院计算机科学系) Manteia Technologies Co., Ltd, Xiamen, China(厦门Manteia技术有限公司) Leicester International Institute, Dalian University of Technology(大连理工大学利兹国际学院) Sorbonne Université, Institut du Cerveau - Paris Brain Institute(索邦大学,巴黎脑研究所) Centre of Formation and Research in Artificial Intelligence, Universidad de Los Andes, Colombia(智利洛斯安德斯大学人工智能培训与研究中心) Data Science Department, EURECOM, Sophia Antipolis, France(EURECOM数据科学系,法国索菲亚安蒂波利斯)

AI总结 该研究旨在解决公共标注数据集不足的问题,通过提供一个包含时间飞行血管造影的7T MRI标注数据集,评估了多种深度学习方法在微血管分割任务中的性能。

详情
AI中文摘要

人类大脑通过复杂的血管网络获取营养和氧气。影响微血管的病理状况是脑血供中的关键弱点,可能导致严重疾病,如小脑血管疾病。7特斯拉MRI系统的发展使得可以获得更高的空间分辨率图像,使能够可视化大脑中的这些血管。然而,缺乏公开可用的标注数据集阻碍了稳健的机器学习驱动分割算法的发展。为此,SMILE-UHURA挑战被组织起来。该挑战与2023年ISBI会议同期在哥伦比亚的加勒比海城市卡塔赫纳举行,旨在为相关研究领域研究人员提供一个平台。SMILE-UHURA挑战通过提供一个包含7T MRI获取的时间飞行血管造影的标注数据集,填补了公共标注数据集的空白。该数据集是通过自动预分割和大量手动精修相结合创建的。在本文中,十六种提交的方法和两个基线方法在两个不同的数据集上进行了定量和定性比较:一个是来自相同数据集的保留测试MRA(标签保密),另一个是单独的7T ToF MRA数据集(输入体积和标签均保密)。结果表明,大多数提交的深度学习方法在提供的训练数据集上训练后,实现了可靠的分割性能。Dice分数在相应数据集上达到了最高0.838±0.066和0.716±0.125,平均性能最高可达0.804±0.15。

英文摘要

The human brain receives nutrients and oxygen through an intricate network of blood vessels. Pathology affecting small vessels, at the mesoscopic scale, represents a critical vulnerability within the cerebral blood supply and can lead to severe conditions, such as Cerebral Small Vessel Diseases. The advent of 7 Tesla MRI systems has enabled the acquisition of higher spatial resolution images, making it possible to visualise such vessels in the brain. However, the lack of publicly available annotated datasets has impeded the development of robust, machine learning-driven segmentation algorithms. To address this, the SMILE-UHURA challenge was organised. This challenge, held in conjunction with the ISBI 2023, in Cartagena de Indias, Colombia, aimed to provide a platform for researchers working on related topics. The SMILE-UHURA challenge addresses the gap in publicly available annotated datasets by providing an annotated dataset of Time-of-Flight angiography acquired with 7T MRI. This dataset was created through a combination of automated pre-segmentation and extensive manual refinement. In this manuscript, sixteen submitted methods and two baseline methods are compared both quantitatively and qualitatively on two different datasets: held-out test MRAs from the same dataset as the training data (with labels kept secret) and a separate 7T ToF MRA dataset where both input volumes and labels are kept secret. The results demonstrate that most of the submitted deep learning methods, trained on the provided training dataset, achieved reliable segmentation performance. Dice scores reached up to 0.838 $\pm$ 0.066 and 0.716 $\pm$ 0.125 on the respective datasets, with an average performance of up to 0.804 $\pm$ 0.15.

2105.09034 2026-05-21 cs.GR cs.CV 版本更新

Guided Facial Skin Color Correction

引导式面部肤色校正

Keiichiro Shirai, Tatsuya Baba, Shunsuke Ono, Masahiro Okuda, Yusuke Tatesumi, Paul Perrotin

发表机构 * Shinshu University(信州大学) The University of Kitakyushu(北九州市立大学) Tokyo Institute of Technology(东京技术大学) Doshisha University(立命馆大学) The University Institute of Technology La Rochelle(拉罗什大学技术学院)

AI总结 本文提出了一种自动图像校正方法,用于人像照片,通过抑制背景颜色引起的肤色变化来提高面部肤色的一致性。在人像摄影中,由于光照环境(如从彩色背景墙反射的光线或相机 strobe 过曝)常导致肤色失真,若照片人工合成其他背景色,则这种颜色变化会更加明显,导致不自然的合成结果。在我们的框架中,首先大致提取面部区域并在颜色空间中校正肤色分布,然后在原始图像中对面部周围进行颜色和亮度校正,以实现适当的面部颜色平衡,不受亮度和背景颜色影响。与传统颜色校正算法不同,我们的最终结果通过带有引导图像的颜色校正过程获得。特别是,我们的引导图像过滤器在颜色校正中不需要像 He 等人最初提出的引导图像过滤器方法中所需的完美对齐的引导图像。实验结果表明,我们的方法在人像照片和自然场景照片上都比传统方法生成更自然的结果。我们还展示了自动年鉴风格照片生成作为另一种应用。

Comments 12 pages, 16 figures

详情
Journal ref
Signals, vol. 2, no. 3, pp. 540-558, 2021
AI中文摘要

本文提出了一种自动图像校正方法,用于人像照片,该方法通过抑制由于背景颜色引起的肤色变化来促进面部肤色的一致性。在人像照片中,由于光照环境(例如,从彩色背景墙反射的光线或相机 strobe 过曝)常常导致肤色失真,如果照片人工合成另一种背景颜色,这种颜色变化会更加明显,导致不自然的合成结果。在我们的框架中,首先大致提取面部区域并在颜色空间中校正肤色分布,然后在原始图像中对面部周围进行颜色和亮度校正,以实现适当的面部颜色平衡,该平衡不受亮度和背景颜色的影响。与传统颜色校正算法不同,我们的最终结果通过带有引导图像的颜色校正过程获得。特别是,我们的引导图像过滤器在颜色校正中不需要像 He 等人最初提出的引导图像过滤器方法中所需的完美对齐的引导图像。实验结果表明,我们的方法在人像照片和自然场景照片上都比传统方法生成更自然的结果。我们还展示了自动年鉴风格照片生成作为另一种应用。

英文摘要

This paper proposes an automatic image correction method for portrait photographs, which promotes consistency of facial skin color by suppressing skin color changes due to background colors. In portrait photographs, skin color is often distorted due to the lighting environment (e.g., light reflected from a colored background wall and over-exposure by a camera strobe), and if the photo is artificially combined with another background color, this color change is emphasized, resulting in an unnatural synthesized result. In our framework, after roughly extracting the face region and rectifying the skin color distribution in a color space, we perform color and brightness correction around the face in the original image to achieve a proper color balance of the facial image, which is not affected by luminance and background colors. Unlike conventional algorithms for color correction, our final result is attained by a color correction process with a guide image. In particular, our guided image filtering for the color correction does not require a perfectly-aligned guide image required in the original guide image filtering method proposed by He et al. Experimental results show that our method generates more natural results than conventional methods on not only headshot photographs but also natural scene photographs. We also show automatic yearbook style photo generation as an another application.

2605.21251 2026-05-21 eess.IV cs.CV 版本更新

Local-sensitive connectivity filter (ls-cf): A post-processing unsupervised improvement of the frangi, hessian and vesselness filters for multimodal vessel segmentation

局部敏感连通性滤波器(ls-cf):一种后处理的无监督改进方法,用于多模态血管分割的Frangi、Hessian和血管性滤波器

Erick O Rodrigues, Lucas O Rodrigues, João HP Machado, Dalcimar Casanova, Marcelo Teixeira, Jeferson T Oliva, Giovani Bernardes, Panos Liatsis

发表机构 * Department of Academic Informatics (DAINF), Universidade Tecnologica Federal do Parana (UTFPR)(学术信息系(DAINF),技术联邦大学帕托布拉诺分校(UTFPR)) Graduate Program of Sciences Applied to Health Products, Universidade Federal Fluminense (UFF)(健康产品应用科学研究生项目,联邦大学弗洛里塞分校(UFF)) Institute of Technological Sciences (ICT), Universidade Federal de Itajuba (UNIFEI)(技术科学研究所,联邦大学伊塔比亚分校(UNIFEI)) Department of Electrical Engineering and Computer Science, Khalifa University of Science and Technology(电气工程与计算机科学系,卡利法科学技术大学)

AI总结 本文提出了一种无监督的多模态方法,改进Frangi滤波器的响应,实现自动血管分割。通过计算像素级血管连续性并引入局部容忍启发式方法来填补Frangi响应产生的血管不连续性,提出局部敏感连通性滤波器(LS-CF),在多种多模态数据集上取得了有竞争力的结果,尤其在OSIRIX视网膜血管造影数据集中,其准确率优于现有最先进方法。

详情
Journal ref
Journal of Imaging 2022
AI中文摘要

视网膜血管分析是一种可用于评估眼部风险的程序。本文提出了一种无监督的多模态方法,改进Frangi滤波器的响应,实现自动血管分割。我们提出了一种滤波器,计算像素级血管连续性并引入局部容忍启发式方法来填补Frangi响应产生的血管不连续性。该方法称为局部敏感连通性滤波器(LS-CF),与基于阈值的Frangi响应滤波器、结合形态学闭运算的简单连通性滤波器以及文献中的现有方法进行了比较。该方法在多种多模态数据集中取得了有竞争力的结果。在OSIRIX视网膜血管造影数据集中,它在准确率方面优于所有现有最先进方法;在IOSTAR数据集中,它在4/5项任务中优于现有方法;在DRIVE和STARE数据集中,它也优于一些现有工作;在CHASE-DB数据集中,它在6/10项任务中优于现有方法,并且在CHASE-DB数据集中也优于所有现有的无监督方法。

英文摘要

A retinal vessel analysis is a procedure that can be used as an assessment of risks to the eye. This work proposes an unsupervised multimodal approach that improves the response of the Frangi filter, enabling automatic vessel segmentation. We propose a filter that computes pixel-level vessel continuity while introducing a local tolerance heuristic to fill in vessel discontinuities produced by the Frangi response. This proposal, called the local-sensitive connectivity filter (LS-CF), is compared against a naive connectivity filter to the baseline thresholded Frangi filter response and to the naive connectivity filter response in combination with the morphological closing and to the current approaches in the literature. The proposal was able to achieve competitive results in a variety of multimodal datasets. It was robust enough to outperform all the state-of-the-art approaches in the literature for the OSIRIX angiographic dataset in terms of accuracy and 4 out of 5 works in the case of the IOSTAR dataset while also outperforming several works in the case of the DRIVE and STARE datasets and 6 out of 10 in the CHASE-DB dataset. For the CHASE-DB, it also outperformed all the state-of-the-art unsupervised methods.

2605.21244 2026-05-21 cs.CV 版本更新

SR-Ground: Image Quality Grounding for Super-Resolved Content

SR-Ground: 图像质量接地用于超分辨内容

Artem Borisov, Evgeney Bogatyrev, Khaled Abud, Dmitriy Vatolin

发表机构 * Lomonosov Moscow State University(莫斯科罗蒙诺索夫莫斯科国立大学) MSU AI Center, Lomonosov Moscow State University(MSU人工智能中心,莫斯科罗蒙诺索夫莫斯科国立大学)

AI总结 本文提出SR-Ground数据集,用于超分辨图像中细粒度伪影分割,通过大规模众包研究生成高质量数据集,提升IQA模型性能并减少超分辨输出中的可感知伪影。

详情
AI中文摘要

超分辨率(SR)近年来发展迅速,扩散模型在保真度上取得了前所未有的进展,但引入了新的视觉伪影类型。尽管现有图像质量评估(IQA)方法提供整体质量评分,但缺乏可解释性且无法区分现代SR方法产生的不同伪影类型。为解决这一差距,我们引入SR-Ground,一个专门设计用于超分辨图像细粒度伪影分割的大规模数据集。该数据集包含由多种最先进的SR模型处理的图像,具有像素级注释的多种伪影类别。我们进行了一项涉及1,062名参与者的大型众包研究,以验证和优化自动生成的分割,最终生成了包含6种不同伪影类型的63,000张高质量图像数据集。我们证明了在SR-Ground上训练具有接地能力的IQA模型在下游任务中显著提高了性能。此外,我们引入了一种微调流程,利用我们的接地模型减少SR输出中的可感知伪影,展示了我们数据集的实用价值。

英文摘要

Super-Resolution (SR) has advanced rapidly in recent years, with diffusion-based models achieving unprecedented fidelity at the cost of introducing new types of visual artifacts. While existing Image Quality Assessment (IQA) methods provide holistic quality scores, they lack interpretability and fail to distinguish between different artifact types arising from modern SR approaches. To address this gap, we introduce SR-Ground, a large-scale dataset specifically designed for fine-grained artifact segmentation in super-resolved images. The dataset comprises images processed by a diverse set of state-of-the-art SR models, with pixel-level annotations for multiple artifact categories. We conduct a large-scale crowdsourcing study involving 1,062 participants to validate and refine automatically generated segmentations, resulting in a high-quality dataset of 63,000 images spanning 6 distinct artifact types. We demonstrate that training IQA models with grounding capabilities on SR-Ground significantly improves performance on downstream tasks. Furthermore, we introduce a fine-tuning pipeline that leverages our grounding model to reduce perceptible artifacts in SR outputs, showcasing the practical utility of our dataset.

2605.21237 2026-05-21 cs.CV cs.AI 版本更新

RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

RePCM:区域特定和表型适应的双心室心脏运动合成

Xuan Yang, Xiaohan Yuan, Hao Li, Lingyu Chen, Yanan Liu, Lei Li

发表机构 * School of Biomedical Engineering, National University of Singapore, Singapore(新加坡国立大学生物医学工程学院) School of Automation, Southeast University, Nanjing, China(东南大学自动化学院) School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China(南京航空航天大学计算机科学与技术学院) School of Information Science and Engineering, Yunnan University, Kunming, China(云南大学信息科学与工程学院)

AI总结 本文提出RePCM方法,通过单帧双心室网格运动补全,利用区域特定和表型适应性来提升心脏运动合成的准确性,以应对心血管疾病导致的区域和疾病特异性差异。

Comments Early Accepted by MICCAI 2026. This is the author's submitted version. 10 pages, 3 figures

详情
AI中文摘要

心脏周期内的运动对于量化区域功能至关重要,并且强烈受到心血管疾病的影响。由于在实践中难以获得时间密集的网格序列,我们专注于利用更易获得的终舒张期帧来推断完整的周期序列。由于存在强区域和疾病特异性差异,传统方法常通过依赖生成模型来过度平滑数据,这些模型是为全球模式优化的。为了解决这个问题,我们提出了Region-Aware和Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis(RePCM)方法,用于单帧双心室网格运动补全。在第一阶段,重建网络学习顶点级别的运动描述符,聚类产生数据驱动的功能分区,提供显式的运动衍生区域结构。在第二阶段,Region-Specific Injection模块在条件VAE中强制执行掩码同步的区域交换,保留局部特定动态并限制跨区域混合。Phenotype-Adaptive Mixture-of-Experts先验条件于ED形状,使用解剖引导的提示来建模潜在运动趋势并捕捉跨疾病变化。在三个涵盖不同心血管疾病的数据集上的实验显示,在几何和功能指标上取得了持续的改进,并且区域特定动态的保护得到了改善。

英文摘要

Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns. To address this problem, we propose Region-Aware and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis (RePCM) for single frame Bi-ventricular mesh motion completion. In Stage I, a reconstruction network learns vertex wise motion descriptors and clustering yields a data driven functional partition, providing an explicit motion derived region structure. In Stage II, a Region-Specific Injection Module enforces masked, synchronized region exchange within a conditional VAE, preserving localized specific dynamics and restricting cross-region mixing. A Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape uses anatomy-guided cues to model latent motion trends and capture inter-disease variability. Experiments on three datasets covering different cardiovascular diseases show consistent gains in geometric and functional metrics and improved preservation of region specific dynamics.

2605.21207 2026-05-21 cs.CV 版本更新

PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection

PGC:用于通用人工智能生成图像检测的峰值引导校准

Xiaoyu Zhou, Jianwei Fei, Peipeng Yu, Jingchang Xie, Chong Cheng, Zhihua Xia

发表机构 * College of Cyber Security, Jinan University, Guangzhou, China(济南大学网络安全学院,中国广州) Department of Information Engineering, University of Florence, Florence, Italy(佛罗伦萨大学信息工程系,意大利佛罗伦萨) School of Integrated Circuits, Guangdong University of Technology, Guangzhou, China(广东工业大学集成电路学院,中国广州)

AI总结 本文提出PGC框架,通过峰值聚焦机制聚合显著特征,以校准全局决策,从而提高对细粒度判别信号的检测能力,并在CommGen15数据集上实现了最先进的性能。

详情
AI中文摘要

生成式AI的快速发展,从GANs到现代扩散模型,导致了越来越微妙的判别线索。这些细粒度信号常常被主导的高保真图像内容(例如主体)所掩盖,限制了现有主要依赖全局表示的检测器的可靠性。为了解决这一挑战,我们提出了峰值引导校准(PGC)框架。PGC引入了一种新的策略,通过峰值聚焦机制聚合显著特征。具体而言,通过采用对峰值敏感的聚合方法,强调最判别性的局部线索,PGC利用这些关键信号来校准全局决策。这种方法恢复了在全局上下文中被淹没的细微模式。此外,为了更好地模拟现实世界威胁,我们引入了CommGen15数据集,一个包含15个商业模型样本的具有挑战性的基准。广泛实验表明,PGC在性能上达到最先进的水平。具体而言,它在我们的CommGen15数据集上将平均准确率提高了+12.3%,并在标准基准上设定了新纪录,包括GenImage(+2.1%)、AIGI(+3.5%)和UniversalFakeDetect(+1.7%)。代码可在https://github.com/xiaoyu6868/PGC上获得。

英文摘要

The rapid evolution of generative AI, from GANs to modern diffusion models, has resulted in increasingly subtle discriminative clues. These fine-grained signals are often overshadowed by dominant, high-fidelity image content (e.g., the main subject), limiting the reliability of existing detectors that predominantly rely on global representations. To address this challenge, we propose the Peak-Guided Calibration (PGC) framework. PGC introduces a novel strategy that aggregates salient features via a peak-focusing mechanism. Specifically, by employing a peak-sensitive aggregation that accentuates the most discriminative local clues, PGC leverages these critical signals to calibrate the global decision. This approach recovers subtle patterns that would otherwise be submerged in the global context. Furthermore, to better simulate real-world threats, we introduce the CommGen15 dataset, a challenging benchmark comprising samples from 15 commercial models. Extensive experiments demonstrate that PGC achieves state-of-the-art performance. Specifically, it improves mean accuracy by +12.3% on our CommGen15 dataset, and sets new records on standard benchmarks, including GenImage (+2.1%), AIGI (+3.5%), and UniversalFakeDetect (+1.7%). Code is available at https://github.com/xiaoyu6868/PGC.

2605.21195 2026-05-21 cs.CV 版本更新

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

RankE: 用于离散文本到图像生成的端到端后训练方法 with Decoder Co-Evolution

Siyong Jian, Siyuan Li, Luyuan Zhang, Zedong Wang, Xin Jin, Ying Li, Cheng Tan, Huan Wang

发表机构 * Westlake University(西湖大学) Zhejiang University(浙江大学) Tsinghua University(清华大学) Hong Kong University of Science and Technology(香港科技大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出RankE,一种端到端的后训练框架,通过解码器与策略的协同进化,解决离散自回归文本到图像生成中策略优化导致的潜在协变量偏移问题,同时提升图像质量和对齐度。

详情
AI中文摘要

离散自回归(AR)文本到图像(T2I)模型将VQ分词器与自回归策略结合,当前后训练流程仅优化策略而保持VQ解码器冻结。最近的扩散T2I工作,如REPA-E,表明VAE本身构成关键对齐瓶颈,但离散AR模型尚无类似研究。我们证明仅优化策略会引发潜在协变量偏移:随着策略进化,生成的token分布偏离解码器训练的地面真实分布,使得奖励分数提升而解码图像质量下降。为解决此不匹配,我们提出RankE,首个用于离散T2I生成的端到端后训练框架。RankE通过交替优化使两者协同进化:每个模块最大化基于排名的对齐目标,同时通过适合其参数空间的稳定性保持锚点进行正则化。这种协同进化打破了冻结解码器方法所 plagued 的保真度-对齐度权衡:在LlamaGen-XL(775M)上,标准RL提高CLIP但降低FID,而RankE同时提升两者(FID 15.21,CLIP 33.76 on MS-COCO 30K)。在Janus-Pro(1B)上的一致收益证实了解码器协同进化可靠地将奖励优化转化为像素空间质量提升。

英文摘要

Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.

2605.21186 2026-05-21 cs.CV cs.AI 版本更新

SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection

SAM-Sode:迈向微小细菌检测的可信解释

Wanying Tan, Shuo Yan, Dazhi Huang, Yazheng Liu, Zili Shao, Rufeng Chen, Hechang Chen, Mude Shi, Tianxing Ji, Sihong Xie

发表机构 * Shenzhen University, Shenzhen, China The Second Affiliated Hospital, Guangzhou Medical University, Guangzhou, China The Hong Kong University of Science Technology (Guangzhou), Guangzhou, China Jilin University, Changchun, China Guangdong ACXEL Micro \& Nano Tech Co., Ltd., Guangzhou, China

AI总结 本文提出SAM-Sode框架,通过几何感知提示和双约束机制提升微小细菌检测的解释性与透明度,有效抑制背景冗余并增强决策透明度。

Comments 10 pages, 4 figures, conference paper

详情
AI中文摘要

对象检测的可解释性为临床辅助诊断提供了关键的信心支持。然而,在微小细菌检测中,传统解释方法由于目标形态特征的极端稀疏性和复杂背景的严重干扰,常面临前景边界模糊和特征归因扩散的问题。这种限制阻碍了逻辑连贯的形态证据的提供。为解决这一问题,我们提出了一种新颖的可解释人工智能(XAI)框架SAM-Sode。该框架创新性地将初始特征归因图转换为几何感知提示,利用基础模型(SAM3)的先验知识实现空间细化和形态重建。此外,我们引入基于物理意义和几何对齐的双约束机制,进行实例级去噪,生成更符合人类专家直觉的解释。在我们自行构建的具有复杂电路背景的细菌数据集(包含2,524张图像)及其他公开数据集上的实验结果表明,所提出的方法有效抑制了背景冗余,并显著增强了微小物体检测的决策透明度。

英文摘要

Interpretability in object detection provides crucial confidence support for clinical auxiliary diagnosis. However, in tiny bacteria detection, traditional explanation methods often suffer from blurred foreground boundaries and diffuse feature attribution due to the extreme sparsity of target morphological features and severe interference from complex backgrounds. Such limitations hinder the provision of logically coherent morphological evidence. To bridge this gap, we propose a novel eXplainable AI (XAI) framework, SAM-Sode. The framework innovatively transforms initial feature attribution maps into geometry-aware prompts, leveraging the prior knowledge of the foundation model (SAM3) to achieve spatial refinement and morphological reconstruction of the explanatory mappings. Furthermore, we introduce a dual-constraint mechanism based on physical significance and geometric alignment to perform instance-level denoising, generating coherent explanations that better align with human expert intuition. Experimental results on our self-constructed bacteria dataset with complex circuit backgrounds (containing 2,524 images) and other public datasets demonstrate that the proposed method effectively suppresses background redundancy and significantly enhances the decision-making transparency of tiny object detection.

2605.21171 2026-05-21 cs.CV 版本更新

FTerViT: Fully Ternary Vision Transformer

FTerViT:全三进制视觉变换器

Szymon Ruciński, Pietro Bonazzi, Engin Türetken, Simon Narduzzi, Michele Magno, Nadim Maamari

发表机构 * CSEM(瑞士塞梅实验室) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文提出了一种全三进制视觉变换器(FTerViT),通过将所有权重矩阵和归一化参数三进制化,实现了模型压缩,同时在资源受限的微控制器上实现了高效的部署。

Comments Preprint

详情
AI中文摘要

三进制视觉变换器(Ternary Vision Transformers)提供了显著的模型压缩,但目前最先进的方法仅将编码器层三进制化,而留下的补丁嵌入、归一化参数和分类头仍保持全精度。在针对资源受限处理器(如微控制器)的紧凑模型中,这些剩余的全精度组件决定了总内存占用,严重限制了部署效率和设备可行性。在本工作中,我们引入了一种完全三进制化的视觉变换器,其中所有权重矩阵和归一化参数均被三进制化(FTerViT)。为此,我们引入了两个新的操作符:具有通道缩放的三进制位卷积(TernaryBitConv2d)用于补丁嵌入,以及三进制归一化(TernaryLayerNorm)。FTerViT通过知识蒸馏进行训练,随后进行轻量级量化感知恢复阶段。我们的三进制W2A8 DeiT-III-S在384×384分辨率下达到82.43%的ImageNet-1K Top-1精度,内存占用为6.09MB(约15倍压缩,相比FP32降低2.42个点),优于先前的三进制ViT方法多达8个点。最后,我们展示了在ESP32-S3系统芯片上的双核XTensa LX7微控制器上首次实现三进制视觉变换器。通过部署FTerViT-Small(基于224×224分辨率的DeiT-III-Small,内存占用5.81MB),我们实现了79.64%的ImageNet-1K Top-1精度。

英文摘要

Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emph{all} weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384$\times$384 resolution achieves 82.43\% ImageNet-1K top-1 at 6.09\,MB (${\sim}$15$\times$ compression, $-$2.42\,pp vs.\ FP32), outperforming prior ternary ViTs methods up to 8 pp. Finally, we demonstrate the first implementation of ternary vision transformers on a dual cores XTensa LX7 microcontroller inside the ESP32-S3 system-on-chip. By deploying FTerViT-Small (based on DeiT-III-Small at 224$\times$224 resolution, 5.81\,MB), we achieve 79.64\% ImageNet-1K top-1 accuracy.

2605.21157 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums

多光谱下无人机影像用于军事检测的比较分析

Sourov Roy Shuvo, Prajwal Panth, Rajesh Chowdhury, Sorup Chakraborty, Sudip Chakrabarty, Prasant Kumar Pattnaik

发表机构 * School of Computer Engineering KIIT Deemed to be University(计算机工程学院 KIIT deemed to be 大学)

AI总结 本文研究了不同光谱条件下无人机影像用于军事目标检测的问题,通过构建四种不同数据集(灰度、热成像、夜视和模糊成像)来评估模型在不同环境下的性能,提出了一种改进的YOLOv11-small模型以提升无人机作战的性能和可靠性。

Comments 6 pages, 7 figures. Accepted at the 16th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 6-11, 2025, IIT Indore. Proceedings pending publication

详情
AI中文摘要

在现代战争中,无人机已成为情报收集和精确打击在不同 hostile 环境中的重要组成部分。其能够从安全距离实时操作 hostile 环境的能力使其在监视和军事行动中具有无价的价值。KIIT-MiTA 数据集由从无人机拍摄的不同军事场景图像组成,为检测军事目标提供了基础,但未考虑各种现实场景。为此,创建了四种不同类型的数据集:灰度、热成像、夜视和模糊成像,以模拟现实环境如低能见度、热成像和夜间条件。YOLOv11-small 模型被训练和用于检测不同设置中的目标。本研究通过在防御和进攻任务中开发先进的检测系统,提高了基于无人机的作战性能和可靠性。

英文摘要

In modern warfare, drones are becoming an essential part of intelligence gathering and carrying out precise attacks in different kinds of hostile environments. Their ability to operate in real-time and hostile environments from a safe distance makes them invaluable for surveillance and military operations. The KIIT-MiTA dataset is comprised of images of different military scenarios taken from drones, and these provide a foundation for detecting military objects, but it does not take into account the various types of real-world scenarios. With that in mind, to evaluate how the models are performing under varying conditions, four different types of datasets are created: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision. These simulate the real-world environments such as low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and used to detect objects across diverse settings. This research boosts the performance and reliability of drone-based operations by contributing to the development of advanced detection systems in both defensive and offensive missions.

2605.21132 2026-05-21 cs.CV 版本更新

SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

SurgOnAir: 基于层次感知的实时手术视频评论

Jingyi He, Yue Zhou, Long Bai, Kun Yuan, Nassir Navab, Yuan Bi

发表机构 * Computer Aided Medical Procedures (CAMP), TU Munich, Germany Munich Center for Machine Learning (MCML), Munich, Germany University of Strasbourg, France The Chinese University of Hong Kong, Hong Kong

AI总结 本研究提出SurgOnAir,一种流式视觉-语言模型,通过层次化数据集实现对手术流程多层级的实时理解与评论生成,提升手术过程中的即时响应能力。

详情
AI中文摘要

理解手术流程的实时动态对于智能手术系统至关重要,其中AI系统需要持续感知并响应手术进展。在手术室中,关键决策依赖于细微且即时的变化,如精细的器械运动和不断演变的组织状态,其中即使是轻微的感知延迟也可能限制辅助或危及安全。然而,现有方法仍为离线或在粗粒度时间尺度上操作,仅在处理视频片段后生成描述,阻碍了即时反应。为此,我们提出SurgOnAir,一种流式视觉-语言模型,能够按顺序处理帧,无需未来信息,并在视觉输入到达时逐步生成叙述标记。SurgOnAir实现了细粒度的帧到标记生成,能够即时响应不断变化的手术动态。基于我们精心编纂的层次化数据集SurgOnAir-11k,该模型被训练以生成多级文本响应,反映手术流程的内在层次结构。此外,特殊过渡标记被生成以显式标记状态变化,使SurgOnAir能够捕捉并信号关键工作流程的转变。实验表明,SurgOnAir通过单一的视觉-语言模型实现了对手术流程多个层次的实时理解,生成更优且层次感知的叙述。代码和数据集将公开。

英文摘要

Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.

2605.21131 2026-05-21 cs.CV 版本更新

UniT: Unified Geometry Learning with Group Autoregressive Transformer

UniT: 基于群自回归变换器的统一几何学习

Haotian Wang, Yusong Huang, Zhaonian Kuang, Hongliang Lu, Xinhu Zheng, Meng Yang, Gang Hua

发表机构 * Intelligent Transportation Thrust of the Systems Hub, The Hong Kong University of Science and Technology (GZ), Guangzhou, P.R.China(香港理工大学(广州)系统中心智能交通研究组,中国广东省广州市) The National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an, P.R.China(人机混合增强智能国家级重点实验室,西安交通大学,中国陕西省西安市) Applied Science, Amazon.com, Inc., USA(亚马逊公司应用科学部,美国)

AI总结 本文提出UniT模型,通过群自回归变换器统一了几何感知中的多种能力,包括在线感知、离线重建、多模态融合、长视界扩展和度量尺度估计,并引入了适应性几何损失以提升跨场景的度量尺度泛化能力。

Comments Submitted to IEEE T-PAMI

详情
AI中文摘要

近期的前馈模型在从传感器观测推断密集3D结构方面显著进步。然而,其本质能力仍然分散在多个不兼容的范式中,包括在线感知、离线重建、多模态整合、长视界可扩展性和度量尺度估计。我们提出了UniT,一种基于新颖的群自回归变换器的统一模型,将这些看似不同的能力重新整合到单一框架中。关键思想是将传感器观测的组视为基本的自回归单元,并以无锚点和自适应尺度的方式预测相应的点图。更具体地说,在线和离线设置中的各种视角配置自然地整合到单一的群自回归过程中。通过改变组的大小,在线模式在多个自回归步骤上使用单帧组,而离线模式在单次前向传递中聚合多帧组。同时,队列式KV缓存机制确保了长视界下的有界自回归内存。这通过减少对早期帧的长距离依赖,通过无锚点关系建模实现,从而允许过时的记忆在飞行中被丢弃。为了提高跨场景的度量尺度泛化能力,进一步在该框架中引入了自适应几何损失。它将相对几何约束与部分绝对尺度项耦合,隐含地正则化全局尺度,并诱导从尺度不变几何到度量尺度解决方案的逐步过渡。与专门的模态注意力模块相结合,用于整合辅助模态,UniT在十个基准上实现了统一几何感知的最先进性能,涵盖了七个代表性任务。

英文摘要

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

2605.21130 2026-05-21 cs.CV 版本更新

VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

VersusQ:用于通用视频质量评估的成对边距推理

Shibei Meng, Binxin Yang, Yuan Liu, Jiexuan Zhang, Zhengyao Lv, Hubery Yin, Qiang Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) WeChat Vision, Tencent Inc.(腾讯公司视觉部门) Beijing Normal University(北京师范大学) Peking University(北京大学) The University of Hong Kong(香港大学)

AI总结 本文提出VersusQ,一种基于成对边距推理的框架,通过直接比较视频来缓解绝对尺度校准偏差,实现跨域的视频质量评估。

详情
AI中文摘要

大型多模态模型(LMMs)在视频质量评估中展现出潜力,但大多数方法仍为每个视频预测一个绝对分数。这种点wise监督通常混合了感知质量和数据集特定的校准,包括标注协议、评分习惯和分数分布。因此,学习到的评分规则可能在基准内表现良好,但在未见过的领域转移效果差。我们主张相对比较通过纯粹关注感知差异而非数据集特定的评分习惯来缓解绝对尺度校准偏差。因此,我们提出了VersusQ,一种完全由直接比较驱动的成对边距推理框架。具体而言,VersusQ在两个视频之间进行基于LMM的比较,推断它们的视觉和时间质量差异,并预测一个带符号的连续边距,以捕捉首选选择和差异程度。此外,为了将可解释的比较理由与细粒度的数值差异对齐,我们引入了Margin-Coupled GRPO,它联合优化基于展开的相对推理和连续边距回归。在多个公共VQA基准上的广泛实验表明,VersusQ在多个公共VQA基准上实现了最先进的性能,强大的跨域泛化能力以及在异构评估场景下的可靠细粒度排名。

英文摘要

Large Multimodal Models (LMMs) have shown promise for video quality assessment, but most methods still predict an absolute score for each video. Such pointwise supervision often mixes perceptual quality with dataset-specific calibration, including annotation protocols, rating habits, and score distributions. As a result, the learned scoring rule may work well within a benchmark but transfer poorly across unseen domains. We argue that relative comparisons alleviate the absolute-scale calibration bias by focusing purely on perceptual differences rather than dataset-specific rating habits. Consequently, we propose \textbf{VersusQ}, a pairwise margin reasoning framework driven entirely by direct comparisons. Specifically, VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference. Furthermore, to align interpretable comparison rationales with fine-grained numerical differences, we introduce Margin-Coupled GRPO, which jointly optimizes rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks demonstrate that VersusQ achieves state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous evaluation scenarios.

2605.21123 2026-05-21 cs.CV cs.LG 版本更新

Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

Linear-DPO: 用于扩散和流匹配生成模型的线性直接偏好优化

Kesong Li, Yixuan Xu, Kuo-kun Tseng, Weiyi Lu, Kan Liu, Tao Lan

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) Alibaba Group(阿里巴巴集团)

AI总结 本文提出Linear-DPO,通过统一的反向时间SDE框架推导出涵盖扩散和流匹配的通用DPO目标,指出标准DPO目标在文本到图像生成中不最优,并通过定性定量实验验证了其在扩散模型和流匹配模型上的优越性。

Comments Code and models are available at: https://github.com/Whynot0101/Linear-DPO . Work done during an internship at Alibaba Group

详情
AI中文摘要

直接偏好优化(DPO)在大语言模型对齐中取得成功,但在文本到图像生成中仍面临挑战。现有研究局限于去噪扩散模型,忽略了流匹配,并在将离散NLP基础的DPO应用于回归基础生成任务时存在目标不匹配的问题。本文推导出一个通用的DPO目标,通过统一的反向时间SDE框架涵盖扩散和流匹配,并从梯度角度指出标准DPO目标在文本到图像生成中不最优。因此,我们提出Linear-DPO,用持续的线性效用函数替代了激进的sigmoid基效用函数,并结合EMA更新的参考模型。在扩散模型(SD1.5、SDXL)和流匹配模型(SD3-Medium)上的定性和定量实验展示了我们的方法优于现有基线。

英文摘要

Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.

2605.21121 2026-05-21 cs.CV cs.GR 版本更新

ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

ROAR-3D: 为高保真3D生成实现任意视角路由

Hanxiao Sun, Mingxin Yang, Shuhui Yang, Zebin He, Xintong Han, Hongbo Fu, Chunchao Guo, Wenhan Luo

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Tencent Hunyuan(腾讯文生)

AI总结 本文提出ROAR-3D方法,通过改进预训练单视角模型以支持任意数量的未置位图像,利用视图路由和双流注意力设计实现高效的多视角3D生成,显著提升生成质量并支持测试时视角扩展。

详情
AI中文摘要

单图像到3D生成模型现在可以生成高质量的几何结构,但对单个视角的条件化不可避免地引入了对未见区域的模糊性。多视角条件化可以减少这种模糊性,但现有方法要么要求固定标准视角,要么依赖外部重建模块,这会带来沉重的训练成本并限制生成质量。我们观察到预训练的单视角模型已经具备强大的2D到3D基础,可以重新用于多视角条件化。然而,更深入的分析表明,它们的条件机制将方向控制与几何传输纠缠在一起,当来自不同视角的图像被简单结合时,这两种功能会冲突。基于此分析,我们提出ROAR-3D,一种轻量级方法,将预训练的单视角模型升级以接受任意数量的未置位图像。一个逐token的视图路由器将每个3D潜在token分配给其最相关的视角,隐式地建立2D到3D对应关系,而无需显式姿态输入。双流注意力设计保留了预训练的主要视角行为,同时通过专用路径路由辅助视角以实现几何增强。一个方向扰动策略确保辅助路径学习方向无关的几何传输。这些组件引入了极小的可训练参数,并在单视角基准上增加了可忽略的推理开销。ROAR-3D在多视角3D生成质量上达到最先进的水平,并支持测试时视角扩展从1到12+个视角,具有一致的改进。

英文摘要

Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.

2605.21112 2026-05-21 cs.CV 版本更新

RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding

RCGDet3D: 重新思考基于增强雷达特征编码的4D雷达-相机融合3D目标检测

Weiyi Xiong, Bing Zhu

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院)

AI总结 本文提出RCGDet3D,通过增强雷达特征编码而非复杂的多模态融合策略,实现了在3D目标检测中更高的准确性和实时性,为实时部署设定了新标准。

详情
AI中文摘要

由于其低成本和鲁棒性,4D汽车雷达对于自动驾驶至关重要,但其点云稀疏性挑战了3D目标检测。现有的4D雷达-相机融合方法侧重于复杂的融合策略,以牺牲推理速度换取微小的增益。这种权衡阻碍了实时部署,因为密集特征图上的计算负担较大。相比之下,从稀疏雷达点中提取特征更加耗时,但仍然被低估。本文发现,仅仅增强雷达特征提取可以实现与复杂融合模块相当或更高的性能,同时保持实时性能。基于这一发现,我们提出了RCGDet3D,其核心在于雷达特征编码和简化多模态融合。其编码器继承自RadarGaussianDet3D中的高效高斯点编码器(PGE),并有两个关键改进。首先,Ray-centric PGE(R-PGE)在射线对齐的坐标系统中预测高斯属性,然后统一到鸟瞰图(BEV)空间,显著提高了几何一致性并减少了学习难度,通过将坐标转换与表征学习解耦。其次,语义注入(SI)模块结合图像中的视觉线索,产生更具几何准确性和语义丰富性的雷达特征。在View-of-Delft(VoD)和TJ4DRadSet上的实验表明,RCGDet3D在准确性和速度上均优于现有最先进方法,为实时部署设定了新的基准。

英文摘要

4D automotive radar is indispensable for autonomous driving due to its low cost and robustness, yet its point cloud sparsity challenges 3D object detection. Existing 4D radar-camera fusion methods focus on complex fusion strategies, trading inference speed for marginal gains. This trade-off hinders real-time deployment due to heavy computation on dense feature maps. In contrast, feature extraction from sparse radar points is less time-consuming but remains under-explored. This work uncovers that simply enhancing radar feature extraction can achieve comparable or even higher performance than elaborate fusion modules, while maintaining real-time performance. Based on this finding, we propose RCGDet3D, which centers on radar feature encoding and simplifies multi-modal fusion. Its encoder inherits from the efficient Gaussian Splatting-based Point Gaussian Encoder (PGE) in RadarGaussianDet3D with two key improvements. First, the Ray-centric PGE (R-PGE) predicts Gaussian attributes in ray-aligned coordinate systems before unifying them to Bird's-Eye View (BEV) space, significantly improving geometric consistency and reducing learning difficulty by decoupling the coordinate transformation from representation learning. Second, a Semantic Injection (SI) module incorporates visual cues from images, producing more geometrically accurate and semantically enriched radar features. Experiments on View-of-Delft (VoD) and TJ4DRadSet show that RCGDet3D outperforms state-of-the-art methods in both accuracy and speed, setting a new benchmark for real-time deployment.

2605.21099 2026-05-21 cs.CV 版本更新

R2AoP: Reliable and Robust Angle of Progression Estimation from Intrapartum Ultrasound

R2AoP: 从产前超声可靠且鲁棒地估计进展角

Yuanhan Wang, Yifei Chen, Beining Wu, Mingxuan Liu, Xiaotian Hu, Chunbo Jiang, Yijin Li, Changmiao Wang, Feiwei Qin, Qiyuan Tian

发表机构 * Tsinghua University(清华大学) Hangzhou Dianzi University(杭州电子科技大学) Shenzhen Research Institute of Big Data(深圳大数据研究院)

AI总结 本文提出R2AoP框架,通过结构引导的分割和置信度引导的几何建模,实现了稳定的进展角估计,同时引入轻量级几何可靠测试时适应策略以提高在异质采集条件下的性能。

Comments 11pages,4 figures,Accepted by MICCAI 2026

详情
AI中文摘要

准确地从产前经阴超声估计进展角(AoP)对于客观评估产程进展至关重要,但仍然高度敏感于成像噪声、边界模糊性和局部分割误差的几何放大。我们提出R2AoP,一种可靠且鲁棒的AoP估计框架,整合了结构引导的分割和置信度引导的几何建模,以实现稳定且可重复的测量。一个三分支局部结构增强的主干提高了耻骨联合(PS)和胎儿头(FH)的界定,而置信度加权轮廓拟合明确抑制了AoP计算中不可靠边界点的影响。为进一步提高在异质采集条件下的性能,我们引入了一种轻量级几何可靠的测试时适应策略作为辅助组件,使推理过程稳定且无需目标标注。在多中心基准上的广泛评估显示,与最先进的AoP方法相比,AoP误差和边界指标均表现出一致的减少。我们的源代码可在https://github.com/baiyou1234/R2AoP上获得。

英文摘要

Accurate estimation of the Angle of Progression (AoP) from intrapartum transperineal ultrasound is critical for objective assessment of labor progression, yet remains highly sensitive to imaging noise, boundary ambiguities, and the geometric amplification of local segmentation errors. We propose R2AoP, a reliable and robust AoP estimation framework that integrates structurally informed segmentation and confidence-guided geometric modeling to achieve stable and reproducible measurements. A three-branch local-structure-enhanced backbone improves the delineation of the pubic symphysis (PS) and fetal head (FH), while confidence-weighted contour fitting explicitly suppresses the influence of unreliable boundary points in AoP computation. To further improve performance under heterogeneous acquisition conditions, we introduce a lightweight geometry-reliable test-time adaptation strategy as an auxiliary component, enabling stable inference without target annotations. Extensive evaluations on multi-center benchmarks demonstrate consistent reductions in AoP error and boundary metrics compared with state-of-the-art AoP methods. Our source code is available at https://github.com/baiyou1234/R2AoP.

2605.21090 2026-05-21 cs.CV 版本更新

TextSculptor: Training and Benchmarking Scene Text Editing

TextSculptor: 训练和评估场景文本编辑

Yiheng Lin, Siyu Jiao, Xiaohan Lan, Wei Zhou, Qi She, Fei Yu, Heyun Chen, Zhengwei Wang, Jinghuan Chen, Moran Li, Yingchen Yu, Zijian Feng, Yao Zhao, Yunchao Wei, Yujie Zhong

发表机构 * Beijing Jiaotong University(北京交通大学) Bytedance(字节跳动)

AI总结 本文提出TextSculptor框架,通过构建大规模数据集和基准测试,解决场景文本编辑中高质量训练数据稀缺和缺乏标准化评估的问题,提升开源模型性能。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)和基于扩散的生成模型的进展显著提升了基于提示的图像编辑能力。然而,场景文本编辑仍具挑战性,因为模型需要精确修改文本内容,同时保持视觉真实性和非目标区域的完整性。当前开源模型仍落后于专有系统,主要由于高质量训练数据稀缺和缺乏针对文本编辑的标准化基准。为解决这些问题,我们提出了TextSculptor,一个全面的场景文本编辑数据构建和评估框架。我们首先开发了一个自动化数据构建管道,结合文本感知图像合成、程序化文本渲染和合成。基于此管道,我们构建了TextSculpt-Data,一个包含320万训练样本的大规模数据集,包括120万经过OCR验证的文本到图像样本和200万配对的文本编辑样本,具有自然对齐的源-目标图像和强背景一致性。我们进一步引入了TextSculpt-Bench,涵盖四个基本文本编辑任务:文本添加、文本替换、文本删除和混合编辑。为了支持可靠的评估,我们设计了一个定制协议,通过OCR文本对齐、多模态判断和背景区域相似性测量文本准确性、视觉质量和背景保持。广泛的实验表明,TextSculptor提升了开源文本编辑性能,缩小了与专有模型之间的差距。数据和基准可在https://github.com/linyiheng123/TextSculptor获取。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.

2605.21075 2026-05-21 cs.CV cs.LG 版本更新

SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

SpectralEarth-FM: 将高光谱图像引入多模态地球观测预训练

Nassim Ait Ali Braham, Aaron Banze, Conrad M. Albrecht, Julien Mairal, Jocelyn Chanussot, Xiao Xiang Zhu

发表机构 * Chair of Data Science in Earth Observation(地球观测数据科学主任) Technical University of Munich(慕尼黑技术大学) Remote Sensing Technology Institute(遥感技术研究所) German Aerospace Center (DLR)(德国航空航天中心) Department of Aerospace Engineering(航空航天工程系) University of the Bundeswehr Munich(联邦国防军慕尼黑大学) LEAP Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Inria(法国国家信息与自动化技术研究院) CNRS(法国国家科学研究中心) Grenoble INP(格勒诺布尔INP) LJK

AI总结 本文提出SpectralEarth-FM,一种用于多传感器地球观测输入的分层变压器,旨在联合处理高光谱图像与低通道观测。通过构建SpectralEarth-MM数据集,采用JEPA风格的目标进行预训练,实现了在高光谱下游任务和标准EO基准上的最佳性能。

详情
AI中文摘要

地球观测(EO)基础模型(FMs)越来越多地使用多传感器数据进行训练,涵盖多谱段图像(MSI)、合成孔径雷达(SAR)和衍生的地理空间层,但高光谱图像(HSI)仍被低估。相反,现有的高光谱FM仅在HSI上训练,未探索HSI与共定位EO传感器的联合预训练和融合。我们引入SpectralEarth-FM,一种用于多传感器EO输入的分层变压器,具有异构光谱维度。该架构结合了高光谱输入的光谱标记化、传感器特定编码器、跨传感器融合模块和共享分层编码器,能够联合处理HSI和低通道观测。为了预训练SpectralEarth-FM,我们构建了SpectralEarth-MM数据集,该数据集将EnMAP、EMIT、DESI三颗空间载荷的HSI与Sentinel-2、Landsat-8/9光学图像、Landsat地表温度(LST)和Sentinel-1 SAR在共同地理足迹上进行共定位。该数据集包含约2000万个全球分布的地点,25000万个地理参考碎片,以及超过40TB的数据。预训练使用一种联合嵌入预测架构(JEPA)风格的目标,匹配全球视图和同一地点单传感器局部视图之间的表示。我们评估了SpectralEarth-FM在高光谱下游任务和标准EO基准上的性能,遵循PANGAEA协议,实现了在两种评估设置中的最佳性能。

英文摘要

Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.

2605.21072 2026-05-21 cs.CV 版本更新

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Q-ARVD: 对自回归视频扩散模型进行量化

Siao Tang, Xinyin Ma, Gongfan Fang, Xingyi Yang, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文针对自回归视频扩散模型(ARVD)的量化问题,提出了一种新的框架Q-ARVD,解决了帧间量化敏感度不平衡和权重中异质性异常模式的问题,从而提高了模型效率。

Comments Code: https://github.com/tsa18/Q-ARVD

详情
AI中文摘要

自回归视频扩散模型(ARVD)已涌现出作为流式视频生成的有前景的架构,为实时交互视频生成和世界建模铺平了道路。尽管具有潜力,ARVDs的显著推理成本仍然是实际部署的主要障碍,使模型量化成为提高效率的自然方向。然而,ARVDs的量化仍鲜有研究。我们的实证分析表明,直接应用现有为标准扩散变压器开发的量化方案到ARVDs会导致性能不佳,揭示了与双向扩散模型观察到的量化行为不同的特性。在本文中,我们识别了量化ARVDs的两个关键挑战:(C1)高度不平衡的帧级量化敏感度。在自回归生成过程中,误差积累可以导致帧间严重的量化敏感度偏斜,遵循指数衰减模式。(C2)权重中显著的异质性异常模式。权重分布表现出明显的异常通道,其模式在层类型和块深度上变化很大。为了解决这些问题,我们提出了Q-ARVD,一种用于准确ARVD量化的新型框架。(S1)为解决高度不平衡的帧级敏感度,Q-ARVD将最终质量感知的帧加权机制纳入量化目标中。(S2)为防止异质性异常影响性能,Q-ARVD引入了异常感知的自适应双尺度量化,该方法可以自动检测任意层中异常通道的存在和数量,并将其隔离以保护正常通道。广泛的实验展示了Q-ARVD的优越性。

英文摘要

Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

2605.21061 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Grounding Driving VLA via Inverse Kinematics

通过逆运动学接地驾驶VLA

Junsung Park, Hyunjung Shim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) KimJaeChul AI Graduate School(金 JaeChul人工智能研究生院)

AI总结 本文提出通过逆运动学求解器重新设计驾驶VLA,以解决轨迹预测中对视觉token的忽略问题,通过引入视觉状态预测和逆运动学网络,提升了视觉接地和轨迹规划性能。

详情
AI中文摘要

现有驾驶VLA在预测轨迹时大多忽略其视觉token--这一现象我们归因于任务公式结构上不合理的设定而非训练不足。我们证明,当通过逆运动学视角看待轨迹恢复时,需要当前和未来视觉状态作为边界条件;现有VLA仅提供前者,促使模型依赖自身状态和文本指令进行捷径预测。为解决此问题,我们重新设计驾驶VLA,使其风格类似于逆运动学求解器。首先,一个需要LLM预测未来视觉场景的下一视觉状态预测目标提供密集的视觉监督并抑制捷径路径。其次,一个单独的逆运动学网络(基于交叉注意力的条件扩散模型)仅输入当前和未来视觉状态,以在轨迹解码过程中抑制对自身状态和文本捷径的依赖。仅通过这种简单的处方,我们的0.5B规模模型恢复了视觉接地能力,并在闭合回路NAVSIM-v2和nuScenes基准上,其轨迹规划性能可与7B-8B规模的VLA相媲美。进一步的分析表明,这种改进源于恢复了利用视觉特征的能力,效果在动态驾驶场景如转弯时尤为明显。

英文摘要

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

2605.21059 2026-05-21 cs.CV cs.LG 版本更新

Multimodal LLMs under Pairwise Modalities

基于成对模态的多模态大语言模型

Yan Li, Yunlong Deng, Yuewen Sun, Gongxu Luo, Kun Zhang, Guangyi Chen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种基于成对模态训练多模态大语言模型的方法,通过理论分析和表示学习框架,实现了跨模态对齐和重构,提升了模型的跨模态性能。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了令人印象深刻的结果,但其训练通常依赖于联合编纂的多模态数据,需要大量的人力来构建多向对齐的数据集,从而限制了跨领域的可扩展性。在本工作中,我们探索了仅利用多种成对模态作为完整联合多模态分布的替代方案进行训练。具体来说,我们首先提供了理论分析,探讨在仅观察成对模态的情况下,表示可识别的条件。基于此分析,我们提出了一种表示学习框架,用于仅使用成对数据对齐跨模态的潜在表示。该框架包括两个阶段:潜在表示对齐和跨模态重构。具体而言,在第一阶段,我们通过自模态重建和成对对比学习学习跨模态的共享潜在空间。我们还通过部分对齐和最小潜在规范在对比学习过程中引入归纳偏置。在第二阶段,我们将新引入的模态的编码器与预训练模态的解码器整合起来,以促进跨模态转移和生成。我们通过将3D点云和触觉模态添加到预训练的MLLMs中,并使用三种模态对进行评估,证明通过学习对齐的潜在表示空间,我们的模型在跨模态性能上表现优异。

英文摘要

Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

2605.21042 2026-05-21 cs.CV 版本更新

Dynamic Video Generation: Shaping Video Generation Across Time and Space

动态视频生成:跨时间和空间的视频生成塑造

Shikang Zheng, Jingkai Huang, Jiacheng Liu, Guantao Chen, Lixuan, Yuqi Lin, Peiliang Cai, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) South China University of Technology(华南理工大学) Tsinghua University(清华大学)

AI总结 本文提出DVG框架,通过在时间和空间上联合分配计算,自动选择内容感知的加速策略,实现近无损加速,展示了在视频生成中的高效性能。

详情
AI中文摘要

扩散模型在视频生成中取得了显著成效,但其迭代去噪过程由于每个时间步处理大量token而计算成本高。最近,渐进分辨率采样作为一种有前途的加速方法,通过在早期阶段降低潜在分辨率。然而,将其扩展到视频生成仍具挑战性,因为额外的时间维度引入了不同视频中多样的时空需求,仅压缩单个维度往往导致有限的加速或质量下降。因此,我们提出DVG,一种动态视频生成框架,通过在时间和空间上联合分配计算,自动选择内容感知的加速策略,无需手动调优或重新训练。DVG在模型和任务上实现了接近无损的加速,达到HunyuanVideo和HunyuanVideo-1.5的7倍加速,结合蒸馏时达到18倍,展示了其作为当今大规模高效视频生成系统关键组件的潜力。我们的代码见补充材料,并将在GitHub上发布。

英文摘要

Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

2605.21032 2026-05-21 cs.CV 版本更新

Towards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation

迈向物理一致的闭环自动驾驶模拟中的4D场景重建

Bowyn Tan, Yutong Xie, Bai Huang, Fan Luo, Xiao Li, Naizheng Wang, Yang Guan, Shengbo Eben Li

发表机构 * Tsinghua University(清华大学) Meituan(美团) Central University of Finance and Economics(中央财经大学)

AI总结 本文提出了一种信息几何诊断框架,解决3DGS方法在同时实现空间和时间参数建模时的信用分配难题,通过引入正交投影梯度(OPG)和时间正则化策略,提升了4D场景重建的物理一致性。

Comments 20 pages, 4 figures

详情
AI中文摘要

高保真的街道场景重建对于端到端自动驾驶模拟至关重要,其中新颖视角合成(NVS)和时间变化信息建模是两种基本能力,以促进闭环训练。然而,现有3DGS方法及其4D扩展未能同时实现这两者。为弥合这一差距,我们建立了信息几何诊断框架,揭示该限制源于空间和时间参数之间的信用分配困境。具体而言,单源观测中视角与时间的确定性耦合产生了一种低秩结构,导致静态视依赖性和动态时间变化组件之间产生大量零空间模糊性。时间信息压制了空间线索,导致空间参数估计方差发散。为了解决这一问题,我们提出正交投影梯度(OPG),一种分层训练方法,旨在恢复空间可识别性。OPG优先保证空间表示的完整性,通过在初始阶段将其固定,然后限制时间更新到空间零空间,使信用分配更加主动。虽然OPG通过代数方式隔离了时间更新,但时间正则化策略被提出,通过基于一致外观演化的物理先验施加平滑约束,确保重建的场景在闭环模拟中保持物理一致性。广泛的实验表明,我们的方法不仅保持了稳定的NVS能力,还在传统观察-再现度量中表现出优越的性能,这间接反映了对时间动态建模能力的建模能力。

英文摘要

High-fidelity street scene reconstruction is pivotal for end-to-end autonomous driving simulation, where novel-view synthesis (NVS) and time-varying information modeling are two fundamental capabilities to facilitate closed-loop training. However, existing 3DGS methods and their 4D extensions fail to simultaneously achieve both. To bridge this gap, we establish an information-geometric diagnostic framework, revealing that this limitation stems from a credit assignment dilemma between spatial and temporal parameters. Specifically, the deterministic coupling between viewpoint and time in single-source observation creates a low-rank structure that induces massive null-space ambiguity between static view-dependent and dynamic time-varying components. Temporal information overshadows spatial cues, causing the estimation variance of spatial parameters to diverge. To address this issue, we propose Orthogonal Projected Gradient (OPG), a hierarchical training method designed to restore spatial identifiability. OPG prioritizes the integrity of spatial representations by securing them in an initial stage, then restricts temporal updates to the spatial null space, enabling proactive credit assignment. While OPG isolates temporal updates algebraically, Temporal Regularization Strategy is proposed to further refine the temporal solution space by imposing a smoothness constraint based on the physical prior of consistent appearance evolution, ensuring that the reconstructed scene remains physically consistent in closed-loop simulation. Extensive experiments demonstrate that our method not only maintains stable NVS capabilities but also demonstrates superior performance in traditional observation-reproducing metrics, which indirectly reflect the capability of modeling temporal dynamics.

2605.21002 2026-05-21 cs.CR cs.CV cs.CY cs.MM 版本更新

Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts

可验证的来源和水印技术用于生成式AI:一个用于国际作战法和国内法院的证据框架

Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov, Nurana Abdullayeva

发表机构 * Department of Military Studies, Försvarshögskolan (Swedish Defence University)(军事研究系,国防大学) School of Law, ADA University(法学院,ADA大学)

AI总结 本文提出一个统一的证据框架,将加密内容来源、稳健的统计水印和零知识证明映射到各法律制度的证明要求,通过公开基准和模型附录为法律专业人士、工程师和操作员提供可复现的参考流程。

Comments 13 pages, 4 figures, 10 tables. Submitted to IEEE Transactions on Information Forensics and Security

详情
AI中文摘要

生成式人工智能现在能够以成本低廉的方式合成逼真图像、音频和视频,这超出了传统法医学的直觉。法律后果跨越了三个迄今为止孤立研究的制度:国际作战法、国内程序和产品监管。本文提出一个统一的证据框架,将加密内容来源、稳健的统计水印和零知识证明映射到各制度的证明要求。我们定义了一个跨越朴素再生、对抗性清洗、跨模型再生、主动水印移除和内部来源伪造的五级威胁模型。我们发布了一个包含12000个生成项目(图像、音频和视频模态)的公开基准,这些项目在六个清洗管道下进行72000次评估样本测试。我们评估了四种代表性方案,报告了固定假阳性率下的真阳性率、鲁棒性曲线下面积、计算开销以及受制度条件限制的法律充分性分数。我们将经验检测界限转化为国际武装冲突法下的命令决策法律充分性阈值,以及国内程序中的刑事和民事可采性阈值,以及欧盟人工智能法案和类似制度下的持续审计阈值。结果是一个可复现的参考流程、一个公开基准和模型附录,供法律专业人士、工程师和操作员共同使用。

英文摘要

Generative artificial intelligence now synthesizes photorealistic imagery, audio, and video at a cost that defeats traditional forensic intuition. The legal consequences span three regimes studied so far in isolation: international operational law, domestic procedure, and product regulation. This article presents a unified evidentiary framework that maps cryptographic content provenance, robust statistical watermarking, and zero knowledge attestation to the proof requirements of each regime. We define a five tier threat model spanning naive regeneration, adversarial laundering, cross model regeneration, active watermark removal, and insider provenance forgery. We release a public benchmark of 12000 generated items across image, audio, and video modalities under six laundering pipelines for 72000 evaluation samples. We evaluate four representative schemes and report true positive rate at fixed false positive rate, robustness area under the curve, computational overhead, and a regime conditioned legal sufficiency score. We translate empirical detection bounds into legal sufficiency thresholds for command decisions under the law of armed conflict, for criminal and civil admissibility under domestic procedure, and for persistence audits under the European Union Artificial Intelligence Act and analogous regimes. The result is a reproducible reference pipeline, a public benchmark, and model annexes that lawyers, engineers, and operators can deploy together.

2605.21001 2026-05-21 cs.CV 版本更新

DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars

DAMA:解耦的体锚定高斯用于可控的多层avatar

Daniel Eskandar, Berna Kabadayi, Garvita Tiwari, Gerard Pons-Moll

发表机构 * University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Max Planck Institute for Informatics(马克斯·普朗克信息研究所) Zuse School ELIZA(祖斯学校ELIZA)

AI总结 本文提出DAMA方法,通过专门的表示和重建方法,生成具有物理合理性的穿衣avatar,实现了可控的多层结构、清晰的衣物分离和显式的堆叠控制。

详情
AI中文摘要

现有的3D穿衣avatar重建方法虽然能实现高视觉保真度,但忽略了几何结构和物理合理性。它们要么将穿衣人类建模为单个可变形表面,要么尝试衣物解耦但不强制几何约束,导致衣物边界模糊且无法控制堆叠或层顺序。为解决这些限制,我们引入DAMA(Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars),一种3D avatar重建方法,通过专门的表示和重建方法生成具有物理合理性的穿衣avatar。在表示层面,我们通过重心平面坐标和正向法线偏移将高斯绑定到SMPL-X面部。基于此参数化,重建方法将2D分割提升为体锚定高斯,利用拓扑引导的修正细化层,并联合优化几何和外观。DAMA是首个从多视角图像生成具有物理合理性的多层avatar的高斯avatar重建方法,实现了清晰的衣物分离和显式的堆叠控制。在完整的4D-DRESS数据集(82扫描)上,DAMA在几何重建、衣物分离、穿透率和穿透深度方面均达到最先进的性能。该表示还支持用户定义的衣物重排和快速将符合身体的衣物转换为模拟准备的网格。项目页面:https://danieleskandar.github.io/dama/

英文摘要

Existing 3D clothed avatar reconstruction methods achieve high visual fidelity but ignore geometric structure and physical plausibility. They either model clothed humans as a single deformable surface or attempt garment disentanglement without enforcing geometric constraints, resulting in ambiguous garment boundaries and no control over stacking or layer ordering. To address these limitations, we introduce DAMA (Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars), a 3D avatar reconstruction method that produces physically plausible clothed avatars through a dedicated representation and reconstruction method. At the representation level, we bind Gaussians to SMPL-X faces using barycentric in-plane coordinates and a positive normal offset. Based on this parameterization, the reconstruction method lifts 2D segmentations to body-anchored Gaussians, refines layers using topology-guided correction, and jointly optimizes geometry and appearance. DAMA is the first Gaussian avatar reconstruction method from multi-view images to achieve physically plausible layering, clean garment separation, and explicit stacking control. On the full 4D-DRESS dataset (82 scans), it achieves state-of-the-art performance in geometry reconstruction, garment separation, penetration rate, and penetration depth. The representation further supports user-defined garment reordering and fast conversion of body-conforming garments to simulation-ready meshes. Project Page: https://danieleskandar.github.io/dama/

2605.20997 2026-05-21 cs.CV cs.AI cs.LG physics.comp-ph 版本更新

Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

基于TanDEM-X和Landsat数据的混合机器学习模型用于森林高度估计

Islam Mansour, Ronny Haensch, Irena Hajnsek, Konstantinos Papathanassiou

发表机构 * German Aerospace Center (DLR)(德国航空航天中心(DLR)) Institute of Environmental Engineering, ETH Zürich(环境工程研究所,苏黎世联邦理工学院)

AI总结 本文提出了一种结合机器学习与物理模型的混合方法,利用TanDEM-X干涉相干测量和Landsat光学数据来提高森林高度估计的精度,通过扩展特征空间减少高度和基线地形坡度的模糊性,实验结果表明RMSE和MAE分别降低了13.5%和16.6%。

详情
AI中文摘要

将机器学习(ML)与物理模型(PM)结合,已成为从遥感数据中检索地球物理参数的一种有前途的方法。在此背景下,一种用于从TanDEM-X干涉相干测量中估计森林高度的ML模型最近被提出,该模型通过物理模型约束学习过程。虽然所选特征用于训练和反演以确保解决方案的物理一致性,但它们无法解决数据中的所有高度/结构和基线/地形坡度模糊性。为改进这一点,提出通过扩展特征空间加入光学Landsat数据,以提供关于森林类型或结构的补充信息。扩展的模型被应用于几处Gabon的Lopé国家公园的TanDEM-X数据,并与空中LiDAR测量进行评估。结果表明,与原始混合模型相比,RMSE和MAE分别减少了13.5%和16.6%,证实了多光谱输入的附加价值。

英文摘要

Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lopé national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

2605.20973 2026-05-21 cs.CV 版本更新

Towards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines

向地下矿山3D点云中的集成岩支可视化迈进

Dibyayan Patra, Simit Raval, Pasindu Ranasinghe, Bikram Banerjee, Ismet Canbulat

发表机构 * School of Minerals and Energy Resources Engineering, University of New South Wales(新南威尔士大学矿物与能源资源工程学院) School of Surveying and Built Environment, University of Southern Queensland(南方昆士兰大学测绘与环境工程学院)

AI总结 本文提出了一种自动化框架,用于利用地下矿山开掘的3D点云进行集成岩支可视化,通过结构映射、岩钉识别、断层面拟合和岩钉方向估计的统一工作流,实现了对断层面和岩钉向量的集成3D可视化,以评估其空间交集和几何关系,同时通过互补的立体分析评估整体锚固几何有效性。

详情
AI中文摘要

地下矿山中岩支的有效性取决于安装的岩钉与周围岩体结构特征之间的相互作用。然而,断层特征化和岩钉识别通常被视为单独的任务,限制了它们在集成支持评估中的价值。本文提出了一种自动化框架,用于利用地下矿山开掘的3D点云进行集成岩支可视化。该框架将结构映射、岩钉识别、断层面拟合和岩钉方向估计整合到一个统一的工作流中,该工作流针对准确性和计算效率进行了优化。输出用于生成拟合的断层面和岩钉向量的集成3D可视化,从而能够直接评估其空间交集和几何关系。此外,还进行了互补的立体分析,以评估断层极和岩钉方向的整体锚固几何有效性,相对于映射的结构特征。此外,岩钉级别的质量指标,包括暴露的突出长度和偏离局部顶板法线的程度,也进行了可视化,以支持安装质量的评估。所提出的框架在真实的地下金属矿扫描上进行了演示,在中等规模的点云中产生了准确的结构映射和岩钉识别结果。总体而言,本研究提供了一个实用的步骤,朝着无需手动测量或额外现场数据采集的自动化、集成的岩支有效性地质力学评估。

英文摘要

The effectiveness of rock support in underground mines depends on the interaction between installed rock bolts and the structural fabric of the surrounding rock mass. However, discontinuity characterisation and rock bolt identification are commonly treated as separate tasks, limiting their value for integrated support assessment. This study presents an automated framework for integrated rock support visualisation using 3D point clouds of underground mine excavations. The framework integrates structure mapping, rock bolt identification, discontinuity plane fitting, and bolt orientation estimation into a unified workflow optimised for accuracy and computational efficiency. The outputs are used to generate an integrated 3D visualisation of fitted discontinuity planes and bolt vectors, enabling direct assessment of their spatial intersections and geometric relationships. A complementary stereographic analysis of discontinuity poles and bolt orientations is also performed to evaluate overall bolting geometric effectiveness relative to the mapped structural fabric. Additionally, bolt-level quality metrics, including exposed protrusion length and deviation from the local roof normal, are visualised to support assessment of installation quality. The proposed framework is demonstrated on real underground metal mine scans, producing accurate structure mapping and rock bolt identification results in medium-scale point clouds. Overall, the study provides a practical step towards automated, integrated geotechnical assessment of rock support effectiveness without requiring manual measurements or additional in-situ data acquisition.

2605.20971 2026-05-21 cs.CV cs.AI cs.CR 版本更新

Comparative Evaluation of Deep Learning Models for Fake Image Detection

深度学习模型在虚假图像检测中的比较评估

Akhitha Pakala, Mohammed Mahir Rahman, Shahzad Memon, Tauseef Ahmed

发表机构 * University of East London(东伦敦大学)

AI总结 本研究通过统一的预处理和训练流程比较了四个预训练的CNN架构在虚假图像检测中的性能,发现VGG16在准确性上表现最佳,但EfficientNetB0在检测虚假图像时的敏感性较高,但对真实图像的可靠性较低,研究指出需要平衡数据集、高级增强和公平性意识训练来开发可靠的虚假图像检测系统。

Comments Accepted at ICCIIoT26 and waiting to be indexed

详情
Journal ref
6th International Conference on Computational Intelligence & Internet of Things (ICCIIoT), 2026
AI中文摘要

随着基于GAN的图像篡改技术日益复杂,数字取证面临重大挑战。本研究比较了四个预训练的CNN架构(VGG16、ResNet50、EfficientNetB0和XceptionNet)在虚假图像检测中的性能,使用统一的预处理和训练流程。通过调整大小、归一化和增强来解决类别不平衡问题并提高泛化能力。模型评估使用了准确性、精确率、召回率、F1分数和ROC-AUC。VGG16在准确性上达到91%,XceptionNet、ResNet50和EfficientNetB0分别达到90%。EfficientNetB0对虚假图像的敏感性更强,但在真实图像上的可靠性较低,反映了由不平衡驱动的偏差。局限性包括数据集不平衡、过拟合和解释性有限,这些因素影响了跨域鲁棒性。本研究提供了一个可重复的基准,并强调了平衡数据集、高级增强和公平性意识训练的必要性,以开发可靠的虚假图像检测系统。

英文摘要

The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

2605.20965 2026-05-21 cs.CV cs.AI 版本更新

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

在不遗忘的情况下寻找正确的视觉证据:通过层间视觉注意力差异减轻LVLMs中的幻觉

Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China(东南大学计算机科学与工程学院) School of Artificial Intelligence, Shenzhen University, Shenzhen, China(深圳大学人工智能学院) College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China(深圳大学计算机科学与软件工程学院) Engineering, South China University of Technology, Guangzhou, China(华南理工大学工程学院) National Engineering Laboratory for Big Data Systems Computing Technology, Shenzhen University, Shenzhen, China(深圳大学大数据系统计算技术国家工程实验室) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用重点实验室(东南大学),中华人民共和国教育部)

AI总结 本文提出了一种基于层间视觉注意力差异的幻觉缓解方法,通过增强视觉证据的注意力来减少视觉遗忘,从而在不遗忘的情况下找到正确的视觉证据。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)在广泛的视觉-语言任务上表现出色。尽管有进展,它们仍然容易产生幻觉,生成与视觉内容不一致的响应。在本工作中,我们发现LVLMs在对正确的视觉证据关注不足时容易产生幻觉,并在生成过程中逐渐遗忘它。我们实证发现,尽管LVLMs整体对视觉证据关注不足,但在特定层中表现出对正确视觉证据的敏感性,存在显著的层间差异。受此观察启发,我们提出了一种新的幻觉缓解方法,通过层间视觉注意力差异(ILVAD)增强视觉证据。具体来说,我们从早期生成的token到视觉token在各层中获取注意力权重,并识别被反复激活作为视觉证据的token,形成显著性图。然后通过显著性图在生成过程中增强对视觉证据的注意力,以减少视觉遗忘。此外,我们利用显著性图获得生成文本对视觉证据的注意力分数,以选择并强调强烈基于视觉证据的文本token。我们的方法是无训练的,即插即用。在五个最近发布的模型上进行的多个基准评估表明,我们的方法可以在不同架构的LVLMs上一致地缓解幻觉。代码可在https://github.com/ytx-ML/ILVAD上获得。

英文摘要

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

2605.20963 2026-05-21 cs.CV 版本更新

Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

面向现实世界的无人机检测:一个新的多光谱数据集UAVNet-MS和一个新方法

Yihang Luo, Jun Chen, Chao Xiao, Yingqian Wang, Zhaoxu Li, Qiang Ling, Xu He, Nuo Chen, Gaowei Guo, Hongge Li, Miao Li, Longguang Wang, Yulan Guo, Li Liu, Wei An, Zhijie Chen

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学与技术学院,国防科技大学) Aviation University of Air Force(空军航空大学) Sun Yat-sen University(中山大学)

AI总结 本文提出了一种新的多光谱数据集UAVNet-MS和一种新的方法MFDNet,用于细粒度小无人机的检测,解决了传统RGB系统在小尺度下的性能问题。

Comments submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

详情
AI中文摘要

无人飞行器(UAV)的普及催生了对精确UAV监测的迫切需求。现有的基于RGB的系统依赖于空间线索,在小尺度下退化,特别是在高类型相似性、目标杂波模糊和低对比度的情况下。多光谱成像(MSI)编码了材料感知的光谱签名,但基于MSI的细粒度小UAV检测仍因缺乏专用数据集而被忽视。我们引入了UAVNet-MS,这是首个用于细粒度小UAV检测的多光谱数据集,包含15,618个时间同步的RGB-MSI数据立方体(1440x1080),带有边界框注释。该数据集具有挑战性的小对象(93.7% <= 32²像素,平均18²像素,约0.02%图像面积)在低对比度下。我们提出MFDNet,一种双流基线方法,解决数组诱导的视差和空间-光谱融合。在RGB-only、MSI-only和RGB+MSI协议下,对20种检测器的广泛评估表明,MFDNet在最佳RGB-only方法上实现了+6.2%的AP50提升,证明光谱线索提供了超越空间线索的互补材料证据。本文为多光谱UAV监测研究提供了基础数据集、强大基线和基准。

英文摘要

The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% <= 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.

2605.20961 2026-05-21 cs.CV 版本更新

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

保留、揭示、扩展:基于区域感知的4D视频编辑

Zhangchi Hu, Wenzhang Sun, Xiangchen Yin, Jiahui Yuan, Chunfeng Wang, Hao Li, Kun Zhan, Xiaoyan Sun

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Inc.(利汽车公司)

AI总结 本文提出PREX框架,通过区域感知分解目标时空体积,解决4D视频编辑中区域保持、揭示和扩展的问题,提升了视频编辑的准确性和稳定性。

Comments 23 pages, 13 figures

详情
AI中文摘要

现有的4D驱动视频扩散模型主要针对合理生成,但忠实的4D编辑需要在合成遮挡或视外内容时保留源观测区域。我们识别出证据角色不匹配问题:可靠的源支持证据、不可靠的渲染提示和不支持的区域在单一条件信号中交织,导致保留漂移、鬼影和不稳定的外推。我们提出PREX(保留、揭示、扩展),一个区域感知框架,根据观测支持和场景范围将目标时空体积分解为保留、揭示和扩展角色。PREX通过校准置信度构建观测支持的外观提示,并通过区域感知适配器注入到冻结的视频扩散骨干网络中,通过代理任务训练而无需配对编辑视频。我们进一步引入PREBench,一个诊断基准,包含精心编辑、区域角色掩码和人类对齐的指标,补充了全局视频质量和4D控制评估。实验表明,PREX在减少区域结构失败的同时,保持了强大的视觉质量和4D编辑控制能力。项目页面:https://ricepastem.github.io/PREX-Open

英文摘要

Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: https://ricepastem.github.io/PREX-Open

2605.20955 2026-05-21 cs.CV 版本更新

DrawMotion: Generating 3D Human Motions by Freehand Drawing

DrawMotion: 通过自由手绘生成3D人体动作

Tao Wang, Lei Jin, Zhihua Wu, Qiaozhi He, Jiaming Chu, Yu Cheng, Junliang Xing, Jian Zhao, Shuicheng Yan, Li Wang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学) NLP Lab, School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院自然语言处理实验室) National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) The Institute of AI (TeleAI), China Telecom(中国电信人工智能研究院) Northwestern Polytechnical University(西北工业大学)

AI总结 本研究提出DrawMotion,一种基于扩散模型的框架,通过自由手绘和文本条件生成3D人体动作,减少用户输入时间,提升生成精度。

详情
AI中文摘要

文本到动作生成,即通过文本描述生成人体动作,面临用户难以通过文本精确表达意图的挑战。为了解决这一问题,本文介绍了DrawMotion,一种高效的扩散基框架,适用于多条件场景。DrawMotion基于传统文本条件和新的手绘条件生成动作,分别提供语义和空间控制。具体而言,我们从三个方面解决细粒度动作生成任务:1) 自由手绘条件。为了准确捕捉用户意图而不需繁琐的文本输入,我们开发了算法自动在不同数据集格式中生成手绘简笔画;2) 多条件融合。我们提出了一个多条件模块(MCM),整合到扩散过程中,使模型能够利用所有可能的条件组合,同时比传统方法减少计算复杂性;3) 训练自由引导。值得注意的是,DrawMotion中的MCM确保其中间特征位于连续空间中,允许分类器引导梯度更新特征,从而使生成的动作与用户意图对齐,同时保持保真度。定量实验和用户研究表明,自由手绘方法在生成与想象一致的动作时,可将用户时间减少约46.7%。代码、演示和相关数据可在https://github.com/InvertedForest/DrawMotion上公开获取。

英文摘要

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.

2605.20942 2026-05-21 cs.CV 版本更新

Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

连接结构与语言:基于图的视觉推理用于自动驾驶道路理解

Lena Wild, Katie Z Luo, Marco Pavone

发表机构 * KTH Royal Institute of Technology(皇家理工学院) TRATON Stanford University(斯坦福大学) NVIDIA(英伟达)

AI总结 本文提出结合道路子基质(CRS)框架,通过图结构和开放词汇语义的联合执行,解决自动驾驶中道路结构理解的精度与语义灵活性之间的平衡问题。

详情
AI中文摘要

车道几何、拓扑和交通元素关系的结构化道路理解是安全自动驾驶的基础。尽管视觉-语言模型(VLMs)提供了有前途的语义灵活性,但它们缺乏精确道路推理所需的几何和关系基础。相反,传统模块化系统,如HD地图和拓扑道路图,提供了结构精度,但保持了语义刚性。为弥合这一差距,我们引入了结合道路子基质(CRS),一种基于图的框架,使几何道路结构和开放词汇语义能够在单一表示中联合执行。CRS能够通过递归图查询自动生成具有组合复杂性和语言多样性的问答对,辅以一种“免费基础”机制,确保逻辑可追溯到特定地图元素,并通过程序提取的推理链监督轨迹。我们证明了最先进的VLMs,包括大型闭源模型,在结构化道路推理上表现显著不足,但训练一个仅需20到80个CRS增强场景的2亿或4亿参数小模型,即可在不同深度的组合推理任务中获得稳定的提升。通过可验证的推理轨迹分析模型行为,揭示了失败模式的系统性转变:尽管基线模型在关系场景理解上失败,CRS训练的模型将失败减少到属性识别,表明道路理解的主要瓶颈不是模型规模,而是缺乏结构化监督。

英文摘要

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

2605.20941 2026-05-21 cs.CV cs.GR cs.HC 版本更新

PaintCopilot: Modeling Painting as Autonomous Artistic Continuation

PaintCopilot: 将绘画建模为自主的艺术延续

Yunge Wen, Yuancheng Shen, Paul Pu Liang

发表机构 * MIT Media Lab(MIT媒体实验室) New York University(纽约大学)

AI总结 本文提出了一种基于神经网络的绘画助手PaintCopilot,通过建模绘画作为开放性自回归艺术行为,基于不断演变的画布状态和先前笔触历史,无需目标图像即可预测未来笔触,与现有神经绘画方法不同,后者将绘画建模为向预定参考图像的像素重建。

详情
AI中文摘要

我们提出了PaintCopilot,一种协作式神经绘画助手,将绘画建模为一种开放性自回归的艺术行为,该行为基于不断演变的画布状态和先前笔触历史,而无需目标图像。与现有神经绘画方法不同,后者将绘画建模为向预定参考图像的像素重建,PaintCopilot直接从学习到的艺术动态中预测未来的笔触,类似于大型语言模型通过先前上下文继续文本序列。该框架提出了三个互补的模型:基于ViT的目标预测器,通过部分画布观察推断艺术家意图;自回归的下一笔预测器,通过流匹配生成时间上连贯的笔触;以及基于VAE的区域采样器,可按需合成语义本地化的笔触序列。基于三种可微分的笔触表示(硬圆、笔尖和2D高斯),系统支持四种交互工作流程:优化历史、笔触完成、区域修复和动态笔刷。通过与专业艺术家的案例研究,我们证明PaintCopilot能够实现流畅的协作绘画工作流程,在创作过程中艺术家和AI不断交替控制。

英文摘要

We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.

2605.20940 2026-05-21 cs.CV 版本更新

3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

3D重建与知识蒸馏以改进多视角图像模型以探索小麦籽粒体积估计

Olivia Zumsteg, Jannis Widmer, Yann Bourdé, Norbert Kirchgessner, Andreas Hund, Lukas Roth, Paraskevi Nousi

发表机构 * ETH Zurich(苏黎世联邦理工学院) Swiss Data Science Center(瑞士数据科学中心)

AI总结 本文提出了一种混合2D-3D方法,通过训练过程中知识蒸馏,使模型能够高效地进行图像-only推理。该方法结合了基于距离直方图特征的刚性不变点云网络和提出的多视角图像基于调节Transformer(RT)的集成架构,最终通过特征或标签蒸馏将知识转移到纯图像模型中,从而提高了籽粒体积估计的精度和效率。

Comments 8 pages, 6 figures (Appendix: 4 pages, 5 figures)

详情
AI中文摘要

准确估计小麦籽粒体积对于产量成分分析和压力耐受性评估至关重要,但基于现场的测量仍然具有挑战性。主动3D传感方法如光检测和测距(LiDAR)或飞行时间(ToF)对植物运动敏感或不适合户外条件,而3D重建计算成本高。直接2D图像处理可提供计算优势,但基于图像的模型缺乏显式几何信息。因此,我们提出了一种混合2D-3D方法,在训练过程中进行知识蒸馏,同时允许高效的图像-only推理。首先,我们训练一个基于距离直方图特征的刚性不变点云网络,以获得姿态鲁棒的几何表示。然后,我们将3D模型与所提出的多视角图像基于调节Transformer(RT)结合到集成架构中。最后,我们通过基于特征或标签的蒸馏将集成知识转移到纯图像学生模型中。两个蒸馏的RTs将非蒸馏RT的均方绝对误差(MAE)从654.31 mm³降低到639.93 mm³和644.62 mm³,并将相关性从0.76提高到0.77和0.82。同时,推理时间从160 ms减少到每粒籽1.4 ms。蒸馏进一步减轻了体积依赖性偏差,并使图像模型的潜在表示向几何感知的形状转变。我们的结果表明,2D Transformer的3D指导训练能够实现高通量田间表型分析中可扩展且高效的籽粒体积估计。

英文摘要

Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.

2605.20922 2026-05-21 cs.LG cs.AI cs.CV 版本更新

Winfree Oscillatory Neural Network

Winfree振荡神经网络

Jiawen Dai, Yue Song

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Qi Zhi Institute(上海启智研究院) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 本文提出了一种基于广义Winfree动力学的振荡神经网络WONN,通过结构化的振荡交互在流形$(S^1)^d$上进化表示,结合基于相位的归纳偏置与灵活的层次交互机制,实现了在图像识别和复杂推理任务上的竞争力和参数效率。

Comments Project page: https://jiawen-dai.github.io/WONN_Project_Page/

详情
AI中文摘要

振荡和同步被认为是表示和计算中的基本要素。然而,现有的基于同步动力学的机器学习方法大多局限于特定领域,如物体发现,缺乏在标准视觉基准或逻辑推理任务中的扩展性证据。我们提出Winfree振荡神经网络(WONN),一种基于广义Winfree动力学的动态神经架构。WONN通过结构化的振荡交互在流形$(S^1)^d$上进化表示,结合基于相位的归纳偏置与灵活且层次化的交互机制,这些机制可以是固定的三角函数映射或可学习的神经网络。我们在图像识别和复杂推理任务上评估了WONN,包括CIFAR、ImageNet、Maze-hard和Sudoku。在这些领域中,WONN实现了具有竞争力或优越性能的成果,并且具有强参数效率。特别是,WONN是目前已知第一个能够与ImageNet-1K竞争的基于同步的振荡架构。此外,在Maze-hard上,WONN仅使用前状态-of-the-art模型1%的参数就达到了80.1%的准确率。这些结果表明,结构化的振荡动力学为传统神经架构提供了一种可扩展且参数高效的替代方案。

英文摘要

Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus $(S^1)^d$ through structured oscillatory interactions, combining phase-based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze-hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K. Furthermore, on Maze-hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state-of-the-art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter-efficient alternative to conventional neural architectures.

2605.20910 2026-05-21 cs.CV 版本更新

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

FlowLong: 通过流形约束的 Tweedie 匹配实现推理时的长视频生成

Jangho Park, Geon Yeong Park, Gihyun Kwon, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院) Amazon(亚马逊)

AI总结 本文提出了一种新的推理时长视频生成方法,通过流形约束的Tweedie匹配在重叠滑动窗口中生成长视频,同时保持时间和空间一致性,并且无需额外训练。

Comments Project Page: https://flowlong-video.github.io/

详情
AI中文摘要

扩展视频扩散模型的生成时间范围仍然是一个长期且重要的挑战。现有的无训练方法分为两类:双向模型的扩展,这些模型紧密耦合到特定架构,且在长范围内质量下降;以及自回归模型,这些模型由于暴露偏差积累漂移误差,倾向于生成重复的运动模式。为了解决这些问题,我们提出了一种新颖但简单的推理时长视频生成方法,该方法对架构不敏感且不需要额外训练。我们的方法通过重叠滑动窗口生成长视频,其中相邻窗口预测的干净样本通过Tweedie匹配融合,以强制重叠区域的流形约束和时间一致性。随后,随机早期阶段采样通过在高噪声阶段每次Tweedie匹配校正后注入新鲜噪声,同步每个窗口的轨迹,然后过渡到确定性ODE采样以保持细粒度的视觉保真度。应用于各种视频生成模型,我们的方法生成的视频长度是原窗口长度的数倍,同时在时间和视觉质量上优于无训练和自回归基线,并且进一步扩展到音频视频联合生成和文本到3DGS,无需微调。

英文摘要

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

2605.20908 2026-05-21 cs.CV 版本更新

SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches

SynCB:一种基于协同概念的模型,具有概念与互补神经分支之间的动态路由

Tores Julie, Sun Rémy, Sassatelli Lucile, Ancarani Elisa, Wu Hui-Yin, Precioso Frédéric

发表机构 * CNRS(法国国家科学研究中心) Inria(法国国家信息与自动化技术研究院) I3S(信息科学与系统研究所)

AI总结 本研究提出了一种协同概念模型SynCB,通过动态路由模块在概念分支和互补神经分支之间进行选择,以提高任务准确性和对人工干预的响应性。

详情
AI中文摘要

基于概念(CB)的模型提供了可解释性和支持测试时的人工干预,而标准神经网络(NN)提供了强大的任务性能但透明性较低。先前的工作探索了将概念和其他表示结合的混合公式以提高准确性,但通常以牺牲人工干预为代价。我们引入了协同概念模型(SynCB)框架,该框架结合了CB分支和互补神经分支,并且有一个可训练的路由模块,可以动态选择每个输入使用的分支。与以往模型不同,SynCB保持两个分支独立,并通过路由模块协调它们。此外,两个分支都是联合学习的,允许互补神经分支和CB分支通过它们的共同骨干进行信息共享。为了提高对干预的响应性,我们进一步引入了测试时的干预策略和相应的损失。在五个数据集和CB基准上,SynCB始终在任务准确性和对人工干预的响应性上取得更高的成绩,比全神经基线高3.9个百分点,比最强竞争对手的干预性能高6.43个百分点。

英文摘要

Concept-based (CB) models provide interpretability and support test-time human intervention, while standard neural networks (NN) offer strong task performance but little transparency. Prior work has explored hybrid formulations that integrate concepts and additional representations to improve accuracy, often at the cost of human interventions. We introduce the \emph{Synergy Concept-Based Model (SynCB)} framework, that combines a CB branch with a complementary neural branch, and a trainable routing module that dynamically selects which branch to use for each input. Unlike prior models, which fuse residual and concept-based predictions, SynCB keeps the two branches distinct and coordinates them through the routing module. Moreover, both branches are learned jointly, allowing information sharing between the complementary neural branch and CB branches through their common backbone. To improve responsiveness to interventions, we further introduce a test-time intervention policy and a corresponding loss. Across five datasets and CB benchmarks, SynCB consistently achieves higher task accuracy while remaining more responsive to human interventions, surpassing the full neural baseline by up to 3.9 percentage points and exceeding the strongest competitor in intervention performance by up to 6.43 percentage points.

2605.20904 2026-05-21 cs.CV 版本更新

JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026

JFAA:EgoVis 2026 EPIC-KITCHENS-100 动作预见挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学)

AI总结 本文提出JFAA,一种基于JEPA的未来动作预见方法,用于EPIC-KITCHENS-100动作预见任务。通过冻结编码器和预测器提取观察上下文特征和近未来潜在标记,再训练轻量级注意力探针以预测动词、名词和动作日志。通过构建字段感知的集成模型提高鲁棒性,实验结果表明JFAA在EgoVis 2026 EPIC-KITCHENS-100动作预见挑战中取得第一名。

Comments The champion solution for the EPIC-KITCHENS-100 Action Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

我们提出JFAA,一种基于JEPA的未来动作预见方法,用于EPIC-KITCHENS-100(EK-100)动作预见任务。受V-JEPA 2.1的表示学习和未来预测能力的启发,JFAA使用冻结的编码器和预测器来提取观察上下文特征和近未来潜在标记。然后训练一个轻量级的注意力探针,使用单独的任务查询来预测动词、名词和动作的日志。为了提高鲁棒性,我们进一步构建了一个字段感知的集成模型,使每个输出字段都能受益于其最可靠的候选者。在官方挑战服务器上的实验结果表明,JFAA在EgoVis 2026 EK-100动作预见挑战中取得第一名。我们的代码将在https://github.com/CorrineQiu/JFAA上发布。

英文摘要

We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at https://github.com/CorrineQiu/JFAA.

2605.20901 2026-05-21 cs.CV cs.AI 版本更新

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

VISTA:EgoVis 2026 ego4D 短期物体交互预测挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学)

AI总结 本文提出VISTA,一种用于EgoVis 2026 ego4D短期物体交互预测挑战的V-JEPA集成静态快速时序预测器。该方法结合了以物体为中心的空间检测与短视时间上下文,通过特征调制和ROI级上下文融合,将时间表示注入检测路径,以提高预测的鲁棒性。

Comments The champion solution for the Ego4D Short-Term Object Interaction Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

我们提出VISTA,一种用于EgoVis 2026 ego4D短期物体交互预测(STA)挑战的V-JEPA集成静态快速时序预测器。给定一个眼动视频时间戳,任务要求预测下一步的人-物体交互,包括未来活跃物体的边界框、名词类别、动词类别、接触时间以及置信度分数。VISTA采用StillFast风格的设计,结合以物体为中心的空间检测与短视时间上下文。具体来说,一个在COCO上预训练的Faster R-CNN ResNet-50 FPN检测器从最后一个观察到的高分辨率帧中生成物体建议,而冻结的V-JEPA 2.1时间分支从观察到的视频中提取片段级眼动上下文。时间表示通过特征调制和ROI级上下文融合注入检测路径。融合的建议特征随后传递给多头STA预测器进行框细化、名词分类、动词分类、接触时间回归和交互置信度估计。为了最终提交,我们进一步融合互补预测以提高鲁棒性。在官方挑战服务器上的实验结果表明,VISTA在EgoVis 2026 ego4D STA挑战中获得第一名。我们的代码将在https://github.com/CorrineQiu/VISTA上发布。

英文摘要

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

2605.20892 2026-05-21 cs.CV 版本更新

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

Enhui Yu, Junhui Li, Ruitong Lu, Jialu Li, Youshan Zhang

发表机构 * University of Science and Technology Liaoning(辽宁科技大学) Chuzhou University(楚州大学) Yeshiva University(犹他大学)

AI总结 本文提出FruitEnsemble框架,通过多阶段动态推理解决细粒度水果分类中的泛化限制问题,利用MLLM进行专家仲裁以提升分类准确率,最终达到70.49%的分类精度。

Comments 10 pages,6 figures,submitted to CVPR 2026

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2026
AI中文摘要

细粒度水果分类是农业计算机视觉中的关键但具有挑战性的任务,主要受高质量数据集匮乏和类别间高视觉相似性阻碍。为解决这些问题,我们首先构建了一个包含306个水果类别、116,233个样本的综合数据集。此外,我们提出FruitEnsemble,一种实用的两阶段动态推理框架,旨在克服静态单模型架构的泛化限制。第一阶段,FruitEnsemble利用验证校准的异构骨干网络加权集成生成稳健的Top-3候选池。为处理困难样本,我们引入专家仲裁机制:当集成置信度低于0.6时,触发多模态大语言模型(MLLM)进行严格视觉验证,通过整合外部植物学描述使用链式推理(CoT)进行验证。此外,我们优化了训练流程,采用硬样本感知的联合损失。大量实验表明,FruitEnsemble实现了70.49%的分类准确率,并优于现有最先进模型。我们的框架为现实世界的农业视觉分拣和质量检测任务提供了高效、部署导向的解决方案。

英文摘要

Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

2605.20891 2026-05-21 cs.CV 版本更新

HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction

HDMoE:一种用于多模态癌症生存预测的分层解耦-融合专家混合框架

Huayi Wang, Haochao Ying, Yuyang Xu, Qiyao Zheng, jun wang, Cheng Zhang, Ying Sun, Jian Wu

发表机构 * Zhejiang University(浙江大学) Xinjiang University(新疆大学) Hangzhou City University(杭州市大学) Sun Yat-sen University Cancer Center(中山大学肿瘤中心)

AI总结 本文提出HDMoE框架,通过分层解耦-融合专家混合方法,有效整合多模态医学数据以提高癌症生存预测的准确性,解决了传统方法中特征解耦和融合效果不佳的问题。

Comments 12 pages, HDMoE has been accepted by KDD 2026 AI for Sciences Track

详情
AI中文摘要

多模态生存预测是一项关键但具有挑战性的任务,要求整合多模态医学数据(例如全切片图像(WSIs)和基因组谱)以实现准确的预后建模。鉴于模态间的固有异质性,特征解耦-融合范式已成为主导方法。然而,这些方法存在以下不足:(1)在解耦前未能减少模态特征的冗余信息,这会负面影响特征解耦和融合效果;(2)缺乏对特征细粒度关系建模的能力,无法捕捉模态内和模态间特征的局部信息交互。为了解决这些问题,我们提出了一种具有两个层次MoE和随机特征重排(RFR)模块的HDMoE框架。在第一层MoE中,使用共享专家和路由专家去除冗余信息并提取每个模态的细粒度特定特征,而第二层MoE促进细粒度的跨模态特征解耦。此外,我们设计了两个RFR模块,分别跟随每个层次的MoE,以精细融合模态内和模态间特征,有助于模型捕捉更多模态间的细粒度关系。在我们的私有肝癌(LC)和三个TCGA公开数据集上的广泛实验结果证实了我们所提出方法的有效性。代码可在https://github.com/ZJUMAI/HDMoE上获得。

英文摘要

Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underline{H}ierarchical \underline{D}ecoupling-Fusion \underline{M}ixture-\underline{o}f-\underline{E}xperts (HDMoE) framework with two levels of MoE and \underline{R}andom \underline{F}eature \underline{R}eorganization (RFR) modules.In the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at https://github.com/ZJUMAI/HDMoE.

2605.20889 2026-05-21 cs.CV 版本更新

Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video

Map-Mono-Ego: 从单目第一视角视频实现基于地图的全局人体姿态估计

Hiroyuki Deguchi, Ryosuke Hori, Kotaro Amaya, Tsubasa Maruyama, Mitsunori Tada, Hideo Saito

发表机构 * Keio University(庆应大学) National Institute of Advanced Industrial Science and Technology(国家先进工业科学与技术研究院)

AI总结 本文提出Map-MonoEgo框架,通过利用预扫描的3D点云实现从单目摄像头获得的全局一致的人体姿态估计,并引入AIST-Living数据集,证明该方法在无需专用硬件的情况下能有效提升日常监控任务的实用性。

Comments Accepted at ICIP 2026, Project page: https://deguchihiroyuki.github.io/Map-Mono-Ego-Project/

详情
AI中文摘要

单目第一视角人体姿态估计对于无处不在的活动监控至关重要。然而,理解用户在环境中的绝对位置仍是一个挑战。现有方法主要关注初始位置的相对运动,而不考虑佩戴者在环境中的绝对位置。此外,单目视觉固有的尺度模糊性导致严重的位移漂移,限制了长期跟踪,而无法使用专用多传感器硬件。为了解决这一问题,我们提出了MapMonoEgo,一种新颖的框架,仅通过单目摄像头即可实现全局一致的人体姿态估计,利用预扫描的3D点云。我们还引入了AIST-Living数据集,该数据集将第一视角视频与扫描环境中的真实运动相结合。实验表明,我们的方法显著优于现有最先进基线,证明其在无需专用硬件的情况下对实际监控任务的实用性。

英文摘要

Monocular egocentric human pose estimation is essential for ubiquitous activity monitoring. However, understanding the user's absolute location within the environment remains a challenge. Existing methods primarily focus on relative motion from an initial position, and tend not to account for the wearer's absolute location within an environment. Furthermore, inherent scale ambiguity in monocular vision leads to severe translational drift, limiting long-term tracking without specialized multi-sensor hardware. To address this, we propose MapMonoEgo, a novel framework achieving globally consistent human pose estimation solely from a monocular camera by leveraging a pre-scanned 3D point cloud. We also introduce AIST-Living dataset, a new dataset pairing egocentric video with ground-truth motion in a scanned environment. Experiments demonstrate that our approach significantly outperforms the state-of-the-art baseline, proving its utility for practical monitoring tasks without specialized hardware.

2605.20867 2026-05-21 cs.MA cs.CV 版本更新

ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection

ProCrit: 通过批评引导的修订实现自激发多视角推理用于多模态讽刺检测

Yingjia Xu, Jiulong Wu, Bowen Zhang, Baokui Guo, Siyuan Chai, Min Cao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Baidu Inc.(百度公司) Zhipu AI(智谱AI)

AI总结 本文提出ProCrit,一种通过批评引导的修订实现自激发多视角推理的框架,用于多模态讽刺检测,解决了现有方法依赖固定视角的问题,通过动态生成多视角分析并进行协同优化。

详情
AI中文摘要

多模态讽刺检测需要对字面表达与意图意义之间的跨模态不一致进行推理,但因讽刺机制的多样性,所需的具体分析视角在样本间变化。尽管近期方法使分析过程显式化,但它们仍依赖于固定、预定义的视角,通过手工设计的路由规则独立运作。我们主张多模态讽刺检测应采用自激发多视角推理,即模型自主为每个样本生成所需的视角并逐步将其整合到一致的分析中。为实现这一目标,我们提出ProCrit,一种Proposal-Critic双智能体框架,包含用于多视角推理的提案智能体和用于外部评估和定向修订指导的批评智能体。首先,为克服现有讽刺数据集在过程级监督方面的不足,ProCrit通过动态角色智能体滚动生成过程级推理注释:一个强大的视觉-语言模型在共享上下文中依次生成分析角色,生成的多角色轨迹被展平为序列,保留跨视角依赖性的同时允许高效的自回归生成。其次,为提高推理可靠性,ProCrit采用草稿-批评-修订范式,其中独立的批评者识别推理缺陷并提供定向的自然语言反馈以指导修订。最后,我们开发了互为改进的训练框架,通过双阶段强化学习共同优化提案起草和反馈引导的修订,同时根据反馈的实际效果优化批评智能体。在三个广泛使用的基准测试上进行的实验验证了ProCrit的有效性。

英文摘要

Multimodal sarcasm detection requires reasoning over cross-modal incongruities between literal expression and intended meaning, yet the specific analytical perspectives needed vary across samples due to the diversity of sarcastic mechanisms. While recent methods make this analytical process explicit, they still rely on fixed, predefined perspectives that operate independently under hand-crafted routing rules. We argue that multimodal sarcasm detection instead calls for self-elicited multi-perspective reasoning, where a model autonomously generates the perspectives needed for each sample and progressively integrates them into a coherent analysis. To realize this goal, we propose ProCrit, a Proposal-Critic two-agent framework with a proposal agent for multi-perspective reasoning and a critic agent for external evaluation and targeted revision guidance. First, to overcome the lack of process-level supervision in existing sarcasm datasets, ProCrit synthesizes process-level reasoning annotations through a dynamic-role agentic rollout: a strong vision-language model sequentially spawns analytical roles within a shared context, and the resulting multi-role trajectories are flattened into sequences that preserve cross-perspective dependencies while enabling efficient autoregressive generation. Second, to improve reasoning reliability, ProCrit adopts a draft-critique-revise paradigm in which an independent critic identifies reasoning deficiencies and provides targeted natural-language feedback for directed revision. Finally, we develop a mutual-refinement training framework that jointly optimizes proposal drafting and feedback-guided revision via dual-stage reinforcement learning, while refining the critic agent according to the actual effectiveness of its feedback. Experiments on three widely used benchmarks demonstrate the effectiveness of ProCrit.

2605.20839 2026-05-21 cs.CV cs.LG 版本更新

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

无需激活的图像识别回骨:在MetaFormer风格视觉模型中的多项式替代方案

Jeffrey Wang, Jonathan Gregory, Grigorios G. Chrysos

发表机构 * University of Wisconsin--Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出无需激活函数的多项式替代方法,用于在MetaFormer风格的视觉模型中实现图像识别,展示了多项式模块在多个数据集上的优越性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

现代视觉回骨将点激活(如ReLU、GELU)和指数softmax视为非线性性的必要来源,但我们证明在MetaFormer风格的视觉回骨中并不需要这些。我们为三个核心基本操作(MLP、卷积和注意力)设计了无需激活的多项式替代方案,其中Hadamard乘积替代标准非线性性以产生输入的多项式函数。这些模块可以无缝集成到现有架构中:在MetaFormer中实现,一个模块化的视觉回骨框架,我们的PolyNeXt模型在ImageNet分类、ADE20K语义分割和分布外鲁棒性上匹配或超过了基于激活的对应物,并且在计算成本降低的情况下显著优于先前的多项式网络,显示了标准模块的多项式变体击败了复杂自定义架构。

英文摘要

Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer-style vision backbones. We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.

2605.20838 2026-05-21 cs.CV cs.AI 版本更新

USV: Towards Understanding the User-generated Short-form Videos

USV: 向理解用户生成的短视频迈进

Haoyue Cheng, Su Xu, Liwei Jin, Wayne Wu, Chen Qian, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室)

AI总结 本文提出了USV数据集,用于高层面的视频语义理解,通过用户生成的短视频进行主题识别和视频-文本检索任务,提出了MMF-Net和VTCL两种有效基线方法。

详情
AI中文摘要

近年来,已经发布了多个大规模视频数据集,推动了视频理解领域的发展。然而,新兴的用户生成的短视频却很少被研究。本文提出了USV数据集,用于高层面的视频语义理解。该数据集包含约224,000个视频,通过标签查询从UGC平台收集,无需额外的人工验证和剪辑。尽管视频理解近年来取得了显著进展,但大多数工作集中在实例级识别,这不足以学习视频高层面语义信息的表示。因此,我们进一步在USV上建立了两个任务:主题识别和视频-文本检索。我们提出了两种统一且有效的基线方法:多模态融合网络(MMF-Net)和视频-文本对比学习(VTCL),分别用于主题识别和视频-文本检索任务,并进行了全面的基准测试以促进未来研究。我们的项目页面是https://usvdataset.github.io。

英文摘要

Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is https://usvdataset.github.io.

2605.20837 2026-05-21 cs.CV cs.AI 版本更新

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

ArchSIBench: 评估视觉-语言模型的建筑空间智能

Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang

发表机构 * School of Architecture, Tsinghua University(清华大学建筑学院)

AI总结 本文提出ArchSIBench,一个基于建筑学、认知科学和心理学视角的建筑空间智能评估基准,通过17个细粒度子任务和3000个问题-答案对,评估多种VLMs在建筑空间感知、推理、导航、转换和配置方面的性能,发现大多数模型在空间转换和配置推理上仍与有建筑训练的人类评估者存在差距。

Comments 51 pages

详情
AI中文摘要

建筑空间智能,即识别和推断建筑空间的能力,是机器人导航、具身交互和3D场景理解和生成等任务的基础。尽管已有大量研究评估了视觉-语言模型(VLMs)的基本空间技能,如相对方向、距离比较和物体计数,但这些任务仅涵盖空间认知的最基础层次,且忽略了更高层次的建筑空间认知,包括布局理解、通行模式和功能分区。在本文中,我们提出ArchSIBench,一个基于建筑学、认知科学和心理学视角的建筑空间智能评估基准。ArchSIBench涵盖五个核心维度:感知、推理、导航、转换和配置,包含17个细粒度子任务。通过专家的精心人工标注,我们构建了3,000个问题-答案对,以实现对建筑空间智能的全面评估。基于ArchSIBench,我们评估了各种VLMs,并发现大多数模型在建筑空间智能方面与人类基线有显著差异;此外,模型在能力维度上表现出显著的差异性。一些最先进的模型可以接近没有建筑训练的人类评估者水平。然而,与有建筑训练的人类评估者相比,仍存在明显差距,特别是在空间转换和配置推理方面。我们相信,ArchSIBench将为测量和提升VLMs的建筑空间智能提供重要的见解和系统资源。数据集和代码可在https://huggingface.co/datasets/ArchSIBench/ArchSIBench获取。

英文摘要

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.

2605.20827 2026-05-21 cs.CV 版本更新

HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction

HyDAR-Pano3D: 一种用于全景到3D重建的混合解耦解剖恢复框架

Yaoyao Yue, Jérôme Schmid, Xiaoshuang Li, Eduardo Delamare, Jinman Kim

发表机构 * School of Computer Science, the University of Sydney(悉尼大学计算机科学学院) Geneva School of Health Sciences, HES-SO University of Applied Sciences and Arts Western Switzerland(日内瓦健康科学学院,HES-SO应用科学和艺术西瑞士大学) Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程系) Sydney Dental School, Faculty of Medicine and Health, The University of Sydney(悉尼牙科学院,医学与健康学院,悉尼大学)

AI总结 本文提出HyDAR-Pano3D框架,通过解耦解剖恢复问题来解决全景影像到CBCT重建中的模糊问题,实验表明其在PSNR、SSIM和Dice评分上均优于基线方法,能够有效恢复临床相关的解剖结构。

Comments 10 pages

详情
AI中文摘要

全景放射影像(PR)在常规牙科护理中被广泛使用,但其本质上只能提供复杂的三维颅面解剖的二维投影。大多数现有的基于学习的方法试图通过直接回归原生锥束CT(CBCT)体积来计算恢复这种三维信息。然而,这种直接映射要求模型同时学习常见的解剖结构和患者特定的形态变化。这种纠缠的公式使二维到三维的逆问题变得高度模糊,通常会产生过度平滑的重建和模糊的解剖边界。为了解决这个问题,我们提出了HyDAR-Pano3D,一个两阶段框架,将PR到CBCT重建重新公式化为解耦的解剖恢复问题。在第一阶段,一个双编码器网络整合了放射影像特征与SAM衍生的语义先验,以重建一个归一化的标准体积。在第二阶段,一个解剖恢复网络预测一个先验约束的结构变形场,将这个标准体积映射回原空间,恢复个体形态变化。在三个大规模数据集上的实验表明,HyDAR-Pano3D显著优于基线方法(p < 0.05),实现了25.76 dB PSNR,85.70% SSIM,以及83.83%的整体解剖Dice评分。合成的体积成功支持下游的完整牙齿(82.4% Dice)和下颌骨管(72.2% Dice)分割,证明了我们的解耦方法能够保留临床相关的结构,当CBCT数据不可用时,能够实现稳健的解剖感知评估。

英文摘要

Panoramic radiograph (PR) is fundamentally used in routine dental care, but it inherently provides only a two-dimensional (2D) projection of complex three-dimensional (3D) craniofacial anatomy. Most existing learning-based methods attempt to computationally recover this 3D information by directly regressing native cone-beam computed tomography (CBCT) volumes from PR. However, this direct mapping requires the model to simultaneously learn common anatomical structures and patient-specific morphological variations. This entangled formulation makes the ill-posed 2D-to-3D inverse problem highly ambiguous, often producing over-smoothed reconstructions with blurred anatomical boundaries. To address this, we propose HyDAR-Pano3D, a two-stage framework that reformulates PR-to-CBCT reconstruction as a disentangled anatomical recovery problem. In Stage 1, a dual-encoder network integrates radiographic features with SAM-derived semantic priors to reconstruct an arch-normalized canonical volume. In Stage 2, an Anatomical Restoration Network predicts a prior-constrained structured deformation field to map this canonical volume back to the native space, restoring individual morphological variations. Experiments on three large-scale datasets show that HyDAR-Pano3D significantly outperforms baseline methods ($p < 0.05$), achieving a 25.76 dB PSNR, 85.70\% SSIM, and an 83.83\% overall anatomical Dice score. The synthesized volumes successfully support downstream segmentation of whole teeth (82.4\% Dice) and the inferior alveolar canal (72.2\% Dice), demonstrating that our disentangled approach preserves clinically relevant structures to enable robust anatomy-aware assessment when CBCT data is unavailable.

2605.20822 2026-05-21 cs.CV 版本更新

TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection

TERDNet: 用于场景变化检测的Transformer编码器-递归解码器网络

Jiae Yoon, Ue-Hwan Kim

发表机构 * Department of AI Convergence, Gwangju Institute of Science and Technology (GIST)(人工智能融合系,全州科学技术院(GIST))

AI总结 本文提出TERDNet,一种用于场景变化检测的Transformer编码器-递归解码器网络,通过多级特征提取、特征融合模块、递归解码器和上采样模块,提升了场景变化检测的精度和鲁棒性。

Comments 8 pages, 4 figures. Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

在本文中,我们针对场景变化检测(SCD)这一挑战,其目标是在不同时间拍摄的同一地点的两幅图像之间识别变化。现有的SCD模型通常忽略了不同层之间特征重要性的变化,使用单步解码器限制了细化过程,并且对编码器预训练策略提供了有限的见解。我们提出了TERDNet,一种Transformer编码器-递归解码器网络,旨在克服这些限制。TERDNet由基于Transformer的编码器提取多级表示,一个融合相关体积与这些特征的特征融合模块,一个执行迭代细化的递归3门GRU解码器,以及一个结合卷积和插值的上采样器组成。在四个公开基准上的大量实验表明,TERDNet在性能上始终优于先前的方法,并产生了更准确和详细的变更掩码。消融研究证实了基于分割的预训练的优势以及我们融合设计的有效性。此外,在视角偏移下的鲁棒性测试确认了TERDNet在现实世界机器人系统中的部署潜力,其中可靠的感知至关重要。我们的代码可在https://github.com/AutoCompSysLab/TERDNet上获得。

英文摘要

In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer Encoder-Recurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate-GRU decoder that performs iterative refinement, and a combined convolution-interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNet's potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is available at https://github.com/AutoCompSysLab/TERDNet.

2605.20821 2026-05-21 cs.CV cs.RO 版本更新

VSCD: Video-based Scene Change Detection in Unaligned Scenes

VSCD: 基于视频的非对齐场景变化检测

Jiae Yoon, Ue-Hwan Kim

发表机构 * Department of AI Convergence, Gwangju Institute of Science(人工智能融合系,全州科学研究院) GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science(全州科学研究院AI-纳米融合研究所,用于早期检测神经退行性疾病的机构)

AI总结 本研究提出VSCD,一种用于非对齐场景中视频基变化检测的方法,通过查询帧生成像素级变化掩码,利用多参考模型和局部补丁对应来对齐参考特征,并融合候选变化特征以生成高分辨率掩码,实现了优于现有图像和视频基基线的性能。

Comments 18 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

检测环境中变化对于长期自主性至关重要,但大多数变化检测设置假设固定视角、轻微错位或仅少数变化对象。我们引入视频基场景变化检测(VSCD),该方法在给定参考和查询RGB视频的情况下,为每个查询帧预测像素级变化掩码。这两个视频记录于不同时间,且相机运动不受约束,视频之间没有时间同步,许多对象实例可能出现或消失。为研究此设置,我们构建了一个包含超过110万帧的大型基准,这些帧标注了像素级变化掩码,并附有现实世界测试集以评估迁移至现实的性能。我们提出了一种以查询为中心的多参考模型,该模型从变化掩码监督中隐式学习时间匹配,通过局部补丁对应对齐候选参考特征,并在解码高分辨率掩码前使用帧级和补丁级置信度融合每个候选的变化特征。我们的方法在强大的图像和视频基基线中实现了最先进的性能,并通过在移动机器人上部署验证其现实影响,用于两个下游应用——视觉监控和对象增量学习。

英文摘要

Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications -- visual surveillance and object incremental learning.

2605.20820 2026-05-21 cs.CV 版本更新

AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting

AIR: 一种用于自监督前馈2D高斯点散射的 amortized 图像重建框架

Zhaojie Zeng, Yuesong Wang, Yawei Luo, Tao Guan

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of Software Technology, Zhejiang University(浙江大学软件学院)

AI总结 本文提出了一种自监督前馈框架AIR,通过将迭代高斯拟合 amortized 到单次网络传递中,消除了每张图像测试时的优化需求。该框架采用分阶段残差架构,逐步从重建残差中预测额外的高斯原始体,并结合显式的阶段控制机制,仅在欠重建区域激活新的原始体。通过预测-优化-蒸馏训练策略,稳定了多阶段预测,最终实现了更高效的图像重建。

Comments preprint version

详情
AI中文摘要

2D高斯点散射提供了一种高效的显式图像重建表示,但现有方法仍然需要昂贵的逐图像迭代优化或依赖手工设计的先验知识来分配原始体。我们提出了AIR,一种自监督前馈框架,将迭代高斯拟合 amortized 到单次网络传递中,消除了每张图像测试时的优化需求。AIR采用分阶段残差架构,逐步从重建残差中预测额外的高斯原始体,并结合显式的阶段控制机制,仅在欠重建区域激活新的原始体。一种预测-优化-蒸馏训练策略通过将短周期优化的高斯增量蒸馏回预测器,稳定了多阶段预测。稳定后的预测器随后在各阶段联合微调,并配备图像自适应量化器以实现紧凑的高斯存储。在Kodak和DIV2K上的实验表明,AIR在重建质量上优于代表性的基于高斯的基线方法,同时将编码时间减少到160-300毫秒。代码:https://github.com/whoiszzj/AIR.git

英文摘要

2D Gaussian splatting provides an efficient explicit representation for image reconstruction, but existing methods still require costly per-image iterative optimization or rely on handcrafted priors for primitive allocation. We present AIR, a self-supervised feed-forward framework that amortizes iterative Gaussian fitting into a single network pass, eliminating per-image test-time optimization. AIR adopts a stage-wise residual architecture that progressively predicts additional Gaussian primitives from reconstruction residuals, together with an explicit Stage Control mechanism that activates new primitives only in under-reconstructed regions. A Predict--Optimize--Distill training strategy stabilizes multi-stage prediction by distilling short-horizon optimized Gaussian increments back into the predictor. The stabilized predictor is then jointly finetuned across stages and equipped with an image-adaptive quantizer for compact Gaussian storage. Experiments on Kodak and DIV2K show that AIR achieves better reconstruction quality than representative Gaussian-based baselines while reducing encoding time to 160--300\,ms. Code: https://github.com/whoiszzj/AIR.git

2605.20818 2026-05-21 cs.CV 版本更新

OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

Yisen Feng, Leigang Qu, Haoyu Zhang, Qiaohui Chu, Meng Liu, Xuemeng Song, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) National University of Singapore(新加坡国立大学) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学) Southern University of Science and Technology(南方科技大学)

AI总结 本文提出一种基于多模态大语言模型(MLLM)的重排序框架,用于解决Ego4D事件记忆挑战2026中的自然语言查询和目标步 tracks,通过结合现有定位模型OSGNet的候选片段和MLLM的视频-语言推理能力,提升时间片段的定位精度。

Comments Champion solution for the Natural Language Queries and GoalStep tracks of the Ego4D Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

在本报告中,我们展示了在CVPR 2026上Ego4D事件记忆挑战的自然语言查询和目标步 tracks中的冠军解决方案。这两个 tracks 都需要从长且未剪辑的egocentric视频中准确地定位时间片段。为解决这些任务,我们提出了一种基于重排序的框架,该框架有效地利用了多模态大语言模型(MLLM)强大的视频-语言推理能力,同时保持了传统定位流程的效率和候选召回率。具体来说,我们首先从现有的定位模型OSGNet中获得一组候选片段,然后利用MLLM来选择最符合给定查询的片段,从而优化最终的预测。最终,我们的方法在自然语言查询和目标步 tracks中均取得了第一名。我们的代码可在https://github.com/iLearn-Lab/CVPR25-OSGNet上找到。

英文摘要

In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at https://github.com/iLearn-Lab/CVPR25-OSGNet.

2605.20808 2026-05-21 cs.CV 版本更新

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

基于超高清图像合成的空间图对齐

Jinjin Zhang, Xiefan Guo, Di Huang

发表机构 * Beihang University(北航大学)

AI总结 本文提出空间图对齐(SGA)方法,通过利用视觉基础模型的表示先验,保留LDMs的生成能力,解决超高清图像合成中生成质量与结构完整性之间的冲突,实现高质量的文本到图像合成。

Comments Technical Report

详情
AI中文摘要

现代超高清图像合成严重依赖大规模预训练潜在扩散模型(LDMs)的强大生成能力。尽管最近的表示对齐方法通过从基础模型(如SAM或DINO)中蒸馏视觉先验到生成潜在特征而有效,但将这些方法扩展到预训练LDMs在极端分辨率下暴露了学习性与保真度之间的关键冲突。具体而言,强制直接的块级特征蒸馏会扰动预训练的潜在流形,最终导致生成退化。为了解决这个瓶颈,我们提出了空间图对齐(SGA),一种新的框架,它明确利用视觉基础模型的表示先验,同时保留LDMs的本原生成能力。超越限制性的直接对齐,SGA通过将生成特征的内部自相似性与基础先验的自相似性对齐,施加一种非侵入性的空间约束。这种空间约束有效地建立了宏观结构的连贯性,而本原的生成目标保留了原始LDMs的微观像素级保真度。值得注意的是,这种通用策略可以无缝整合到预训练LDMs的中间扩散特征和VAE潜在空间中。广泛的实验表明,SGA在超高清文本到图像合成中实现了最先进的性能,有效协调了全局结构完整性和细粒度视觉细节。代码可在https://github.com/zhang0jhon/SGA获取。

英文摘要

Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.

2605.20807 2026-05-21 cs.CV 版本更新

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

通过中间结构预测分解主体驱动的图像生成

Hanzhong Guo, Yizhou Yu

发表机构 * School of Computing and Data Science, The University of Hong Kong(计算与数据科学学院,香港大学)

AI总结 该研究提出了一种两阶段框架,通过先预测Canny图再基于源外观和预测结构生成最终图像,以解决主体驱动文本到图像生成中高频率身份细节如logo、图案和文本的保留问题,并通过自动管道构建了10万对文本感知数据集,实验结果表明中间结构预测能有效提升高保真主体驱动生成的性能。

详情
AI中文摘要

主体驱动的文本到图像生成仍然难以保留诸如logo、图案和文本等高频率身份细节。现有方法通常直接在RGB空间中操作,这在大规模编辑下常导致细节退化。我们提出了一种两阶段框架,通过首先预测Canny图,然后基于源外观和预测的结构生成最终图像。为提高文本处理能力,我们进一步引入了一个全自动流程,构建了一个包含10万对文本感知数据集,并确保跨视角文本一致性。实验包括基于GPT-4.1的评估和知识蒸馏研究,结果表明在选定基线之上有明显提升,并表明中间结构预测是实现高保真主体驱动生成的有效途径。我们的数据集和代码将向公众开放。

英文摘要

Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will be made publicly available.

2605.20804 2026-05-21 cs.CV cs.LG 版本更新

OlmoEarth v1.1: A more efficient family of OlmoEarth models

OlmoEarth v1.1: 一个更高效的OlmoEarth模型家族

Gabriel Tseng, Yawen Zhang, Favyen Bastani, Henry Herzog, Joseph Redmon, Hadrien Sablon, Piper Wolters, Patrick Alan Johnson, Christopher Wilhelm, Patrick Beukema

发表机构 * Allen Institute for AI(人工智能研究所)

AI总结 本文提出了一种改进的OlmoEarth模型家族,通过优化训练和推理过程,显著降低了计算成本,同时保持了模型的整体性能。

详情
AI中文摘要

我们介绍了OlmoEarth家族的一系列改进。这些改进使我们在训练过程中减少了计算成本(训练Base模型所需的GPU小时减少了1.7倍),并在Sentinel-2任务中推理时减少了MACs(2.9倍),同时保持了模型的整体性能。所有训练代码均在github.com/allenai/olmoearth_pretrain上提供。

英文摘要

We present a set of improvements to the OlmoEarth family. These improvements allow us to cut compute costs during training ($1.7 \times$ reduction in GPU hours required to train our Base models) and inference ($2.9\times$ reductions in MACs on Sentinel-2 tasks), while maintaining the models' overall performance. All training code is available at github.com/allenai/olmoearth_pretrain.

2605.20780 2026-05-21 cs.LG cs.CV 版本更新

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

学习物理中的推理:通过表征对齐打破科学扩散中的捷径学习

Haozhe Jia, Pengyu Yin, Wenshuo Chen, Shaofeng Liang, Lei Wang, Bowen Tian, Xiucheng Wang, Nanqian Jia, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Shandong University(山东大学) LimX Dynamics Technology Co., Ltd.(LimX动态技术有限公司) Xidian University(西安电子科技大学) Peking University(北京大学) Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)(感知技术研究院,江苏工业技术研究院(JITRI)) Griffith University(格里菲斯大学)

AI总结 该研究提出了一种无需教师的框架REPA-P,通过使用原理残差对中间特征与物理状态进行对齐,以解决物理信息扩散模型中中间表示在边界条件变化时容易产生捷径学习的问题,从而在四个PDE任务中提高了收敛速度、减少了物理残差并增强了分布外鲁棒性。

详情
AI中文摘要

物理信息扩散模型通常只在最终输出上强制实施PDE约束,导致中间表示不受约束且在边界条件变化时容易产生捷径学习。我们引入了REPA-P,一种无需教师、架构无关的框架,通过原理残差对中间特征与物理状态进行对齐。REPA-P在选定的层上附加轻量级1×1投影头,将隐藏激活解码为物理量,并在训练过程中应用PDE残差损失。这些头在推理时被丢弃,引入了零开销。在四个PDE任务中,包括达西流、拓扑优化、静电势和湍流通道流,REPA-P通过2倍的收敛加速、66.4%的残差减少和49.3%的分布外鲁棒性提升,实现了在U-Net和扩散变换器骨干网络上的持续收益。消融实验显示,监督少量中间层捕获了大部分收益,并补充了输出级物理损失。代码可在[https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P)获得。

英文摘要

Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce **REPA-P**, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing **zero overhead**. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$, with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P).

2605.20777 2026-05-21 cs.CV 版本更新

AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

AttriStory: 基于扩散模型的视觉叙事中细粒度属性实现

Manogna Sreenivas, Rohit Kumar, Soma Biswas

发表机构 * Indian Institute of Science(印度科学研究院)

AI总结 本文提出AttriStory基准,通过细粒度属性实现提升视觉叙事的质量,引入了在早期去噪步骤中操作的潜在优化模块,并通过AttriLoss目标增强属性-对象对的对齐度,从而实现更精确的属性定位。

Comments Accepted at CVPR AIStory Workshop, 2026

详情
AI中文摘要

基于扩散模型的视觉叙事在保持叙事场景中角色一致性方面取得了显著进展。然而,一个关键的差距仍然存在:尽管这些方法确保角色在不同场景中保持一致,但它们没有系统的方法来确保生成图像中诸如服装颜色和纹理等细粒度属性得到忠实呈现。为此,我们引入了AttriStory基准,通过大型语言模型收集了200个跨场景故事,涵盖10种不同的艺术风格。每个场景都包含详细的属性规范,以实现丰富的视觉叙事。进一步,为了解决属性实现问题,我们提出了一种插件式的潜在优化模块,在早期去噪步骤中操作,当模型建立结构和语义内容时。我们通过AttriLoss目标实现这一点,该目标旨在最大化所需属性-对象对的交叉注意力图的对齐度,同时抑制虚假关联,引导模型正确定位属性。这种方法与现有的一致性机制正交,能够无缝集成到当前的故事生成流程中,而无需进行架构修改。我们的实验表明,AttriLoss在所有基线中都实现了持续的改进。这项工作将属性实现定位为视觉叙事的一个独立且互补的维度,与角色一致性并列,推动该领域向细粒度属性控制的故事生成发展。项目页面:https://manogna-s.github.io/attristory/

英文摘要

Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:https://manogna-s.github.io/attristory/

2605.20766 2026-05-21 cs.CV 版本更新

Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection

Diffuse to Detect: 基于伪标签扩散的双级样本再平衡点监督红外小目标检测

Zhu Liu, Yuanhang Yao, Ping Qian, Zihang Chen, Risheng Liu

发表机构 * School of Software Technology, Dalian University of Technology, Dalian, China(大连理工大学软件学院)

AI总结 本文提出了一种更适应且稳定的框架,通过利用热辐射模式与热扩散的内在一致性,提出了一种物理诱导的标注策略,扩展单点标签为可靠的伪掩码,并开发了双级双更新框架,联合优化检测器权重、样本权重和扩散参数,以提高监督效果并缓解样本不平衡问题。

详情
AI中文摘要

点监督已成为解决红外小目标检测密集标注问题的可扩展解决方案,但其性能受限于两个耦合的瓶颈:在杂乱、低对比度的红外图像中伪标签演化的不稳定性以及严重的样本分布不平衡。本文提出了一种更适应且稳定的框架来解决这些问题。利用热辐射模式与热扩散的内在一致性,我们提出了一种物理诱导的标注策略,将单点标签扩展为可靠的伪掩码。为进一步增强监督并缓解样本不平衡,我们开发了双级双更新框架,联合优化检测器权重、样本权重和扩散参数。一个元分类器动态预测样本级损失权重,而一个可微扩散模块通过检测反馈细化伪标签,使训练与超参数优化之间实现自适应交互。在多个数据集上的广泛实验表明,该方法实现了五倍的标注加速,优越的检测精度,并在仅使用30%训练数据时表现出可比的性能,验证了该方法的效率和实用性。我们的代码可在https://github.com/yuanhang-yao/diffuse-to-detect获取。

英文摘要

Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudo-label evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach. Our code is available at https://github.com/yuanhang-yao/diffuse-to-detect.

2605.20760 2026-05-21 cs.CV 版本更新

SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation

SpineContextResUNet: 一种计算高效的残差U-Net用于脊柱CT分割

K S Nithurshen, Saurabh J. Shigwan

发表机构 * Shiv Nadar University(施瓦德纳大学)

AI总结 本文提出SpineContextResUNet,一种高效的3D残差U-Net,用于快速脊柱定位,通过轻量级的上下文块在不牺牲性能的情况下减少了计算资源需求,适用于资源受限环境。

Comments 2 Figures, 3 Tables

详情
AI中文摘要

自动分割CT扫描中的脊柱是病理评估和手术规划的前提。然而,基于Transformer或大规模集合的方法需要大量GPU资源,限制了在资源受限环境或边缘设备上的临床应用。为此,我们引入了SpineContextResUNet,一种计算高效的3D残差U-Net,用于快速脊柱定位。我们的架构整合了一个轻量级的上下文块,该块使用并行多扩张卷积来捕捉长距离解剖依赖,而无需递归神经网络(RNN)的高延迟或自注意力机制的记忆开销。在两个公开基准测试集VerSe2020和CTSpine1K上的广泛验证显示,我们的模型分别实现了88.17%和88.13%的Dice分数。为了评估在严格硬件限制下的性能,我们将模型与一个缩放后的瓶颈SwinUNETR进行了比较,以匹配我们的~1.7M硬件足迹。尽管受限的Transformer由于在有限数据集中的空间归纳偏置缺乏而遭受严重性能下降,我们的CNN方法成功地保持了高精度。关键的是,重基线如TotalSegmentator由于在商用硬件(Intel Core i5,8GB RAM)上的内存耗尽而失败,而我们的模型在内存限制下执行稳健,使其成为点诊诊断和在Nvidia Jetson Orin Nano等边缘平台部署的可行解决方案。

英文摘要

Automated segmentation of the vertebral column in Computed Tomography (CT) scans is a prerequisite for pathological assessment and surgical planning. However, state-of-the-art methods, particularly those based on Transformers or large-scale ensembles, demand substantial GPU resources, creating a barrier for clinical adoption in resource-constrained environments or on edge devices. To address this, we introduce SpineContextResUNet, a computationally efficient 3D Residual U-Net designed for rapid spinal localization. Our architecture integrates a lightweight Context Block that employs parallel multi-dilated convolutions to capture long-range anatomical dependencies without the high latency of Recurrent Neural Networks (RNNs) or the memory overhead of Self-Attention mechanisms. Extensive validation on two public benchmarks, VerSe2020 and CTSpine1K, demonstrates that our model achieves a Dice score of 88.17% and 88.13% respectively. To evaluate performance under strict hardware constraints, we compared our model against a bottlenecked SwinUNETR scaled to match our ~1.7M hardware footprint. While the constrained Transformer suffers severe performance degradation due to a lack of spatial inductive biases in a limited-data regime, our CNN-based approach successfully maintains high accuracy. Crucially, heavy baselines like TotalSegmentator fail due to memory exhaustion on commodity hardware (Intel Core i5, 8GB RAM), our model performs robust inference, making it a viable solution for point-of-care diagnostics and deployment on edge platforms like the Nvidia Jetson Orin Nano.

2605.20758 2026-05-21 cs.AI cs.CV cs.LG cs.RO 版本更新

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

面向组合奖励的冲突感知加法引导:流模型中的对抗性生成

Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh

发表机构 * Smart Systems Institute, National University of Singapore, Singapore(新加坡国立大学智能系统研究所) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院) School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院)

AI总结 本文提出了一种面向组合奖励的冲突感知加法引导方法,用于在流模型中处理对抗性生成问题,通过动态检测和解决梯度冲突来纠正离曼福德漂移,提升了生成保真度。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在推理时间进行引导采样可以无需微调就通过解释生成过程为可控轨迹来驱动最先进的扩散和流模型。这提供了一种简单灵活的方式,将外部约束(如成本函数或预训练验证器)注入受控生成中。然而,现有方法在同时组合多个约束时往往失效,导致偏离真实数据曼福德。在本工作中,我们识别出这种离曼福德漂移的根本原因,并发现近似误差随着梯度不一致程度严重增加。基于这些发现,我们提出了一种轻量且可学习的方法,即冲突感知加法引导(g^car),该方法通过动态检测和解决梯度冲突来主动纠正离曼福德漂移。我们验证了g^car在多样化的领域中的有效性,从合成数据集和图像编辑到生成决策规划与控制。我们的结果表明,g^car有效纠正了离曼福德漂移,在生成保真度方面超越了基线方法,同时使用轻量计算。代码可在https://github.com/yuxuehui/CAR-guidance获取。

英文摘要

Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.

2605.20743 2026-05-21 cs.CV cs.CL 版本更新

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

Draw2Think: 通过约束引擎交互增强几何推理

Juncheng Hu, Jiawei Du, Xin Zhang, Joey Tianyi Zhou

发表机构 * National University of Singapore(新加坡国立大学) Centre for Frontier AI Research, Agency for Science, Technology and Research(科技研究局前沿人工智能研究中心) Institute of High Performance Computing, Agency for Science, Technology and Research(科技研究局高性能计算研究所)

AI总结 Draw2Think通过与GeoGebra约束引擎交互,将几何推理从潜在空间推断转换为与约束引擎的代理交互,从而提高几何推理的准确性和可验证性。

详情
AI中文摘要

视觉-语言模型在解决几何问题时准确性不断提高,但其中间状态仍然保持在潜在空间中且不可验证:文本推理或绘图代码中表达的关系无法保证约束满足的配置能实现它。我们发现现有的基于渲染像素或单次脚本的外部化方法无法提供精确的、每一步的几何保证。通过代数定义强制几何关系从而填补了这一差距:工作空间变成一个经过约束检查的动态画布。我们提出了Draw2Think框架,该框架将几何推理从潜在空间推断转换为与GeoGebra约束引擎的代理交互。在提出-绘制-验证循环中,Draw2Think将假设外部化到可执行画布上,测量精确的几何量,并将结构化的观察反馈给模型,使后续推理从由共享工作空间支撑的检查画布状态开始。这种外部化使两个属性可以分别审计:模型级别的构造保真度(画布是否实现了预期的配置)和引擎级别的测量保真度(来自画布约束的精确值和关系)。在构造、结果和渲染评估中,Draw2Think构建的画布在GeoGoal上通过95.9%的谓词级别和84.0%的严格问题级别构造检查,改进了平面/实体基准测试的结果准确性,最高提高了4.1%/16.4%,并在GenExam-math上达到了68.2%/90.5%的严格/宽松渲染分数。项目页面可在https://draw2think.github.io/上找到。

英文摘要

Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/

2605.20738 2026-05-21 cs.CV 版本更新

STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection

STAR-IOD: 无尺度耦合拓扑对齐与伪标签细化用于遥感增量目标检测

Yaoteng Zhang, Qing Zhou, Junyu Gao, Qi Wang

发表机构 * School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China(计算机科学学院,西北工业大学,西安710072,中国) School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China(人工智能学院,光学与电子学(iOPEN),西北工业大学,西安710072,中国)

AI总结 本文提出STAR-IOD框架,通过子空间解耦拓扑蒸馏模块和聚类驱动伪标签生成器,解决遥感增量目标检测中类别间拓扑关系对齐和尺度变化导致的表示差异问题,同时通过动态识别类别特定阈值来缓解旧类标注缺失问题,实验表明在DIOR-IOD和DOTA-IOD数据集上,方法在mAP上分别优于现有方法1.7%和2.1%。

Comments STAR-IOD was accepted by ISPRS Journal of Photogrammetry and Remote Sensing

详情
AI中文摘要

遥感影像通常以连续数据流的形式出现。传统检测器在学习新类别时往往会遗忘之前学习的类别;因此,研究遥感增量目标检测(RS-IOD)具有重要意义。然而,现有方法大多忽视了遥感场景中普遍存在的类别内尺度变化,这削弱了知识迁移和旧知识保留的有效性。此外,RS-IOD还受到标注缺失的影响,导致模型将旧类实例误分类为背景。为了解决这些挑战,我们提出了一种新的框架STAR-IOD。首先,我们引入了子空间解耦拓扑蒸馏(STD)模块,以转移结构知识,显式对齐类别间拓扑关系,并缓解由尺度变化引起的类别内表示差异。此外,我们引入了聚类驱动伪标签生成器(CPG),这是一个即插即用模块,利用K-Means聚类动态识别类别特定阈值,从而保证真正阳性目标与背景噪声之间的准确区分,并缓解旧类标注缺失问题。我们还构建了两个遥感增量目标检测数据集,DIOR-IOD和DOTA-IOD,以促进RS-IOD的研究。广泛的实验表明,我们的方法在DIOR-IOD和DOTA-IOD数据集上分别以1.7%和2.1%的mAP优于现有方法,有效缓解了灾难性遗忘,同时在基础类和新类上保持了强劲的检测性能。代码和数据集已发布在:https://github.com/zyt95579/STAR-IOD。

英文摘要

Remote sensing imagery typically arrives in the form of continuous data streams. Traditional detectors often forget previously learned categories when learning new ones; therefore, research on Remote Sensing Incremental Object Detection (RS-IOD) is of great significance. However, existing methods largely overlook the intra-class scale variations prevalent in remote sensing scenes, which undermines the effectiveness of knowledge transfer and old knowledge preservation. Moreover, RS-IOD also suffers from missing annotations, which cause the model to misclassify old-class instances as background. To address these challenges, we propose a novel framework, STAR-IOD. First, we introduce a Subspace-decoupled Topology Distillation (STD) module to transfer structural knowledge, explicitly aligning inter-class topological relationships and mitigating intra-class representation discrepancies induced by scale shifts. Furthermore, we introduce the Clustering-driven Pseudo-label Generator (CPG), a plug-and-play module that leverages K-Means clustering to dynamically identify class-specific thresholds, thereby guaranteeing an accurate distinction between true positive targets and background noise and alleviating the issue of missing annotations for old classes. We also constructed two Remote Sensing Incremental Object Detection datasets, DIOR-IOD and DOTA-IOD to facilitate research on RS-IOD. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 1.7% and 2.1% mAP on DIOR-IOD and DOTA-IOD, respectively, effectively alleviating catastrophic forgetting while preserving strong detection performance on both base and novel classes. The code and dataset are released at: https://github.com/zyt95579/STAR-IOD.

2605.20737 2026-05-21 cs.CV 版本更新

Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

通过语言先验解决无监督3D点云分割中的长尾歧义

Siqi Wei, Hongbin Xu, Feng Xiao, Tian Lan, Chun Li, Ming Li, Qiuxia Wu

发表机构 * South China University of Technology(华南理工大学) Bytedance(字节跳动) Tsinghua University(清华大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学) Guangming Laboratory(光明实验室)

AI总结 本文提出LangTail框架,利用语言模型中的平衡世界知识来缓解无监督3D分割中的长尾歧义问题,通过建立语言衍生语义先验与视觉上不常见的小类之间的多级关联,提升小类的表示能力,实验表明在ScanNet-v2、S3DIS和nuScenes数据集上均取得显著提升。

Comments In submission. The code will be released at: https://github.com/Whisky0129/langtail_official

详情
AI中文摘要

现有的无监督3D点云分割方法主要依赖于纯视觉相似性基于聚类的学习范式,这存在一个根本性限制:长尾歧义。在这样的范式中,次要类别的特征会被主导簇持续吸收,导致预测严重不平衡。为了解决这个问题,我们提出了LangTail,一种语言引导的分层学习框架,利用语言模型中编码的平衡世界知识来缓解无监督3D分割中的长尾歧义。关键思想是建立语言衍生语义先验与视觉上不常见的次要类别之间的多级关联,从而补偿纯粹视觉聚类对主导类别的偏关注。具体来说,LangTail首先从语言模型中构建实体级语义先验,捕捉跨类别的平衡和细粒度世界知识。这些先验通过对比对齐注入到分层聚类框架中。这引导多粒度语义结构的形成,并防止次要类别被主导簇吸收,从而为不常见的类别产生更具判别性的表示。在ScanNet-v2、S3DIS和nuScenes上进行的大量实验表明,LangTail在ScanNet-v2、S3DIS和nuScenes上分别比现有方法提高了+13.5、+12.9和+8.9 mIoU。这些结果证明了语言先验在提升3D点云中少数类别表示的有效性。代码将在:https://github.com/Whisky0129/langtail_official发布。

英文摘要

Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: https://github.com/Whisky0129/langtail_official.

2605.20733 2026-05-21 cs.CV 版本更新

Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

Sketch2MinSurf: 通过视觉-语言引导从手绘草图生成可编辑的最小曲面

Wenda Wang, Anqi Liu, Junqi Yang, Lei He, Luying Wang, Jiachen Lu, Weixin Huang

发表机构 * School of Architecture, Tsinghua University(清华大学建筑学院) Department of Architecture, National University of Singapore(新加坡国立大学建筑系)

AI总结 本研究提出Sketch2MinSurf方法,结合视觉-语言引导和几何优化,从手绘草图生成平滑且可编辑的3D曲面,通过空间-拓扑编码和Sketch2MinSurf结构损失函数实现拓扑一致性与几何重建的联合约束。

Comments 22 pages, 16 figures, includes appendix

详情
AI中文摘要

将手绘草图转换为结构化的3D几何体仍然具有挑战性,因为非欧几里得曲面的表示和拓扑一致性维护困难。现有的生成模型如GANs、NeRFs和扩散架构往往无法直接生成可编辑的流形用于下游设计流程。我们提出了Sketch2MinSurf,一种结合视觉-语言和几何优化的混合框架,通过将视觉-语言引导与最小曲面理论相结合,从手绘草图生成平滑且可编辑的3D曲面。我们的方法核心是一种空间-拓扑编码,将几何表示为节点坐标和实/虚拟边骨架的元组,使在生成过程中能够实现稳定的拓扑控制。我们进一步引入了Sketch2MinSurf结构损失函数(S2MS-Loss),一种奖励调制的目标,联合约束几何重建和拓扑一致性。在100个草图的测试集上,Sketch2MinSurf实现了0.844的拓扑相似度得分,优于现有的草图到形状基线。生成的流形可以直接编辑且没有非流形伪影。一所大学的公共艺术装置展示了该方法在人类意图驱动的3D形式生成中的潜力。数据集和代码可在https://anonymous.4open.science/r/Sketch2MinSurf/上获取。

英文摘要

Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at https://anonymous.4open.science/r/Sketch2MinSurf/.

2605.20732 2026-05-21 cs.CV 版本更新

Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations

深度注意力重加权:CNN中的后处理注意力特征聚合以解纠缠核心与伪相关特征

Kin Whye Chew, Jingxian Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种基于注意力的后处理特征聚合方法DAR,通过替换全局平均池化层来减少CNN中因伪相关特征引起的纠缠,从而提升模型的泛化能力和公平性。

Comments Under review. 26 pages, 7 figures

详情
AI中文摘要

卷积神经网络(CNNs)经常利用数据集中的伪相关性,学习出表面预测但因果无关的特征,导致泛化能力差和公平性问题。深度特征重加权(DFR)是一种后处理技术,通过在目标数据集上重新训练分类头来减少模型对伪相关性的依赖。然而,我们发现DFR受限于在纠缠特征上操作,限制了其增强核心特征同时抑制伪特征的能力。我们追溯这种纠缠到普遍存在的全局平均池化(GAP)层,该层 indiscriminately 将空间上不同的核心和伪特征压缩成单一表示。为了解决这个问题,我们提出了深度注意力重加权(DAR),一种基于注意力的后处理特征聚合模块,它替换了GAP层并与分类头一起重新训练。DAR在特征图上计算空间位置的自适应加权,使在压缩成纠缠特征前能选择性地抑制伪特征。在各种数据集、指标和消融实验中,DAR始终优于DFR,证明了我们的基于注意力的聚合方法减轻了GAP引起的纠缠并减少了对伪相关性的依赖。

英文摘要

Convolutional Neural Networks (CNNs) often exploit spurious correlations in datasets, learning superficially predictive yet causally irrelevant features, leading to poor generalization and fairness issues. Deep Feature Reweighting (DFR) is a post-hoc technique that reduces a trained model's reliance on spurious correlations by retraining its classification head on a target dataset. However, we show that DFR is fundamentally constrained by operating on entangled features, limiting its ability to amplify the core features while simultaneously suppressing the spurious ones. We trace this entanglement to the ubiquitous Global Average Pooling (GAP) layer, which indiscriminately collapses spatially distinct core and spurious features into a single representation. To address this, we propose Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces GAP and is retrained jointly with the classification head. DAR computes an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before the collapse into entangled features. Across various datasets, metrics, and ablations, DAR consistently outperforms DFR, demonstrating that our attention-based aggregation mitigates GAP-induced entanglement and reduces spurious reliance.

2605.20728 2026-05-21 cs.CV 版本更新

Early High-Frequency Injection for Geometry-Sensitive OOD Detection

早期高频注入用于几何敏感的域外检测

Chuanjie Cheng, Ningkang Peng, Chenxi Liu, Yifan He, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学)

AI总结 本文通过带宽分析揭示了高频输入对几何敏感域外检测的重要性,提出EIHF方法在CIFAR-100和ImageNet-100上提升了检测性能,同时揭示了其在场景中心Places迁移上的局限性。

详情
AI中文摘要

事后域外检测器在训练后对logits或特征进行评分,其成功依赖于表示中已编码的几何结构。我们通过跨CE、SimCLR、SupCon和域外导向表示方法PALM的带宽MMD^2分析重新审视这一假设。在我们的诊断中,低频输入带诱导更弱的ID/OOD特征差异,而高频带倾向于提供更强的分离性。这一观察促使提出EIHF,一种输入侧干预方法,在第一次卷积之前暴露高频证据而不改变训练目标。EIHF在几何敏感的域外检测中表现最强:在匹配的训练和评分设置下,它重塑类条件特征几何并减少ID/OOD马哈拉诺斯距离重叠。在CIFAR-100和ImageNet-100上的实验显示,在CIFAR-100上获得提升,在ImageNet-100上获得最佳的平均FPR95和次佳的平均AUROC,同时揭示了在场景中心Places迁移上的局限性。代码可在https://anonymous.4open.science/r/EIHF获得。

英文摘要

Post-hoc OOD detectors score logits or features after training, so their success depends on the geometry already encoded in the representation. We revisit this assumption through a band-wise MMD^2 analysis across CE, SimCLR, SupCon, and the OOD-oriented representation method PALM. In our diagnostic, low-frequency input bands induce weaker ID/OOD feature discrepancy, whereas higher-frequency bands tend to provide stronger separability. This observation motivates EIHF, an input-side intervention that exposes high-frequency evidence before the first convolution without changing the training objective. EIHF is strongest for geometry-sensitive OOD detection: under matched training and scoring settings, it reshapes class-conditional feature geometry and reduces ID/OOD Mahalanobis score overlap. Experiments on CIFAR-100 and ImageNet-100 show gains on CIFAR-100 and the best average FPR95 with second-best average AUROC on ImageNet-100, while also revealing a limitation on the scene-centric Places shift. Code is available at https://anonymous.4open.science/r/EIHF.

2605.20727 2026-05-21 cs.CV 版本更新

GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels

GAMR: 带虚拟异常合成的几何感知流形正则化用于噪声标签学习

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Peirong Ma, Xichen Yang, Weiguang Qu, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学) Nanjing University of Chinese Medicine(南京中医药大学)

AI总结 本文提出了一种几何感知流形正则化方法,通过主动合成虚拟异常样本来重构特征空间几何,从而提升在噪声标签下的学习性能,其核心贡献是增强模型对难样本和噪声样本的区分能力,实现更鲁棒的表示学习。

详情
AI中文摘要

深度神经网络(DNNs)在处理噪声标签时会遭受显著的性能下降,主要由于过度拟合错误标记的数据。当前主流方法试图通过在训练过程中被动过滤干净样本来缓解这一问题。然而,在受噪声破坏的特征空间中,简单的样本过滤难以区分具有挑战性的样本和噪声样本,从而成为模型性能的瓶颈。我们首次强调了主动重塑特征空间几何在学习噪声数据中的根本重要性。我们提出了一种新颖的几何感知流形正则化范式,其核心思想是通过主动合成虚拟异常样本来显式构建数据流形之间的能量屏障。通过施加促进类内紧凑性和类间分离的几何约束,该方法增强了难样本与噪声样本之间的可区分性,从而学习到更鲁棒的表示。我们的正则化机制具有高度的通用性,其有效性不依赖于任何关于噪声模式的先验假设。它可以作为独立机制集成到现有的样本选择框架中,提供更强的鲁棒性以应对多样的噪声环境。实验表明,我们的范式在多个基准上,包括CIFAR-10,均实现了超越当前最先进(SOTA)方法的性能,特别是在更具挑战性的不对称噪声条件下表现尤为突出。此外,该范式显著增强了模型在Out-of-Distribution(OOD)检测方面的能力,确保了在开放世界场景中更高的可靠性和安全性。

英文摘要

Deep neural networks (DNNs) experience significant performance degradation when processing noisy labels, primarily due to overfitting on mislabeled data. Current mainstream approaches attempt to mitigate this issue by passively filtering clean samples during training. However, simple sample filtering within feature spaces degraded by noise struggles to distinguish between challenging samples and noisy samples, creating a bottleneck for model performance. We highlight for the first time the fundamental importance of actively reshaping feature space geometry for learning from noisy data. We propose a novel Geometry-aware Manifold Regularization Paradigm whose core idea is to explicitly construct energy barriers between data manifolds by actively synthesizing virtual outlier samples. By imposing geometric constraints that promote intra-class compactness and inter-class separation, this approach enhances the discriminability between hard and noisy samples, leading to the learning of more robust representations. Our regularization mechanism exhibits high universality, with effectiveness independent of any prior assumptions about noise patterns. It can be integrated as a standalone mechanism into existing sample selection frameworks, providing stronger robustness against diverse noisy environments. Experiments demonstrate that our paradigm achieves performance surpassing current state-of-the-art (SOTA) methods on multiple benchmarks, including CIFAR-10, with particularly pronounced advantages under more challenging asymmetric noise conditions. Furthermore, this paradigm significantly enhances the model's capability in Out-of-Distribution (OOD) detection, ensuring superior reliability and safety for deployment in open-world scenarios.

2605.20725 2026-05-21 cs.CV 版本更新

Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label

整体可靠性传播:解耦标注与预测以实现鲁棒的噪声标签

Jingyang Mao, Ningkang Peng, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学)

AI总结 本文提出了一种整体可靠性传播方法,通过解耦标注和预测来提高在噪声标签下的鲁棒性,该方法通过双层元学习生成两个批次标准化标量,分别用于给定标签和伪标签,并在不同目标上路由这些可靠性,从而在合成和现实基准上提升了平均准确率。

详情
AI中文摘要

在多媒体分类中,使用噪声标签学习时通常将外部注释和模型预测合并为一个可靠性权重,尽管这两个来源可能因不同的原因失效。我们相反地估计解耦的可靠性:双层元学习为每个样本生成两个批次标准化标量,alpha用于给定标签,beta用于伪标签,而不将它们限制为总和为一。整体可靠性传播(HRP)然后将它们路由到不同的目标,使用可靠性感知的Mixup和全局门控在输入分支上,以及beta门控的伪标签正例在对比分支上。在合成和现实世界基准上,HRP在强基线之上提高了平均准确率,并在最高噪声率下保持竞争力。

英文摘要

Learning with noisy labels in multimedia classification often combines external annotations and model predictions into a single reliability weight, even though the two sources can fail for different reasons. We instead estimate disentangled reliabilities: bilevel meta-learning produces two batch-normalized scalars per sample, alpha for the given label and beta for the pseudo-label, without constraining them to sum to one. Holistic Reliability Propagation (HRP) then routes them to different objectives, using reliability-aware Mixup with global gating on the input branch and beta-gated pseudo-label positives on the contrastive branch. On synthetic and real-world benchmarks, HRP improves average accuracy over strong baselines and remains competitive at the highest noise rates.

2605.20717 2026-05-21 cs.NE cs.AR cs.CV eess.IV 版本更新

E-ReCON: An Energy- and Resource-Efficient Precision-Configurable Sparse nvCIM Macro for Conventional and Spiking Neural Edge Inference

E-ReCON:一种能量和资源高效、精度可配置的稀疏nvCIM宏单元,用于传统和脉冲神经边缘推理

Ankit Kumar Tenwar, Mukul Lokhande, Santosh Kumar Vishvakarma

发表机构 * Dept of Science and Technology (DST), Govt of India(印度科技部) MeitY/SMDP-C2S(科技部/SMDP-C2S)

AI总结 本文提出了一种基于紧凑型3T1R ReRAM位单元的16 Kb能量和资源高效的数字计算在内存(DCIM)宏单元E-ReCON,用于传统和脉冲神经网络边缘推理,通过引入新型交错10T/28T加法器树,减少晶体管数量和功耗,同时在65 nm CMOS工艺下实现低延迟、高吞吐量和高能效,适用于多种神经网络模型。

详情
AI中文摘要

本工作提出E-ReCON,一种基于紧凑型3T1R ReRAM位单元的16 Kb能量和资源高效的数字计算在内存(DCIM)宏单元,用于边缘AI推理。所提出的位单元仅占用0.85 um^2,并支持可靠的基于AND的在内存乘法,适用于传统卷积神经网络(CNN)和脉冲神经网络(SNN)工作负载。为减少累积开销,引入了新型交错10T/28T加法器树,与传统28T RCA设计相比,晶体管数量和功耗分别减少了37%和28%。在65 nm CMOS工艺下,该宏单元实现了最小延迟0.48 ns,吞吐量2.31-3.1 TOPS,能量效率高达419 TOPS/W。在LeNet-5、AlexNet和CNN-8模型上评估时,分别在MNIST/A-Z、CIFAR10和SVHN数据集上实现了97.81%、93.23%和96.51%的准确率。此外,40%的剪枝保留了几乎99.8%的原始准确率,同时减少了MAC操作和计算周期。对于面向SNN的工作负载,所提出的AND型位单元高效支持脉冲-权重乘法,具有低开关活动,其中2A2W配置在CIFAR-10、CIFAR-100和ImageNet-1K数据集上,准确率接近FP32基线。与之前的ADC基于ReRAM-CIM设计相比,所提出的架构在保持全PVT和ReRAM变异性下,将延迟和能效提高了近30-40%。总体而言,E-ReCON提供了一种可扩展、低延迟、高能效的nvCIM平台,适用于下一代边缘AI、物联网、生物医学传感和神经形态应用。

英文摘要

This work presents E-ReCON, a 16 Kb energy and resource-efficient digital compute-in-memory (DCIM) macro based on a compact 3T1R ReRAM bitcell for edge-AI inference. The proposed bitcell occupies only 0.85 um^2 and supports reliable AND-based in-memory multiplication for both conventional convolutional neural network (CNN) and spiking neural network (SNN) workloads. To reduce accumulation overhead, a novel interleaved 10T/28T adder tree is introduced, reducing transistor count and power consumption by 37% and 28%, respectively, compared to a conventional 28T RCA-based design. Implemented in 65 nm CMOS at 1.2 V, the proposed macro achieves a minimum latency of 0.48 ns, throughput of 2.31-3.1 TOPS, and energy efficiency of up to 419 TOPS/W. When evaluated on LeNet-5, AlexNet, and CNN-8 models, the macro achieves 97.81%, 93.23%, and 96.51% accuracy on MNIST/A-Z, CIFAR10, and SVHN datasets, respectively. In addition, 40% pruning preserves nearly 99.8% of the original accuracy while reducing MAC operations and computation cycles. For SNN-oriented workloads, the proposed AND-type bitcell efficiently supports spike-weight multiplication with low switching activity, where the 2A2W configuration achieves accuracy close to the FP32 baseline across VGG-8, VGG-16, and ResNet-18 networks on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. Compared to prior ADC-based ReRAM-CIM designs, the proposed architecture improves latency and energy efficiency by nearly 30-40% while maintaining robust operation under full PVT and ReRAM variability. Overall, E-ReCON provides a scalable, low-latency, and energy-efficient nvCIM platform for next-generation edge-AI, IoT, biomedical sensing, and neuromorphic applications.

2605.20713 2026-05-21 cs.CV cs.AI cs.LG 版本更新

SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

SAVER:选择性所需视觉证据用于多模态信息提取

Miaobo Hu, Shuhao Hu, Bokun Wang, Rui Chen, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) University of Chinese Academy of Sciences(中国科学院大学) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 该研究提出SAVER框架,通过选择性视觉证据提升多模态命名实体识别和关系抽取的性能,减少计算开销并提高准确性。

详情
AI中文摘要

多模态信息提取在社交媒体中具有挑战性,因为帖子可能附加多个弱相关、冗余甚至误导性的图像。在这样的情况下,持续的多模态融合会浪费计算资源并放大虚假的视觉提示。核心挑战是决定是否为每个候选跨度或标记实体对咨询视觉信息,以及如果需要,哪些小图像子集提供可信的证据。我们提出SAVER,一种选择性视觉所需框架用于多模态命名实体识别和多模态关系抽取。SAVER使用符合性地面性门(CGG)来估计MNER中的跨度级视觉地面性,从两个标记实体推导出对级激活,通过符合性风格程序和Clopper-Pearson上界校准激活阈值。当被激活时,一个子模ularity相关性-多样性选择器选择跨图像的紧凑证据子集,然后通过集合变换器进行聚合。一个受能量启发的联合评分头结合文本、可选视觉证据、文本-图像一致性以及稀疏路由用于实体类型或关系分类。实验表明,SAVER在强文本-only和持续多模态基线上一致提高F1,同时减少AURC,增加激活覆盖面积,在固定风险水平下,降低FLOPs和P90延迟。

英文摘要

Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper--Pearson upper bounds. When activated, a submodular relevance--diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text--image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency.

2605.20682 2026-05-21 cs.CV 版本更新

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

IndusAgent: 通过智能工具增强开放词汇工业异常检测

Rongbin Tan, Fangfang Lin, Zhenlong Yuan, Min Qiu, Kejin Cui, Mengmeng Wang, Yi Wang, Zijian Song, Zhiyuan Wang, Jiyuan Wang, Yue Wang, Shuhan Song§, Huawei Cao

发表机构 * State Key Lab of Processors, Institute of Computing Technology, CAS(处理器国家重点实验室,计算技术研究所,中国科学院) Santa Clara University(圣克拉拉大学) LongCat Team(LongCat团队) Independent Researcher(独立研究者) New York University(纽约大学) Sun Yat-sen University(孙中山大学) Nanyang Technological University(南洋理工大学) Stanford University(斯坦福大学) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国)

AI总结 本文提出IndusAgent框架,通过整合视觉观测、高分辨率局部片段和专家正常性先验,提升开放词汇工业异常检测的零样本性能,验证了方法的鲁棒性和泛化能力。

详情
AI中文摘要

多模态大语言模型(MLLMs)在连接视觉感知和文本推理方面表现出色,能够跨多样化的工业场景实现零样本理解。然而,其在开放词汇工业异常检测(IAD)中的性能常受限于领域不匹配的推理和幻觉的结构推断。为了解决这些挑战,我们提出了IndusAgent,一种工具增强的智能框架用于开放词汇IAD。具体而言,我们首先构建了Indus-CoT,一个整合了全局视觉观测、高分辨率局部片段和专家正常性先验的结构化数据集,为在严格工业检查轨迹上微调模型提供监督。在此基础上,IndusAgent动态协调一组外部工具,包括动态区域裁剪、高频特征增强和先验检索,从而使代理能够主动解决视觉歧义并分离细微异常。此外,我们引入了一个门控强化学习目标,联合优化异常分类、定位准确性、异常类型推理和高效的工具使用,确保工具调用仅在有益时发生。在五个工业异常基准测试上(包括MVTec-AD、VisA、MPDD、DTD和SDD)的广泛评估表明,IndusAgent在所有现有方法中实现了最先进的零样本性能,验证了我们的鲁棒性和泛化能力。

英文摘要

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.

2605.20680 2026-05-21 cs.CV 版本更新

DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

DarkShake-DVS: 低光和摇晃条件下基于事件的行人动作识别

Jiaqi Chen, Qinfu Xu, Liyuan Pan

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 本文提出了一种结合事件相机和惯性测量单元的EIS-HAR方法,通过非线性变形模块减少运动模糊并提取时空特征,同时引入DarkShake-DVS基准数据集,用于低光和6自由度运动条件下的行人动作识别研究。

Comments 8pages,7 figures

详情
AI中文摘要

行人动作识别(HAR)是计算机视觉中的基本任务,具有广泛的应用。实际部署通常涉及低光环境和无约束的6-DoF相机运动,这些条件会降低视觉质量,破坏时间一致性,并影响现有方法的可靠性。事件相机具有高低光灵敏度和微秒级时间分辨率,结合惯性测量单元(IMU)提供了一种有前途的解决方案。然而,当前研究面临两个关键挑战:缺乏整合低光条件、6-DoF运动和同步IMU数据的基准;以及缺乏有效的运动补偿技术。为此,我们提出事件-IMU稳定HAR(EIS-HAR),包含两个模块。第一个是EIS模块,通过非线性变形函数减少运动模糊以重建运动补偿的输入。第二个是HAR模块,具有四阶段混合架构,以高效提取时空特征进行准确的动作识别。为缓解数据稀缺,我们引入DarkShake-DVS,第一个大规模基于事件的HAR基准,包含18,041个真实世界片段,在低光和强烈6-DoF运动条件下拍摄,并补充同步IMU数据。在三个数据集上的广泛实验表明,EIS-HAR在状态-of-the-art方法上表现出一致的优越性。

英文摘要

Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

2605.20676 2026-05-21 cs.CV 版本更新

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

VISTAQA: 评估联合视觉问答与像素级证据

Mozhgan Nasr Azadani, Yimu Wang, Yongpeng Zhu, Lihong Chen, Milan Ganai, Sean Sedwards, Marco Pavone, Krzysztof Czarnecki

发表机构 * University of Waterloo(滑铁卢大学) Stanford University(斯坦福大学) NVIDIA(英伟达)

AI总结 本文提出VISTAQA基准,用于评估视觉问答中自由回答的正确性和像素级证据的定位,通过引入GROVE指标,强调回答正确性与视觉证据对齐的重要性,实验显示现有系统在该指标下表现有限,揭示了回答准确性和视觉证据对齐之间的显著差距。

详情
AI中文摘要

建立模型预测与支持它们的视觉证据之间的清晰联系对于多模态推理的透明性和可靠性至关重要,但当前的多模态大语言模型(MLLM)评估并未明确强制这种对齐。现有的基准评估要么单独评估文本答案的正确性,要么单独评估像素级定位,使推理与定位的耦合成为一个开放性挑战。我们介绍了VISTAQA,一个用于联合评估自由回答正确性和像素级证据定位的全面基准。VISTAQA包含1,157个专家整理的样本,涵盖六种任务类型和六个视觉领域,从直接感知到组合和关系推理。VISTAQA要求模型不仅要正确回答,还要提供精确的分割掩码以支持其答案。它还包含有幻觉意识的例子,其中不存在有效的视觉证据。为了支持这种增强的评估,我们引入了GROVE,一个统一的评估指标,通过每样本几何均值结合文本准确性与定位质量,确保两者都不能补偿对方的不足。在接地意识模型和混合管道与通用MLLM的全面实验中,即使最强的系统在GROVE下也表现有限,突显了回答准确性和视觉证据对齐之间的显著差距。

英文摘要

Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.

2605.20669 2026-05-21 cs.CV 版本更新

GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection

GSA-YOLO: 一种通过结构稀疏性和自适应知识蒸馏实现高效率的实时X射线安全检查框架

Jiahao Kong

发表机构 * SDU-ANU Joint Science College(山东大学-澳大利亚国立大学联合科学学院) Shandong University(山东大学)

AI总结 本文提出GSA-YOLO框架,通过结构稀疏性和自适应知识蒸馏提升实时X射线安全检查的检测鲁棒性和推理效率,实现了高精度和高效率的平衡。

Comments 41 pages, 8 figures, submitted to Scientific Reports

详情
AI中文摘要

X射线安全检查需要准确实时检测违禁物品,但现有模型往往难以平衡严重遮挡、复杂杂乱和严格速度要求的挑战。为克服这些挑战,本文提出GSA-YOLO,一种基于YOLOv8n架构的新型轻量框架,专门设计以增强检测鲁棒性和推理效率。GSA-YOLO通过三个核心组件策略性整合结构稀疏性和自适应知识转移:Group Lasso(GL)应用于网络颈部以实现鲁棒的特征提取;Sparse Structure Selection(SSS)应用于检测头以实现显著的模型瘦身;以及自适应知识蒸馏(Ada-KD)机制以实现全面的准确率恢复。这种整合方法协同增强了特征表示,同时修剪冗余通道,最大化模型效率而不牺牲性能。在HiXray和PIDray数据集上的严格评估证实了GSA-YOLO的全面能力,实现了领先的推理速度189.62 FPS,伴随计算成本从8.7G降至8.0G。关键的是,GSA-YOLO在HiXray和PIDray上分别实现了mAP50:95结果0.531和0.679,分别比基线提高了2.4%和1.8%。与其他模型相比,GSA-YOLO在保持计算效率的同时表现出更高的准确性,使其成为实际X射线安全检查的有前景的解决方案。

英文摘要

X-ray security inspection requires accurate real-time detection of prohibited items, but existing models often struggle to balance the challenges of severe occlusion, complex clutter, and strict speed requirements. To overcome these challenges, this paper proposes GSA-YOLO, a novel lightweight framework built upon the YOLOv8n architecture, specifically engineered to enhance detection robustness and inference efficiency. GSA-YOLO strategically integrates structured sparsity and adaptive knowledge transfer through three core components: Group Lasso (GL) applied to the network neck for robust feature extraction; Sparse Structure Selection (SSS) applied to the detection head for significant model slimming; and an Adaptive Knowledge Distillation (Ada-KD) mechanism for comprehensive accuracy recovery. This integrated approach synergistically enhances feature representation while pruning redundant channels, maximizing model efficiency without sacrificing performance. Rigorous evaluations on the HiXray and PIDray datasets confirm GSA-YOLO's comprehensive capability, achieving a leading inference speed of 189.62 FPS, accompanied by a reduction in computational cost from 8.7G to 8.0G. Crucially, GSA-YOLO secures mAP50:95 results of 0.531 and 0.679 on HiXray and PIDray, demonstrating 2.4% and 1.8% improvements over the baseline, respectively. Compared to other models, GSA-YOLO exhibits enhanced accuracy while maintaining computational efficiency, making it a promising solution for practical X-ray security inspection.

2605.20667 2026-05-21 cs.CV 版本更新

LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

LER-YOLO: 一种可靠性感知的专家路由方法用于对齐不准确的RGB-红外无人机检测

Liming Hou, Yueping Peng, Hexiang Hao, Ji Wang, Xuekai Zhang, Wei Tang, Zecong Ye, Xin Ying, Yubo He

发表机构 * Engineering University of PAP(中国人民解放军防务大学) Unit Command Department, Officers College of PAP(中国人民解放军军官学院作战指挥部)

AI总结 该研究提出LER-YOLO,一种可靠性感知的稀疏专家混合方法,用于解决RGB-红外遥感对中无人机检测的挑战,通过引入不确定性感知的目标对齐模块和可靠性引导的稀疏MoE融合模块,提升跨模态交互的可靠性。

Comments 17 pages, 6 figures, 8 tables

详情
AI中文摘要

检测RGB-红外遥感对中的小型无人驾驶航空器仍然具有挑战性,因为目标尺度小、背景杂乱以及异构传感器之间的空间不对齐。现有的双模检测器通常对齐或融合特征,但未评估局部跨传感器对应关系的可靠性,导致不匹配伪影传播到检测头。为此,我们提出了LER-YOLO,一种可靠性感知的稀疏混合专家框架,用于对齐不准确的RGB-红外无人机检测。LER-YOLO首先引入了一个不确定性感知的目标对齐模块,将可见特征重新采样到红外参考,并估计空间可靠性图。此可靠性先验随后被可靠性引导的稀疏MoE融合模块使用,以从RGB主导、红外主导和交互融合专家中自适应选择k个专家,从而在抑制不可靠融合的同时实现可信的跨模态交互。在公共MBU基准上,使用YOLOv5s家族协议进行实验,结果显示LER-YOLO在三个独立种子下达到89.7±0.2%的AP50,最佳结果为89.9%。广泛的消融实验、参数匹配比较、合成位移评估和复杂度分析表明,收益主要来自可靠性引导的专家路由,而非增加模型容量。

英文摘要

Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.

2605.20659 2026-05-21 cs.CV cs.LG 版本更新

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

RoPeSLR: 3D RoPE驱动的稀疏低秩注意力用于高效的扩散变换器

Yuxi Liu, Zekun Zhang, Yixiang Cai, Renjia Deng, Yutong He, Kun Yuan

发表机构 * Peking University(北京大学) University of Electronic Science and Technology of China(电子科技大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本研究提出RoPeSLR,一种基于3D RoPE的稀疏低秩注意力框架,旨在解决扩散变换器中长序列生成的高复杂度问题,通过结合高频率语义尖峰集和极低秩背景连续体,实现子二次稀疏性和子线性秩增长,从而在超长视频推理中表现出色。

详情
AI中文摘要

扩散变换器(DiTs)已革新了高保真视频生成,但其$\mathcal{O}(L^2)$的注意力复杂度对长序列合成构成了重大瓶颈。尽管近期的稀疏线性注意力混合体旨在缓解这一问题,但其在极端稀疏性下性能严重下降,这是因为“RoPE困境”:标准线性注意力无法保持3D旋转位置嵌入(RoPE)的正交相对位置结构,从而消除了关键的距离意识。为了解决这个问题,我们提出了RoPeSLR,一种3D RoPE引导的稀疏低秩注意力框架。我们建立,根据经验证实的假设,DiT注意力流形可以解耦为一个高频率语义尖峰集(受限于$\mathcal{O}(L^{3/2})$稀疏性)和一个极低秩($\mathcal{O}(d_h \log L)$)背景连续体。受这一结构先验的指导,RoPeSLR摒弃标准线性注意力,采用具有可学习3D绝对位置嵌入(PE)注入的头级低秩参数化,无缝合成长距离相对距离衰减。通过保证子二次稀疏性和子线性秩增长,RoPeSLR特别适合扩展到超长视频推理。广泛的评估验证了这种可扩展优势:在90%稀疏性下,RoPeSLR在Wan2.1-1.3B上实现高达10倍的FLOPs减少,并在HunyuanVideo-13B的超长100K+ token序列上提供2.26倍的端到端推理加速,同时保持接近无损的生成保真度(平均VBench退化低于1.3%)

英文摘要

Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

2605.20651 2026-05-21 cs.CV 版本更新

Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

凝视细节:用于OCTA视网膜血管分割的局部敏感增强

Tuopusen Huang, Ding Ma, Xiangqian Wu

发表机构 * Faculty of Computing(计算学院)

AI总结 本文提出LSENet,通过引入三个创新模块解决OCTA血管分割中局部对比度低导致的断续和细节丢失问题,实验表明其在多个公开数据集上达到最佳性能且参数更少。

详情
AI中文摘要

现有的OCTA血管分割深度学习框架大多基于U-Net架构,但大多数方法仅关注整体表示,难以处理OCTA特有的低局部对比度问题,导致血管断续和细节丢失。为此,我们提出LSENet,基于U-Net架构引入三个核心创新模块:为解决血管断续问题,引入补丁信息增强模块(PIE),用补丁级注意力替代标准跳接连接;为缓解细节丢失问题,提出多尺度特征融合模块(MFF),通过从原始输入和前一层提取可解释特征,为PIE模块提供丰富多尺度信息;最后设计连接性细化解码器(CRD),通过最终卷积层的大核减少碎片化。在三个公开数据集(OCTA-500、ROSE-1和ROSSA)上的实验表明,所提LSENet在性能上达到最佳,且参数更少。

英文摘要

Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.

2605.20645 2026-05-21 cs.CV 版本更新

Seeing Through Fog: Towards Fog-Invariant Action Recognition

穿透雾气:迈向雾不变的动作识别

Enqi Liu, Liyuan Pan, Zhi Gao, Lingzhi Li, Qing Li

发表机构 * Beijing Institute of Technology, Beijing, China(北京理工大学,北京,中国) Beijing Institute for General Artificial Intelligence, Beijing, China(北京通用人工智能研究院,北京,中国) Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China(北京理工大学扬子江地区研究院,嘉兴,中国)

AI总结 本文提出FogAct基准数据集和FogNet模型,旨在解决雾天环境下动作识别中的挑战,通过改进的两流CLIP模型提取雾不变的语义信息,提升在雾天条件下的动作识别性能。

详情
AI中文摘要

雾天条件在现实应用中很常见;然而,现有动作识别方法通常假设有利的天气和高质量的视频输入。在雾天,不可预测的可见性降级和对比度降低会阻碍语义线索的提取,给当前的动作识别方法带来重大挑战。在本文中,我们通过采用两种策略来缓解雾天条件下动作识别的问题。首先,我们提出了FogAct,这是第一个雾状动作识别基准数据集,由使用立体相机系统拍摄的配对干净和雾天视频组成。该数据集涵盖10个场景和55个动作类别,包含近10000个视频片段。其次,我们提出了FogNet,一种两流CLIP模型,该模型发现隐藏在降质视频背后的雾不变的语义信息。FogNet通过清洁视频的指导学习雾视频的稳健表示,有效捕捉清洁和雾天视频之间的共享结构和运动线索。在FogAct和三个其他流行数据集上的广泛实验表明,我们的方法在与最先进(SOTA)方法相比时具有竞争性性能。我们的FogAct和FogNet可在我们的项目页面上找到。

英文摘要

Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

2605.20640 2026-05-21 cs.CV cs.AI 版本更新

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

帕累托优化的肖像生成:用于对齐、真实性和美学的视觉对齐文本监督

Yunlong Wang, Jinjin Shi, Wenbin Gao, Xuran Xu, Runyu Shi, Ying Huang

AI总结 本文提出了一种多模态扩散变换器(MM-DiT)的特征监督方法,通过引入轻量级的跨模态对齐机制,隐式提取多粒度的视觉对齐文本表示,以提升文本-图像对齐、真实性和美学质量,从而在Pareto前沿上实现协同改进。

详情
AI中文摘要

文本到图像扩散模型在生成人类肖像时往往面临严重的三重困境:文本-图像对齐、逼真度和人类感知的美学之间相互抑制。监督微调(SFT)是一种有效提升图像生成逼真度的方法,但通常会导致过度拟合训练数据集、破坏预训练图像先验并降低对齐或美学质量。为突破这一瓶颈,我们提出了一种多模态扩散变换器(MM-DiT)的特征监督范式。具体而言,我们引入了一种轻量级的跨模态对齐机制,隐式地从SigLIP 2中提取多粒度的视觉对齐文本表示,并在训练阶段将监督应用于MM-DiT的图像分支,且无额外的推理开销。我们的方法在保持基模型原有泛化能力的同时,注入了视觉对齐的文本指导,避免了SFT导致的退化。此外,我们的方法直接从预训练的视觉基础模型中挖掘隐含的多粒度美学信号,以优化人类感知的美学。在MM-DiT上的广泛实验表明,我们的方法推动了Pareto前沿,并在文本-图像对齐、逼真度和人类感知的美学方面实现了协同改进。

英文摘要

Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

2605.20626 2026-05-21 cs.CL cs.AI cs.CV 版本更新

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

基于检索的长上下文翻译用于文化图像描述:佛罗里达大学Gators参加2026年美洲自然语言处理共享任务的提交

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant

发表机构 * University of Florida(佛罗里达大学)

AI总结 本文提出了一种基于检索的长上下文翻译方法,用于文化图像描述,通过两阶段流程生成西班牙语中间描述,再利用检索增强的多示例提示生成目标语言描述,显著提升了Bribri、Guaraní和Orizaba Nahuatl语言的描述生成性能,并在共享任务中获得冠军。

详情
AI中文摘要

我们提出了佛罗里达大学Gators团队对2026年美洲自然语言处理共享任务在原住民语言文化图像描述任务中的提交。我们的两阶段流程使用Qwen2.5-VL生成西班牙语中间描述,然后利用检索增强的多示例提示与Gemini 2.5 Flash生成目标语言描述。我们在开发集评估中分别实现了Bribri、Guaraní和Orizaba Nahuatl描述生成性能的164.1%、131.7%和122.6%的提升,并在测试集评估中保持Bribri和Orizaba Nahuatl语言的>150%提升。我们发现检索高度依赖语言,仅对大规模、领域内语料有效,并且合成数据增强对开发集Guaraní性能提升贡献了约28 chrF++。我们的提交在共享任务中获得冠军,位列五份最终提交中的第二名。

英文摘要

We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaraní performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

2605.20624 2026-05-21 cs.CV cs.AI cs.LG 版本更新

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

用自回归扩散模型加速视频逆问题求解器

Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院) EverEx

AI总结 本文提出自回归视频逆问题求解器(AVIS),通过自回归扩散模型实现流式视频恢复,显著降低初始延迟并提高吞吐量,同时保持高质量的恢复效果,并进一步提出加速变体AVIS Flash,实现更高的吞吐量和更优的效率-性能权衡,为实时部署铺平道路。

Comments Project page is available here: https://avis-project.github.io/

详情
AI中文摘要

扩散模型为零样本视频逆问题提供了强大的先验知识,但其实时部署受到两个效率问题的阻碍:由整体视频恢复引起的高初始延迟,以及由于在像素空间中多次VAE传递以强制测量一致性导致的低吞吐量。为克服这些限制,我们提出了自回归视频逆问题求解器(AVIS)。AVIS框架利用自回归视频扩散模型以流式方式恢复视频,自然地消除了延迟瓶颈。具体而言,AVIS通过测量一致性的估计初始化反向扩散,减少了所需的采样步骤。与领先的非自回归求解器相比,AVIS将初始延迟从114秒减少到4秒,并将吞吐量从0.71提高到1.18 FPS,同时实现更优的恢复质量。我们进一步引入了一个高度加速的变体,称为AVIS Flash,该变体仅在第一个片段上强制测量一致性。AVIS Flash在单个RTX 4090 GPU上将吞吐量提高到5.91 FPS,同时保持竞争性的性能,并实现有利的效率-性能权衡,为实时部署铺平道路。

英文摘要

Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

2605.20610 2026-05-21 cs.CV cs.AI 版本更新

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

超越路由:表征专家调节与表示在视觉混合专家中的刻画

Gene Tangtartharakul, Katherine R. Storrs

发表机构 * School of Psychology University of Auckland(心理学系奥克兰大学)

AI总结 本文研究了视觉混合专家模型中专家调节与表示的特性,通过对比学习训练稀疏门控卷积MoE模型,并利用视觉神经科学工具分析专家的专业化,发现动植物区分主导专家划分,并揭示了专家在更广泛的连续视觉和语义维度上的调节。

Comments 21 Pages, 6 Main Figures, 1 Table

详情
AI中文摘要

混合专家(MoE)模型通常通过分析哪些类别被路由到哪些专家来解释。然而,仅靠路由并不能揭示每个专家实际编码的内容。我们训练了稀疏门控卷积MoE模型,并在自然图像上使用对比目标进行训练,利用视觉神经科学工具来表征专家的专业化。从门控级别扩展到专家级别分析,我们测量了每个专家的类别分离度,并利用最吸引人的输入来分析每个专家的调节。从类别级别扩展到特征级别解释,我们通过从人类行为判断数据集(THINGS)中衍生出的语义维度来解释调节。最后,我们使用调节和表征相似性分析来评估在独立初始化下专家分配的稳定性。我们发现,动植物区分主导专家划分,从门控到专家读取都明显,并在独立训练模型中保持稳定。尽管路由统计数据表明相对稀疏的、类别的偏好,但专家分析揭示了更广泛的对连续视觉和语义维度的调节,超出了类别边界。尽管特征调节不同,专家之间表现出相似的类别分离度,这表明超越类别级别分析的解释优势。这些结果表明,视觉MoE中的专家专业化远超类别路由,并通过探测细粒度专家级别调节和表征结构来更好地理解。

英文摘要

Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.

2605.20607 2026-05-21 cs.LG cs.CV cs.RO 版本更新

Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

基于视觉着陆系统的学习保证机制解释

Romeo Valentin, Olivia Beyer Bruvik, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Stanford Intelligent Systems Laboratory, Stanford University, Stanford, CA, USA(斯坦福智能系统实验室,斯坦福大学,斯坦福,CA,美国)

AI总结 本文提出了一种基于视觉着陆系统的学习保证机制,通过分离内容与风格来构建可解释的模型,从而提供可靠的证据支持,同时引入了新的运行时保证方法来监控模型的情境表示。

Comments 10 pages, 4 figures

详情
AI中文摘要

EASA的学习保证指导要求数据驱动的航空系统构建并监控自身的情境表示,但对神经网络而言,提供此类证据的技术手段仍是一个开放问题。我们针对基于视觉的飞机着陆系统填补了这一空白:我们提出,一个可保证的模型至少必须展示其情境表示中能够分离内容与风格。展示模型的预测主要依赖于内容表示组件,从而得到一个具体的保证路径。为了在具体模型上展示这个保证路径,我们训练了一个用于跑道关键点回归的视觉Transformer模型,在LARDv2数据集上进行训练。该模型作为我们保证演示的主体,产生每块嵌入,我们通过K-SVD稀疏字典学习将其分解为可解释的原子。定性可视化确认了内容原子跟踪任务相关的跑道结构,风格原子跟踪领域特定的外观,且回归头几乎将所有线性权重放在内容原子上。我们进一步基于内容/风格分离并定义了模型外范围(OOMS)检测,一种新的运行时保证方法,直接监控模型的情境表示。OOMS监控与操作设计领域和输出空间的分布外监控互补,并满足最近EASA指导的明确要求。通过在测试时间和运行时直接分析模型的情境表示,本工作提供了EASA学习保证指导所要求的第一个具体的表示层面证据,并指出了机制解释作为未来航空安全案例的实用构建块。

英文摘要

EASA's learning-assurance guidance requires data-driven aviation systems to build and monitor their own situation representation, yet for neural networks the technical means to provide such evidence remain an open problem. We address this gap for a vision-based aircraft landing system: we propose that a minimally assurable model must at least be shown to separate content from style in its own situation representation. Showing that the model's predictions then rely largely on the contentful representation components leads to a concrete assurance path. To demonstrate this assurance path on a concrete model we train a vision transformer model for runway keypoint regression on the LARDv2 dataset. The model, which acts as the subject for our assurance demonstration, produces per-patch embeddings that we decompose into interpretable atoms via K-SVD sparse dictionary learning. A qualitative visualization confirms that contentful atoms track task-relevant runway structure and stylistic atoms track domain-specific appearance, and the regression head is shown to place almost all of its linear weight on contentful atoms. We further build on the content/style separation and define out-of-model-scope (OOMS) detection, a novel runtime assurance approach directly monitoring the model's situation representation. OOMS monitoring is complementary to operational design domain and output-space out-of-distribution monitoring and addresses concrete requirements of the recent EASA guidance. By directly analyzing a model's situation representation both at test time and runtime, this work delivers the first concrete piece of the representation-level evidence that EASA learning-assurance guidance demands, and points to mechanistic interpretability as a practical building block of future aviation safety cases.

2605.20600 2026-05-21 cs.CV 版本更新

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

面向头部的键值压缩用于高效自回归图像生成

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Yunming Ye

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出HeadKV框架,通过根据注意力头的局部性偏置分配不同的缓存预算,提高自回归图像生成的效率和内存利用率,同时设计分层令牌驱逐策略以保留长距离信息。

Comments Under review

详情
AI中文摘要

自回归(AR)视觉生成在性能上取得了显著成果,但存在内存使用高和吞吐量低的问题,因为需要缓存之前生成的视觉标记。最近的研究表明,仅保留少量缓存标记即可维持高质量图像,同时显著减少内存使用并提高吞吐量。然而,这些方法为每个注意力头分配固定预算,忽视了注意力头之间的异质性,导致内存分配不优。在本文中,我们观察到不同层的注意力头表现出多样的注意力模式,其中一些头专注于局部邻域,而另一些头捕捉更广泛的上下文依赖。基于这一见解,我们提出了一种新的面向头部的键值(KV)缓存压缩框架,称为HeadKV,用于自回归图像生成,该框架为局部偏置头分配较小的预算,为具有更广泛注意力的头分配更大的预算。一个关键挑战在于确定每个注意力头的类型以指导缓存压缩。我们进一步观察到,在同一层中,每个头在不同位置的令牌上表现出一致的注意力模式,即一个头在早期令牌上的行为与后期令牌上的行为保持一致。这一见解表明,头类型可以在早期阶段确定并在生成过程中重用以进行KV压缩。其优势是它不需要额外的训练或数据集级统计,并且可以无缝泛化到不同的输入。此外,我们设计了一种分层令牌驱逐策略以有效保留长距离信息。广泛的实验展示了其在多种自回归图像生成模型上的有效性。

英文摘要

Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

2605.20588 2026-05-21 cs.CL cs.CV 版本更新

Direct Translation between Sign Languages

手语之间的直接翻译

Zetian Wu, Bowen Xie, Wuyang Meng, Milan Gautam, Stefan Lee, Liang Huang

发表机构 * Oregon State University(俄勒冈州立大学)

AI总结 本文提出了一种直接的手语到手语翻译方法,通过使用回译技术生成合成的手语对,从而克服了传统级联方法中的误差传播和信息丢失问题,并在多个手语数据集上实现了更高的翻译质量和速度提升。

详情
AI中文摘要

手语翻译领域在手语与口语之间的翻译上取得了显著进展,但手语之间的翻译仍鲜为人知且难以实现。后者可以帮助15亿全球聋人和听力障碍者在语言障碍中交流,而无需依赖听力翻译者或书面语言能力。级联方法由单独的手语到文本、文本到文本和文本到手语系统组成,但存在误差传播、额外延迟以及视觉模态中独特信息的丢失。我们旨在开发直接的手语到手语翻译。然而,尚未有大规模的开放领域平行语料库在手语之间。为了实现直接的手语翻译,我们使用回译技术从不对齐的个体语言语音-手语语料库中生成合成的手语对。使用这些数据,我们联合训练了一个基于MBART的单一模型,用于文本到手语(T2S)和手语到手语(S2S)。在合成生成的美国手语(ASL)、中国手语(CSL)和德国手语(DGS)之间配对集上,我们的直接S2S方法在几何手语误差指标(20%更低的DTW对齐MPJPE)和翻译回句子后的语言匹配指标(50%高BLEU-4)上优于级联基线,同时实现了大约2.3倍的速度提升。在一小部分现有的跨语言手语数据上,我们发现我们的方法也实现了类似的改进。

英文摘要

The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

2605.20584 2026-05-21 cs.CV 版本更新

QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

QwenSafe: 通过偏好对齐的视觉语言模型实现多模态内容评级描述识别

Dishanika Denipitiyage, Aruna Seneviratne, Suranga Seneviratne

发表机构 * University of Sydney(悉尼大学) University of New South Wales(新南威尔士大学)

AI总结 本文提出QwenSafe,一种通过联合推理应用元数据和截图来自动识别苹果定义的内容评级描述(CRDs)的视觉语言模型,通过引入metadata2CRD数据构建管道和直接偏好优化(DPO)提升模型预测准确性,实验结果显示QwenSafe在二元CRD分类中显著优于现有模型。

详情
AI中文摘要

移动应用市场要求开发者披露标准化的内容评级描述(CRDs)以告知用户潜在敏感或受限制的内容。确保这些披露的准确性和一致性仍然具有挑战性,因为应用内容的多模态性质跨越了文本描述和视觉界面。在本文中,我们提出了QwenSafe,一种视觉语言模型(VLM),旨在通过联合推理应用元数据和截图自动识别苹果定义的CRDs。为了使该任务能够扩展训练,我们引入了metadata2CRD数据构建管道,通过结合应用描述、截图和正式描述定义来合成描述对齐的问题-答案对。我们通过监督微调后直接偏好优化(DPO)调整Qwen3-VL-8B,以使模型预测与视觉和文本模态的描述特定证据和解释对齐。我们在12个苹果定义的内容评级描述上评估QwenSafe,并将其与最先进的视觉语言模型进行比较,包括Qwen3-VL、LLaVA-1.6和Gemini-2.5-Flash。QwenSafe在二元CRD分类中始终优于所有基线模型,分别在正类召回率上实现了111.8%、36.1%和2.1%的提升。我们的结果表明,描述意识的多模态对齐显著提高了自动化内容分类,并突显了视觉语言模型在支持移动应用市场中可扩展和一致的内容评级方面的潜力。

英文摘要

Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.

2605.20576 2026-05-21 cs.CV 版本更新

$Δ$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

$Δ$ynamics: 一种基于语言的表示方法,用于从视频中推断刚体动力学

Chia-Hsiang Kao, Cong Phuoc Huynh, Chien-Yi Wang, Noranart Vesdapunt, Stefan Stojanov, Bharath Hariharan, Oleksandr Obiednikov, Ning Zhou

发表机构 * Cornell University(康奈尔大学) Amazon(亚马逊)

AI总结 本文提出$Δ$YNAMICS框架,通过语言统一表示刚体动力学,利用结构化文本生成物理模拟场景配置,结合自然语言运动推理和光流输入提升泛化能力,在CLEVRER数据集上实现了7倍于现有VLMs的分割IoU,并在新数据集上展示了良好的迁移能力。

Comments Accepted to CVPR 2026. Project page: https://iandrover.github.io/2026_dynamics

详情
AI中文摘要

从单目视频中推断刚体物理状态和属性是实现基于物理的感知和模拟的关键步骤。现有方法假设特定的物理系统、物体类型和相机姿态,无法泛化到复杂的现实环境。我们引入$Δ$YNAMICS,一种视觉-语言框架,利用语言作为刚体动力学的统一表示。不同于直接预测参数,$Δ$YNAMICS生成结构化的文本格式场景配置用于物理模拟。我们通过整合自然语言运动推理和利用光流作为语义无关的输入来增强模型的泛化能力。在CLEVRER数据集上,$Δ$YNAMICS实现了0.30的分割IoU,比领先的VLMs(InternVL3-8B,Qwen2.5-VL-7B和Claude-4-Sonnet)提高了7倍。此外,测试时采样和进化搜索分别将分割IoU提高27%和120%。最后,我们展示了在包含235个现实世界刚体视频的新数据集上的良好迁移能力,突显了语言驱动的物理推断在连接感知和模拟方面的潜力。

英文摘要

Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $Δ$YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $Δ$YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $Δ$YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

2605.20569 2026-05-21 cs.CV 版本更新

End-to-End Unmixing with Material Prompts for Hyperspectral Object Tracking

端到端材料提示的超光谱目标跟踪

Xu Han, Mohammad Aminul Islam, Lei Wang, Zekun Long, Guanmanyi Fu, Wangshu Cai, Kuldip K. Paliwal, Jun Zhou

发表机构 * School of Information and Communication Technology, Griffith University, Australia(信息与通信技术学院,格里菲斯大学,澳大利亚) School of Engineering and Built Environment, Griffith University, Australia(工程与建筑环境学院,格里菲斯大学,澳大利亚) School of Environment and Science, Griffith University, Australia(环境与科学学院,格里菲斯大学,澳大利亚)

AI总结 本文提出了一种端到端的材料感知跟踪框架,通过联合优化材料分解和目标定位,利用加权目标导向的解混损失对齐材料表示与定位精度,以提升超光谱图像在外观模糊、光照变化和背景杂波下的跟踪鲁棒性。

详情
AI中文摘要

超光谱成像编码了丰富的材料属性,可以在外观模糊、光照变化和背景杂波下提高跟踪鲁棒性。然而,由于超光谱视频数据有限,许多现有方法通过空间或通道融合策略适应预训练的RGB跟踪器,很大程度上忽略了超光谱成像中的内在材料信息。此外,很少的材料感知方法通常依赖于外部光谱解混管道,这些管道与跟踪目标解耦,限制了对材料表示的有效优化。为了解决这些限制,我们将超光谱目标跟踪公式化为材料分解和目标定位的联合优化问题,通过加权目标导向的解混损失将两个任务耦合起来,显式地对齐材料表示与定位精度。具体来说,我们提出了一种用于深度学习光谱解混的材料表示分解模块,具有自适应频率分解。基于分解的材料表示,我们进一步引入了双分支小波增强的材料提示模块,通过频域中的高效空间-材料交互学习低频和高频的材料提示。该框架是模型无关的,可以无缝扩展到不同的解混后端。在标准的超光谱跟踪基准上的大量实验验证了所提出端到端材料感知跟踪框架的最先进性能,并验证了其有效性。代码可在https://github.com/han030927/E2EMPT上获得。

英文摘要

Hyperspectral imagery encodes rich material properties that can improve tracking robustness under appearance ambiguity, illumination change, and background clutter. However, due to the limited availability of hyperspectral video data, many existing methods adapt pretrained RGB trackers via spatial or channel fusion strategies, largely neglecting the intrinsic material information in hyperspectral imagery. Moreover, the few material-aware approaches typically rely on external spectral unmixing pipelines that are decoupled from the tracking objective, limiting effective optimization of material representations for target localization. To address these limitations, we formulate hyperspectral object tracking as a joint optimization problem of material decomposition and target localization, coupling the two tasks via a weighted target-oriented unmixing loss that explicitly aligns material representations with localization accuracy. Specifically, we propose a material representation decomposition module for deep learning-based spectral unmixing with adaptive frequency decomposition. Building on the decomposed material representations, we further introduce a dual-branch wavelet-enhanced material prompt module that learns low- and high-frequency material prompts through efficient spatial-material interactions in the frequency domain. The framework is model-agnostic and can be seamlessly generalized to different unmixing backbones. Extensive experiments on standard hyperspectral tracking benchmarks demonstrate state-of-the-art performance and validate the effectiveness of the proposed end-to-end material-aware tracking framework. Code is available at https://github.com/han030927/E2EMPT.

2605.20551 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

更快或更强:通过加权聚合和标记剪枝实现灵活的视觉位置识别

Zichao Zeng, June Moh Goo, Junwei Zheng, Weijia Fan, Jiaming Zhang, Rainer Stiefelhagen, Jan Boehm

发表机构 * University College London(伦敦大学学院) Karlsruhe Institute of Technology(卡尔斯鲁厄大学) Hunan University(湖南大学) Shenzhen University(深圳大学)

AI总结 本文提出了一种加权聚合描述符(WeiAD)和标记剪枝框架(WeiToP),用于提升视觉位置识别的性能和效率,通过动态调整特征提取的精度与效率平衡。

详情
AI中文摘要

视觉位置识别(VPR)旨在将查询图像匹配到大规模数据库中相同地点的参考图像。最近最先进的方法采用视觉Transformer(ViTs)作为基础模型,提取对视角、光照和季节变化具有鲁棒性的补丁级特征,然后聚合为紧凑的全局描述符进行检索。大多数现有聚合方法将补丁标记均匀地池化到学习的簇中,尽管不同簇往往编码不同的空间或语义模式,并对VPR性能贡献不均。为了解决这一限制,我们提出了加权聚合描述符(WeiAD),在聚合过程中分配簇的权重,产生更具判别性的全局表示。除了准确性之外,检索延迟是大规模部署和资源受限边缘设备的关键关注点。先前的工作主要通过压缩全局描述符来减少延迟,而忽略了特征提取的成本,这在基于ViT的基础模型中变得更加严重。因此,我们引入了面向VPR的标记剪枝框架WeiToP,通过自蒸馏减少特征提取成本,其中聚合诱导的标记重要性监督一个轻量级剪枝模块,附加到早期Transformer层上,使推理时能够进行标记剪枝。在单次联合训练阶段后,WeiToP能够在推理时实现插拔式的标记剪枝,允许在不额外训练的情况下灵活地控制精度-效率权衡。此外,WeiToP在现有针对通用视觉任务的标记剪枝方法上表现更优。

英文摘要

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

2605.20549 2026-05-21 cs.CV 版本更新

MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space

MAPS:用于在受控3D场景空间中探测视觉模型的合成数据集

Santiago Galella, Pamela Osuna-Vargas, Maren Wehrheim, Martina G. Vilas, Gemma Roig, Matthias Kaschube

发表机构 * FIAS & Institute of Computer Science Goethe University Frankfurt(FIAS与计算机科学研究所弗赖堡大学) Mila & Department of Biology York University(Mila与生物学系约克大学) Institute of Computer Science Goethe University Frankfurt(计算机科学研究所弗赖堡大学)

AI总结 本文提出MAPS数据集,用于在受控3D场景空间中研究视觉模型的行为,通过回归敏感性分析评估20种模型对场景因素的依赖性,发现相机距离和高度是导致识别失败的主要因素,且现代CNN和Transformer模型在敏感性上表现出相似性。

Comments 33 pages, 20 figures

详情
AI中文摘要

现代视觉模型在标准基准上表现强劲,但其整体准确率难以揭示驱动预测的场景属性。现有鲁棒性基准提供重要压力测试,但通常操纵全局2D图像属性,依赖现实世界变化或仅覆盖有限的3D对象和场景参数。我们引入MAPS(Manifolds of Artificial Parametric Scenes),一种可扩展的工具,用于受控地将视觉模型行为归因于场景参数。MAPS包含2,618个经过筛选的逼真3D网格,已验证在560个ImageNet类别上具有可识别性,并提供基于Blender的渲染管道,可按需生成图像,连续变化九个独立场景因素,涵盖背景、相机和照明,可扩展至其他因素。为了展示其适用性,我们使用MAPS评估20种卷积和Transformer模型,通过基于回归的敏感性分析量化其对这些场景因素的依赖性。我们发现所有测试架构中普遍存在一个几乎普遍的失败轴:相机距离和高度在识别失败中始终占主导地位,无论ImageNet准确性如何。然而,完整的敏感性结构揭示出现代CNN和Transformer模型聚集在一起,与旧架构不同,表明细粒度的架构设计选择,而非粗粒度的CNN与Transformer区别,是敏感性特征的更强决定因素。

英文摘要

Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.

2605.20544 2026-05-21 cs.RO cs.CV 版本更新

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

顺从综合征:具身机器人代理中的退避基准测试

Doguhan Yeke, Elif Su Temirel, Ananth Shreekumar, Brandon Lee, Dongyan Xu, Z Berkay Celik

发表机构 * Purdue University(普渡大学) Bilkent University(比尔肯特大学)

AI总结 本文提出了一种用于具身机器人代理的退避基准测试框架RoboAbstention,通过五种机器人数据集中的图像生成退避指令,评估了多个前沿VLMs在退避任务中的表现,并探讨了改进退避性能的方法。

详情
AI中文摘要

视觉语言模型(VLMs)被用作具身代理的高层规划器,将自然语言指令和视觉观察转化为行动计划。尽管先前的工作研究了LLMs中的退避行为,但现有的基准测试大多仅限于文本,无法捕捉到具身机器人环境中的感知基础和物理约束。在这样的环境中,退避需要识别指令模糊、物理不可行、基于错误前提或在给定可用感觉模态和上下文下无法解决的情况。为了解决这一差距,我们引入了一个分类法来分类具身机器人中的退避行为,并提出了RoboAbstention,一个可扩展且可审计的框架,用于生成基于五个机器人数据集收集的图像的退避指令。RoboAbstention通过三个阶段的流程实现该分类法:(1)结构化的视觉基础,(2)确定性的约束推导,(3)通过类别特定模板进行受控的指令生成。这使能够构建一个具有可验证退避条件的多样化数据集。我们评估了几种前沿VLMs,并发现所有模型在退避任务中都表现出显著的弱点,包括那些具有高级推理能力的模型。表现最好的模型Gemini 2.5 Flash仅在6,069个基准指令中退避39.0%,而具身规划器Gemini Robotics ER 1.6 Preview仅在16.5%的指令中退避。我们进一步探讨了改进VLM规划器退避性能的方法,如防御性提示和上下文学习,并发现这些干预措施显著提高了性能,达到Gemini Robotics ER 1.6 Preview的93.6%退避率和GPT 5.4 Mini的88.6%退避率,但没有任何方法完全解决了该问题。我们开源了RoboAbstention在https://purseclab.github.io/RoboAbstention/。

英文摘要

Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.

2605.20543 2026-05-21 cs.CV 版本更新

Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation

不确定性引导的保守传播用于血管分割的结构推理

Huan Huang, Michele Esposito, Chen Zhao

发表机构 * Department of Computer Science, Kennesaw State University(凯斯西储大学计算机科学系) Department of Cardiology, Medical University of South Carolina(南卡罗来纳医科大学心脏病科)

AI总结 本文提出了一种不确定性引导的保守传播(UGCP)模块,用于改进血管分割的结构推理,通过局部预测交互进行多次logit空间更新步骤,提高分割的Dice相似系数、中心线Dice和95百分位Hausdorff距离,同时减少血管断开并提高结构一致性。

Comments Pattern Recognition submission. 35 pages, 6 figures

详情
AI中文摘要

准确的血管分割对于医学图像分析至关重要,但仍然具有挑战性,因为复杂的血管模式和成像模糊性导致了困难。大多数深度模型依赖于单次预测,限制了它们在推理过程中细化不确定或断开区域的能力。为了解决这一限制,我们提出了不确定性引导的保守传播(UGCP),这是一种通用的插件模块用于血管分割。与其直接使用一次输出作为最终预测不同,UGCP通过局部预测交互进行少量logit空间更新步骤来改进分割。预测不确定性引导可靠区域以支持模糊区域,同时结构意识调制和源基于稳定化减少不可靠传播和过度漂移。该模块是可微的,可以与不同的分割网络端到端训练。我们在四个公开的血管分割数据集上评估了UGCP,涵盖2D和3D任务,包括视网膜血管、冠状动脉和脑血管分割。使用基于卷积神经网络和Transformer的后端进行的实验显示,Dice相似系数、中心线Dice和95百分位Hausdorff距离均有所提高。进一步分析表明,UGCP在有限的额外计算下减少了血管断开并提高了结构一致性。代码将在https://github.com/chenzhao2023/UGC_PR上提供。

英文摘要

Accurate vessel segmentation is essential for medical image analysis, yet remains challenging due to complex vascular patterns and imaging ambiguity. Most deep models rely on single-pass prediction, limiting their ability to refine uncertain or disconnected regions during inference. To address this limitation, we propose Uncertainty-Guided Conservative Propagation (UGCP), a general plug-in module for vessel segmentation. Instead of directly using a one-shot output as the final prediction, UGCP performs a small number of logit-space update steps to refine the segmentation through local predictions interaction. Predictive uncertainty guides reliable regions to support ambiguous regions, while structure-aware modulation and source-based stabilization reduce unreliable propagation and excessive drift. The module is differentiable and can be trained end-to-end with different segmentation networks. We evaluate UGCP on four public vessel segmentation datasets covering 2D and 3D tasks, including retinal vessel, coronary artery, and cerebral vessel segmentation. Experiments with convolutional neural network-based and Transformer-based backbones show consistent improvements in Dice similarity coefficient, centerline Dice, and 95th percentile Hausdorff distance. Further analysis demonstrates that UGCP reduces vessel disconnections and improves structural consistency with limited additional computation. The code will be made available at https://github.com/chenzhao2023/UGC_PR.

2605.20185 2026-05-21 cs.GR cs.CV 版本更新

PiG-Avatar: Hierarchical Neural-Field-Guided Gaussian Avatars

PiG-Avatar:分层神经场引导的高斯虚拟人物

Julian Kaltheuner, Jan Spindler, Sina Kitz, Patrick Stotko, Reinhard Klein

发表机构 * University of Bonn(波恩大学)

AI总结 本文提出PiG-Avatar,通过使用参数化身体模型进行运动传输,将虚拟人物表示为受连续神经场约束的体积标准空间中的高斯分布,从而解耦了表示与模板拓扑,实现了对复杂衣物几何和分层表面的高保真重建。

详情
AI中文摘要

现有的高斯虚拟人物方法通常在身体模板表面上参数化几何,这将虚拟人物的表示空间与模板的变形空间纠缠在一起,限制了对分层、非身体和非刚性的衣物几何的捕捉。我们提出了PiG-Avatar,通过仅使用参数化身体模型进行运动传输,将虚拟人物表示为受连续神经场约束的体积标准空间中的高斯分布,从而解耦了表示与模板拓扑,避免了基于表面的参数化几何约束。通过3D重心锚点传输维持运动一致性,该方法引导运动而不限制几何,并允许锚点自由偏离模板表面,从而通过构造生成密集且稳定的时空表面对应关系。为了使这种无约束的公式可操作,我们引入了双层空间一致优化,结合Sobolev预条件的神经场更新与一种新的基于KNN的预条件化标准锚点几何。这些机制共同诱导了锚点密度的自组织:锚点迁移到高曲率、外观变化和非一致运动的区域,而无需显式启发式。结果,复杂的衣物几何和分层表面作为自然、高保真的输出出现。这种单一表示进一步支持多级细节的分层重建,粗略级别的监督通过共享场和耦合锚点图传播到更细的级别。在具有复杂衣物和具有挑战性的非刚性运动的已建立基准上,PiG-Avatar实现了最先进的渲染质量,对不完美的身体模型初始化具有鲁棒性,并且可以在所有细节级别上实时渲染。

英文摘要

Existing Gaussian avatar methods typically parameterize geometry on a body-template surface, which entangles the avatar's representation space with the template's deformation space and limits the capture of layered, off-body, and non-rigid clothing geometry. We present PiG-Avatar, which addresses this limitation by using the parametric body model solely for kinematic transport, while representing the avatar as Gaussians anchored in a volumetric canonical space governed by a continuous neural field. This decouples representation from template topology, avoiding the geometric constraints of surface-based parameterizations. Kinematic coherence is maintained through 3D barycentric anchor transport, which guides motion without constraining geometry and allows anchors to deviate freely from the template surface, yielding dense, stable temporal surface correspondences by construction. To make this unconstrained formulation tractable, we introduce dual-level spatially coherent optimization, combining Sobolev-preconditioned neural-field updates with a novel KNN-based preconditioning of canonical anchor geometry. Together, these mechanisms induce an emergent self-organization of anchor density: anchors migrate toward regions of high curvature, appearance variation, and non-coherent motion without explicit heuristics. As a result, complex clothing geometry and layered surfaces emerge as natural, high-fidelity outputs. This single representation further supports hierarchical reconstruction across multiple levels of detail, with coarse-level supervision propagating to finer levels through the shared field and coupled anchor graph. On established benchmarks featuring subjects with complex clothing and challenging non-rigid motion, PiG-Avatar achieves state-of-the-art rendering quality, generalizes robustly to imperfect body model initialization, and renders in real time across all detail levels.

2605.19776 2026-05-21 cs.CV 版本更新

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

偏好顺序、评分锚定:从融合专家审美真实数据到自我蒸馏

Yuanpei Zhao, Jie Lin, Chao Zhang, Yilin Wang, Mao Li, Chenhui Li, Jie Hou, Tangjie Lv

发表机构 * Sichuan University(四川大学) NetEase Fuxi AI Lab(网易福溪人工智能实验室) East China Normal University(华东师范大学)

AI总结 本文提出PPaint基准,通过融合专家偏好和评分数据,改进图像审美评估模型,通过自我蒸馏方法在单次推理中实现更准确的审美评分,优于现有开源和闭源基线模型。

Comments 27 pages, 7 pages

详情
AI中文摘要

成对偏好和点状评分是图像审美评估(IAA)的两种主要标注协议,但现有基准仅采用其中一种,未能在受控条件下测量其互补性。我们引入PPaint,一种匹配双协议基准,在五个审美维度上,15名领域专家(每类5名)对150幅中国画进行双协议标注,通过本地密集偏好设计收集45,900个成对专家判断,同时匹配评分。匹配设计揭示了互补优势:偏好产生更一致的顺序排名,而评分锚定了绝对分数尺度。通过两种独立的偏好到评分方法融合两种信号,得到融合的专家真实数据,使两种构造收敛到几乎相同的分数。同样的偏好到评分原则也适用于无标签VLM训练。PSDistill通过Elo参考池将VLM的成对判断转换为校准的伪分数,并通过置信度加权排名优化训练相同的VLM,生成单次推理的审美评分器。在单个绘画类别上训练,蒸馏后的Qwen3-VL-8B在所有三个类别上将均值SRCC从0.504提升到0.709,优于所有开源基线,包括专用审美模型ArtiMuse,并在单次推理成本下与闭源Gemini-3.1-Pro相差0.04 SRCC,跨领域转移在APDDv2上进一步验证。我们将发布完整的PPaint数据集和训练代码。

英文摘要

Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.

2605.19649 2026-05-21 cs.CV 版本更新

CAD-Free Learning of Spacecraft Pose Estimators via NeRF-Based Augmentations

无需CAD的基于NeRF的航天器姿态估计器学习方法

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

发表机构 * Department of Electrical Engineering (ELEN), ICTEAM, UCLouvain(电子工程系(ELEN),ICTEAM,鲁汶大学) Department of Electrical Engineering (ESAT), KU Leuven(电子工程系(ESAT),鲁汶大学) Department of Mechanical Engineering (MECH), KU Leuven(机械工程系(MECH),鲁汶大学) Aerospacelab(航天实验室)

AI总结 本文提出了一种基于NeRF的图像增强方法,使航天器姿态估计器的学习不再依赖大量CAD渲染图像,仅需几十到几百张真实图像即可训练出准确的姿态估计器,同时提升了对实际轨道条件的鲁棒性。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

航天器姿态估计网络需要数万张CAD渲染图像进行训练。这种对合成CAD数据的依赖(i)限制了其在具有可靠几何先验的目标上的应用,排除了不合作或文档不全的航天器,(ii)由于不现实的光照和材料外观导致对真实轨道条件的泛化能力差。本文介绍了一种基于NeRF的图像增强方法,使学习航天器姿态估计器仅需几十到几百张图像。该方法通过几何一致的视角和外观增强生成大量多样化的数据集。这个增强的数据集使无需CAD模型或大规模合成数据集即可训练出准确的目标特定姿态估计器。实验表明,我们的方法支持仅用25到400张真实图像训练出准确的姿态估计器,即使在严重的光照变化下也是如此。当应用于大型CAD基于的合成数据集时,基于NeRF的增强也增强了域外泛化能力,提高了对真实轨道条件的鲁棒性。

英文摘要

Spacecraft pose estimation networks require tens of thousands of CAD-rendered images to be trained. This reliance on synthetic CAD data (i) limits applicability to targets with reliable geometry prior, excluding uncooperative or poorly documented spacecraft, and (ii) causes poor generalization to real on-orbit conditions due to unrealistic illumination and material appearance. This paper introduces a NeRF-based image augmentation method that enables the learning of spacecraft pose estimators from only a few tens to a few hundreds of images. The method learns a Neural Radiance Field of the target and generates a large, diverse dataset through geometrically-consistent viewpoint and appearance augmentation. This augmented dataset enables the training of accurate target-specific pose estimators without requiring a CAD model or large synthetic datasets. Experiments show that our approach supports the training of accurate pose estimators from only 25 to 400 realistic images, even under severe illumination variations. When applied on large CAD-based synthetic datasets, the NeRF-based augmentation also enhances out-of-domain generalization, yielding improved robustness to real on-orbit conditions.

2605.19624 2026-05-21 cs.CV cs.AI 版本更新

Component-Aware Structure-Preserving Style Transfer for Satellite Visual Sim2Real Data Construction

面向组件的结构保持风格迁移用于卫星视觉Sim2Real数据构建

Zongwu Xie, Yonglong Zhang, Yifan Yang, Yang Liu, Baoshi Cao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(机器人系统国家重点实验室,哈尔滨工业大学)

AI总结 本文提出了一种面向组件的结构保持风格迁移框架,用于卫星视觉的合成到真实数据构建,通过提取真实图像的部件级风格代码并注入到合成图像中,从而提高标注保持的卫星视觉Sim2Real数据生成效果。

详情
AI中文摘要

对于基于相机的卫星视觉感知,Sim2Real数据构建需要图像接近真实域传感器外观同时保留来自模拟的注释。具有可靠姿态标签和组件级遮罩的卫星目标的真实传感器图像难以大规模获取,而合成渲染提供精确的几何注释但存在明显的外观差距。本文提出了一种面向组件的结构保持风格迁移框架用于卫星视觉的合成到真实数据构建。该方法通过校准的真实获取、基于ArUbo的相机姿态测量、CAD渲染和组件遮罩构建弱配对的真实-合成样本。然后从未标记的真实图像中提取部件级真实域风格代码,并通过遮罩对齐调节将其注入到对应的合成卫星区域中。为了保持生成图像对下游传感器数据监督的可用性,对抗训练与局部对比一致性、自正则化和边缘保持约束相结合。实验在5000张渲染的卫星图像和100张在校准设置下拍摄的真实图像上进行。真实图像提供目标域外观参考和最终评估图像,而下游的GDRNet姿态估计器仅在合成或翻译的合成图像上进行训练。与代表性图像翻译基线相比,所提方法实现了最小的图像分布差异,FID为54.32,KID为0.048。当翻译数据用于在目标域适应设置下训练GDRNet时,ADD通过率提高到0.260,AUC提高到0.611。这些结果表明,组件级外观迁移可以提高标注保持的卫星视觉Sim2Real数据生成效果。

英文摘要

For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

2605.18860 2026-05-21 cs.LG cs.CV 版本更新

Spectral structural distortion reveals redundant neurons in neural networks

谱结构扭曲揭示神经网络中的冗余神经元

Yongyu Wang

AI总结 本文提出了一种基于谱结构扭曲的神经元冗余判定方法,通过分析神经网络层变换前后的关系结构,识别可移除的神经元并保持任务性能。

详情
AI中文摘要

过度参数化的神经网络通常包含许多可移除的神经元,但什么使神经元冗余仍不明确。现有剪枝标准通常依赖局部量如权重大小、激活强度或梯度敏感性,但这些指标对神经元在层变换中结构作用的洞察有限。本文表明,神经元冗余可通过在层间表示变换中参与谱结构扭曲的程度来表征。对于训练好的网络的每个隐藏层,我们记录预激活和后激活的隐藏状态,将神经元视为图节点,构建描述神经元层面关系结构的输入侧和输出侧图。然后我们定义了一个谱结构重要性分数,测量每个神经元对这两个关系结构之间主导图谱扭曲的贡献。参与度低的神经元被视为结构冗余并通过迭代剪枝过程移除,在每次结构变化后重新计算分数。在中间剪枝轮次中不进行参数更新;在达到目标参数减少后,对紧凑模型应用一次恢复微调阶段。直接消融分析和在传统神经网络、编码器-only Transformer 和解码器-only 语言模型上的实验表明,这种图谱标准能够识别可移除的神经元和 Transformer 单元,同时在压缩后保持任务性能。这些结果表明,神经冗余不仅仅是小权重或弱激活的结果,而是可以通过在层间关系结构谱扭曲中的弱参与来理解。

英文摘要

Overparameterized neural networks often contain many removable neurons, yet what makes a neuron redundant remains poorly understood. Existing pruning criteria commonly rely on local quantities such as weight magnitude, activation strength, or gradient sensitivity, but these measures provide limited insight into the structural role of a neuron in the transformation performed by a layer. Here we show that neuronal redundancy can be characterized by weak participation in the spectral structural distortion induced by layer-wise representation transformations. For each hidden layer of a trained network, we record pre-activation and post-activation hidden states, model neurons as graph nodes, and construct input-side and output-side graphs that describe neuron-level relational structure before and after the layer transformation. We then define a spectral structural importance score that measures the contribution of each neuron to the dominant graph-spectral distortion between these two relational structures. Low-participation neurons are treated as structurally redundant and removed through an iterative pruning process in which scores are recomputed after each structural change. No parameter updates are performed during intermediate pruning rounds; after the target parameter reduction is reached, a single recovery fine-tuning stage is applied to the compact model. Direct ablation analysis and experiments across conventional neural networks, encoder-only Transformers, and decoder-only language models show that this graph-spectral criterion identifies removable neurons and Transformer units while preserving task performance after compression. These results suggest that neural redundancy is not merely a consequence of small weights or weak activations, but can be understood through weak participation in the spectral distortion of layer-wise relational structure.

2605.18736 2026-05-21 cs.CV 版本更新

Spectral Progressive Diffusion for Efficient Image and Video Generation

频域渐进扩散用于高效图像和视频生成

Howard Xiao, Brian Chao, Lior Yariv, Gordon Wetzstein

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出了一种频域渐进扩散框架,通过在预训练扩散模型的去噪轨迹中逐步提高分辨率,实现高效的图像和视频生成,同时改进了效率和质量。

Comments Project website at https://howardxiao.ca/speed

详情
AI中文摘要

扩散模型已被证明可以在频域中隐式地自回归地生成视觉内容,其中低频分量在去噪过程中早期生成,而高频细节仅在后期时间步出现。这种结构为高效的生成提供了自然机会,因为对噪声主导的高频分量进行高分辨率计算几乎冗余。我们提出了频域渐进扩散,这是一种通用框架,它在预训练扩散模型的去噪轨迹中逐步增加分辨率。为此,我们开发了一种频域噪声扩展机制,并从模型的功率谱中推导出最优的分辨率计划。我们的框架支持无训练加速,并且提供了一种新的微调配方,进一步提高了效率和质量。我们在最先进的预训练图像和视频生成模型上实现了显著的加速,同时保持了视觉质量。

英文摘要

Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later timesteps. This structure offers a natural opportunity for efficient generation, as high-resolution computation on noise-dominated frequencies is largely redundant. We propose Spectral Progressive Diffusion, a general framework that progressively grows resolution along the denoising trajectory of pretrained diffusion models. To this end, we develop a spectral noise expansion mechanism and derive an optimal resolution schedule from the model's power spectrum. Our framework supports training-free acceleration and a novel fine-tuning recipe that further improves efficiency and quality. We demonstrate significant speedups on state-of-the-art pretrained image and video generation models while preserving visual quality.

2605.18678 2026-05-21 cs.CV cs.AI 版本更新

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance:通过多任务协同实现统一多模态建模

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang

发表机构 * Intelligent Creation Lab, ByteDance(字节跳动智能创作实验室)

AI总结 本文提出Lance,一种轻量级的原生统一模型,支持图像和视频的多模态理解、生成和编辑。该模型通过协同多任务训练的实用范式实现统一多模态建模,基于统一上下文建模和解耦能力路径两个核心原则,通过双流混合专家架构实现联合上下文学习并解耦理解和生成路径。

Comments 34 pages, 14 figures, 10 tables, homepage url: https://lance-project.github.io , code url: https://github.com/bytedance/Lance

详情
AI中文摘要

我们提出了Lance,一种轻量级的原生统一模型,支持图像和视频的多模态理解、生成和编辑。与依赖模型容量扩展或文本-图像主导设计不同,Lance通过协同多任务训练探索统一多模态建模的实用范式。其基于两个核心原则:统一上下文建模和解耦能力路径。具体而言,Lance从头开始训练,并在共享交错的多模态序列上采用双流混合专家架构,实现联合上下文学习的同时解耦理解和生成路径。我们进一步引入模态感知的旋转位置编码以减轻异构视觉标记之间的干扰并提升跨任务对齐。在训练过程中,Lance采用分阶段的多任务训练范式,结合能力导向的目标和自适应数据调度,以加强语义理解和视觉生成性能。实验结果表明,Lance在图像和视频生成方面显著优于现有开源统一模型,同时保留了强大的多模态理解能力。该模型的主页可在https://lance-project.github.io上访问。

英文摘要

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.

2605.18447 2026-05-21 cs.CV 版本更新

NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty

基于NeRF的在轨航天器单目影像重建:在光照变化和姿态不确定性下的应用

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

发表机构 * ICTEAM, UCLouvain(ICTEAM,乌得勒支大学) ESAT, KU Leuven(ESAT,鲁汶大学) MECH, KU Leuven(机械工程系,鲁汶大学) Aerospacelab(航天实验室) UCLouvain(乌得勒支大学)

AI总结 本文提出一种基于NeRF的方法,通过引入图像特定的外观嵌入和姿态修正项,提升在光照变化和姿态不确定性下的航天器重建鲁棒性,验证了其在离线重建中的有效性,并展示了其在在线重建中的潜力。

Comments (under review)

详情
AI中文摘要

自主接近和临近操作围绕非合作、未知航天器是主动碎片清除和在轨服务任务的关键。此类操作的关键组成部分是从一组2D图像中离线重建目标的3D模型。这项任务具有挑战性,因为有两个主要因素:首先,在轨光照条件表现出显著的变异性,并且随时间迅速变化。其次,图像中的姿态信息不准确,导致3D重建的不确定性。为克服这些挑战,我们提出扩展Neural Radiance Fields,引入每图像的自由度:一个可学习的外观嵌入,捕捉每张图像特定的光照条件,以及一个图像特定的姿态修正项,以细化其噪声姿态标签,提高图像间的3D一致性。这些参数增加了极小的复杂性,因为它们与NeRF联合学习,但显著提高了对光照变化和姿态不准确性的鲁棒性。我们在三个代表在轨操作的图像集中验证了我们的方法,证明了其在离线重建中的有效性,并突显了其在在线重建中的适用性,这在该领域是一个开放性问题。

英文摘要

Autonomous rendezvous and proximity operations around uncooperative, unknown spacecraft are critical for active debris removal and on-orbit servicing missions. A key component of such operations is the offline reconstruction of a 3D model of the target from a set of 2D images. This task is challenging due to two main factors. First, in-orbit illumination conditions exhibit considerable variability, and change rapidly over time. Second, the inaccuracy of pose information in the images, results in 3D reconstruction uncertainty. To overcome these challenges, we propose to extend Neural Radiance Fields with per-image degrees of freedom: a learnable appearance embedding that captures the illumination conditions specific to each image, and an image-specific pose correction term that refines its noisy pose label to increase 3D consistency across images. These parameters add minimal complexity, as they are learned jointly with the NeRF, yet they substantially improve robustness to illumination variability and pose inaccuracies. We validate our approach on three image sets representative of in-orbit operations, demonstrating its effectiveness for offline reconstruction and highlighting its suitability for online reconstruction, an open problem in the field.

2605.17946 2026-05-21 cs.AI cs.CV cs.LG 版本更新

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch: 一种面向游戏垂直领域的多模态知识密集型短视频帧搜索基准

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出SVFSearch,首个针对中文游戏领域短视频帧搜索的多模态知识密集型基准,通过5000个四选一测试示例和4198个辅助训练示例,评估了从直接问答到计划-行动-重新计划代理等多种方法在短视频帧搜索中的性能。

详情
AI中文摘要

多模态大语言模型越来越多地被用作代理的骨干,以理解多模态输入、计划检索操作、调用外部工具并推理由检索信息得出的结论。然而,现有的基准很少评估在短视频应用中的这种能力,其中暂停的帧通常在视觉上具有歧义性,回答需要垂直的、长尾的和快速发展的领域知识。我们引入了SVFSearch,这是首个针对中文游戏领域短视频帧搜索的开放基准。SVFSearch包含5,000个四选一测试示例和4,198个辅助训练示例,每个示例都围绕一个暂停的游戏场景展开,来自真实的短视频片段。为了支持公平且可重复的评估,SVFSearch提供了一个冻结的离线检索环境,包括一个游戏领域文本语料库、一个主题链接的图像画廊以及文本、图像和多模态检索接口,避免了对不受控的网络搜索API的依赖。我们评估了从直接问答和RAG工作流程到计划-行动-重新计划代理和学习搜索模型在内的代表性范式。结果揭示了模型单独回答、实际代理搜索和 oracle 知识之间的巨大差距:最好的开源直接问答模型达到66.4%,最好的实际代理达到79.1%,而 oracle 知识达到95.4%。进一步分析揭示了视觉定位、检索质量、证据基础推理和工具使用行为中的瓶颈,包括过度检索、只回答捷径和检索诱导的误导。

英文摘要

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

2605.17472 2026-05-21 cs.CV 版本更新

Weighted Reverse Convolution for Feature Upsampling

加权反卷积用于特征上采样

Wentong Li, Zhiyuan Qi, Zichen Zhao, Kai Zhang, Lei Zhang

发表机构 * Nanjing University of Aeronautics and Astronautics(南京航空航天大学) The Hong Kong Polytechnic University(香港理工大学) Nanjing University(南京大学)

AI总结 本文提出加权反卷积(WRC),从逆问题的角度重新审视视觉基础模型中的特征上采样,通过空间自适应的逆操作提升高层视觉描述符的密度,从而在需要细粒度定位、密集预测和点对应的任务中提升性能。

Comments 18 pages, 7 figures, code:https://github.com/PolyU-VCLab/WRC

详情
AI中文摘要

预训练的视觉基础模型(VFMs)提供强大的语义表示,但其补丁级特征本质上是粗略的,限制了在需要细粒度定位、密集预测和点对应的任务中的有效性。在本文中,我们从逆问题的角度重新审视VFMs中的特征上采样,并提出加权反卷积(WRC),一种空间自适应的逆操作,用于密集化高层视觉描述符。具体来说,我们将特征上采样公式化为加权Tikhonov正则化最小二乘问题,其中空间变化的权重在每个空间位置调节数据保真度和先验强度。这使得WRC能够适应空间变化的特征特性,从而在保留关键结构的同时减轻过平滑问题。此外,WRC保留了一个高效、完全可微的闭合形式FFT解,使其成为一种实用的上采样操作符。在轻量级自监督密集化框架中集成后,WRC在各种下游基准测试中一致提高了密集特征质量,包括分割、深度估计、视频对象分割、对象发现和关键点对应,同时保持高计算效率。

英文摘要

Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

2605.16962 2026-05-21 cs.CV cs.AI 版本更新

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro: 一个增强工具的代理用于综合视觉-语言防伪

Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China(合肥工业大学计算机科学与信息工程学院) Wuhan University, Wuhan, China(武汉大学) Lab for Intelligence and visiON (LION)(智能与视觉实验室) Xi'an Jiaotong University(西安交通大学)

AI总结 该研究提出OmniVL-Guard Pro,一种增强工具的代理,用于综合视觉-语言防伪,通过整合多种工具环境和引入新的强化学习方法,实现了开放世界中的线索驱动推理,并在多个任务上达到了最先进的性能。

Comments 29 pages

详情
AI中文摘要

现有的视觉-语言伪造检测和定位方法基于封闭世界范式,假设模型可以单独完成验证。然而,自包含的MLLMs受限于有限的参数知识、静态训练语料和有限的感知分辨率,在动态开放世界防伪中存在实际限制,特别是在需要外部线索的实时事件验证和需要对局部篡改进行细致审查的伪造分割中。为了解决这些限制,我们从扩大自包含模型转向超越它。我们提出了OmniVL-Guard Pro,一种增强工具的代理,将统一的防伪从封闭世界预测扩展到开放世界的线索驱动推理。OmniVL-Guard Pro整合了一个涵盖实时事件搜索、局部裁剪和缩放、边缘异常筛查、人脸检测、视频帧提取以及SAM3基于分割的工具环境。为了生成高质量的工具推理轨迹,我们引入了树状结构的自进化工具轨迹生成,通过种子引导、无引导的自我进化和弱提示的硬样本合成生成多样化的轨迹,产生Full-Spectrum Tool Reasoning (FSTR)数据集用于训练。我们进一步提出了Checker-Guided Agentic Reinforcement Learning (CGARL),它为过程级监督提供,以惩罚那些答案正确但推理扭曲的情况。广泛的实验表明,OmniVL-Guard Pro在各种任务上实现了最先进的性能,并表现出强大的零样本泛化能力。FSTR数据集和OmniVL-Guard Pro的代码将在https://github.com/shen8424/OmniVL-Guard-Pro公开发布。

英文摘要

Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at https://github.com/shen8424/OmniVL-Guard-Pro.

2605.16530 2026-05-21 cs.CV 版本更新

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

SWoMo:用于白内障手术模拟的神经符号世界模型

Ssharvien Kumar Sivakumar, Akwele Johnson, Anirudh Dhingra, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay

发表机构 * Technical University Darmstadt(德累斯顿技术大学) Carl Zeiss AG(蔡司股份有限公司) AICM, Medical Faculty of Heidelberg University(海德堡大学医学院)

AI总结 本文提出SWoMo,一种用于白内障手术模拟的神经符号世界模型,通过分离运动生成与视觉真实性,结合规则基模拟器和场景图表示来建模运动动态和工具-组织交互,同时使用扩散模型生成逼真的视觉效果,从而提升手术模拟的真实性和临床适用性。

详情
AI中文摘要

现实手术模拟在培训初学者外科医生和开发自主代理方面起着至关重要的作用。世界模型可以通过根据当前观察和手术动作预测未来患者状态,将此类模拟环境扩展到真实且多样的程序中。然而,当前最先进的方法往往无法满足临床应用所需的关键标准,包括视觉真实性、物理基础的交互以及模拟超出训练分布的场景的能力。因此,我们引入SWoMo,一种用于白内障手术模拟的神经符号世界模型,该模型将运动生成与视觉真实性解耦。符号组件包括基于规则的模拟器和场景图表示,用于建模运动动态和工具-组织交互,而扩散模型则生成逼真的视觉外观,包括纹理和组织变形。我们提出了一种逆配对策略,通过在模拟器中重建真实的手术视频以获得配对的模拟和真实视频,然后用于训练我们的视频扩散模型,以实现反向的仿真到现实的翻译目标。我们的实验表明,与先前工作相比,既有定性也有定量的改进。我们证明,我们的模拟器进一步满足了关键标准,包括对未见交互几何的泛化、下游阶段检测的改进以及无监督的视频风格迁移。代码、数据和模型权重可在:https://ssharvienkumar.github.io/SWoMo/上获取。

英文摘要

Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

2605.15876 2026-05-21 cs.CV 版本更新

Unlocking Dense Metric Depth Estimation in VLMs

解锁VLMs中的密集度量深度估计

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

发表机构 * Zhejiang University(浙江大学) Tencent Hunyuan LLM(腾讯混元大模型) HKUST(香港科技大学) Shenzhen Loop Area Institute(深圳环城研究院)

AI总结 本文提出DepthVLM,一种将单个VLM转换为原生密集几何预测器的简单有效框架,同时保持其多模态能力。通过在LLM主干上附加轻量级深度头,并在统一的视觉-文本监督范式下进行训练,DepthVLM能够在单次前向传递中生成高分辨率深度图和语言输出。此外,还引入了一个统一的室内-室外度量深度基准,实验表明DepthVLM在推理效率、复杂3D空间推理等方面均优于现有VLMs和纯视觉模型。

Comments Project Page: https://depthvlm.github.io/

详情
AI中文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

英文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

2605.14417 2026-05-21 cs.RO cs.CV 版本更新

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

在身体移动之前:为语言条件的人形控制学习预见性关节意图

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) LimX Dynamics Technology Co., Ltd.(LimX动态技术有限公司) Shandong University(山东大学) Data61/CSIRO Griffith University(格里菲斯大学) Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)(深度感知技术研究院,江苏省工业技术研究院(JITRI))

AI总结 该研究提出DAJI框架,通过学习语言生成与闭环控制之间的预见性关节意图接口,解决语言条件人形机器人中预见未来物理转换的需求,实现了在HumanML3D风格生成和BABEL任务中的高性能表现。

详情
AI中文摘要

自然语言是人形机器人的直观接口,但流式全身控制需要能够现在执行并预见未来物理转换的控制表示。现有语言条件人形系统通常生成低级跟踪器必须反应性修复的运动学参考,或使用隐式/动作策略,其输出不显式编码即将发生的接触变化、支撑转移和平衡准备。我们提出DAJI(Dynamics-Aligned Joint Intent),一个分层框架,学习语言生成与闭环控制之间的预见性关节意图接口。DAJI-Act通过学生驱动的回放将未来的教师 distill 成可部署的扩散动作策略,而 DAJI-Flow 自回归地从语言和意图历史生成未来意图块。实验表明,DAJI 在预见性隐式学习、单指令生成和流式指令跟随中表现优异,在 HumanML3D 风格生成中达到 94.42% 的回放成功率,在 BABEL 任务中达到 0.152 的子序列 FID。

英文摘要

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

2605.14382 2026-05-21 cs.CV cs.GR cs.MM 版本更新

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Delta Forcing:交互式自回归视频生成中的信任区域引导

Yuheng Wu, Xiangbo Gao, Tianhao Chen, Xinghao Chen, Qing Yin, Zhengzhong Tu, Dongman Lee

发表机构 * Texas A\&M University(德克萨斯A&M大学) University of Washington(华盛顿大学)

AI总结 本文提出Delta Forcing方法,通过约束不可靠的教师监督在适应性信任区域中,以提高自回归视频生成的一致性并保持对新事件的响应性。

详情
AI中文摘要

交互式实时自回归视频生成对于内容创作和世界建模等应用至关重要,其中视觉内容必须适应动态变化的事件条件。一个基本挑战在于在反应性和稳定性之间取得平衡:模型必须迅速响应新事件,同时在长时间范围内保持时间一致性。现有方法将双向模型蒸馏为自回归生成器,并进一步通过流式长调优进行适应,但往往在条件变化后仍会出现持续漂移。我们发现原因在于条件偏差,其中教师可能提供与条件对齐但轨迹无关的指导,使生成偏向于局部有效但全局不一致的模式。受信任区域策略优化的启发,我们提出Delta Forcing,一种简单而有效的框架,它将不可靠的教师监督限制在适应性信任区域内。具体而言,Delta Forcing从教师和生成器轨迹之间的潜在delta估计转移一致性,并利用它来平衡教师监督与单调连续性目标。这抑制了不可靠的教师诱导的偏移,同时保持对新事件的响应性。广泛的实验表明,Delta Forcing在提高一致性的同时保持了事件的响应性。

英文摘要

Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

2605.14201 2026-05-21 cs.RO cs.CV 版本更新

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

MAPLE:基于潜在空间的多智能体交互用于端到端自动驾驶

Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Meysam Sadeghigooghari, Hanno Ackermann, Litian Liu, Pranav Desai, Fatih Porikli, Mohammad Ghavamzadeh, Hong Cai

发表机构 * Qualcomm AI Research(英矽人工智能研究)

AI总结 本文提出MAPLE框架,通过在视觉-语言-动作模型的潜在空间中实现反应式多智能体滚动,以解决传统模仿学习框架下闭环设置中模型易碎的问题,通过监督微调和强化学习结合多样性奖励,实现了可扩展且无需外部模拟器的闭环训练,提升了端到端自动驾驶系统的鲁棒性。

Comments 19 pages, 9 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型在端到端运动规划中表现出色,但在闭环设置中由于训练基于传统模仿学习框架而显得脆弱。现有的闭环监督方法缺乏可扩展性且无法完全建模反应式环境。我们提出MAPLE,一种新的框架,用于在VLA模型的潜在空间中进行动态驾驶场景的反应式多智能体滚动。主体车辆和附近交通代理在多步时间范围内独立控制,同时对场景中的其他代理具有反应性,从而实现闭环训练。MAPLE包含两个训练阶段:(1)基于真实轨迹的潜在滚动监督微调,随后是(2)具有全局和代理特定奖励的强化学习,这些奖励鼓励安全、进展和交互真实感。我们进一步提出多样性奖励,鼓励模型生成可能不在记录驾驶数据中存在的规划行为。值得注意的是,我们的闭环训练框架具有可扩展性,且无需外部模拟器,这些模拟器计算成本高且视觉保真度有限。MAPLE在Bench2Drive上实现了最先进的驾驶性能,并展示了可扩展的闭环多智能体交互,为鲁棒的端到端自动驾驶系统提供了支持。

英文摘要

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

2605.13475 2026-05-21 cs.CV 版本更新

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

FedHPro: 通过梯度匹配实现联邦超原型学习

Huan Wang, Jun Shen, Haoran Li, Zhenyu Yang, Jun Yan, Ousman Manjang, Yanlong Zhai, Di Wu, Guansong Pang

发表机构 * School of Computing and Information Technology, University of Wollongong, Wollongong, Australia(计算机与信息科技学院,沃尔灵戈大学,澳大利亚) School of Computing and Information Systems, Singapore Management University, Singapore, Singapore(计算机与信息系统学院,新加坡管理大学,新加坡) Monash University, Australia(莫纳什大学,澳大利亚) Macquarie University, Australia(麦考瑞大学,澳大利亚) Beijing Institute of Technology, China(北京理工大学,中国) La Trobe University, Australia(拉筹伯大学,澳大利亚)

AI总结 本文提出FedHPro框架,通过引入超原型和梯度匹配来提升联邦学习中的类别分离度和类别内一致性,实验表明其在多个基准数据集上达到最优性能。

Comments 23 pages, ICML 2026 Camera-ready Version

详情
AI中文摘要

联邦学习(FL)能够在保护隐私的同时实现分布式客户端的协同训练。为了增强FL的泛化能力,基于原型的FL受到关注,因为共享的全局原型为对齐客户端特定的局部原型提供了语义锚点。然而,现有方法通过平均局部原型或细化全局锚点来更新全局原型,这通常导致客户端间的语义漂移,从而产生不一致的全局信号。为了解决这个问题,我们引入了超原型,由一组可学习的全局类别级原型定义,以在客户端间保留底层语义知识。超原型通过梯度匹配进行优化,以对齐从客户端真实样本中直接提取的类别相关特征,而不是原型级描述符。我们进一步提出了FedHPro,一个联邦超原型学习框架,以利用超原型通过互对比学习和客户端特定的边距来促进类别间分离度,同时通过一致性惩罚促进类别内均匀性。在多样化的异构场景中的全面实验表明,1)超原型产生更一致的全局信号,2)FedHPro在多个基准数据集上达到最优性能。代码可在https://github.com/mala-lab/FedHPro获取。

英文摘要

Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.

2605.13081 2026-05-21 cs.CV 版本更新

PRA-PoE: Robust Multimodal Alzheimer's Diagnosis with Arbitrary Missing Modalities

PRA-PoE: 基于任意缺失模态的鲁棒多模态阿尔茨海默病诊断

Guangqian Yang, Ye Du, Wenlong Hou, Qian Niu, Shujun Wang

发表机构 * Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China(生物医学工程系,香港理工大学,香港特别行政区,中国) Department of Technology Management for Innovation, The University of Tokyo, Japan(创新技术管理系,东京大学,日本) Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China(数据科学与人工智能系,香港理工大学,香港特别行政区,中国)

AI总结 该研究提出PRA-PoE框架,通过原型锚定表示对齐和不确定性感知专家融合机制,解决多模态学习中模态缺失导致的表示偏移问题,提升了在不同缺失模式下的诊断鲁棒性与准确性。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

缺失模态在真实世界阿尔茨海默病评估中普遍存在,对多模态学习构成重大挑战,尤其是在训练和部署时观测模态子集的分布不同时。这种缺失模式不匹配会引发跨模态子集的条件表示偏移。现有方法依赖于隐式插补或模态合成,往往无法显式建模模态可用性和不确定性,导致过度依赖合成特征、鲁棒性降低和不确定性估计不准确。为了解决这些限制,我们提出PRA-PoE,一种不完整多模态学习框架,配备了原型锚定表示对齐(PRA)和不确定性感知专家(UA-PoE)融合机制。首先,PRA使用可学习的全局原型和可用性条件化标记来编码模态可用性,区分观测与缺失模态,重新合成缺失模态的特征,并自适应地细化观测表示以对齐跨模态子集的潜在空间,目标是在不同缺失模式下减少表示偏移。其次,UA-PoE将每个模态建模为高斯专家并执行闭式产品专家融合,其中不确定性较高的专家会通过较低的精度自动降权,从而提高不确定性可靠性。我们通过在临床现实协议下训练使用自然缺失数据并在所有非空模态组合上测试来评估PRA-PoE。PRA-PoE在所有非空模态子集上均优于现有最佳方法,实现了在ADNI数据集上的平均准确率5.4%的相对提升,在OASIS-3数据集上的平均F1值达到10.9%的相对提升。

英文摘要

Missing modalities are prevalent in real-world Alzheimer's disease (AD) assessment and pose a significant challenge to multimodal learning, particularly when the distribution of observed modality subsets differs between training and deployment. Such missingness pattern mismatch induces a conditional representation shift across modality subsets. Existing approaches that rely on implicit imputation or modality synthesis often fail to explicitly model modality availability and uncertainty, leading to overconfident dependence on synthesized features, reduced robustness, and miscalibrated uncertainty estimates. To address these limitations, we propose PRA-PoE, an incomplete multimodal learning framework that is equipped with Prototype-anchored Representation Alignment (PRA) and an Uncertainty-aware Product of Experts (UA-PoE) fusion mechanism. First, PRA uses learnable global prototypes and availability-conditioned tokens to encode modality availability, distinguish observed from missing modalities, re-synthesize features for missing modalities, and adaptively refine observed representations to align latent spaces across modality subsets, with the goal of reducing representation shift under varying missingness patterns. Second, UA-PoE models each modality as a Gaussian expert and performs closed-form Product of Experts fusion, where experts with higher uncertainty are automatically down-weighted via lower precision, improving uncertainty reliability. We evaluate PRA-PoE under a clinically realistic protocol by training with naturally missing data and testing on all non-empty modality combinations. PRA-PoE consistently outperforms the state-of-the-art across datasets, achieving a 5.4% relative improvement in average accuracy on ADNI and a 10.9% relative gain in average F1 on OASIS-3 over the strongest baseline across all non-empty modality subsets.

2605.10830 2026-05-21 cs.CV cs.LG 版本更新

Predicting 3D structure by latent posterior sampling

通过潜在后验采样预测3D结构

Azmi Haider, Dan Rosenbaum

发表机构 * Department of Computer Science(计算机科学系) University of Haifa(海法大学) Department of Computational Science(计算科学系)

AI总结 本文提出了一种结合NeRF表示和扩散模型的概率建模方法,用于从不同类型的观测数据(如单视角、多视角、噪声图像、稀疏像素和稀疏深度数据)中准确预测3D结构。

详情
AI中文摘要

生成模型在2D图像和神经场表示在3D场景中的显著成就提供了一个有吸引力的机会,将两种方法的优势结合起来。在本工作中,我们提出了一种方法,将基于NeRF的3D场景表示与扩散模型的概率建模和推理相结合。我们将3D重建视为一个具有内在不确定性的感知问题,从而可以受益于概率推断方法。核心思想是将3D场景表示为一个随机的潜在变量,我们可以学习其先验分布,并在给定一组观测数据的情况下进行后验推断。我们通过扩散模型的分数推理方法进行后验采样,并结合从重建模型计算出的似然项(包括体渲染)。我们通过两阶段过程训练模型:首先训练重建模型并自动解码潜在表示以处理3D场景的数据集,然后在潜在空间上训练扩散模型的先验。通过使用模型从后验中生成样本,我们证明了各种3D重建任务可以执行,根据所使用的输入观测类型不同。我们展示了从单视角、多视角、噪声图像、稀疏像素和稀疏深度数据的重建。这些观测在提供的场景信息量上有所不同,我们展示了我们的方法能够建模与每个任务相关的不同水平的内在不确定性。我们的实验表明,这种方法产生了一种全面的方法,能够准确地从各种观测类型中预测3D结构。

英文摘要

The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.

2605.10603 2026-05-21 cs.CV 版本更新

Segment Anything with Robust Uncertainty-Accuracy Correlation

具有鲁棒不确定性和准确性相关性的分割任何东西

Hongyou Zhou, Marc Toussaint, Ling Shao, Zihan Ye

发表机构 * Technical University of Berlin(柏林技术大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出了一种名为RUAC的分割方法,通过引入轻量级不确定性头和对抗性训练,提高在外观和形变转移下的像素级不确定性估计,从而提升分割质量和不确定性准确性相关性。

Comments ICML 2026

详情
AI中文摘要

尽管在零样本性能方面表现强劲,SAM在域转移下不可靠,因为Mask级置信度混淆(MCC),其中基于IoU的单个掩码分数无法反映边界附近的像素级可靠性。受神经网络中纹理偏置捷径与人类视觉中以形状为中心的处理之间的对比启发,我们将域外变化建模为外观转移和非刚性变形,这些共同压力校准。我们提出Segment Anything with Robust Uncertainty-Accuracy Correlation(RUAC)以在外观和变形转移下实现鲁棒的像素级不确定性估计。RUAC添加了一个轻量级的不确定性头,通过联合扰动纹理和几何的协作风格-变形攻击进行训练,并应用不确定性-准确性对齐以确保在对抗性扰动下不确定性仍能一致地突出错误像素。在23个零样本领域中,RUAC提高了分割质量和更忠实的不确定性,具有更强的不确定性-准确性相关性。项目页面:https://hongyouzhou.github.io/ruac/.

英文摘要

Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: https://hongyouzhou.github.io/ruac/.

2605.10181 2026-05-21 cs.CV cs.AI 版本更新

A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

机器学习与深度学习在分布外检测中的比较研究

Jihyeon Baek, Seunghoon Lee, Gitaek Kwon, Doohyun Park

发表机构 * VUNO Inc.(VUNO公司)

AI总结 本文比较了传统机器学习和深度学习在分布外检测任务中的性能,发现轻量级机器学习方法在保持同等准确性的同时,具有显著更低的计算成本,适用于视觉复杂度有限的任务。

Comments Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

详情
AI中文摘要

分布外检测(OOD)对于构建可靠的人工智能系统至关重要,因为无法信任产生无效输入输出的模型。尽管深度学习(DL)通常被认为优于传统机器学习(ML),但医学影像数据通常是在标准化协议下获取的,导致在OOD检测任务中图像变化相对受限。这促使在该设置下直接比较ML和DL方法。两种方法在包含超过60,000张视网膜和非视网膜图像的开放数据集上进行了评估,涵盖多种分辨率。两种方法在内部和外部验证集上均实现了AUROC为1.000和准确性在0.999至1.000之间的结果,显示出相当的检测性能。然而,ML方法在保持等同准确性的同时,表现出显著更低的端到端延迟,表明具有更大的计算效率。这些结果表明,对于视觉复杂度有限的OOD检测任务,轻量级ML方法可以实现DL级别的性能,但计算成本显著降低,支持实际应用场景的部署。

英文摘要

Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.

2605.10165 2026-05-21 cs.CV cs.AI 版本更新

Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation

通过标准化损失聚合进行任务无关的噪声标签检测

Inhyuk Park, Doohyun Park

发表机构 * VUNO Inc.(VUNO公司)

AI总结 本文提出了一种任务无关的噪声标签检测方法SLA,通过聚合标准化的交叉验证损失来量化标签可靠性,实验表明SLA在各种噪声水平下均优于硬计数基线,并在低噪声比情况下收敛更快,有助于高效重新标注和提升数据集可靠性。

Comments Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

详情
AI中文摘要

由于观察者差异和模糊案例,大规模医学影像数据集中的噪声标签很常见。我们提出了一种统计上站得住且任务无关的框架,即标准化损失聚合(SLA),用于在样本层面检测噪声标签。SLA通过在重复交叉验证运行中聚合标准化的折叠级验证损失来量化标签可靠性。这种公式将离散的硬计数方案泛化为一个连续估计器,能够捕捉性能偏差的频率和幅度,从而产生可解释且统计上稳定的噪声分数。在公共视网膜数据集上的实验表明,SLA在所有噪声水平下均优于硬计数基线,并在低噪声比情况下收敛速度显著加快,尤其是在细微损失变化具有信息量的情况下。具有高SLA分数的样本指示可能模糊或错误标注的案例,从而指导高效的重新标注,提高任何分类任务的数据集可靠性。

英文摘要

Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.

2605.09586 2026-05-21 cs.CV cs.RO 版本更新

DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

DeformMaster: 一个用于从视频中生成变形物体交互物理-神经世界模型

Can Li, Zhoujian Li, Ren Li, Jie Gu, Lei Lei, Jingmin Chen, Lei Sun

发表机构 * Nankai University(南开大学) Zhejiang University(浙江大学) Southern University of Science and Technology(南方科技大学) Rightly Robotics, A4X(Rightly Robotics,A4X) University of Science and Technology of China(中国科学技术大学)

AI总结 本研究提出DeformMaster,一种基于视频的交互物理-神经世界模型,能够从真实交互视频中生成变形物体的统一动态-外观框架,通过保留结构化的物理推演并利用神经残差补偿未建模效应,实现高保真4D外观生成,实验表明其在动态预测和外观渲染方面优于现有方法。

Comments Project page: https://can-lee.github.io/deformmaster-web/

详情
AI中文摘要

世界模型用于变形物体应恢复不仅几何和外观,还应包含底层物理动态、交互基础和材料行为。从真实视频中学习此类模型具有挑战性,因为变形的线性、平面和体积物体在高维变形、噪声交互和复杂材料响应下演变。因此,模型必须从视觉观测中推断物理状态,通过新交互推进,并以高视觉保真度渲染结果。我们提出了DeformMaster,一种视频衍生的交互物理-神经世界模型,将真实交互视频转化为统一动态-外观框架中的变形物体在线交互模型。DeformMaster保留了结构化的物理推演,同时利用神经残差补偿未建模效应,将稀疏手部运动作为分布式合规执行器用于手-连续体交互,用空间变化的本构专家表示材料响应,并从预测的物理演变中驱动高保真4D外观。在真实世界变形物体序列上的实验表明,DeformMaster能够推演未来动态并渲染动态外观,优于现有最先进基线,同时支持新动作推演、材料参数变化和动态新视角合成。项目页面:https://can-lee.github.io/deformmaster-web/

英文摘要

World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics-neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand-continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis. Project page: https://can-lee.github.io/deformmaster-web/

2605.08858 2026-05-21 cs.CV 版本更新

ProDG: Prototypes for Data-Free Generative Post-Hoc Explainability

ProDG:用于无数据后置可解释性的原型

Piotr Borycki, Magdalena Trędowicz, Jacek Tabor, Łukasz Struski, Przemysław Spurek

发表机构 * Jagiellonian University(雅盖隆大学) IDEAS Research Institute(IDEAS研究所) Centre for Credible AI(可信AI中心) Warsaw University of Technology(华沙理工大学)

AI总结 本文提出ProDG,一种无需数据的后置可解释性框架,通过生成模型直接从冻结模型的权重中合成纯高保真原型,从而摆脱了对任何外部数据的依赖,为隐私敏感领域提供了稳健的视觉可解释性。

详情
AI中文摘要

基于原型的前置可解释性方法通过利用直观的'这看起来像那'推理范式提供高度准确的解释。另一方面,后置模型可以在不依赖底层数据集或需要昂贵神经网络重新训练的情况下解释单个图像的预测。最近的方法成功解决了原型网络的重新训练问题。然而,它们仍然面临一个根本限制:它们需要访问数据子集(例如测试或验证集)来搜索并提取视觉原型。在本文中,我们解决了这一问题,并引入了ProDG:用于无数据后置可解释性的生成原型,一种新的框架,利用生成模型直接从冻结模型的权重中合成纯、高保真的原型,完全消除了对任何外部数据的依赖。通过在无数据XAI领域建立新的前沿,ProDG为隐私敏感领域解锁了稳健的视觉可解释性,其中原始数据受到严格限制或根本无法访问。项目页面:https://github.com/piotr310100/ProDG

英文摘要

Ante-hoc interpretability methods based on prototypes provide highly accurate explanations by utilizing the intuitive "this looks like that" reasoning paradigm. On the other hand, post-hoc models can explain predictions for a single image without relying on an underlying dataset or requiring costly neural network retraining. Recent approaches successfully solve the retraining problem for prototype-based networks. However, they still face a fundamental limitation: they require access to a subset of data (e.g., a test or validation set) to search for and extract the visual prototypes. In this paper, we address this issue and introduce ProDG: Generative Prototypes for Data-Free Post-Hoc Explainability, a novel framework that leverages generative models to synthesize pure, high-fidelity prototypes directly from the frozen model's weights, completely eliminating the dependency on any external data. By establishing this new frontier in Data-Free XAI, ProDG unlocks robust visual interpretability for privacy-sensitive domains, where original data is strictly restricted or fundamentally inaccessible. Project page: https://github.com/piotr310100/ProDG

2605.05405 2026-05-21 cs.CV 版本更新

Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response

通过联合嵌入实现零样本卫星图像检索:应用于危机响应

James Walsh, William Fawcett, Grace Colverd, Raúl Ramos-Pollán

发表机构 * Trillium Technologies University of Cambridge(剑桥大学) Universidad de Antioquia(Antioquia大学)

AI总结 本文提出GeoQuery系统,通过两阶段语义和视觉搜索,在无需配对数据和计算资源的情况下实现全球范围内的自然语言查询,利用部分全球数据的自然语言嵌入,优化描述生成提示以使文本嵌入空间与冻结CLAY视觉嵌入空间的距离相关联,从而在灾难地点查询中实现高精度检索。

详情
AI中文摘要

地球观测档案的语义搜索仍具挑战性。视觉基础模型如CLAY能生成丰富的卫星图像嵌入,但缺乏用于直观查询所需的自然语言基础,而对遥感CLIP式模型的完整对比训练需要配对数据和计算资源,这些在全球范围内不可用。为允许全球范围内的自然语言查询,我们提出GeoQuery,一种零样本检索系统,通过两阶段语义和视觉搜索绕过数据和计算限制,利用部分全球数据的自然语言嵌入。我们不训练联合编码器,而是为100,000个代理子集的全球Sentinel-2瓦片生成语言描述,并优化描述生成提示,使生成的文本嵌入空间中的距离与冻结CLAY视觉嵌入空间中的距离相关联。查询分为两个阶段,首先在代理子集上进行文本相似度搜索,然后在全球CLAY嵌入中进行视觉最近邻搜索。在76个灾难地点查询中,包括英国洪水、美国野火和美国干旱,GeoQuery在50公里内达到31.6%的准确率,其中洪水表现最强(50%在50公里内),因为地形特征由RGB嵌入良好捕获。在名为\ECHO{}的危机响应系统中部署,GeoQuery在布里斯班2025年 Cyclone Alfred期间识别出易受灾区域,下游洪水模拟重现了历史模式。提示对齐的代理为EO基础模型与操作检索之间提供了一个实用的桥梁,当完整对比训练不可行时。

英文摘要

Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. To allow natural language querying at global scales, we present GeoQuery, a zero-shot retrieval system that sidesteps data and compute constraints through a two-stage semantic and visual search, leveraging a natural language embedding of a subset (proxy) of global data. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6\% accuracy within 50\,km, with the strongest performance on floods (50\% within 50\,km) where terrain features are well captured by RGB embeddings. Deployed within a crisis response system called \ECHO{}, GeoQuery identified vulnerable areas during Brisbane's 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.

2605.04128 2026-05-21 cs.GR cs.AI cs.CL cs.CV cs.LG 版本更新

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

JoyAI-Image: 激活统一多模态理解和生成中的空间智能

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy, JD(joy未来学院,京东)

AI总结 本文提出JoyAI-Image,一种统一的多模态基础模型,用于视觉理解、文本到图像生成和指令引导的图像编辑。该模型结合了空间增强的多模态大语言模型(MLLM)和多模态扩散Transformer(MMDiT),通过共享的多模态接口实现感知与生成的交互。构建可扩展的训练配方,结合统一指令微调、长文本渲染监督、空间 grounded 数据和通用及空间编辑信号,使模型具备广泛的多模态能力,同时增强几何感知推理和可控视觉合成。实验表明,JoyAI-Image在理解、生成、长文本渲染和编辑基准上达到最先进的性能。更重要的是,增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力,向更强的空间智能发展。

Comments Code: https://github.com/jd-opensource/JoyAI-Image

详情
AI中文摘要

我们提出了JoyAI-Image,一种统一的多模态基础模型,用于视觉理解、文本到图像生成和指令引导的图像编辑。JoyAI-Image将空间增强的多模态大语言模型(MLLM)与多模态扩散Transformer(MMDiT)结合,允许感知和生成通过共享的多模态接口进行交互。围绕此架构,我们构建了一个可扩展的训练配方,结合了统一指令微调、长文本渲染监督、空间 grounded 数据以及通用和空间编辑信号。该设计使模型具备广泛的多模态能力,同时增强了几何感知推理和可控视觉合成。在理解、生成、长文本渲染和编辑基准上的实验表明,JoyAI-Image实现了最先进的或高度竞争的性能。更重要的是,增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力,向更强的空间智能发展。这些结果表明,统一视觉模型在下游应用如视觉-语言-动作系统和世界模型中具有前景。

英文摘要

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

2604.27505 2026-05-21 cs.CV 版本更新

Leveraging Verifier-Based Reinforcement Learning in Image Editing

利用基于验证器的强化学习进行图像编辑

Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang

发表机构 * School of Computing and Data Science, The University of Hong Kong(计算与数据科学学院,香港大学) Center for Embodied AI and Computer Vision, Shenzhen Loop Area Institute(具身人工智能与计算机视觉中心,深圳Loop Area研究院)

AI总结 本文提出Edit-R1框架,通过构建基于推理的验证器奖励模型(RRM)来解决图像编辑中缺乏稳健奖励模型的问题,该模型通过分解指令为不同原则并逐项评估图像,实现细粒度奖励,实验表明其在图像编辑任务中优于现有模型。

详情
AI中文摘要

尽管强化学习从人类反馈(RLHF)已成为文本到图像生成的关键范式,但其在图像编辑中的应用仍鲜有研究。关键瓶颈在于缺乏适用于所有编辑任务的稳健通用奖励模型。现有编辑奖励模型通常仅提供总体评分而无详细检查,忽视了不同指令要求,导致奖励偏差。为此,我们主张从简单的评分器转向推理验证器。我们引入Edit-R1框架,构建基于推理链(CoT)的验证器奖励模型(RRM)并用于下游图像编辑。Edit-RRM将指令分解为不同的原则,将编辑后的图像与每个原则进行评估,并将这些检查汇总成可解释、细粒度的奖励。为了构建此类RRM,我们首先应用监督微调(SFT)作为“冷启动”生成CoT奖励轨迹。然后,我们引入组对比偏好优化(GCPO),一种利用人类配对偏好数据强化点状RRM的强化学习算法。在构建RRM后,我们使用GRPO训练编辑模型,利用此非可微但强大的奖励模型。大量实验表明,我们的Edit-RRM在图像编辑特定任务中优于强大的VLMs如Seed-1.5-VL和Seed-1.6-VL,并观察到明显的扩展趋势,性能从3B到7B参数持续提升。此外,Edit-R1为编辑模型如FLUX.1-kontext带来增益,凸显了其在提升图像编辑任务中的有效性。

英文摘要

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

2604.27375 2026-05-21 cs.CV 版本更新

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

VeraRetouch: 一个轻量级的全微分框架用于多任务推理照片修复

Yihong Guo, Youwei Lyu, Jiajun Tang, Yizhuo Zhou, Hongliang Wang, Jinwei Chen, Changqing Zou, Qingnan Fan

发表机构 * Zhejiang University(浙江大学) vivo BlueImage Lab(vivo BlueImage实验室) University of Chinese Academy of Sciences(中国科学院大学) Zhejiang Lab(浙江实验室)

AI总结 本文提出VeraRetouch,一个轻量级且全微分的多任务照片修复框架,通过使用0.5B视觉-语言模型和全微分的修复渲染器,实现了端到端的像素级训练,并引入了AetherRetouch-1M+数据集和DAPO-AE强化学习策略,以提升修复性能和泛化能力。

详情
AI中文摘要

推理照片修复已获得显著关注,要求模型分析图像缺陷、提供推理过程并执行精确的修复增强。然而,现有方法常依赖非微分的外部软件,导致优化障碍,并存在参数冗余和泛化能力有限的问题。为解决这些问题,我们提出了VeraRetouch,一个轻量级且全微分的多任务照片修复框架。我们采用一个0.5B视觉-语言模型(VLM)作为核心智能,根据指令和场景语义制定修复计划。此外,我们开发了一个全微分的修复渲染器,取代外部工具,通过解耦控制潜在变量实现直接端到端的像素级训练。为克服数据稀缺,我们引入了AetherRetouch-1M+,第一个百万级的专业修复数据集,通过新的逆降级工作流程构建。此外,我们提出DAPO-AE,一种强化学习后训练策略,以增强自主审美认知。大量实验表明,VeraRetouch在多个基准上实现了最先进的性能,同时保持显著更小的模型规模,支持移动部署。我们的代码和模型已公开在https://github.com/OpenVeraTeam/VeraRetouch。

英文摘要

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.

2604.21060 2026-05-21 cs.CV 版本更新

Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images

基于临床信息的儿童脑肿瘤全切片病理图像分类建模

Joakim Nguyen, Jian Yu, Jinrui Fang, Nicholas Konz, Tianlong Chen, Sanjay Krishnan, Chandra Krishnan, Ying Ding, Hairong Wang, Ankita Shukla

发表机构 * Dept. of Computer Science(计算机科学系) University of Texas at Austin(德克萨斯大学奥斯汀分校) School of Information(信息学院) Dell Children's Medical Center(德尔儿童医学中心) Dept. of OREI(OREI部门) University of Nevada, Reno(内华达大学里诺分校)

AI总结 本文提出一种结合临床信息的对比学习框架,用于在有限数据和类别不平衡条件下提高儿童脑肿瘤全切片图像的细粒度分类性能。

Comments Accepted at the IEEE International Conference on Healthcare Informatics (ICHI), 2026

详情
AI中文摘要

准确诊断儿童脑肿瘤,从组织病理学开始,对深度学习提出了独特的挑战,包括严重的数据稀缺性、类别不平衡以及不同诊断亚型之间细微的形态学重叠。尽管病理基础模型在片段级表示学习方面取得了进展,但其在有限数据下有效适应弱监督的儿童脑肿瘤分类仍待探索。在本文中,我们引入了一种专家指导的对比微调框架,用于从全切片图像(WSI)中进行儿童脑肿瘤诊断。我们的方法将对比学习整合到滑动级别的多实例学习(MIL)中,以在下游微调过程中显式正则化滑动级别的表示几何。我们提出了一个通用的监督对比设置以及一个结合临床信息的硬负样本变体,旨在针对诊断上易混淆的亚型。通过在现实中的低样本和类别不平衡条件下对儿童脑肿瘤WSI分类进行全面实验,我们证明对比微调在细粒度诊断区分上产生了可测量的改进。我们的实验分析揭示了不同对比策略之间的互补优势,专家指导的硬负样本促进了更紧凑的类内表示和改进的类间分离。本文强调了在数据稀缺的儿童病理学设置中显式塑造滑动级别表示对于鲁棒细粒度分类的重要性。

英文摘要

Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.

2604.12239 2026-05-21 cs.CV eess.IV 版本更新

Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

基于标准化车牌字体的单目车辆距离估计

Manognya Lokesh Reddy, Zheng Liu

发表机构 * Department of Computer and Information Science, University of Michigan-Dearborn(1计算机与信息科学系,密歇根大学-迪尔伯恩分校) Department of Industrial and Manufacturing Systems Engineering, University of Michigan-Dearborn(2工业与制造系统工程系,密歇根大学-迪尔伯恩分校)

AI总结 本文提出了一种利用美国标准化车牌字体作为被动标记进行车辆距离估计的方法,通过显式的几何先验知识解决尺度模糊问题,无需训练数据或主动照明,实现了鲁棒的距离、相对速度和碰撞预警。

Comments 21 pages, 12 figures

详情
AI中文摘要

准确的车辆间距离估计是高级驾驶辅助系统(ADAS)和自动驾驶的核心。尽管LiDAR和雷达提供高精度,但其高成本限制了在大众市场车辆中的广泛应用。基于单目相机的估计提供了低成本的替代方案,但存在根本性的尺度模糊问题。最近的单目深度学习方法取得了显著成果,但需要昂贵的监督训练,存在领域偏移,并且生成的预测难以在安全关键部署中认证。本文提出了一种框架,利用美国标准化车牌字体作为被动标记进行度量测距,通过显式的几何先验知识解决尺度模糊问题,无需任何训练数据或主动照明。首先,一个四方法并行车牌检测器在全汽车照明范围内实现了稳健的车牌阅读。其次,一个三阶段状态识别引擎融合光学字符识别文本匹配、多设计颜色评分和轻量级神经网络分类器,在所有环境条件下提供稳健的识别。第三,混合深度融合与逆方差加权和在线尺度对齐,结合一维常速卡尔曼滤波器,提供平滑的距离、相对速度和时间到碰撞用于碰撞预警。在受控静态数据集上的基线验证重现了字符高度测量的2.3%系数变异和与先前工作中的车牌宽度方法相比距离估计方差减少了36%。

英文摘要

Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing optical character recognition text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation on a controlled static dataset reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work.

2604.11530 2026-05-21 cs.CV cs.AI 版本更新

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

超越注意力分数:基于SVD的视觉令牌修剪用于高效视觉-语言模型

Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa

发表机构 * anoncvlab(匿名计算机视觉实验室)

AI总结 本文提出SVD-Prune方法,通过SVD分解视觉令牌特征矩阵并利用统计杠杆分数选择顶级令牌,以在极端视觉令牌预算下保持高性能,优于现有修剪方法。

详情
AI中文摘要

视觉-语言模型(VLMs)通过联合处理视觉和文本信息革新了多模态学习。然而,由于处理长序列视觉令牌的高计算和内存需求,它们面临显著挑战。许多现有方法依赖于局部启发式方法,如注意力分数或令牌范数。然而,这些标准存在位置偏见和信息分散的问题,限制了它们在高修剪比率下保留本质内容的能力,导致在视觉细节丰富的图像上性能下降。为了解决这些问题,我们提出了SVD-Prune,一种训练免费、即插即用的令牌修剪方法,基于奇异值分解。它分解视觉令牌特征矩阵,并利用统计杠杆分数选择顶级令牌,确保仅保留对主导全局方差贡献最大的令牌。实验表明,SVD-Prune在极端视觉令牌预算下始终优于现有修剪方法,即使在32和16个视觉令牌的情况下也能保持强劲性能。

英文摘要

Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a training-free, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-k tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

2604.11071 2026-05-21 cs.CV cs.AI cs.LG 版本更新

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

轻量级低光照图像增强 via 分布归一化预处理和深度卷积U-Net

Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi

发表机构 * Sony Semiconductor Solutions Corporation(索尼半导体解决方案公司)

AI总结 本文提出了一种轻量级两阶段框架,通过分布归一化预处理和深度卷积U-Net实现低光照图像增强,相比现有方法参数更少且感知质量更优。

Comments Technical report for the NTIRE 2026 Efficient Low-Light Image Enhancement Challenge (CVPR 2026 Workshops), 3rd place solution

详情
AI中文摘要

我们提出了一种轻量级两阶段框架,用于低光照图像增强(LLIE),该框架在参数远少于现有方法的情况下实现了具有竞争力的感知质量。我们的方法结合了冻结算法的预处理与一个完全由深度卷积构成的紧凑型U-Net。预处理通过提供互补的亮度校正视图来归一化输入分布,使可训练网络能够专注于残差颜色校正。我们的方法在CVPR 2026 NTIRE高效低光照图像增强挑战中获得了第三名。我们进一步提供了扩展的基准测试和消融实验以证明我们方法的通用有效性。

英文摘要

We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 3rd place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

2604.10245 2026-05-21 cs.CV physics.med-ph 版本更新

Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

基于迭代3D/2D肝脏配准的预热启动强化学习

Hanyuan Zhang, Lucas He, Zijie Cheng, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew. J Clarkson

发表机构 * UCL Hawkes Institute, University College London(伦敦大学学院UCL霍克斯研究所) Medtronic plc.(美敦力公司)

AI总结 本文提出了一种基于离散动作强化学习的框架,用于将术前CT与术中腹腔镜视频进行配准,通过预热启动的监督姿态估计网络提供稳定的几何特征和更快的收敛速度,从而实现更高效的配准。

Comments Laparoscopic Liver Surgery, Augmented Reality, Image Registration, Reinforcement Learning

详情
AI中文摘要

术前CT与术中腹腔镜视频之间的配准在增强现实(AR)导航中微创手术中起着关键作用。基于学习的方法最近在配准误差上实现了与基于优化的方法相当的性能,同时提供了更快的推理速度。然而,许多监督方法会产生粗略的对齐,需要额外的基于优化的细化,从而增加推理时间。我们提出了一种离散动作强化学习(RL)框架,将CT到视频的配准视为一个序列决策过程。一个共享的特征编码器从CT渲染和腹腔镜帧中提取表示,通过从监督姿态估计网络预热启动以提供稳定的几何特征和更快的收敛。一个RL策略头学习在六个自由度上选择刚体变换,并决定何时停止迭代。在公开的腹腔镜数据集上的实验表明,我们的方法实现了平均目标配准误差(TRE)为15.70毫米,与监督方法结合优化相当,同时实现了更快的收敛。所提出的基于RL的公式使自动、高效的迭代配准成为可能,而无需手动调整步长或停止标准。这种离散框架为未来连续动作和可变形配准模型在手术AR应用中的实际基础提供了支持。

英文摘要

Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.

2604.06750 2026-05-21 cs.CV 版本更新

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

视觉-语言模型对连续驾驶场景的理解有多好?一项敏感性研究

Roberto Brusnicki, Mattia Piccinini, Johannes Betz

发表机构 * Professorship of Autonomous Vehicle Systems(自动驾驶系统教授职位) TUM School of Engineering and Design(技术大学慕尼黑工程与设计学院)

AI总结 本文通过系统分析视觉-语言模型(VLMs)在连续驾驶场景中的表现,揭示了输入配置对模型性能的影响,发现即使顶级模型在连续驾驶场景中的准确率仅为57%,远低于人类在相似约束下的65%,并暴露了VLMs在理解车辆动态和时间关系上的显著差距。

Comments 8 pages, 5 figures

详情
AI中文摘要

视觉-语言模型(VLMs)越来越多地被提出用于自动驾驶任务,但它们在连续驾驶场景中的性能仍然缺乏充分的描述,尤其是在输入配置如何影响其能力方面。我们介绍了VENUSS(VLM评估在理解连续场景上),这是一个用于系统敏感性分析VLM在连续驾驶场景中性能的框架,为未来研究建立了基准。基于现有数据集,VENUSS从驾驶视频中提取时间序列,并在自定义类别中生成结构化评估。通过在2,600多个场景上比较25多个现有VLMs,我们揭示了即使顶级模型也只能达到57%的准确率,这并未达到在相似约束下人类的65%表现,并暴露了显著的能力差距。我们的分析表明,VLMs在静态物体检测方面表现优异,但在理解车辆动态和时间关系方面则存在困难。VENUSS提供了首次系统性的VLM敏感性分析,专注于输入图像配置(如分辨率、帧数、时间间隔、空间布局和展示模式)如何影响连续驾驶场景中的性能。补充材料可在https://TUM-AVS.github.io/VENUSS/上获得。

英文摘要

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance under similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://TUM-AVS.github.io/VENUSS/.

2604.01341 2026-05-21 cs.CV q-bio.NC 版本更新

Perceptual misalignment of texture representations in convolutional neural networks

卷积神经网络中纹理表示的感知偏差

Ludovica de Paolis, Fabio Anselmi, Alessio Ansuini, Eugenio Piasini

发表机构 * Neuroscience area International School for Advanced Studies (SISSA) Trieste Italy(国际先进研究学院(SISSA)神经科学部,特里埃斯特,意大利) Department of Mathematics Informatics and Geosciences Università degli Studi di Trieste Trieste Italy(特里埃斯特大学数学、信息学和地球科学系,特里埃斯特,意大利) Department of Data Engineering Area Science Park Trieste Italy(数据工程系,Area Science Park,特里埃斯特,意大利)

AI总结 本文研究了卷积神经网络中纹理表示与人类感知内容之间的对齐关系,发现传统CNN视觉模型质量评估与人类纹理感知对齐性无直接关联,表明纹理感知可能涉及不同于传统CNN对象识别模型的机制。

详情
AI中文摘要

视觉纹理的数学建模可以追溯到Julesz的直觉,即人类对纹理的感知基于图像特征之间的局部相关性。一种有影响力的纹理分析和生成方法将这一概念推广到卷积神经网络(CNNs)中非线性特征之间的线性相关性,这些特征被编译成Gram矩阵。鉴于CNNs常被用作视觉系统的模型,自然会问这些

英文摘要

Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we quantify the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.

2603.27747 2026-05-21 cs.CV cs.AI 版本更新

AI-Powered Facial Mask Removal Is Not Suitable For Identification

基于AI的面部遮挡去除并不适合识别

Emily A Cooper, Hany Farid

发表机构 * Herbert Wertheim School of Optometry & Vision Science University of California, Berkeley(赫伯特·韦瑟姆视觉科学学院,加州大学伯克利分校) School of Information University of California, Berkeley(信息学院,加州大学伯克利分校)

AI总结 本文研究了基于AI的面部遮挡去除技术的有效性和风险,探讨其在真实身份匹配中的可靠性。

详情
AI中文摘要

最近,众包在线刑事调查已使用生成式AI来增强低质量的视觉证据。在一项高关注度案件中,社交媒体用户传播了一张联邦执法人员涉致命枪击事件的

英文摘要

Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an "AI-unmasked" image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.

2603.27309 2026-05-21 cs.GR cs.CV 版本更新

MeshTailor: Cutting Seams via Generative Mesh Traversal

MeshTailor: 通过生成网格遍历进行剪裁缝线

Xueqi Ma, Xingguang Yan, Congyue Zhang, Hui Huang

发表机构 * Shenzhen University(深圳大学) Simon Fraser University(西蒙 Fraser大学)

AI总结 本文提出MeshTailor,一种首个基于网格的生成框架,用于在3D表面合成边缘对齐的缝线。与以往基于优化或外在学习的方法不同,MeshTailor直接在网格图上操作,消除了投影伪影和脆弱的 snapping 策略。我们引入了ChainingSeams,一种层次化的缝线图序列化,按从全局结构切割到局部细节的粗到细方式对链进行排序,并引入了双流编码器以融合拓扑和几何上下文。利用这种层次化表示和双流顶点嵌入,我们的MeshTailor Transformer 使用自回归指针层在局部邻域内逐顶点追踪缝线。广泛的评估表明,与最近的基于优化和学习的基线相比,MeshTailor生成的缝线布局更加连贯和结构规整。

详情
AI中文摘要

我们提出了MeshTailor,首个基于网格的生成框架,用于在3D表面上合成边缘对齐的缝线。与以往基于优化或外在学习的方法不同,MeshTailor直接在网格图上操作,消除了投影伪影和脆弱的 snapping 策略。我们引入ChainingSeams,一种层次化的缝线图序列化,按从全局结构切割到局部细节的粗到细方式对链进行排序,并引入了双流编码器以融合拓扑和几何上下文。利用这种层次化表示和双流顶点嵌入,我们的MeshTailor Transformer 使用自回归指针层在局部邻域内逐顶点追踪缝线。广泛的评估表明,与最近的基于优化和学习的基线相比,MeshTailor生成的缝线布局更加连贯和结构规整。

英文摘要

We present MeshTailor, the first mesh-native generative framework for synthesizing edge-aligned seams on 3D surfaces. Unlike prior optimization-based or extrinsic learning-based methods, MeshTailor operates directly on the mesh graph, eliminating projection artifacts and fragile snapping heuristics. We introduce ChainingSeams, a hierarchical serialization of the seam graph that orders chains from global structural cuts down to local details in a coarse-to-fine manner, and a dual-stream encoder that fuses topological and geometric context. Leveraging this hierarchical representation and dual-stream vertex embeddings, our MeshTailor Transformer utilizes an autoregressive pointer layer to trace seams vertex-by-vertex within local neighborhoods. Extensive evaluations show that MeshTailor produces more coherent and structurally regular seam layouts compared to recent optimization-based and learning-based baselines.

2603.14184 2026-05-21 cs.CV cs.AI 版本更新

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

更深入的思考,更弱的目标:理解并缓解多模态大语言模型推理过程中感知障碍

Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Huawei Technologies(华为技术)

AI总结 本文研究了多模态大语言模型在推理过程中出现的视觉感知障碍问题,提出了一种无需训练的视觉区域引导注意力框架,通过选择和重新加权视觉头部来引导模型关注与问题相关区域,从而提高视觉定位和推理准确性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在进行扩展推理模式时常常出现感知障碍,特别是在视觉问答(VQA)任务中。我们识别出注意力分散是根本原因:在多步推理过程中,模型的视觉注意力变得分散并远离与问题相关区域,实际上“失去焦点”于视觉输入。为了更好地理解这一现象,我们分析了MLLMs的注意力图,并观察到推理提示显著减少了回答问题关键区域的注意力。我们进一步发现模型对图像标记的总体注意力与图像内注意力的空间分散性之间存在强相关性。基于这一见解,我们提出了一个无需训练的视觉区域引导注意力(VRGA)框架,该框架根据熵-聚焦准则选择视觉头部并重新加权其注意力,从而有效引导模型在推理过程中关注与问题相关区域。在视觉-语言基准上的广泛实验表明,我们的方法有效缓解了感知退化,从而在视觉定位和推理准确性方面取得改进,同时提供了可解释的见解,说明MLLMs如何处理视觉信息。

英文摘要

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

2603.08235 2026-05-21 cs.CV cs.AI 版本更新

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

探索深度学习与超宽场成像用于糖尿病视网膜病变和黄斑水肿

Pablo Jimenez-Lizcano, Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Guillermo González de Rivera, Ruben Vera-Rodriguez, Julian Fierrez

发表机构 * BiometricsAI, Universidad Autónoma de Madrid, Madrid, Spain(生物度量AI,马德里自治大学,马德里,西班牙) Department of Mathematics, Universidad de Las Palmas de Gran Canaria, Spain(数学系,拉斯帕尔马斯大Canaria大学,西班牙) HCTLab Research Group, Universidad Autónoma de Madrid, Madrid, Spain(HCTLab研究组,马德里自治大学,马德里,西班牙)

AI总结 本文研究了深度学习和超宽场成像在糖尿病视网膜病变和黄斑水肿检测中的应用,通过公开数据集评估了多种深度学习模型,并探讨了特征融合和频域表示的潜力。

Comments 6 pages, 4 figures, 2 tables

详情
AI中文摘要

糖尿病视网膜病变(DR)和糖尿病黄斑水肿(DME)是导致成年劳动力失明的主要原因之一。传统方法主要依赖标准彩色视网膜摄影(CFP)进行检测。然而,最近的超宽场成像(UWF)相比CFP提供了更宽的视野。受此启发,本文探讨了最新深度学习(DL)方法和UWF成像在三个临床相关任务上的应用:i)UWF图像质量评估,ii)可参考糖尿病视网膜病变(RDR)的识别,iii)DME的识别。使用公开的UWF4DR挑战数据集(作为MICCAI 2024会议的一部分发布),我们评估了DL模型在空间(RGB)和频域中的表现,包括流行的卷积神经网络(CNNs)以及最近的视觉变换器(ViTs)和基础模型。此外,我们还探索了最终的特征级融合以提高鲁棒性。最后,我们还利用Grad-CAM分析DL模型的决策,提高可解释性。我们的方法在所有架构中均实现了稳定强劲的性能,凸显了新兴ViTs和基础模型的竞争力,以及特征级融合和频域表示在UWF分析中的潜力。

英文摘要

Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.

2602.24138 2026-05-21 cs.CV cs.AI 版本更新

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

多模态最优传输用于手术机器人中的无训练时序分割

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Ivan Laptev, Cesare Stefanini

发表机构 * Dept. of Robotics, Mohamed bin Zayed University of AI(机器人系,Mohamed bin Zayed人工智能大学)

AI总结 本文提出了一种无需标注的手术时序分割框架TASOT,通过结合时间对齐的文本描述和视觉信息,在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索,实现了在多个公开手术数据集上的显著提升。

详情
AI中文摘要

自动化识别手术阶段和步骤是机器人辅助手术中术中决策支持、工作流自动化和技能评估的基本能力。现有方法要么依赖大规模标注手术数据集,要么需要昂贵的领域特定预训练,这限制了它们在不同机器人平台和临床环境中的实际部署。在本文中,我们提出TASOT(文本增强的动作分割最优传输),一种无需任务特定标注或手术领域预训练的手术时序分割框架。TASOT扩展了动作分割最优传输(ASOT)公式,通过结合直接从输入视频生成的时间对齐文本描述,在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索。视觉表示使用DINOv3提取,而由视觉-语言模型生成的时间描述通过CLIP编码并时间对齐到单个帧,为传输成本提供互补的语义结构。我们在三个公开手术数据集和四个基准设置上评估了TASOT,涵盖腹腔镜和机器人手术程序,显示出显著优于最强的零样本基线:在Cholec80上+18.9 F1,在AutoLaparo上+33.7,在StrasByPass70上+23.7,在BernByPass70上+4.5。这些结果表明,在机器人环境中可以实现细粒度的手术工作流理解,而无需手动训练标注或手术特定的预训练流程,为实际的机器人手术系统提供了一种有前景的替代方案。

英文摘要

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

2602.18532 2026-05-21 cs.CV cs.AI cs.RO 版本更新

VLANeXt: Recipes for Building Strong VLA Models

VLANeXt: 构建强大VLA模型的配方

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S实验室) SenseTime Research(商汤研究) Sun Yat-sen University(中山大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文通过统一框架和评估设置重新审视VLA设计空间,系统分析了基础组件、感知要素和动作建模视角,总结出12项关键发现,提出了一种简单有效的VLA模型VLANeXt,并在LIBERO和LIBERO-plus基准测试中超越了现有方法,同时提供了易于使用的代码库。

Comments Accepted in ICML 2026, Project Page: https://dravenalg.github.io/VLANeXt/

详情
AI中文摘要

在大基础模型兴起之后,视觉-语言-动作模型(VLAs)应运而生,利用视觉语言模型的强大视觉和语言理解能力进行通用目的策略学习。然而,当前VLA领域仍处于碎片化和探索阶段。尽管许多团队提出了各自的VLA模型,但训练协议和评估设置的一致性不足,使得难以确定哪些设计选择真正重要。为了使这一发展领域更具结构化,我们重新审视VLA设计空间,基于类似RT-2的简单VLA基线,系统地分析了三个维度:基础组件、感知要素和动作建模视角。从这项研究中,我们提炼出12项关键发现,共同构成了构建强大VLA模型的实用配方。该探索的成果是一种简单而有效的模型VLANeXt,它在LIBERO和LIBERO-plus基准测试中优于现有方法,并在现实世界实验中表现出色。我们还发布了一个统一且易于使用的代码库,以重现我们的发现、探索设计空间并基于共享基础开发新的VLA变体。代码库可在https://github.com/DravenALG/VLANeXt上获得。

英文摘要

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

2602.16608 2026-05-21 cs.CL cs.AI cs.CV cs.LG 版本更新

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

可解释的人工智能:面向Transformer模型的上下文感知分层集成梯度方法

Melkamu Abay Mersha, Jugal Kalita

发表机构 * College of Engineering and Applied Science, University of Colorado Colorado Springs(科罗拉多州立大学工程与应用科学学院)

AI总结 本文提出了一种上下文感知分层集成梯度框架(CA-LIG),用于解释Transformer模型的决策过程,通过计算每个Transformer块内的分层集成梯度,并将这些token级属性与类特定的注意力梯度融合,从而生成具有符号和上下文敏感性的属性图,以捕捉支持和反对的证据,并追踪Transformer层中的相关性层次流动。

详情
AI中文摘要

Transformer模型在多个领域和任务中实现了最先进的性能,然而其深层表示使得预测难以解释。现有的可解释性方法依赖于最终层的属性,只能捕捉局部token级属性或全局注意力模式,缺乏对token间依赖关系和结构组件的上下文感知能力。它们还无法捕捉相关性如何在层之间演变以及结构组件如何影响决策。为了解决这些限制,我们提出了上下文感知分层集成梯度(CA-LIG)框架,一种统一的层次属性框架,该框架在每个Transformer块内计算分层集成梯度,并将这些token级属性与类特定的注意力梯度融合。这种整合产生了带有符号和上下文敏感性的属性图,能够捕捉支持和反对的证据,同时追踪Transformer层中的相关性层次流动。我们评估了CA-LIG框架在多样化的任务、领域和Transformer模型家族中的表现,包括使用BERT进行情感分析和长多类文档分类,使用XLM-R和AfroLM在低资源语言设置中进行仇恨言论检测,以及使用Masked Autoencoder Vision Transformer模型进行图像分类。在所有任务和架构中,CA-LIG提供了更忠实的属性,显示出对上下文依赖的更强敏感性,并产生了更清晰、更语义连贯的可视化结果,优于现有可解释性方法。这些结果表明,CA-LIG提供了更全面、上下文感知和可靠的Transformer决策解释,推动了深度神经网络的实用可解释性和概念理解。

英文摘要

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

2602.11499 2026-05-21 cs.CV 版本更新

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

如果智能体能够想象?通过生成强化开放词汇人-物交互理解

Zhenlong Yuan, Yue Wang, Dapeng Zhang, Kejin Cui, Rui Chen, Jing Tang, Lei Sun, Hongwei Yu, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou

发表机构 * Dream-X Team(Dream-X团队) Stanford University(斯坦福大学) National University of Singapore(新加坡国立大学) Independent Researcher(独立研究者) Case Western Reserve University(凯斯西储大学) UC Santa Cruz(加州大学圣克鲁兹分校)

AI总结 本文提出ImagineAgent框架,通过生成式世界建模和工具增强强化学习,解决开放词汇人-物交互理解中的跨模态幻觉和视角限制问题,实现了高效且鲁棒的推理。

详情
AI中文摘要

多模态大语言模型在连接视觉和文本推理方面展现出有前景的能力,但其在开放词汇人-物交互(OV-HOI)中的推理能力受到跨模态幻觉和图像视角有限的限制。为此,我们提出ImagineAgent,一种整合认知映射、工具增强强化学习(RL)和生成式世界建模的智能体框架,以实现稳健的OV-HOI理解。具体而言,我们首先提出一个创新的CoT数据集hicodet-6K用于监督微调(SFT),通过将感知实体结构化为交互对,有效弥合感知到认知的差距,实现全面预测。随后,我们开发了一个多模态工具库,集成了在线检索、图像裁剪和生成式建模,使智能体能够动态增强推理,利用领域特定工具解决推理中的视觉-语义模糊性和幻觉问题。此外,我们引入生成模型重建替代视角,使智能体能够在有限视角下进行‘想象’。最后,我们提出一个复合奖励机制,共同优化预测准确性和工具效率。在SWIG-HOI和HICO-DET数据集上的评估表明,我们的方法在仅需36.7%的训练数据相比现有方法的情况下实现了最先进的性能,验证了我们的鲁棒性、经验有效性和效率。

英文摘要

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and limited viewpoints of images. To address this, we propose ImagineAgent, an agentic framework that integrates cognitive mapping, tool-augmented reinforcement learning (RL), and generative world modeling for robust OV-HOI understanding. Specifically, we first propose an innovative CoT dataset named hicodet-6K for supervised fine-tuning (SFT), which effectively bridges the perception-to-cognition gap by structuring perceived entities into interaction pairs for comprehensive predictions. Subsequently, we develop a multimodal tool library integrating online retrieval, image cropping, and generative modeling, enabling the agent to dynamically augment reasoning with domain-specific tools to resolve visual-semantic ambiguities and hallucinations during inference. Moreover, we incorporate a generative model to reconstruct alternative viewpoints, enabling the agent to 'imagine' under limited viewpoints. Finally, we propose a composite reward mechanism to jointly optimize prediction accuracy and tool efficiency. Evaluations on both SWIG-HOI and HICO-DET datasets demonstrate that our method achieves state-of-the-art performance while requiring merely 36.7% of the training data compared to existing methods, validating our robustness, empirical effectiveness and efficiency.

2602.06862 2026-05-21 cs.CV 版本更新

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

参数作为专家:通过动态参数路由适应视觉模型

Meng Lou, Stanley Yu, Yizhou Yu

发表机构 * The University of Hong Kong(香港大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出ParaX方法,通过动态参数路由机制实现视觉模型的高效微调,以生成更定制化和强大的特征表示,从而在多种视觉识别任务中取得优越性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

利用参数高效微调(PEFT)来适应预训练视觉模型仍然具有挑战性,因为其目标是在少量可训练参数的情况下实现与完整微调相当的性能。当应用于复杂的密集预测任务时,现有方法存在局限性,包括输入无关的建模和冗余的跨层表示。为此,我们提出了ParaX,一种新的适配器式方法,其特征是简单的混合专家(MoE)架构。具体而言,我们引入了共享专家中心,其中每个专家都是可训练的参数矩阵。在前向传递过程中,网络中的每个ParaX模块通过简单的动态参数路由机制动态生成针对当前模块的权重矩阵,该机制选择性地聚合相应专家中心的参数矩阵。ParaX模块中的动态权重矩阵通过输入依赖的方式实现低秩适应,从而生成更加定制化和强大的特征表示。此外,由于多个网络层的ParaX模块共享相同的专家中心,它们通过促进隐含的跨层特征交互来提高特征多样性。广泛的实验结果表明,ParaX在多种视觉识别任务中均表现出色。代码已公开发布:https://github.com/LMMMEng/ParaX。

英文摘要

Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose ParaX, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each ParaX module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in ParaX modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since ParaX modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experimental results demonstrate the superiority of ParaX across diverse visual recognition tasks. Code is publicly released at: https://github.com/LMMMEng/ParaX.

2602.04876 2026-05-21 cs.CV 版本更新

PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

PerpetualWonder: 长时间地平线动作条件4D场景生成

Jiahao Zhan, Zizhang Li, Hong-Xing Yu, Jiajun Wu

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出PerpetualWonder,一种混合生成模拟器,能够从单张图像生成长时间地平线动作条件的4D场景。该方法通过引入真正的闭环系统,解决了现有方法因物理状态与视觉表示解耦导致的生成问题,实现了物理动态和外观的双向修正。

Comments Project website: https://johnzhan2023.github.io/PerpetualWonder/

详情
AI中文摘要

我们介绍了PerpetualWonder,一种混合生成模拟器,能够从单张图像生成长时间地平线动作条件的4D场景。当前工作无法完成此任务,因为其物理状态与其视觉表示相互分离,这阻止了生成性改进更新底层物理以供后续交互。PerpetualWonder通过引入首个真正的闭环系统来解决这一问题。它具有一个新颖的统一表示,创建了物理状态与视觉原语之间的双向链接,使生成性改进能够同时修正动态和外观。它还引入了一种稳健的更新机制,通过多个视角收集监督以解决优化模糊性。实验表明,从单张图像出发,PerpetualWonder能够成功模拟复杂、多步骤的长时间动作交互,保持物理合理性和视觉一致性。

英文摘要

We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.

2602.01273 2026-05-21 cs.CV 版本更新

Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Q-DiT4SR: 探索细节保留的扩散变换器量化以实现实景图像超分辨率

Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei(华为)

AI总结 本文提出Q-DiT4SR,一种专门针对基于扩散变换器的实现实景图像超分辨率的后训练量化框架,通过引入层次化SVD和变异性感知时空混合精度方法,在保持细节的同时实现高效的模型压缩和加速。

Comments Accepted to ICML 2026. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR

详情
AI中文摘要

近年来,扩散变换器(DiTs)在实现实景图像超分辨率(Real-ISR)中崭露头角,能够生成高质量的纹理,但其沉重的推理负担阻碍了实际应用。尽管后训练量化(PTQ)是加速的有希望的解决方案,但现有超分辨率方法大多集中在U-Net架构上,而通用的DiT量化通常针对文本到图像任务设计。直接将这些方法应用于基于DiT的超分辨率模型会导致局部纹理严重退化。因此,我们提出了Q-DiT4SR,这是首个专门针对基于DiT的Real-ISR的PTQ框架。我们提出了H-SVD,一种层次化SVD,它在匹配的参数预算下集成了一个全局低秩分支和一个局部块状秩-1分支。我们进一步提出了变异性感知时空混合精度:VaSMP在无数据的情况下基于率-失真理论分配跨层权重位宽,而VaTMP通过动态规划(DP)在最小校准下调度跨扩散时间步的层内激活精度。在多个实现实景数据集上的实验表明,我们的Q-DiT4SR在W4A6和W4A4设置下均实现了SOTA性能。值得注意的是,W4A4量化配置将模型大小减少了5.8倍,并将计算操作减少了6.14倍。我们的代码和模型将在https://github.com/xunzhang1128/Q-DiT4SR上提供。

英文摘要

Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by 6.14$\times$. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.

2601.18577 2026-05-21 cs.CV cs.LG 版本更新

Self-Refining Video Sampling

自 refining 视频采样

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, Sung Ju Hwang

AI总结 本文提出了一种自 refining 视频采样方法,通过预训练的视频生成器作为自身 refine 器,无需外部验证器或额外训练,在推理时实现迭代内部循环 refine,提高了运动一致性和物理对齐性。

Comments ICML 2026. Project page: https://agwmon.github.io/self-refine-video/

详情
AI中文摘要

现代视频生成器仍难以处理复杂的物理动态,往往无法达到物理真实感。现有方法通过外部验证器或在增强数据上额外训练来解决这一问题,但计算成本高且仍难以捕捉细粒度运动。在本工作中,我们提出了自 refining 视频采样,一种简单的方法,利用在大规模数据集上预训练的视频生成器作为自身的 self-refiner。通过将生成器解释为去噪自编码器,我们能够在推理时实现迭代内部循环 refine,而无需任何外部验证器或额外训练。我们进一步引入了一种不确定性的 refine 策略,根据 self-consistency 选择性地 refine 区域,这防止了过度 refine 引起的伪影。在最先进的视频生成器上进行的实验显示,在运动一致性与物理对齐性方面有显著提升,达到比默认采样器和 guidance-based 采样器高出 70% 以上的人类偏好。

英文摘要

Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance-based sampler.

2601.04068 2026-05-21 cs.CV cs.AI 版本更新

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

注意生成细节:面向视频扩散模型的直接局部化细节偏好优化

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Alibaba Group - Taobao & Tmall Group(阿里巴巴集团-淘宝 & 天猫集团)

AI总结 本文提出LocalDPO,一种新的后训练框架,通过从真实视频中构建局部偏好对,并在时空区域层面优化对齐,以提高视频生成的质量和人类偏好评分。

Comments Accepted by CVPR 2026

详情
AI中文摘要

将文本到视频的扩散模型与人类偏好对齐对于生成高质量视频至关重要。现有的直接偏好优化(DPO)方法依赖于多样本排序和任务特定的批评模型,这效率低下且常导致模糊的全局监督。为了解决这些限制,我们提出了LocalDPO,一种新的后训练框架,该框架从真实视频中构建局部偏好对,并在时空区域层面进行优化。我们设计了一个自动化流程,高效地收集偏好对数据,通过单次提示推理生成偏好对,消除了对外部批评模型或人工标注的需求。具体来说,我们将高质量的真实视频作为正样本,并通过局部随机时空掩码来生成对应的负样本,仅使用冻结的基模型恢复被掩码的区域。在训练过程中,我们引入了区域感知的DPO损失,将偏好学习限制在被损坏的区域以实现快速收敛。在Wan2.1和CogVideoX上的实验表明,LocalDPO在视频保真度、时间连贯性和人类偏好评分方面优于其他后训练方法,建立了更高效和精细的视频生成器对齐范式。代码可在https://github.com/1170300714/Local-DPO上获得。

英文摘要

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

2512.13402 2026-05-21 cs.CV cs.AI 版本更新

End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

End2Reg: 为无标记定位学习任务特定分割在脊柱手术中

Lorenzo Pettinari, Sidaty El Hadramy, Michael Wehrli, Philippe C. Cattin, Daniel Studer, Carol C. Hasler, Maria Licci

发表机构 * Department of Biomedical Engineering, University of Basel, Allschwil, Switzerland(巴塞尔大学生物医学工程系,瑞士Allschwil) Department of Orthopedics, University Children’s Hospital, Basel, Switzerland(巴塞尔大学儿童医院骨科部,瑞士Basel)

AI总结 本文提出End2Reg,一种端到端深度学习框架,通过联合优化分割和定位,无需分割标签和手动步骤,从而提高脊柱手术中无标记导航的精度。

Comments Early Accepted MICCAI 2026. Code and interactive visualizations: https://lorenzopettinari.github.io/end-2-reg/

详情
AI中文摘要

脊柱手术中的术中导航需要毫米级的精度。目前,这通过辐射强度大的术中成像和骨锚定标记实现,但这些标记侵入性且会干扰手术流程。无标记RGB-D定位方法提供了一种有前途的替代方案。然而,现有方法依赖于弱分割标签来隔离相关解剖结构,这可能导致在定位过程中传播误差。我们提出了End2Reg,一种端到端深度学习框架,通过联合优化分割和定位,消除了对分割标签和手动步骤的需要。网络学习任务特定的分割掩码,以适应定位,仅通过定位目标进行指导,而无需显式的分割监督。End2Reg在体外和体内基准测试中实现了最先进的性能,将中位目标定位误差减少了32%,均方根误差平均减少了61%,同时在部分遮挡下保持稳健性能。消融结果证实,端到端优化显著提高了定位精度。总体而言,End2Reg朝着完全自动化的无标记术中导航迈进。代码和交互式可视化可在:https://lorenzopettinari.github.io/end-2-reg/ 上找到。

英文摘要

Intraoperative navigation in spine surgery demands millimeter-level accuracy. Currently, this is achieved through radiation-intensive intraoperative imaging and bone-anchored markers that are invasive and disrupt surgical workflow. Markerless RGB-D registration methods offer a promising alternative. However, existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, potentially propagating errors through the registration process. We present End2Reg, an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for segmentation labels and manual steps. The network learns task-specific segmentation masks optimized for registration, guided solely by the registration objective without explicit segmentation supervision. End2Reg achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% and mean Root Mean Square Error by 61%, while maintaining robust performance under partial occlusions. Ablation results confirm that end-to-end optimization significantly improves registration accuracy. Overall, End2Reg advances towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.

2512.09806 2026-05-21 cs.CV cs.AI 版本更新

CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing

CHEM: 估计和理解深度学习在图像处理中的幻觉

Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck

发表机构 * Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University of Tromsø(特罗姆斯大学) German Aerospace Center (DLR)(德国航天中心) Foundation for Research and Technology Hellas (FORTH)(希腊研究与技术基金会)

AI总结 本文提出CHEM方法,用于量化和表征图像重建模型中的幻觉 artifacts,通过小波和shearlet表示定位幻觉区域,并利用 conformalized quantile regression 评估幻觉水平,同时分析U-shaped网络为何容易产生幻觉预测。

详情
AI中文摘要

基于深度学习的方法最近在图像重建问题中取得了显著成功。然而,挑战出现了,因为这些方法可能会生成不真实的 artifacts 或幻觉,这可能干扰安全关键场景中的分析。本文介绍了一个框架,用于量化和表征图像重建模型中的幻觉 artifacts。所提出的方法称为 Conformal Hallucination Estimation Metric (CHEM),能够识别模型预测中的幻觉易发区域。它利用小波和shearlet表示在图像特征层面定位这些区域,并使用 conformalized quantile regression 以分布无关的方式评估幻觉水平。提供了理论分析,表征了CHEM对幻觉 artifacts 的灵敏度及其与均方误差的关系。基于这些见解并采用基于逼近理论的观点,我们研究了为何U-shaped网络,广泛用于图像重建的架构,倾向于产生易受幻觉影响的预测。我们在天文图像去卷积中使用CANDELS数据集(如U-Net、SwinUNet和Learnlets)以及在自然图像超分辨率中使用DIV2K数据集(如DRUNet、Unfolded DRS、RAM和DPS)上评估了所提出方法的有效性。

英文摘要

Deep learning-based methods have recently achieved significant success in image reconstruction problems. However, challenges have emerged, as these methods may generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a framework for quantifying and characterizing hallucinated artifacts in image reconstruction models. The proposed method, termed the Conformal Hallucination Estimation Metric (CHEM), enables the identification of hallucination-prone regions in model predictions. It leverages wavelet and shearlet representations to localize such regions at the level of image features, and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. A theoretical analysis is provided, characterizing the sensitivity of CHEM to hallucinated artifacts and its relationship to the mean squared error. Building on these insights and adopting a viewpoint grounded in approximation theory, we investigate why U-shaped networks, widely used architectures for image reconstruction, tend to hallucination-prone predictions. We assess the effectiveness of the proposed approach on astronomical image deconvolution using the CANDELS dataset with architectures such as U-Net, SwinUNet, and Learnlets, and on natural image super-resolution using the DIV2K dataset with models such as DRUNet, Unfolded DRS, RAM, and DPS.

2512.09447 2026-05-21 cs.RO cs.CV 版本更新

Query-Calibrated Segmental Admission for Descriptor-Agnostic LiDAR Loop Closure in Repetitive Environments

基于查询校准的分段准入用于无描述符的激光雷达回环闭合在重复环境中

Jaehyun Kim, Seungwon Choi, Wonseok Kang, Tae-Wan Kim

发表机构 * Department of Naval Architecture and Ocean Engineering(naval architecture and ocean engineering department)

AI总结 该研究提出了一种无描述符的稀疏回环准入策略,用于在重复环境中稳定图结构,通过校准查询级的分段假设并验证代表性配对来减少回环因素的误入,从而提高回环闭合的精度和稳定性。

Comments 8 pages, 3 figures

详情
AI中文摘要

结构重复的环境会产生视觉上合理但存在混叠的LiDAR回环候选者,当这些候选者被作为回环因子加入图中时,可能会破坏位姿图优化。我们提出了一种名为查询校准分段准入(QCSA)的策略,这是一种面向图稳定性的稀疏回环准入政策。该策略通过与硬负样本对比对短描述符分段进行评分,校准哪些查询级的分段假设能达到几何关系,并通过广义迭代最近点(G-ICP)验证代表性配对。我们在SNU图书馆数据集(SNULib)和HeLiPR重叠路线上评估了该方法。在SNULib上对七种LiDAR描述符家族进行汇总分析,QCSA将插入的回环因子减少了3.8倍,将因子精度从0.542提高到0.717,并显著降低了每组查询的误入率。在更稀疏的图中,它保持了可比的平均绝对轨迹误差(ATE)并大幅降低了最坏序列ATE与密集Top1+G-ICP相比,从1.064降至0.778米。这些结果支持了所提出的回环准入层在重混叠的同时定位与建图(SLAM)中的应用。我们的实现和数据集将在:https://github.com/wanderingcar/snu_library_dataset上发布。

英文摘要

Structurally repetitive environments produce visually plausible but aliased LiDAR loop candidates that can destabilize pose-graph optimization when admitted as loop factors. We propose Query-Calibrated Segmental Admission (QCSA), a descriptor-agnostic sparse loop-admission policy for graph-stability-oriented insertion. The policy scores short descriptor segments against hard negatives, calibrates which query-level segment hypotheses reach geometry, and inserts representative pairs validated by Generalized Iterative Closest Point (G-ICP). We evaluate it on the SNU Library Dataset (SNULib) and HeLiPR overlap routes. Aggregated over seven LiDAR descriptor families on SNULib, QCSA reduces inserted loop factors by 3.8 times, raises factor precision from 0.542 to 0.717, and sharply lowers false admissions per query group. With this sparser graph, it maintains comparable mean absolute trajectory error (ATE) and substantially reduces worst-sequence ATE versus dense Top1+G-ICP, from 1.064 to 0.778 m. The aggregate mean and worst-sequence ATE remain lower than the odometry-only reference. Under a matched factor budget, QCSA also attains lower trajectory error than SeqSLAM and sparse Top1+G-ICP selections. Fixed-transfer validation on HeLiPR, with no route-specific tuning, likewise suppresses hard-negative admissions. These results support the proposed admission layer for aliasing-heavy simultaneous localization and mapping (SLAM). Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.

2510.23538 2026-05-21 cs.AI cs.CL cs.CV cs.SE 版本更新

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

JanusCoder: 向代码智能的视觉-程序化界面迈进

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

发表机构 * The University of Hong Kong(香港大学) Shanghai AI Laboratory(上海人工智能实验室) Nanjing University(南京大学) Carnegie Mellon University(卡内基梅隆大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出JanusCoder,一种面向代码智能的视觉-程序化界面,通过构建大规模多模态代码数据集和统一模型,实现从文本指令、视觉输入或两者结合生成代码,展示了其在文本和视觉编程任务中的优越性能。

Comments ICLR 2026 Camera Ready Version, with code and data available

详情
AI中文摘要

神经代码智能的范围正在迅速扩展,从基于文本的源代码扩展到程序生成的丰富视觉输出。这种视觉维度对于高级应用如灵活的内容生成和精确的可视化程序驱动编辑至关重要。然而,进展受到高质量多模态代码数据稀缺的阻碍,这源于合成和质量评估的挑战。为了解决这些挑战,我们从数据和建模的角度做出贡献。我们首先引入了一个完整的合成工具包,利用数据模态之间的相互协同效应,高效地生成涵盖标准图表到复杂交互式网页UI和代码驱动动画的大型高质量语料库。利用该工具包,我们构建了JanusCode-800K,目前最大的多模态代码语料库。这推动了我们模型JanusCoder和JanusCoderV的训练,建立了从文本指令、视觉输入或两者结合生成代码的视觉-程序化界面。我们的统一模型不同于现有方法,后者为孤立任务构建专门模型。在文本导向和视觉导向的编程任务上的大量实验表明,JanusCoder系列的性能优越,我们的7B到14B规模模型接近甚至超过了商业模型的性能。此外,广泛的分析提供了将程序逻辑与其视觉表达和谐统一的关键见解。我们的代码和检查点可在https://github.com/InternLM/JanusCoder上获得。

英文摘要

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints are available at https://github.com/InternLM/JanusCoder.

2510.18034 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

VLMs能否解锁语义异常检测?一个结构化推理的框架

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

发表机构 * Professorship of Autonomous Vehicle Systems TUM School of Engineering Design, Technical University of Munich Munich, Germany

AI总结 本文提出SAVANT框架,通过结构化推理方法提升VLM在语义异常检测中的性能,实现对自动驾驶场景中罕见异常情况的更准确识别。

Comments 8 pages, 5 figures

详情
AI中文摘要

自动驾驶系统仍然对长尾的稀有、分布外语义异常极度脆弱。尽管VLMs已显现为感知的有前途工具,但其在异常检测中的应用仍然主要局限于提示专有模型,限制了可靠性、可重复性和部署可行性。为解决这一差距,我们引入SAVANT(语义异常验证/分析工具包),一种新的模型无关推理框架,将异常检测重新表述为分层语义一致性验证。通过应用SAVANT的两阶段流程——结构化场景描述提取和多模态评估,现有VLMs在输入图像中检测异常驾驶场景的得分得到提升。我们的方法取代了随意提示,通过语义感知推理,将基于VLM的检测转化为四个语义领域之间的原则性分解。我们证明,在平衡的现实驾驶场景集上,应用SAVANT可将VLM的绝对召回率提高约18.5%,相比提示基线。此外,这一增益使大规模注释成为可能:利用我们框架内的最佳专有模型,我们自动标注了约10,000张现实世界图像,具有高置信度。我们使用由此产生的高质量数据集来微调一个7B开源模型(Qwen2.5-VL)以执行单次异常检测,达到90.8%的召回率和93.8%的准确率,超越所有评估模型,同时在接近零成本的情况下实现本地部署。通过将结构化语义推理与可扩展的数据整理相结合,我们为自动驾驶系统中的语义异常检测数据稀缺问题提供了实用的解决方案。补充材料:https://TUM-AVS.github.io/SAVANT/.

英文摘要

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.

2510.17269 2026-05-21 cs.CV cs.AI 版本更新

FineVision: Open Data Is All You Need

FineVision: 你只需要开放数据

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti

发表机构 * Hugging Face Technical University of Munich(慕尼黑技术大学) Stanford University(斯坦福大学)

AI总结 本文提出FineVision,一个包含2400万样本的高质量数据集,通过半自动化流程整合了200多个来源,通过严格的数据清洗和人工审核确保数据质量,训练基于该数据集的模型在广泛评估中表现更优,推动数据驱动的视觉语言模型研究。

详情
AI中文摘要

视觉语言模型(VLMs)的进步受到碎片化、不一致和受污染的公共数据集的阻碍。我们引入了FineVision,一个精心收集、整理和统一的2400万样本数据集,是最大的开放资源。我们通过半自动化、人机协作的流程将超过200个来源整合为185个子集:自动化处理大量数据和模式映射,而审核员检查映射并抽查输出以验证注释的忠实消费、适当的格式和多样性以及安全性;问题会触发针对性的修复和重新运行。该流程进一步在源内和跨源之间应用严格的去重,并针对66个公共基准进行去污染。FineVision还包含具有统一动作空间的代理/GUI任务;审核员验证模式并检查样本轨迹以确认可执行性。在广泛评估套件中,基于FineVision训练的模型始终优于基于现有开放混合数据训练的模型,凸显了规模、数据卫生和平衡自动化与人工监督的好处。我们发布该数据集和整理工具以加速数据驱动的VLM研究。

英文摘要

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

2510.09833 2026-05-21 cs.CV 版本更新

Post Processing of image segmentation using Conditional Random Fields

利用条件随机场对图像分割进行后处理

Aashish Dhawan, Pankaj Bodani, Vishal Garg

发表机构 * Dept. of Computer Science & Engineering(计算机科学与工程系) Space Applications Center(空间应用中心) JMIETI(JMIETI学院) ISRO(印度空间研究组织)

AI总结 本文研究了如何通过条件随机场提升图像分割结果的清晰度,分析了不同CRF类型在低质量卫星图像和高质量航拍照片上的表现,评估了不同方法的优缺点。

详情
Journal ref
Proc. 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom), pp. 147-151, 2019
AI中文摘要

图像分割过程的输出通常由于卫星图像的低质量特征而不够清晰。本研究旨在寻找合适的条件随机场(CRF)以提高分割图像的清晰度。我们首先尝试了不同类型的CRF,并研究它们为何适合或不适合我们的目的。我们在两个不同的数据集上评估了我们的方法——具有低质量特征的卫星图像和高质量的航拍照片。在研究过程中,我们尝试了各种CRF,找出在图像上表现最佳的CRF,并将我们的结果与这些数据集进行比较,以展示不同方法的陷阱和潜力。

英文摘要

The output of image the segmentation process is usually not very clear due to low quality features of Satellite images. The purpose of this study is to find a suitable Conditional Random Field (CRF) to achieve better clarity in a segmented image. We started with different types of CRFs and studied them as to why they are or are not suitable for our purpose. We evaluated our approach on two different datasets - Satellite imagery having low quality features and high quality Aerial photographs. During the study we experimented with various CRFs to find which CRF gives the best results on images and compared our results on these datasets to show the pitfalls and potentials of different approaches.

2510.08482 2026-05-21 cs.CV cs.CL 版本更新

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

视觉象征性挑战:在手语形式-意义映射上评估视觉-语言模型

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

发表机构 * Multimodal Language Department(多模态语言部门) Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Department of Linguistics(语言学系) Boğaziçi University(博多伊奇大学) Donders Institute for Brain Cognition and Behaviour(多纳尔斯脑认知与行为研究所) Radboud University(拉德堡德大学) Department of Linguistics and Communication(语言学与沟通系) University of Birmingham(伯明翰大学)

AI总结 本文提出一个新颖的视频基准测试,用于评估视觉-语言模型在手语形式-意义映射上的表现,通过心理语言学测量来评估三种任务:语音学手语形式预测、透明度和渐进象征性评分,并发现模型在语音形式预测上表现较好但整体仍低于人类表现。

详情
AI中文摘要

象征性,即语言形式与意义之间的相似性,在手语中普遍存在,为视觉 grounding 提供了自然的测试环境。对于视觉-语言模型(VLMs),挑战在于从动态的人类运动中恢复这种本质的映射,而非静态上下文。我们引入了视觉象征性挑战,一个新颖的基于视频的基准测试,将心理语言学测量适应于评估 VLMs 在三个任务上的表现:(i)语音学手语形式预测(例如,手形、位置),(ii)透明度(从视觉形式推断意义),以及(iii)渐进象征性评分。我们评估了13种最先进的VLMs在零样本和少样本设置下在荷兰手语上的表现,并将其与人类基线进行比较。在语音形式预测上,VLMs恢复了一些手形和位置细节,但表现仍低于人类;在透明度上,它们与人类基线相差甚远;只有顶级模型与人类象征性评分有中等相关性。有趣的是,语音形式预测能力更强的模型更能与人类象征性判断相关联,表明它们对视觉基础结构有共同的敏感性。我们的发现验证了这些诊断任务,并推动了以人类为中心的信号和具身学习方法,用于建模象征性和改进多模态模型中的视觉 grounding。

英文摘要

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

2510.00520 2026-05-21 cs.CV 版本更新

CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

CardioBench: 心脏超声基础模型是否能超越实验室?

Darya Taratynova, Ahmed Aly, Numan Saeed, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫莫德·本·扎耶德人工智能大学)

AI总结 本文提出CardioBench,一个用于评估心脏超声基础模型的基准,通过统一多个公开数据集,评估不同模型在零样本、探测和对齐协议下的性能,揭示通用模型在功能任务上表现优异,但细粒度区分任务上存在不足。

详情
AI中文摘要

基础模型正在重塑医学影像,但其在心脏超声中的应用仍然有限,受制于对私有数据集的依赖,限制了可重复的比较。心脏超声具有独特的挑战,包括噪声采集、高帧冗余和有限的多样化公开数据集。为了解决这个问题,我们引入了CardioBench,一个全面的心脏超声基础模型基准。具体而言,CardioBench将八个公开可用的数据集统一为一个标准化的套件,涵盖四个回归和五个分类任务,覆盖功能、结构、诊断和视图识别终点。利用这一框架,我们评估了几种领先的基座模型,包括心脏专用、生物医学和通用编码器,在一致的零样本、探测和对齐协议下。我们的分析显示,尽管通用编码器转移良好,往往接近探测,但在视图分类和细微病理识别等细粒度区分任务上表现不佳。结果表明,能够捕捉心脏时间动态的模型在功能任务上表现最佳,而基于检索的方法在跨数据集的泛化上更加一致。通过发布预处理、分割和公开评估流程,CardioBench建立了可重复的参考点,以指导未来心脏超声和可能其他医学影像基础模型的架构设计。

英文摘要

Foundation models are reshaping medical imaging, yet their application in echocardiography remains limited, hindered by a heavy reliance on private datasets that prevent reproducible comparison. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited diverse public datasets. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography foundation models. Specifically, CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. Leveraging this framework, we evaluate several leading foundation models, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our analysis reveals that while general-purpose encoders transfer well and often close the gap with probing, they struggle significantly with fine-grained distinctions like view classification and subtle pathology recognition. Results indicate that models capturing temporal cardiac dynamics perform best on functional tasks, while retrieval-based approaches generalize more consistently across datasets. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point to guide the architectural design of future echocardiography and possibly other medical imaging foundation models.

2509.17931 2026-05-21 cs.CV physics.med-ph 版本更新

Multi-needle Localization for Pelvic Seed Implant Brachytherapy based on Tip-handle Detection and Matching

基于尖端-柄检测与匹配的盆腔种子植入近距离放射治疗多针定位

Zhuo Xiao, Fugen Zhou, Jingjing Wang, Chongyu He, Bo Liu, Haitao Sun, Zhe Ji, Yuliang Jiang, Junjie Wang, Qiuwen Wu

发表机构 * Image Processing Center, Beihang University(北航图像处理中心) Department of Radiation Oncology, Peking University Third Hospital(北京大学第三医院放疗科) Department of Radiation Oncology, Duke University Medical Center(达勒姆大学医学中心放疗科)

AI总结 本文提出了一种基于尖端-柄检测与匹配的新方法,用于解决术中CT图像中多针定位的难题,通过锚点自由网络和贪心匹配与合并方法,在100名患者的数据集上实现了更高的精度和F1分数,为复杂临床场景下的针定位提供了更鲁棒和准确的解决方案。

详情
AI中文摘要

在术中CT图像中实现准确的多针定位对于优化盆腔种子植入近距离放射治疗中的种子放置至关重要。然而,由于图像对比度差和针管粘附,这一任务具有挑战性。本文提出了一种新颖的方法,将针定位重新框架为尖端-柄检测与匹配问题,以克服这些困难。提出了一种基于HRNet的锚点自由网络,用于提取多尺度特征,并通过解耦分支进行热图回归和极角预测,准确检测针尖和柄。为了将检测到的尖端和柄关联为个体针,提出了一种贪心匹配与合并(GMM)方法,该方法设计用于解决具有约束条件的不平衡分配问题(UAP-C)。GMM方法通过迭代选择最可能的尖端-柄对并基于距离度量进行合并,以重建3D针路径。在100名患者的数据集上评估,所提方法表现出优越的性能,其精度和F1分数优于使用nnUNet模型的基于分割的方法,从而为复杂临床场景中的针定位提供了更稳健和准确的解决方案。

英文摘要

Accurate multi-needle localization in intraoperative CT images is crucial for optimizing seed placement in pelvic seed implant brachytherapy. However, this task is challenging due to poor image contrast and needle adhesion. This paper presents a novel approach that reframes needle localization as a tip-handle detection and matching problem to overcome these difficulties. An anchor-free network, based on HRNet, is proposed to extract multi-scale features and accurately detect needle tips and handles by predicting their centers and orientations using decoupled branches for heatmap regression and polar angle prediction. To associate detected tips and handles into individual needles, a greedy matching and merging (GMM) method designed to solve the unbalanced assignment problem with constraints (UAP-C) is presented. The GMM method iteratively selects the most probable tip-handle pairs and merges them based on a distance metric to reconstruct 3D needle paths. Evaluated on a dataset of 100 patients, the proposed method demonstrates superior performance, achieving higher precision and F1 score compared to a segmentation-based method utilizing the nnUNet model,thereby offering a more robust and accurate solution for needle localization in complex clinical scenarios.

2509.14165 2026-05-21 cs.CV cs.AI 版本更新

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

令牌去哪了?在高分辨率下的STEP中理解剪枝行为

Michal Szczepanski, Martyna Poreba, Karim Haroun

发表机构 * Université Paris-Saclay, CEA, List(巴黎-萨克雷大学,CEA,List) I3S, Université Côte d’Azur, CNRS(I3S,尼斯大学,CNRS)

AI总结 本文提出STEP框架,通过动态补丁合并和令牌剪枝提高效率,同时在高分辨率语义分割任务中实现显著的计算成本降低和吞吐量提升,同时保持较高的准确性。

详情
Journal ref
SN Computer Science 2026
AI中文摘要

视觉变换器(ViTs)在语义分割任务中实现了最先进的性能,但受到高计算和内存成本的限制。为了解决这一问题,我们提出了STEP(SuperToken和Early-Pruning),一种混合的令牌减少框架,结合动态补丁合并和令牌剪枝,以提高效率而不显著牺牲准确性。STEP的核心是dCTS,一个轻量级的CNN基政策网络,能够灵活地合并为超补丁。编码器块也集成了早期退出,以移除高置信度的超令牌,从而降低计算负载。我们在高分辨率语义分割基准上评估了我们的方法,包括高达1024x1024像素的图像,并显示当仅应用dCTS时,令牌数量可以比标准的16x16像素补丁方案减少2.5倍。这在使用ViT-Large作为骨干时,导致计算成本减少2.6倍,吞吐量增加3.4倍。应用完整的STEP框架进一步提高效率,达到计算复杂度减少4倍,推理速度提高1.7倍,最大精度下降不超过2.0%。通过提出的STEP配置,可以自信地在到达最终编码器层之前停止多达40%的令牌。

英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

2509.13482 2026-05-21 cs.CV 版本更新

Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization

通过场景自适应晶格向量量化改进3D高斯散射压缩

Hao Xu, Xiaolin Wu, Xi Zhang

发表机构 * Department of Electrical & Computer Engineering, McMaster University(麦卡特尼大学电气与计算机工程系) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 本文提出了一种场景自适应晶格向量量化(SALVQ)方法,用于改进3D高斯散射(3DGS)的压缩性能,通过优化晶格基矢来提高适应性和R-D效率,同时减少计算开销和训练时间。

Comments Accepted by IEEE TIP. Code available at https://github.com/hxu160/SALVQ

详情
AI中文摘要

3D高斯散射(3DGS)因其逼真渲染质量和实时性能而迅速流行,但会产生大量数据。因此,压缩3DGS数据对于其模型的成本效益至关重要。最近,一些基于锚点的神经压缩方法已被提出,实现了良好的3DGS压缩性能。然而,它们都依赖于统一标量量化(USQ)因其简单性。一个引人注目的问题是,更复杂的量化器是否能在极小的额外开销和系统最小变化的情况下改进当前的3DGS压缩方法。答案是肯定的,通过将USQ替换为晶格向量量化(LVQ)。为了更好地捕捉场景特定特性,我们为每个场景优化晶格基矢,提高LVQ的适应性和R-D效率。这种场景自适应LVQ(SALVQ)在向量量化和USQ的低复杂性之间取得了平衡。SALVQ可以无缝集成到现有的3DGS压缩架构中,通过最小的修改和计算开销提高其R-D性能。此外,通过缩放晶格基矢量,SALVQ可以动态调整晶格密度,使单个模型能够适应多种比特率目标。这种灵活性消除了为不同压缩级别训练单独模型的需要,显著减少了训练时间和内存消耗。

英文摘要

3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ's adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.

2509.09946 2026-05-21 cs.CV 版本更新

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

通过鲁棒的2D跟踪和基于深度的后期聚合实现在线3D多摄像机感知

Vu-Minh Le, Thao-Anh Tran, Duc Huy Do, Xuan Canh Do, Huong Ninh, Hai Tran

发表机构 * Optoelectronics Center, Viettel Aerospace Institute, Viettel Group(Viettel集团光学电子中心、 Viettel航空航天研究所) University of Engineering and Technology, Vietnam National University(越南国家大学工程大学) School of Electrical and Electronic Engineering, Hanoi University of Science and Technology(河内科学技术大学电子与电气工程学院)

AI总结 本文提出了一种方法,通过利用深度信息将现有的在线2D多摄像机跟踪系统扩展到3D空间,通过点云空间重建目标并利用聚类和偏转细化恢复其3D框,同时引入了增强的在线数据关联机制,以局部ID一致性来分配跨帧的全局ID,该框架在2025年AI城市挑战赛的3D MTMC数据集上评估,取得了第三名的成绩。

Comments Accepted at ICCVW 2025

详情
AI中文摘要

多目标多摄像机跟踪(MTMC)是自动化大规模监控中的关键计算机视觉任务。通过摄像机标定和深度信息,场景中的目标可以投影到3D空间,提供对3D环境的前所未有的自动感知水平。然而,在3D空间中的跟踪需要替换所有2D跟踪组件,这可能对现有的MTMC系统不可行。本文提出了一种方法,通过利用深度信息将任何在线2D多摄像机跟踪系统扩展到3D空间,通过点云空间重建目标,并通过聚类和偏转细化恢复其3D框。我们还引入了增强的在线数据关联机制,利用目标的局部ID一致性来分配跨帧的全局ID。所提出的框架在2025年AI城市挑战赛的3D MTMC数据集上进行评估,取得了排行榜第三名的成绩。

英文摘要

Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset, achieving 3rd place on the leaderboard.

2508.11354 2026-05-21 cs.CV cs.AI cs.LG 版本更新

FunduSegmenter: Leveraging the RETFound Foundation Model for Joint Optic Disc and Optic Cup Segmentation in Retinal Fundus Images

FunduSegmenter:利用RETFound基础模型进行视网膜底照相图像中视盘和视杯联合分割

Zhenyi Zhao, Muthu Rama Krishnan Mookiah, Emanuele Trucco

发表机构 * University of Dundee(邓迪大学)

AI总结 本文提出了一种基于RETFound基础模型的FunduSegmenter模型,通过引入一系列新颖模块实现视盘和视杯的联合分割,实验表明该模型在多个数据集上均优于现有方法。

详情
Journal ref
Trans. Vis. Sci. Tech. 2026;15(5):14
AI中文摘要

目的:本研究首次将RETFound模型应用于视盘(OD)和视杯(OC)的联合分割。RETFound是一个为眼底相机和光学相干断层扫描图像开发的知名基础模型,已在疾病诊断中表现出色。方法:我们提出FunduSegmenter,该模型整合了一系列新颖模块与RETFound,包括预适配器、解码器、后适配器、带有卷积块注意模块的跳跃连接以及视觉Transformer块适配器。该模型在自有数据集GoDARTS以及四个公开数据集IDRiD、Drishti-GS、RIM-ONE-r3和REFUGE上进行了评估,通过内部验证、外部验证和领域泛化实验进行验证。结果:在内部验证中,平均Dice相似系数达到90.51%,优于所有基线方法,其中nnU-Net为82.91%,DUNet为89.17%,TransUNet为87.91%。在所有外部验证实验中,平均结果比最佳基线高约3%,且在领域泛化中也具有竞争力。结论:本研究探讨了RETFound通过学习潜在通用表示在眼底相机图像中进行OD和OC分割的潜力。我们的FunduSegmenter在整体上优于现有最先进基线方法。所提出的模块是通用的,可以扩展到其他基础模型的微调。临床相关性:该模型在分布内和分布外数据上均表现出强大的稳定性与泛化能力,提供了稳定的OD和OC分割。这是许多自动化任务的关键步骤,从设置准确的视网膜坐标到生物标志物发现。代码和训练权重可在:https://github.com/JusticeZzy/FunduSegmenter上获得。

英文摘要

Purpose: This study introduces the first adaptation of RETFound for joint optic disc (OD) and optic cup (OC) segmentation. RETFound is a well-known foundation model developed for fundus camera and optical coherence tomography images, which has shown promising performance in disease diagnosis. Methods: We propose FunduSegmenter, a model integrating a series of novel modules with RETFound, including a Pre-adapter, a Decoder, a Post-adapter, skip connections with Convolutional Block Attention Module and a Vision Transformer block adapter. The model is evaluated on a proprietary dataset, GoDARTS, and four public datasets, IDRiD, Drishti-GS, RIM-ONE-r3, and REFUGE, through internal verification, external verification and domain generalization experiments. Results: An average Dice similarity coefficient of 90.51% was achieved in internal verification, which outperformed all baselines, some substantially (nnU-Net: 82.91%; DUNet: 89.17%; TransUNet: 87.91%). In all external verification experiments, the average results were about 3% higher than those of the best baseline, and our model was also competitive in domain generalization. Conclusions: This study explored the potential of the latent general representations learned by RETFound for OD and OC segmentation in fundus camera images. Our FunduSegmenter generally outperformed state-of-the-art baseline methods. The proposed modules are general and can be extended to fine-tuning other foundation models. Translational Relevance: The model shows strong stability and generalization on both in-distribution and out-of-distribution data, providing stable OD and OC segmentation. This is an essential step for many automated tasks, from setting the accurate retinal coordinate to biomarker discovery. The code and trained weights are available at: https://github.com/JusticeZzy/FunduSegmenter.

2508.06206 2026-05-21 cs.RO cs.CV 版本更新

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Affordance-R1: 为多模态大语言模型中的通用化 affordance 推理设计的强化学习

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma

发表机构 * The Hong Kong University of Science and Technology (GZ)(香港科技大学(广州)) National University of Singapore(新加坡国立大学) ShanghaiTech University(上海科技大学) East China Normal University(华东师范大学) Nanjing University of Information Science & Technology(南京信息工程大学) Zhejiang University(浙江大学) Institute of Automation, Chinese Academy of Science(中国科学院自动化研究所) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出 Affordance-R1,一种结合认知 CoT 引导的 Group Relative Policy Optimization (GRPO) 的统一 affordance 地标框架,通过强化学习实现零样本泛化和测试时推理能力。

详情
AI中文摘要

Affordance grounding 旨在预测与机器人执行动作相关的物体特定区域。它在人机交互、人-物交互、具身操作和具身感知领域中起着至关重要的作用。现有模型由于缺乏链式思维(CoT)推理能力,往往忽视不同物体间的 affordance 共享,限制了其域外(OOD)泛化和显式推理能力。为了解决这些挑战,我们提出了 Affordance-R1,这是首个集成认知 CoT 引导的 Group Relative Policy Optimization(GRPO)的统一 affordance 地标框架。具体而言,我们设计了一个复杂的 affordance 函数,包含格式、感知和认知奖励,以有效引导优化方向。此外,我们构建了一个高质量的 affordance 中心推理数据集 ReasonAff,以支持训练。通过仅使用强化学习与 GRPO 进行训练,而不使用显式推理数据,Affordance-R1 实现了稳健的零样本泛化,并表现出涌现的测试时推理能力。全面的实验表明,我们的模型优于已建立的方法,并展示了开放世界泛化能力。据我们所知,Affordance-R1 是首个将基于 GRPO 的 RL 与推理结合到 affordance 推理中的方法。我们的方法和数据集已发布在 https://github.com/hq-King/Affordance-R1。

英文摘要

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

2508.03578 2026-05-21 cs.CV 版本更新

RadProPoser: Probabilistic Radar Tensor Human Pose Estimation That Knows Its Limits

RadProPoser: 一种具有局限性的概率雷达张量人体姿态估计方法

Jonas Leo Mueller, Lukas Engel, Eva Dorschky, Daniel Krauss, Ingrid Ullmann, Martin Vossiek, Bjoern M. Eskofier

发表机构 * Munich Center for Machine Learning, Germany(慕尼黑机器学习中心,德国)

AI总结 本文提出RadProPoser,一种端到端的概率框架,通过原始雷达张量数据预测三维身体关节及其每个关节的不确定性,该方法在新的基准数据集上实现了6.425厘米的均值位置误差,并通过等调校校准总不确定性。

Comments Accepted at IJCNN 2026 (WCCI, Maastricht)

详情
AI中文摘要

基于雷达的人体姿态估计使环境智能中的隐私保护运动跟踪成为可能,但雷达传感的噪声特性使得不确定性量化至关重要。我们提出了RadProPoser,一种端到端的概率框架,能够从原始雷达张量数据中预测三维身体关节并为每个关节提供不确定性。使用变分编码器-解码器与频谱注意力机制,该方法融合了时间帧中的实部和虚部雷达组件。通过可学习的高斯和拉普拉斯分布,我们建模了aleatoric不确定性。在新的基准数据集上训练,我们的方法实现了6.425厘米的均值位置误差。模型输出每个关节的aleatoric不确定性,等调校校准总不确定性,预期校准误差为0.027。由于频谱注意力机制在个体雷达张量组件上操作,扩展到多雷达配置只需拼接额外的输入流。在双正交雷达的HuPR基准上,该方法实现了5.042厘米的MPJPE。该框架在NVIDIA RTX 3090上以89帧每秒的速度运行,超过了15赫兹雷达帧率。

英文摘要

Radar-based human pose estimation enables privacy-preserving motion tracking for ambient intelligence, yet the noisy nature of radar sensing makes uncertainty quantification essential. We present RadProPoser, an end-to-end probabilistic framework that predicts three-dimensional body joints with per-joint uncertainties from raw radar tensor data. Using a variational encoder-decoder with spectral attention that fuses real and imaginary radar components across temporal frames, we model aleatoric uncertainty through learnable Gaussian and Laplace distributions. Trained on a new benchmark dataset with optical motion-capture ground truth, our method achieves 6.425 cm mean per-joint position error. The model outputs per-joint aleatoric uncertainties, and isotonic recalibration yields calibrated total uncertainty with expected calibration error of 0.027. Since spectral attention operates on individual radar tensor components, extending to multi-radar configurations requires only concatenating additional input streams. On the HuPR benchmark with dual orthogonal radars, this achieves 5.042 cm MPJPE. The framework runs at 89 frames per second (FPS) on an NVIDIA RTX 3090, exceeding the 15 Hz radar frame rate.

2507.23313 2026-05-21 cs.CV 版本更新

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

伦勃朗的牛 - 分析文本到图像模型中艺术提示的解释

Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti

发表机构 * Department of Computer Science, Università degli Studi di Milano, Via Celoria, 18, 20133 Milan, Italy(米兰大学计算机科学系)

AI总结 本文研究了文本到图像扩散模型在生成艺术作品时如何解释内容和风格的概念,通过交叉注意力热图分析生成图像中像素与特定提示词的关联,揭示了不同艺术提示和风格下内容与风格分离的程度,为理解大规模生成模型内部如何表示复杂艺术概念提供了新见解。

Comments to be published in: Applications of AI in the Analysis of Cultural and Artistic Heritage, organized within the 35th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025

详情
AI中文摘要

文本到图像扩散模型通过学习数十亿张图像,在生成艺术内容方面展现了显著的能力,包括流行艺术作品。然而,这些模型如何内部表示概念,如绘画中的内容和风格,这一基本问题仍未被探索。传统计算机视觉假设内容和风格是正交的,但扩散模型在训练过程中并未获得关于这一区别的显式指导。在本文中,我们研究了基于Transformer的文本到图像扩散模型在生成艺术作品时如何编码内容和风格概念。我们利用交叉注意力热图将生成图像中的像素归因于特定的提示词,使我们能够隔离受内容描述词和风格描述词影响的图像区域。我们的发现表明,扩散模型在不同艺术提示和风格请求下表现出不同程度的内容-风格分离。在许多情况下,内容词主要影响物体相关区域,而风格词影响背景和纹理区域,这表明模型对内容-风格区别的理解是涌现的。这些见解有助于理解大规模生成模型如何在没有显式监督的情况下内部表示复杂的艺术概念。我们分享了代码、数据集以及用于可视化注意力地图的探索工具,地址为https://github.com/umilISLab/artistic-prompt-interpretation。

英文摘要

Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

2507.09180 2026-05-21 cs.CV cs.RO 版本更新

Multimodal Fusion for Sim2real Transfer in Visual Reinforcement Learning

多模态融合用于视觉强化学习中的仿真到现实迁移

Zichun Xu, Jingdong Zhao, Chenyu Guo, Qianxue Zhang, Liao Zhang, Xiao Zhang, Yiming Ren, Lian Zhang, Zengren Zhao

发表机构 * Medical Artificial Intelligence Lab, The First Hospital of Hebei Medical University, Hebei Medical University(医学人工智能实验室,河北医科大学第一医院,河北医科大学) State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(机器人系统国家重点实验室,哈尔滨工业大学)

AI总结 本文提出基于视觉变换器的多模态融合框架,通过融合RGB和深度信息提升泛化能力,并设计对比学习方案和课程式域随机化方案以提高样本效率和迁移性能,实验结果表明该方法在现实任务中表现优异。

详情
AI中文摘要

深度信息对场景外观变化具有鲁棒性,并固有地包含3D空间细节。因此,本文提出基于视觉变换器的视觉主干,用于融合RGB和深度模态以增强泛化能力。不同模态首先通过单独的CNN茎部进行处理,结合的卷积特征被送入可扩展的视觉变换器以获得视觉表示。此外,设计了一种对比学习方案,通过掩码和未掩码的token来提高样本效率和泛化性能。采用基于课程的域随机化方案以灵活稳定训练过程。最后,仿真结果表明,我们的融合方案优于其他基线。通过零样本迁移验证了模型的可行性,能够执行现实世界操作任务。

英文摘要

Depth information is robust to scene appearance variations and inherently carries 3D spatial details. Thus, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization in this paper. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive learning scheme is designed with masked and unmasked tokens to enhance the sample efficiency and generalization performance. A curriculum-based domain randomization scheme is used to flexibly stabilize the training process. Finally, simulation results demonstrate that our fusion scheme outperforms the other baselines. The feasibility of our model is validated to perform real-world manipulation tasks via zero-shot transfer.

2505.14654 2026-05-21 cs.CV cs.AI cs.CL 版本更新

Beyond Words: Multimodal LLM Knows When to Speak

超越词语:多模态大语言模型何时说话

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin

发表机构 * Department of Computer Science, Stony Brook University(石溪大学计算机科学系) Atmee AI

AI总结 本文提出了一种多模态策略,通过同步视频、音频和文本线索提高对话中的响应时机意识,从而提升大语言模型在对话中的响应准确性。

Comments Project page: https://github.com/lzk901372/MM-When2Speak

详情
AI中文摘要

基于大语言模型(LLMs)的聊天机器人能够生成流畅的响应,但在何时发言的问题上常常遇到困难,尤其是在对话过程中需要简短及时的听众反应时。我们提出了一种多模态策略,利用同步的视频、音频和文本线索来改进对话中的时间感知能力。该策略将响应时间重新表述为密集响应类型预测任务,使智能体能够在流式约束下决定是否保持沉默、生成简短反应或开始完整响应。因此,我们引入了一个经过精心挑选的多模态数据集,该数据集来自真实世界的双人对话视频,具有时间对齐的多模态数据和细粒度的反应类型注释。此外,我们设计了一种多模态策略MM-When2Speak,在LLM骨干网络上增加了多模态集成模块。在各种模态设置和强大的LLM基线模型上的实验表明,MM-When2Speak在响应类型预测性能上实现了高达3倍的提升,突显了多模态感知在自然和吸引人的对话交互中的重要性。

英文摘要

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.

2504.19584 2026-05-21 cs.CV 版本更新

ShowMak3r: Compositional TV Show Reconstruction

ShowMak3r: 动态光场的动态重建

Sangmin Kim, Seunguk Do, Daeun Lee, Jaesik Park

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出ShowMak3r,一种能够对电视节目场景进行动态重建的综合管道,通过编辑场景实现类似影视制作控制室中的剪辑效果,解决了动态光场重建中的遮挡、杂乱舞台和视角变化等挑战。

Comments Project page : https://nstar1125.github.io/showmak3r

详情
AI中文摘要

从视频片段中重建动态光场具有挑战性,尤其是当给定的是娱乐视频如电视节目时。许多挑战使重建变得困难,原因包括(1)演员相互遮挡并具有多样的面部表情,(2)杂乱的舞台,以及(3)小基线视角或突然的镜头切换。为了解决这些问题,我们提出了ShowMak3r,一种综合的重建管道,允许像在制作控制室中剪辑视频片段一样编辑场景。在ShowMak3r中,3DLocator模块利用深度先验来定位恢复的演员并估计未见的人体姿态。所提出的ShotMatcher模块则在镜头切换下跟踪演员。此外,ShowMak3r引入了一个面部拟合网络,动态地恢复演员的表情。在Sitcoms3D数据集上的实验表明,我们的管道能够用不同时间戳的新摄像机重新组装电视节目场景。我们还展示了ShowMak3r能够实现有趣的应用,如合成镜头制作、演员重新定位、插入、删除和姿态操控。项目页面:https://nstar1125.github.io/showmak3r

英文摘要

Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : https://nstar1125.github.io/showmak3r

2504.13109 2026-05-21 cs.CV 版本更新

UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

UniEdit-Flow:在流模型时代释放反向与编辑

Guanlong Jiao, Biqing Huang, Kuan-Chieh Wang, Renjie Liao

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Tsinghua University(清华大学) Snap Inc.(Snap公司) Vector Institute(向量研究所) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 本文提出了一种基于预测-校正框架的流模型反向与编辑方法,通过Uni-Inv实现准确重建,并通过Uni-Edit实现区域感知的图像编辑,方法无需调优,具有通用性和高效性,实验表明其在多种生成模型中均表现出色。

Comments ICLR 2026. Project Page: https://uniedit-flow.github.io/

详情
AI中文摘要

流匹配模型已作为一种强大的替代扩散模型的选项,但现有的针对扩散模型的反向和编辑方法往往在流模型上效果不佳或不适用。流模型的直线、非交叉轨迹对基于扩散的方法构成了挑战,但也为新的解决方案提供了途径。在本文中,我们介绍了一种用于流模型反向和编辑的预测-校正框架。首先,我们提出了Uni-Inv,一种有效的反向方法,用于准确的重建。在此基础上,我们将延迟注入的概念扩展到流模型,并引入Uni-Edit,一种区域感知且稳健的图像编辑方法。我们的方法无需调优,模型无关,高效且有效,能够在多样化编辑的同时,确保对编辑无关区域的强保留。在各种生成模型上的广泛实验表明,Uni-Inv和Uni-Edit的优越性和通用性,即使在低成本设置下也是如此。项目页面:https://uniedit-flow.github.io/

英文摘要

Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: https://uniedit-flow.github.io/

2504.06925 2026-05-21 cs.CV cs.AI 版本更新

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

视觉-语言模型是否准备好进行饮食评估?探索AI驱动的食品图像识别的下一个前沿

Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales

发表机构 * Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid(生物度量与数据模式分析实验室,马德里自治大学) IMDEA Food, CEI UAM+CSIC(IMDEA食品,CEI UAM+CSIC)

AI总结 本文评估了六种先进的视觉-语言模型在不同层次上的食品识别能力,提出了一个新的评估指标,并展示了FoodNExTDB数据库在饮食评估中的应用潜力。

Comments Accepted at IEEE/CVF Computer Vision and Pattern Recognition Conference workshops 2025 (CVPRw) 10 pages, 4 figures, 2 tables

详情
Journal ref
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1-10
AI中文摘要

基于食品图像的自动饮食评估仍是一个挑战,需要精确的食品检测、分割和分类。视觉-语言模型(VLMs)通过整合视觉和文本推理提供了新的可能性。在本研究中,我们评估了六种最先进的VLMs(ChatGPT、Gemini、Claude、Moondream、DeepSeek和LLaVA),分析它们在不同层次上的食品识别能力。在实验框架中,我们引入了FoodNExTDB,一个独特的食品图像数据库,包含9,263张由专家标注的图像,涵盖10个类别(例如“蛋白质来源”)、62个子类别(例如“家禽”)和9种烹饪风格(例如“烤制”)。总共,FoodNExTDB包括50,000个由七位专家生成的营养标签,这些标签由手动标注所有数据库中的图像生成。此外,我们提出了一种新的评估指标,专家加权召回率(EWR),该指标考虑了不同标注者之间的差异。结果表明,封闭源模型在识别包含单一产品的图像中的食品产品时,性能优于开源模型,达到了超过90%的EWR。尽管有潜力,当前VLMs在细粒度食品识别方面面临挑战,特别是在区分烹饪风格的细微差异和视觉相似的食品项目时,这限制了它们在自动饮食评估中的可靠性。FoodNExTDB数据库在https://github.com/AI4Food/FoodNExtDB上公开可用。

英文摘要

Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

2501.15151 2026-05-21 cs.CV 版本更新

SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks

SpikeDet: 更准确且节能的基于脉冲神经网络的目标检测中的 firing 模式

Yimeng Fan, Changsong Liu, Mingyang Li, Dongze Liu, Yuting Su, Yanyan Liu, Wei Zhang

发表机构 * School of Microelectronics(微电子学院) School of Electrical and Information Engineering(电气与信息工程学院) Optoelectronic Thin Film Device and Technology Research Institute(光电薄膜器件与技术研究所)

AI总结 本文提出SpikeDet,一种新型的脉冲神经网络目标检测器,通过优化firing模式实现更准确且节能的目标检测。具体来说,设计了MDSNet脉冲骨干网络,有效调整每个层的膜电位突触输入分布,实现更优的脉冲特征提取;引入Spiking Multi-direction Fusion Module (SMFM)实现多方向融合,增强多尺度检测能力;提出Local Firing Saturation Index (LFSI)定量衡量局部firing饱和度。实验结果验证了方法的有效性,在COCO 2017数据集上达到52.2% AP,比现有SNN方法提升3.3% AP,能耗仅为一半。

详情
AI中文摘要

脉冲神经网络(SNNs)是神经网络的第三代。由于其低能耗和生物可解释性,SNNs在目标检测中获得了广泛关注。然而,现有的基于SNN的目标检测方法受到局部firing饱和的影响,相邻神经元同时达到最大firing率,尤其是在以对象为中心的区域。这种异常的神经元firing模式降低了特征辨别能力和检测准确性,同时增加了firing率,阻碍了SNNs实现其潜在的能源效率。为了解决这个问题,我们提出了SpikeDet,一种新颖的脉冲目标检测器,通过优化firing模式实现更准确且节能的检测。具体来说,我们设计了MDSNet脉冲骨干网络,该网络在每一层有效调整膜电位突触输入分布,从而在脉冲特征提取过程中实现更好的神经元firing模式。对于颈部部分,为了更好地利用和保留这些高质量的骨干特征,我们引入了Spiking Multi-direction Fusion Module (SMFM),实现了脉冲特征的多方向融合,增强了模型的多尺度检测能力。此外,我们提出了Local Firing Saturation Index (LFSI),以定量衡量局部firing饱和度。实验结果验证了我们方法的有效性。在COCO 2017数据集上,它达到了52.2%的AP,比先前的SNN方法提高了3.3%的AP,同时仅需一半的能耗。在目标检测子任务中,包括基于事件的GEN1、水下URPC 2019、低光ExDARK和密集场景CrowdHuman数据集上,SpikeDet也取得了最佳性能。

英文摘要

Spiking Neural Networks (SNNs) are the third generation of neural networks. They have gained widespread attention in object detection due to their low energy consumption and biological interpretability. However, existing SNN-based object detection methods suffer from local firing saturation, where adjacent neurons concurrently reach maximum firing rates, especially in object-centric regions. This abnormal neuron firing pattern reduces the feature discrimination capability and detection accuracy, while also increasing the firing rates that prevent SNNs from achieving their potential energy efficiency. To address this problem, we propose SpikeDet, a novel spiking object detector that optimizes firing patterns for accurate and energy-efficient detection. Specifically, we design a spiking backbone network, MDSNet, which effectively adjusts the membrane synaptic input distribution at each layer, achieving better neuron firing patterns during spiking feature extraction. For the neck, to better utilize and preserve these high-quality backbone features, we introduce the Spiking Multi-direction Fusion Module (SMFM), which realizes multi-direction fusion of spiking features, enhancing the multi-scale detection capability of the model. Furthermore, we propose the Local Firing Saturation Index (LFSI) to quantitatively measure local firing saturation. Experimental results validate the effectiveness of our method. On the COCO 2017 dataset, it achieves 52.2% AP, outperforming previous SNN-based methods by 3.3% AP while requiring only half the energy consumption. On object detection sub-tasks, including event-based GEN1, underwater URPC 2019, low-light ExDARK, and dense scene CrowdHuman datasets, SpikeDet also achieves the best performance.

2412.01944 2026-05-21 cs.CV eess.IV 版本更新

A Comparative Study of Transformer and Convolutional Models for Crop Segmentation from Satellite Image Time Series

变换器与卷积模型在卫星图像时间序列作物分割中的比较研究

Mattia Gatti, Ignazio Gallo, Nicola Landro, Christian Loschiavo, Anwar Ur Rehman, Mirco Boschetti, Riccardo La Grassa

发表机构 * University of Insubria(因斯布鲁克大学) IREA CNR(意大利国家研究委员会IREA分部) INAF-Astronomical Observatory(意大利国家天体物理研究所天文台)

AI总结 本文比较了变换器和卷积模型在从卫星图像时间序列中进行作物分割中的应用,发现TSViT在整体表现上最佳,而VistaFormer在效率与性能之间提供了良好的权衡。

Comments This version corrects an error in the evaluation pipeline affecting previously reported metrics. Results have been recomputed, leading to updated values and a revised conclusion: the adapted Swin UNETR model does not outperform CNN baselines. Tables, figures, and comparisons have been updated, and the analysis has been extended to include additional transformer-based models

详情
AI中文摘要

从卫星图像时间序列(SITS)中进行作物分割是农业监测和土地利用分析中的基本任务。尽管卷积神经网络(CNNs)已被广泛应用,但基于变换器的架构提供了另一种机制,用于在多光谱数据中表示空间和时间依赖性。本文提出了对CNN和基于变换器的分割模型的比较研究,用于Sentinel-2时间序列的作物制图,包括3D U-Net、3D FPN、3D DeepLabv3以及三种变换器架构:Swin UNETR、TSViT和VistaFormer,它们采用不同的策略来捕捉时间依赖性。在Munich和Lombardia数据集上的实验表明,TSViT在整体表现上最佳,略微优于3D U-Net,后者仍然是一个强大的CNN基线。VistaFormer提供了最佳的效率,而Swin UNETR表现竞争,但不如那些显式建模时间动态的变换器。这些结果突显了时间建模在SITS中的重要性:TSViT优于CNNs和将时间视为额外空间维度的方法,而VistaFormer提供了良好的效率-性能权衡。

英文摘要

Crop segmentation from satellite image time series (SITS) is a fundamental task for agricultural monitoring and land-use analysis. While convolutional neural networks (CNNs) have been widely used, transformer-based architectures offer alternative mechanisms for representing spatial and temporal dependencies in multispectral data. This paper presents a comparative study of CNN and transformer-based segmentation models for crop mapping from Sentinel-2 time series, including 3D U-Net, 3D FPN, 3D DeepLabv3, and three transformer architectures: Swin UNETR, TSViT, and VistaFormer, which adopt different strategies for capturing temporal dependencies. Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively but is less effective than transformers that explicitly model temporal dynamics. These results highlight that temporal modelling is critical for SITS: TSViT outperforms CNNs and approaches that treat time as an additional spatial dimension, while VistaFormer provides a strong efficiency-performance trade-off.

2406.14978 2026-05-21 cs.CV 版本更新

E2GS: Event Enhanced Gaussian Splatting

E2GS:事件增强的高斯点撒法

Hiroyuki Deguchi, Mana Masuda, Takuya Nakabayashi, Hideo Saito

发表机构 * Keio University(庆应大学)

AI总结 本文提出E2GS方法,结合事件数据与高斯点撒法,提升图像去模糊和高质量视角合成效果,实验表明其在合成和真实数据集上均能生成视觉吸引人的渲染结果,且训练和渲染速度更快(140 FPS)

Comments 7pages, Accepted at ICIP 2024

详情
AI中文摘要

事件相机因其高动态范围、无运动模糊和低能耗而闻名,这些特性使其在最近的应用中得到了广泛应用。在过去的几年中,基于神经辐射场(NeRF)的事件驱动3D重建领域取得了显著进展,NeRF方法展示了逼真的视角合成结果。然而,NeRF的体积渲染范式需要大量的训练和渲染时间。在本文中,我们介绍了事件增强的高斯点撒法(E2GS),这是一种将事件数据融入高斯点撒法的新方法,该方法最近在新型视角合成领域取得了显著进展。我们的E2GS有效利用了模糊图像和事件数据,显著提高了图像去模糊效果,并产生了高质量的新型视角合成。我们在合成和真实世界数据集上的全面实验表明,我们的E2GS能够生成视觉吸引人的渲染结果,同时提供更快的训练和渲染速度(140 FPS)。我们的代码可在https://github.com/deguchihiroyuki/E2GS上获得。

英文摘要

Event cameras, known for their high dynamic range, absence of motion blur, and low energy usage, have recently found a wide range of applications thanks to these attributes. In the past few years, the field of event-based 3D reconstruction saw remarkable progress, with the Neural Radiance Field (NeRF) based approach demonstrating photorealistic view synthesis results. However, the volume rendering paradigm of NeRF necessitates extensive training and rendering times. In this paper, we introduce Event Enhanced Gaussian Splatting (E2GS), a novel method that incorporates event data into Gaussian Splatting, which has recently made significant advances in the field of novel view synthesis. Our E2GS effectively utilizes both blurry images and event data, significantly improving image deblurring and producing high-quality novel view synthesis. Our comprehensive experiments on both synthetic and real-world datasets demonstrate our E2GS can generate visually appealing renderings while offering faster training and rendering speed (140 FPS). Our code is available at https://github.com/deguchihiroyuki/E2GS.

2205.13524 2026-05-21 cs.CV cs.GR 版本更新

PREF: Phasorial Embedding Fields for Compact Neural Representations

PREF: 用于紧凑神经表示的相位嵌入场

Binbin Huang, Xinhao Yan, Anpei Chen, Shenghua Gao, Jingyi Yu

发表机构 * ShanghaiTech University(上海科技大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文提出了一种高效的基于频率的神经表示PREF,通过引入覆盖显著边谱的相位体积,结合快速傅里叶变换和局部插值加速傅里叶映射,从而减少频率表示中的成本MLP,提升效率和可解释性。

详情
AI中文摘要

我们提出了一种高效的基于频率的神经表示,称为PREF:一种带有相位体积的浅层MLP,能够覆盖比之前傅里叶特征映射或位置编码更显著的边谱。核心是我们的紧凑3D相位体积,其中频率在2D平面上均匀分布并在1D轴上扩展。为此,我们开发了一种专门且高效的傅里叶变换,结合快速傅里叶变换和局部插值以加速朴素傅里叶映射。我们还引入了Parsvel正则化器以稳定基于频率的学习。通过这些方法,我们的PREF减少了频率表示中的成本MLP,从而显著缩小了其与其他混合表示之间的效率差距,并提高了其可解释性。全面的实验表明,我们的PREF能够捕捉高频细节,同时保持紧凑和鲁棒,包括2D图像泛化、3D签名距离函数回归和5D神经辐射场重建。

英文摘要

We present an efficient frequency-based neural representation termed PREF: a shallow MLP augmented with a phasor volume that covers significant border spectra than previous Fourier feature mapping or Positional Encoding. At the core is our compact 3D phasor volume where frequencies distribute uniformly along a 2D plane and dilate along a 1D axis. To this end, we develop a tailored and efficient Fourier transform that combines both Fast Fourier transform and local interpolation to accelerate naïve Fourier mapping. We also introduce a Parsvel regularizer that stables frequency-based learning. In these ways, Our PREF reduces the costly MLP in the frequency-based representation, thereby significantly closing the efficiency gap between it and other hybrid representations, and improving its interpretability. Comprehensive experiments demonstrate that our PREF is able to capture high-frequency details while remaining compact and robust, including 2D image generalization, 3D signed distance function regression and 5D neural radiance field reconstruction.

2605.20538 2026-05-21 cs.CV 版本更新

Continual Segmentation under Joint Nonstationarity

连续分割下的联合非平稳性

Prashant Pandey, Himanshu Kumar, Devineni Sri Venkatraya Chowdary, Brejesh Lall

发表机构 * Bharti School of Telecommunications Technology \& Management, Indian Institute of Technology, Delhi, India Yardi School of Artificial Intelligence, Indian Institute of Technology, Delhi, India Department of Computer Science Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, India

AI总结 本文研究了在联合非平稳性条件下连续语义分割的问题,提出了一种基于梯度适应稳定机制和半监督学习的方法,以应对数据分布漂移带来的不稳定性和过拟合问题,并在多种场景下验证了方法的有效性。

详情
AI中文摘要

演化数据流导致连续语义分割中的联合非平稳性,其中语义类别、输入分布和监督可用性随时间同时变化。这种设置反映了实际的结构预测系统,但此前的持续学习工作通常孤立地研究这些因素。我们正式化了在耦合类别、领域和标签漂移下的持续分割,并研究了在异构密集预测环境中有限标注和丰富未标注数据下的学习。为了解决在分布漂移下少量监督带来的不稳定性及过拟合问题,我们引入了梯度适应稳定机制,这是一种通过梯度缩放的随机扰动实现的参数级正则化机制,促进了原理上的稳定性-可塑性权衡。我们进一步通过半监督学习利用未标注数据,并引入原型锚定监督,通过联合置信度和原型一致性验证伪标签。这些机制共同使持续分割在联合非平稳性下得以学习。在类别递增、领域递增和少样本场景中的广泛实证评估显示,在异构结构预测设置中,与现有方法相比有持续的改进。我们的结果揭示了现有持续分割方法的根本失败模式,并提供了在动态演变环境中学习鲁棒密集预测器的见解。

英文摘要

Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from few-shot supervision under distribution drift, we introduce gradient-adaptive stabilization, a parameter-wise regularization mechanism implemented via gradient-scaled stochastic perturbations that promotes a principled stability-plasticity tradeoff. We further leverage unlabeled data through semi-supervised learning and introduce prototype anchored supervision that validates pseudo-labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class-incremental, domain-incremental, and few-shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.

2605.20536 2026-05-21 cs.CV 版本更新

HADS-Net:A Hybrid Attention-Augmented Dual-Stream Network with Physics-Informed Augmentation for Breast Ultrasound Image Classification

HADS-Net:一种融合注意力增强的双流网络,用于乳腺超声图像分类

Chinedu Emmanuel Mbonu, Blessing Nwamaka Iduh, Joseph Ikechukwu Odo, Doris Chinedu Asogwa

发表机构 * NedumCares

AI总结 本文提出HADS-Net,一种融合注意力增强的双流网络,通过两个并行路径利用全局纹理和局部边界线索,结合物理信息增强,以提高乳腺超声图像分类的准确性。

Comments 7 pages, 4 figures

详情
AI中文摘要

准确地将乳腺超声图像分类为良性、恶性和正常类别是至关重要的临床任务,但受到斑点噪声、声影效应和类间视觉模糊的阻碍。现有的深度学习方法依赖于单流架构,使用通用增强方法,忽略了超声成像的物理特性,并且没有先前的方法专门针对被确定为最诊断性视觉线索的病变边界特征进行处理。我们提出了HADS-Net,一种混合注意力增强的双流网络,通过两个并行路径利用全局纹理和局部边界线索。流1应用物理信息增强模拟斑点噪声、声影效应和增益变化,在提取特征前使用预训练的EfficientNet-B3投影到512维空间。流2提取Sobel边缘图,经过轻量级CNN处理后投影到相同的512维空间。交叉注意力融合模块允许纹理流选择性地查询边界特征,生成联合优化的表示,由通过自适应类别加权焦点损失训练的MLP进行分类。使用五折分层交叉验证,在50个周期中使用余弦退火,选择验证损失最低的全局最佳检查点。在BUSI数据集上,HADS-Net实现了96.58%的准确率,宏ROC-AUC为0.9978,宏F1为0.9654,以及良性、恶性和正常类别的F1分数分别为0.970、0.951和0.976。没有恶性病变被误分类为正常。这些结果证实,模态特定的增强与跨模态注意力融合是超声波乳腺癌诊断的有效策略。

英文摘要

Accurate classification of breast ultrasound images into benign, malignant, and normal categories is a critical clinical task complicated by speckle noise, acoustic shadowing, and inter-class visual ambiguity. Existing deep learning methods rely on single-stream architectures with generic augmentation that ignores ultrasound acquisition physics, and no prior method dedicates a stream to the lesion boundary features identified as the most diagnostically significant visual cue. We propose HADS-Net, a Hybrid Attention-Augmented Dual-Stream Network exploiting global texture and local boundary cues through two parallel pathways. Stream 1 applies physics-informed augmentation simulating speckle noise, acoustic shadowing, and gain variation before extracting features via pretrained EfficientNet-B3 projected to 512 dimensions. Stream 2 extracts Sobel edge maps processed by a lightweight CNN projected to the same 512-dimensional space. A cross-attention fusion module allows the texture stream to selectively query boundary features, producing a jointly optimised representation classified by an MLP trained with adaptive class-weighted focal loss. Five-fold stratified cross-validation with cosine annealing over 50 epochs is used, with the globally best checkpoint selected by lowest validation loss evaluated on a held-out test set. On the BUSI dataset, HADS-Net achieves 96.58% accuracy, macro ROC-AUC of 0.9978, macro F1 of 0.9654, and per-class F1-scores of 0.970, 0.951, and 0.976 for benign, malignant, and normal. No malignant lesion is misclassified as normal. These results confirm that modality-specific augmentation with cross-modal attention fusion is an effective strategy for ultrasound-based breast cancer diagnosis.

2605.20525 2026-05-21 cs.CV cs.AI cs.CL cs.LG eess.IV 版本更新

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

NeuroQA: 一种大规模的3D脑部MRI理解图像 grounded 评估基准

Mohammad H. Abbasi, Favour Nerrise, Shaurnav Ghosh, Ridvan Yesiloglu, Yuncong Mao, Bailey Trang, Mohammad Asadi, Merryn Daniel, Gustavo Chau Loo Kung, Ken Chang, Pavan Pinkesh Shah, Adam Turnbull, Kyan Younes, Seena Dehkharghani, Ehsan Adeli

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出NeuroQA,一个大规模的3D脑部MRI视觉问答基准,包含来自12977名受试者的56953个问答对,涵盖5-104岁及五个临床领域,通过3D体积评估11种临床推理技能,并提供可复现的生成脚本和在线排行榜。

Comments 30 pages, dataset and benchmark release

详情
AI中文摘要

我们提出了NeuroQA,一个大规模的3D脑部磁共振成像(MRI)视觉问答基准,包含来自12977名受试者的56953个问答对,涵盖5-104岁及五个临床领域:阿尔茨海默病、帕金森病、肿瘤、白质疾病和神经发育。与以往基于2D切片或狭窄诊断标签的医学视觉问答(VQA)方法不同,NeuroQA将每个项目与完整的3D体积配对。它评估11种临床相关的推理技能,涵盖是/否、多项选择和开放式格式。在203个模板中,131个是图像 grounded(可从3平面查看器回答),72个是图像 informed(答案来自定量体积测量或临床仪器)。为消除纯文本捷径,我们应用了答案分布优化,将封闭式文本-only 准确率从>80%降至44.6%;图像必要性通过发布的图像 grounded 协议单独评估。一个38规则的确定性管道和两轮专家审查验证每个QA对与FreeSurfer测量、元数据或放射学报告字段的匹配,零个相同受试者矛盾。我们进行了临床评估,两名临床医生独立评估100个冻结测试项目,使用3平面查看器。在封闭式(是/否+多项选择)测试公开项目上,最好的零样本视觉语言模型和监督的3D CNN基线分别达到47.5%和43.7%的准确率,均低于49.4%的文本-only 多数模板基准。NeuroQA采用两级发布,公开QA对用于开放访问数据集和受数据使用协议(DUAs)限制的数据集的可复现生成脚本,加上受试者级划分、保留的私人测试集和在线排行榜。

英文摘要

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

2605.20510 2026-05-21 cs.CV cs.AI cs.CY 版本更新

ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society

ShadeBench: 一个用于可持续社会建筑阴影模拟的基准数据集

Longchao Da, Mithun Shivakoti, Xiangrui Liu, T Pranav Kutralingam, Yezhou Yang, Hua Wei

发表机构 * School of Computing and Augmented Intelligence, Arizona State University(计算与增强智能学院,亚利桑那州立大学) Global Futures Laboratory, Arizona State University(全球未来实验室,亚利桑那州立大学)

AI总结 本文提出ShadeBench,一个用于城市阴影理解的综合数据集和基准,通过多模态数据支持阴影生成、分割和3D建筑重建,并提供标准化评估协议和基线方法,为数据驱动的城市气候研究和热适应城市规划提供基础。

Comments 12 pages, 13 figures, 2 tables. Accepted by KDD 2026 AI for Sciences Track

详情
AI中文摘要

由于城市热岛效应的加剧,城市热暴露问题变得越来越严峻。细粒度的阴影模式,尤其是由建筑物引起的阴影,强烈影响行人热暴露和户外活动规划。然而,大规模准确建模和分析城市阴影仍然困难,因为缺乏大规模数据集和系统评估框架。为了解决这一挑战,我们提出了ShadeBench,一个全面的城市阴影理解数据集和基准。ShadeBench包含地理多样的城市场景,具有时间变化的模拟阴影地图和文本描述,以及对齐的卫星图像、建筑骨架表示和3D建筑网格。基于此多模态数据集,ShadeBench支持一系列下游任务,包括阴影生成、阴影分割和3D建筑重建。我们进一步建立了这些任务的标准评估协议和基线方法。通过使大规模和细粒度的阴影分析成为可能,ShadeBench为数据驱动的城市气候研究提供了基础,并支持未来在热适应城市规划和决策中的研究。代码和数据集可在https://darl-genai.github.io/shadebench/上公开获取。

英文摘要

Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at https://darl-genai.github.io/shadebench/.

2605.20502 2026-05-21 cs.LG cs.AI cs.CV stat.AP stat.ML 版本更新

Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

基于表示空间扩散模型的Tippett最小融合多编码器异常检测

Neelkamal Bhuyan

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种多编码器融合的表示空间扩散模型,通过统计分析每个编码器对特定分布偏移类型的敏感性,引入EncMin2L门控机制,无需使用OOD标签即可在较低参数成本下提升异常检测性能,同时在四种分布偏移类型上均达到0.94以上的AUROC。

Comments 14 pages

详情
AI中文摘要

我们通过多编码器融合的每编码器表示空间扩散模型(RDMs)来解决跨完整分布偏移谱的异常检测问题,包括全局域变化、语义分歧、纹理差异和协变量腐蚀。我们从ID数据中统计地识别每个编码器对特定偏移类型的敏感性,并引入EncMin2L——一种编码器无关的两级min(⋅)门控,能够在不使用OOD标签的情况下结合和校准每编码器扩散基的似然检测器,参数成本比单编码器基线低2.3倍。两种ID数据诊断:η²(类条件F检验)和Δμ(在合成腐蚀下的对数似然偏移)量化编码器的专业化,而Tippett最小p值组合将每编码器得分聚合为一个校准稳定的OOD信号。EncMin2L在所有四种偏移类型上均达到≥0.94的AUROC,优于在重叠基准上的最佳表示空间扩散OOD检测器。

英文摘要

We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$-gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $η^2$ (class-conditional F-test) and $Δμ$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $p$-value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

2605.20496 2026-05-21 q-bio.NC cs.CV 版本更新

Platonic Representations in the Human Brain: Unsupervised Recovery of Universal Geometry

人类大脑中的柏拉图表示:无监督恢复通用几何

Pablo Marcos-Manchón, Rishi Jha, Lluís Fuentemilla

发表机构 * Department of Cognition, Development and Education Psychology, University of Barcelona(巴塞罗那大学认知、发展与教育心理学系) Institute of Neurosciences, University of Barcelona(巴塞罗那大学神经科学研究所) Department of Computer Science, Cornell University(康奈尔大学计算机科学系) Bellvitge Institute for Biomedical Research, Spain(西班牙贝尔维德生物医学研究 institute)

AI总结 该研究探讨了人类大脑是否能无监督地恢复通用几何结构,通过自监督编码器在fMRI数据中学习个体特定的嵌入表示,并证明这些表示可以通过几何变换在不同个体间转换。

Comments Code available at https://github.com/memory-formation/platonic-representations-fmri

详情
AI中文摘要

强柏拉图表示假说提出,人工神经网络中的表征收敛可以被积极利用:嵌入可以通过一个通用潜在空间在不同模型间转换,而无需配对数据。我们探讨是否可以在人类大脑中恢复类似的几何结构。使用自然场景数据集的fMRI数据,我们提出了一种自监督编码器,通过利用重复的刺激呈现,仅依靠脑数据学习个体特定的嵌入表示。我们证明这些独立学习的空间可以通过无监督的正交旋转在不同个体间转换,而无需配对的跨个体样本或中间模型表示。将成对旋转同步到一个共享的潜在空间进一步提高了跨个体检索效果,表明个体特定的空间与一个共同的坐标系统相互兼容。这些结果为人类视觉皮层中的共享神经几何提供了证据:个体特定的fMRI表示在不同个体间近似等距,并可通过纯粹的几何变换进行转换。

英文摘要

The Strong Platonic Representation Hypothesis suggests that representational convergence in artificial neural networks can be harnessed constructively: embeddings can be translated across models through a universal latent space without paired data. We ask whether an analogous geometry can be recovered across human brains. Using fMRI data from the Natural Scenes Dataset, we propose a self-supervised encoder that learns subject-specific embeddings from brain data alone by exploiting repeated stimulus presentations. We show that these independently learned spaces can be translated across subjects using unsupervised orthogonal rotations, without paired cross-subject samples or intermediate model representations. Synchronizing pairwise rotations into a single shared latent space further improves cross-subject retrieval, indicating that subject-specific spaces are mutually compatible with a common coordinate system. These results provide evidence for a shared neural geometry in the human visual cortex: subject-specific fMRI representations are approximately isometric across individuals and can be translated through purely geometric transformations.

2605.20495 2026-05-21 cs.CV 版本更新

A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models

一种用于显微镜视觉-语言模型中高效提示选择的人机协作框架

Abhiram Kandiyana, Ankur Mali, Lawrence O. Hall, Peter R. Mouton, Dmitry Goldgof

发表机构 * University of South Florida(佛罗里达州立大学) SRC Biosciences(SRC生物科学公司)

AI总结 本文提出了一种人机协作框架,通过目标驱动的主动学习方法解决显微镜视觉-语言模型中提示集构建的问题,减少专家验证图像的数量,提高分类性能。

Comments Accepted to CVPR workshops, 2026

详情
AI中文摘要

显微镜图像分类的深度学习流程通常需要昂贵、耗时的人工标注来生成高质量的训练地面真实数据。最近的研究表明,通过提示调整视觉-语言模型(VLMs)可以减少手动标注,通过构建一个小的专家验证图像-描述示例集,作为少样本上下文来对所有剩余图像进行分类。为了进一步减少工作量,VLM可以为候选示例生成描述,然后由专家验证并进行轻微编辑,而不是从头编写文本。然而,仍有两个实际问题未得到解决:(1)哪些未标注图像应优先进行验证?(2)需要多少验证示例才能达到性能目标?在本文中,我们通过将提示集构建公式化为目标驱动的主动学习问题来解决这些问题,优先标注哪些图像。我们在严格低资源约束下研究了三种互补的选取标准,并在小的未标注池中进行实验。实验表明,我们的方法在显著较少的专家验证图像下达到目标性能,平均只需20个标注图像即可达到100%的测试准确率。更广泛地说,我们的以人为本的框架展示了生成式AI在生物医学图像分析中的应用,其中专家在验证和改进模型输出方面仍保持积极的参与,同时显著降低了标注成本。代码和数据将向公众开放。

英文摘要

Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.

2605.20479 2026-05-21 cs.CV cs.LG 版本更新

Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising

用于基于模型的图像去噪中超参数预测的Oracle监督转移

Jianmin Liao, Lixin Shen, Yuesheng Xu

发表机构 * Department of Mathematics Syracuse University(数学系苏利文大学) Department of Mathematics & Statistics Old Dominion University(数学与统计学系老 Dominion 大学)

AI总结 该研究提出HyperDn,一种单配置条件预测器,通过聚合源配置的Oracle监督,预测新的去噪器-噪声配置的异质超参数,展示了在跨范式实验中,从相对便宜的TV/TGV变分源转移到更昂贵的扩散模型DiffPIR时,通过少量或无目标Oracle标签实现接近Oracle性能的成果。

详情
AI中文摘要

超参数预测是基于模型的图像去噪器中的关键实际瓶颈,从经典的TV/TGV变分求解器到现代的扩散基模型如DiffPIR。尽管现有的学习预测器可以实现接近Oracle的性能,但这种方法扩展性差:每个新的配置通常需要其自身的Oracle标记训练集,且每个标签都需要通过与干净地面真实值对比的分层网格搜索来评估。因此,我们询问是否可以从源配置收集的Oracle监督能够转移到目标配置,而使用很少或没有目标Oracle标签。我们提出了HyperDn,一种单配置条件预测器,通过聚合源配置的Oracle监督,预测新的去噪器-噪声配置的异质超参数。在跨范式实验中,HyperDn从相对便宜的TV/TGV变分源转移到更昂贵的扩散基DiffPIR。仅使用2个目标Oracle标签,它达到了30.23 dB,接近Oracle性能,且在使用1/32个目标标签的情况下优于训练自研的每配置64标签预测器。在没有目标Oracle标签的情况下,HyperDn在两个未见过的噪声类型混合和从相对便宜的96×96源图像转移到512×768目标时也达到了接近Oracle的PSNR。这些结果表明,超参数预测的昂贵Oracle监督可以从源转移到新的目标配置,从而减少为每个新的去噪配置重建Oracle标签的需求。

英文摘要

Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser--noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only $2$ target oracle labels, it reaches $30.23$\,dB, within $0.90$\,dB of the oracle, and outperforms the $64$-label per-configuration predictor trained from scratch, using $1/32$ as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap $96\times 96$ source images to $512\times 768$ targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.

2605.20476 2026-05-21 cs.CV 版本更新

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

告别漂移:用于长时视频到视频生成的锚定树采样

Matthew Bendel, Stephen W. Bailey, Mithilesh Vaidya, Sumukh Badam, Xingzhe He

发表机构 * Descript, Inc.(Descript公司)

AI总结 本文提出了一种名为锚定树采样的方法,通过减少关键路径步骤来解决长时视频生成中的漂移问题,并在静态相机模式下实现了稳定且高质量的视频生成。

Comments 30 pages, 23 figures

详情
AI中文摘要

长时视频生成面临两个交织的问题。首先,漂移问题,即视频质量随时间下降。其次,连续性问题,表现为物体永久性问题或不当渲染瞬态内容(例如,出现在非连续帧中的物体颜色/风格变化)。最近的工作集中在自回归蒸馏技术上,旨在同时解决这两个问题。我们选择专注于漂移问题,并引入锚定树采样(ATS):一种无训练的推理时间调度器,用稀疏到密集、锚定范围内的填补方法替代从左到右的滚动。根调用在全时间范围内生成稀疏锚点,递归细化生成中间锚点,最终叶跨度在相邻锚点之间合成。这将关键路径从K个连续滚动步骤减少到L+1个树状步骤,并将时间累积漂移转换为锚定范围内的漂移。我们专注于静态相机模式下的V2V生成,其中稀疏锚点在时间范围内可由密集条件信号近似,且基础模型可在不重新训练的情况下生成它们。我们在Wan 2.1 + VACE上评估了ATS,针对五种条件模式(修复、扩展、边缘、姿态、深度)。我们证明ATS在整体质量和漂移防止方面均优于两个竞争对手。此外,我们还展示了在LTX-2.3上稳定生成至少40分钟的视频。最后,我们提出了一条路径,将ATS扩展到任意长的T2V生成,以及动态相机和多镜头模式。

英文摘要

Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

2605.20470 2026-05-21 cs.CV cs.AI physics.med-ph 版本更新

EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

EPC-3D-Diff: 基于CBCT到CT合成的等价物理一致条件3D潜在扩散模型

Alzahra Altalib, Chunhui Li, Haytham Al Ewaidat, Khaled Alawneh, Ahmad Qendel, Alessandro Perelli

发表机构 * School of Science and Engineering, University of Dundee UK(邓迪大学科学与工程学院) Faculty of Applied Sciences, Jordan University of Science and Technology(约旦科学技术大学应用科学学院) Experia Healthcare, Jordan(约旦Experia医疗) School of Cardiovascular and Metabolic Health, University of Glasgow UK(格拉斯哥大学心血管与代谢健康学院)

AI总结 本文提出EPC-3D-Diff,一种新的条件3D潜在扩散框架,用于体积CBCT到CT合成,通过引入从成像物理导出的投影域等价损失,提高了物理一致性。该方法在训练过程中通过正向投影旋转合成的CT体积,并将其与相应角度偏移的投影进行匹配,从而在扩散目标中集成物理一致的等价约束。

Comments 10 pages, 4 figures

详情
AI中文摘要

锥束CT(CBCT)在放疗中常用于患者定位,但其定量可靠性受到散射、噪声和重建伪影的限制,限制了Hounsfield单位(HU)的准确性。我们提出了EPC-3D-Diff,一种新的条件3D潜在扩散框架,用于体积CBCT到CT合成,引入了从成像物理导出的投影域等价损失。与常见的图像域等价性不同,我们利用体积内旋转对应于其投影的角偏移的事实。在训练过程中,我们通过正向投影旋转合成的CT体积并将其与适当角度偏移的投影进行匹配,从而在扩散目标中集成物理一致的等价约束。为了高效捕捉完整的3D上下文,条件扩散在由轻量3D自动编码器学习的紧凑潜在空间中进行,保持轴向深度的同时在平面分辨率上进行下采样以实现稳定训练。我们验证了配对的头CBCT/CT假体数据集,包括重复扫描,并使用患者层面的分割进行配对临床数据验证,并进行了单域和混合域训练、消融实验和与扩散和CycleGAN的比较。EPC-3D-Diff具有良好的泛化能力,并在PSNR上相比最先进的方法取得了显著的改进,分别在假体和临床数据上提高了+7.4 dB和+1.8 dB,同时在SSIM和HU准确性方面也有所提升,在组织边界内。总体而言,EPC-3D-Diff提高了鲁棒性和物理一致性,支持HU意识的合成,以支持下游的放疗工作流程。

英文摘要

Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

2605.20469 2026-05-21 cs.CV 版本更新

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

HalluCXR: 评估和缓解医疗视觉-语言模型在胸部X光解读中的幻觉

Haoyu Wang, Zitong Li

发表机构 * Department of Biostatistics & Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London(生物统计学与健康信息学系,精神病学、心理学与神经科学研究所,伦敦国王学院)

AI总结 本文提出HalluCXR基准,评估六种不同架构的视觉-语言模型在856例分层MIMIC-CXR胸部X光图像上的表现,发现61.9%-82.3%的输出存在幻觉,其中80.2%存在临床危险错误,通过引入幻觉分类学、检测管道和模型集成方法,提出了缓解幻觉的策略。

详情
AI中文摘要

视觉-语言模型(VLMs)在医学影像解读中日益被使用,但它们经常产生幻觉,即生成在临床上合理但事实错误的发现,这直接对患者安全构成风险。我们介绍了HalluCXR,一个基准,评估了六个架构各异的VLMs在856例分层MIMIC-CXR胸部X光图像和三种查询类型上的表现,产生15,408次模型评估。一个八类幻觉分类学,带有临床严重程度评分和一个双层检测管道,经过250个人类注释验证(自动检测F1=0.959;LLM判断F1=0.907)。我们发现61.9%-82.3%的输出包含幻觉,其中最多80.2%存在临床危险错误。三种关键模式显现:正常X光图像反而吸引最严重的幻觉,常见发现被系统性夸大,而罕见发现被低估,且响应长度本身预测幻觉风险(AUC最高达0.908)。一个六模型集成减少了伪造的84.8%,但增加了遗漏;一个三模型子集在成本减半的情况下保持了相当的性能。这些结果表明,幻觉审计、基于 verbosity 的风险监控和基于集成的安全层是临床部署的先决条件。

英文摘要

Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9--82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go under-detected, and response length alone predicts hallucination risk (AUC up to 0.908). A six-model ensemble reduces fabrication by up to 84.8% at the cost of increased omission; a three-model subset retains comparable performance at half the cost. These results establish that hallucination auditing, verbosity-based risk monitoring, and ensemble-based safety layers are prerequisites for clinical deployment.

2605.20461 2026-05-21 cs.CV 版本更新

Understanding Model Behavior in Monocular Polyp Sizing

理解单目肠镜下息肉大小的模型行为

Xinqi Xiong, Andrea Dunn Beltran, Junmyeong Choi, Sarah K. McGill, Marc Niethammer, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文通过多中心数据集和多种模型对二元息肉大小分类(≤5 mm vs. >5 mm)进行诊断审核,发现模型性能在不同架构和输入模态下较为一致,表明其依赖于与检查行为相关的线索而非真实度量尺度,并展示了完美尺度信息的潜在改进以及当前深度估计和全局校准的有限增益。

详情
AI中文摘要

准确的息肉大小分层指导监视决策,通常大于5 mm的病变需要更密切的随访。然而,单目结肠镜缺乏可靠的参考度量标准。我们对多个公共多中心数据集、模型家族和患者分层交叉验证中的二元息肉大小分类(≤5 mm vs. >5 mm)进行了诊断审核。在不同架构和输入模态(包括RGB外观、相对深度和照度)下,模型性能相对一致,表明其依赖于与检查行为相关的线索而非真实度量尺度。通过提供不同粒度的地面真实尺度,我们量化了完美尺度信息的潜在改进,并显示当前深度估计和全局校准提供的增益有限。我们进一步证明,在分布偏移下分割错误消除了大部分潜在增益,具有预测掩码的oracle尺度仅恢复基线性能。这些结果突显了度量尺度和掩码鲁棒性作为两个独立的瓶颈,并提供了可重用的评估工具,如oracle尺度梯子、快捷分组和掩码替换,用于审核未来的息肉大小管道。我们的代码在https://github.com/anaxqx/polyp-sizing-audit上公开可用。

英文摘要

Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.

2605.20459 2026-05-21 cs.CV cs.AI 版本更新

Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures

基于像素的新冠CT影像病变预测:自动图像分割架构的比较分析

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

发表机构 * Department of Computer \& Software Engineering National University of Sciences \& Technology Islamabad, Pakistan School of Computing \& Information Systems University of Melbourne Melbourne, Australia

AI总结 本文通过比较四种深度学习架构与六种预训练编码器,评估了在新冠CT影像中预测病变的性能,发现深度学习在分割任务中具有高精度和效率,其中二分类分割达到98%的F1分数,多分类分割在不同数据集上分别达到75%和77%的F1分数。

Comments 7 pages, 6 figures, 4 tables

详情
AI中文摘要

近年来,深度学习算法在医学图像分割领域受到了越来越多的关注。然而,由于缺乏标准化的性能分析方法和先前研究中使用不同数据集,该领域的可靠性受到阻碍。本研究的主要目的是全面评估当前的分割框架与最先进的预训练骨干网络,以准确预测CT影像中的新冠病变。此外,这种评估可以作为其他成像场景图像分割的参考点。为了实现这一目标,我们整合了四个不同的深度学习架构,即Unet、PSPNet、Linknet和FPN,以及六个预训练编码器,包括VGG 19、DenseNet 121、Inception ResNet V2、MobileNet V2、SeresNet 101和EfficientNet B0。这种方法使能够开发出多样化的测试架构。在图像分割的背景下,我们的研究涵盖了二分类和多分类实验。通过分析三个不同的新冠CT分割数据集,我们的分析结果表明深度学习架构能够产生精确且高效的分割结果。显著的是,二分类分割的最高F1分数达到98%,而多分类分割在两个不同的数据集上分别达到了75%和77%的F1分数。人工智能和深度学习的使用在多个维度上增强了对流行病疾病诊断过程的帮助。

英文摘要

In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.

2605.20458 2026-05-21 cs.CV 版本更新

ELEMENT: Multi-Modal Retinal Vessel Segmentation Based on a Coupled Region Growing and Machine Learning Approach

ELEMENT:基于耦合区域生长和机器学习方法的多模态视网膜血管分割

Erick O. Rodrigues, Aura Conci, Panos Liatsis

AI总结 本文提出了一种基于耦合区域生长和机器学习方法的多模态视网膜血管分割框架ELEMENT,通过区域生长和机器学习提取特征并进行像素分类,提高了分割的准确性和效率,实验表明其在多个数据集上均优于现有方法。

详情
Journal ref
IEEE Journal of Biomedical and Health Informatics 2020
AI中文摘要

视网膜血管结构包含重要的信息,用于检测和分析眼部疾病,包括年龄相关性黄斑变性、糖尿病视网膜病变和青光眼。常用的诊断模态包括视网膜摄影、扫描激光眼底镜(SLO)和荧光素血管造影(FA)。通常,视网膜血管分割是手动或交互式进行的,这使得过程耗时且容易出错。在本研究中,我们提出了一种新的多模态分割框架,称为ELEMENT(vEsseL sEgmentation using Machine lEarning and coNnecTivity)。该框架由区域生长和机器学习进行的特征提取和像素分类组成。所提出的特征基于灰度级和血管连通性属性捕获互补证据。后者信息在分类阶段无缝传播通过像素。ELEMENT减少了不一致性和加快了分割吞吐量。我们分析并比较了所提出方法与现有血管分割算法在三个主要实验组中的性能,针对每种眼部模态。我们的方法产生了更高的整体性能,整体准确率为97.40%,优于26种现有方法中的25种,包括6种基于深度学习的工作,评估在广泛知名的DRIVE视网膜图像数据集上。在STARE、CHASE-DB、VAMPIRE FA、IOSTAR SLO和RC-SLO数据集中,所提出的框架分别以98.27%、97.78%、98.34%、98.04%和98.35%的准确率超过了所有现有方法。

英文摘要

Vascular structures in the retina contain important information for the detection and analysis of ocular diseases, including age-related macular degeneration, diabetic retinopathy and glaucoma. Commonly used modalities in diagnosis of these diseases are fundus photography, scanning laser ophthalmoscope (SLO) and fluorescein angiography (FA). Typically, retinal vessel segmentation is carried out either manually or interactively, which makes it time consuming and prone to human errors. In this research, we propose a new multi-modal framework for vessel segmentation called ELEMENT (vEsseL sEgmentation using Machine lEarning and coNnecTivity). This framework consists of feature extraction and pixel-based classification using region growing and machine learning. The proposed features capture complementary evidence based on grey level and vessel connectivity properties. The latter information is seamlessly propagated through the pixels at the classification phase. ELEMENT reduces inconsistencies and speeds up the segmentation throughput. We analyze and compare the performance of the proposed approach against state-of-the-art vessel segmentation algorithms in three major groups of experiments, for each of the ocular modalities. Our method produced higher overall performance, with an overall accuracy of 97.40%, compared to 25 of the 26 state-of-the-art approaches, including six works based on deep learning, evaluated on the widely known DRIVE fundus image dataset. In the case of the STARE, CHASE-DB, VAMPIRE FA, IOSTAR SLO and RC-SLO datasets, the proposed framework outperformed all of the state-of-the-art methods with accuracies of 98.27%, 97.78%, 98.34%, 98.04% and 98.35%, respectively.

2605.20445 2026-05-21 cs.CV cs.AI 版本更新

A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT & X-ray Imagery

对用于CT和X光影像中新冠分类的深度学习架构的全面比较

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

发表机构 * Department of Computer \& Software Engineering National University of Sciences \& Technology Islamabad, Pakistan School of Computing \& Information Systems University of Melbourne Melbourne, Australia

AI总结 本文通过比较多种深度学习架构,提出基于卷积神经网络的计算机辅助诊断系统,以区分新冠和正常肺部影像,并在X光和CT数据集上取得了95至98%的平均准确率。

Comments 6 pages, 2 figures, 5 tables

详情
AI中文摘要

新冠是一种造成大量人员伤亡的重大挑战,不仅涉及某些国家,甚至全球也因冠状病毒而遭受影响。使用计算断层扫描(CT)和X光的肺部影像技术是新冠或其他大流行病筛查过程中最有效的工具。如今,技术已通过人工智能取代手动过程,用自动化机器使系统能够模仿人类大脑,通过经验做出明智决策。受此启发,我们的工作提出使用卷积神经网络(CNN)模型设计一个计算机辅助诊断(CAD)系统,以区分新冠和正常肺部影像。我们使用了两组不同的肺部X光影像和两组不同的CT扫描,并利用预训练的多种网络(如VGG(16, 19)、Densenet(121)、Resnet(50, 50 V2, 101 V2)、MobileNet(V2)、Xception Inception(V3, Resnet V2)、EfficientNet(B0)和Nasnet(Large))进行分类。在X光和CT图像数据集上,Resnet和VGG架构显示出能够正确区分新冠和正常图像的能力,平均准确率分别为95至98%。我们在分类数据集上的结果具有竞争力,并优于文献中已报告的发现。

英文摘要

COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.

2605.20405 2026-05-21 eess.IV cs.AI cs.CV physics.med-ph 版本更新

Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation

在CT身体成分分割中解耦采样与训练预算

Iason Skylitsis, Dimitrios Karkalousos, Ivana Išgum

发表机构 * Amsterdam University Medical Center(阿姆斯特丹大学医学中心) University of Amsterdam(阿姆斯特丹大学) Informatics Institute, Faculty of Science(信息学院,科学学院) Department of Radiology(放射科) Mayo Clinic Rochester(罗切斯特梅奥诊所)

AI总结 本文提出了一种基于少样本学习的episodic采样方法,用于解决医学图像分割中的类别不平衡问题,通过解耦采样与训练预算,提高了小数据集下的分割性能。

详情
AI中文摘要

类别不平衡是医学图像分割中的基本挑战,其中频繁类通常在训练中占主导地位,而稀有类被忽视。基于损失的方法通过在批次内重新加权每个像素的损失来缓解不平衡,而采样策略控制哪些图像进入批次。然而,两者均未明确控制批次中出现的类别,导致稀有类的暴露仅部分平衡。在本文中,我们采用少样本学习中的episodic采样,以在完全监督设置中促进类别平衡的批次构造。我们解耦episodic采样与其传统的度量学习上下文,并在CT身体成分分割中评估其效果。我们在九种肌肉和脂肪组织上,从公共SAROS数据集中提取了210次扫描,将episodic采样与随机和加权采样进行比较。训练是在全数据和低数据模式下进行的,此外在匹配训练迭代预算下也进行了额外比较。在全数据训练中,三种策略表现相当(episodic的平均Dice为0.882,随机和加权为0.878)。在低数据训练中,episodic采样优于随机和加权(0.787 vs. 0.758和0.762),这由训练迭代数的12倍差异驱动。在匹配训练预算下,随机和加权过早过拟合,而episodic在达到平台前提高了约三倍的迭代次数。我们的发现识别了训练迭代预算作为采样策略中被低估的混淆因素,推动了小数据集的迭代感知评估协议。此外,episodic采样的残余优势与隐含的类别平衡批次的正则化效应一致,提供了一种低成本、模型无关的解决医学图像分割类别不平衡问题的策略。代码可在https://github.com/iasonsky/episodic-sampling上获得。

英文摘要

Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss-based approaches mitigate imbalance by reweighting the per-pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare-class exposure only partially rebalanced. In this work, we adopt episodic sampling from few-shot learning to promote class-balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric-learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full- and low-data regimes, with additional comparisons under matched training iteration budgets. Under full-data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low-data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12-fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under-recognized confound in sampling strategies, motivating iteration-aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class-balanced batches, offering a low-cost, model-agnostic strategy for class-imbalanced medical image segmentation. Code is available at https://github.com/iasonsky/episodic-sampling.

2605.20390 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

STELLAR: 为自动驾驶扩展3D感知大模型

Yingwei Li, Xin Huang, Yang Liu, Yang Fu, Alex Zihao Zhu, Chen Song, Junwen Yao, Anant Subramanian, Hao Xiang, Weijing Shi, Yuliang Zou, Tom Hoddes, Zhaoqi Leng, Govind Thattai, Dragomir Anguelov, Mingxing Tan

发表机构 * Waymo UCSD(加州大学圣地亚哥分校)

AI总结 本文研究了大规模训练在自动驾驶感知系统中的应用,通过扩展输入模态并训练大规模模型,实现了在Waymo数据集上的新状态-of-the-art性能。

详情
AI中文摘要

模型扩展通过在多样化数据集上进行大规模训练已显示出显著的成功。然而,尚不清楚相同的范式是否适用于自动驾驶感知系统,因为存在独特的挑战,如融合异构传感器数据和需要复杂的3D空间理解。为弥合这一差距,我们进行了系统分析,研究了规模对这些系统的影响。我们基于稀疏窗口变换器开发了STELLAR模型,扩展了输入模态,包括LiDAR、雷达、相机和地图先验。我们在一个包含5000万驾驶示例的大规模数据集上训练该模型,参数数量高达5亿。我们的大规模实验揭示了模型性能与模型大小、数据和计算之间的经验扩展趋势。所得到的模型在Waymo Open Dataset挑战中建立了新的状态-of-the-art,大幅超越了先前的成果。我们的工作表明,大规模训练是提升自动驾驶感知模型能力极具前景的路径。

英文摘要

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

2605.20388 2026-05-21 cs.CV 版本更新

How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

如何移动决定了将来的行动:轨迹条件的自身视角预测

Sejoon Jun, Hai Nguyen-Truong, Luigi Seminara, Lorenzo Torresani

发表机构 * Khoury College of Computer Sciences, Northeastern University, Boston(北德文斯克学院,东北大学,波士顿) Korea Advanced Institute of Science and Technology, Daejeon(韩国科学技术院,大田) Department of Mathematics and Computer Science, University of Catania, Italy(卡塔尼亚大学数学与计算机科学系,意大利)

AI总结 该研究通过轨迹条件自身视角预测,发现轨迹能更精确地表达意图,从而在任务规划中优于语言条件,且在预测时无需观察轨迹即可获得显著优势。

Comments Project page: https://farsightlab.github.io/TrajPilot

详情
AI中文摘要

预测一个人的第一人称视角如何演变(接下来会采取什么行动,什么计划完成任务,正在进行的投篮是否会得分)从根本上是不充分的:相同的情境允许许多可能的未来,而一个训练以最小化预测误差的模型被迫在这些未来之间做妥协或平均,无论哪种方式都可能出错。我们的方法基于两个发现。首先,未来的相机轨迹,即头部在空间中划出的路径,让模型承诺其中一个未来:它以足够精细的形式承载操作者的意图,从而决定行动如何展开,显著优于语言作为条件信号。其次,这种意图使轨迹本身可以从当前情境中部分预测出来,足以在测试时无需观察轨迹即可恢复大部分收益。我们将其实例化为TrajPilot,一个模型从自身视角上下文预测候选未来轨迹,并利用这些轨迹在与行动对齐的嵌入空间中引导动作预测,其中语言塑造了结构但从未用作条件输入。TrajPilot在Ego-Exo4D原子、Ego-Exo4D Keystep、Ego-Exo4D GoalStep和EgoPER的程序规划任务中优于VLM和结构化规划基线,随着预测范围的扩大(正是先前规划器崩溃的地方),并且在仅使用RGB的相机姿态估计下保持稳定。在推理时目标被遮蔽,同一模型能够进行无目标的预测,其在Ego-Exo4D原子任务上击败VLM基线,并扩展到EPIC-Kitchens-100和篮球投篮结果预测。

英文摘要

Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

2605.20385 2026-05-21 cs.CV cs.AI 版本更新

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

ConceptSeg-R1: 通过元强化学习实现任意概念的分割

Yuan Zhao, Youwei Pang, Jiaming Zuo, Wei Ji, Kailai Zhou, Bin Fan, Yunkang Cao, Lihe Zhang, Xiaofeng Liu, Huchuan Lu, Weisi Lin, Dacheng Tao, Xiaoqi Zhao

发表机构 * Dalian University of Technology(大连理工大学) X3000 Inspection Co., Ltd(X3000检测有限公司) Nanyang Technological University(南洋理工大学) Yale University(耶鲁大学) Northwestern Polytechnical University(西北工业大学) Hunan University(湖南大学)

AI总结 本文提出ConceptSeg-R1框架,通过元强化学习机制学习可迁移的任务规则,结合轻量级概念翻译模块实现概念分割,并在多个领域基准上验证了其在概念层次上的强性能。

详情
AI中文摘要

近年来,可提示分割的进步使视觉感知从对象级定位转向概念级理解。然而,概念的定义仍不明确,使得当前方法是否真正超越类别识别仍存疑问。本文通过包含上下文无关(CI)、上下文依赖(CD)和上下文推理(CR)概念的三级分类,揭示了随着认知复杂性增加的能力差距。为解决这一挑战,我们提出ConceptSeg-R1统一框架,将概念分割重新公式化为规则诱导的概念定位。核心方法是Meta-GRPO,通过视觉示范学习可迁移的任务规则并通过代理推理验证。推导出的推理状态通过轻量级概念翻译模块转换为分割准备的概念提示,使推理应用能够扩展到目标图像。快捷路由策略进一步保留了分割模型在简单情况下的原生效率。为系统评估概念分割,我们在自然、工业、医疗和推理密集领域进行了广泛的实验。无需额外装饰,ConceptSeg-R1在完整概念层次上实现了强性能,同时保持了可提示分割主干的原生能力。作为向分割任何概念的初步步骤,我们希望ConceptSeg-R1能成为推进分割从对象级预测到概念级理解的实用基线。

英文摘要

Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains under-specified, making it unclear whether current methods truly generalize beyond category recognition. In this work, we formalize generalized concept segmentation through a three-level taxonomy consisting of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts, which reveals a clear capability gap across increasing levels of cognitive complexity. To address this challenge, we propose ConceptSeg-R1, a unified framework that reformulates concept segmentation as rule-induced concept grounding. At the core of our method is Meta-GRPO, a meta-reinforcement learning mechanism that learns transferable task rules from visual demonstrations and verifies them through proxy reasoning. The inferred reasoning states are then translated into segmentation-ready concept prompts via a lightweight concept translation module, enabling deductive application to target images. A shortcut routing strategy further preserves the native efficiency of segmentation models on simple cases. To systematically evaluate generalized concept segmentation, we conduct extensive experiments across diverse CI, CD, and CR concept segmentation benchmarks spanning natural, industrial, medical and reasoning-intensive domains. Without bells and whistles, ConceptSeg-R1 achieves strong performance across the full concept hierarchy while maintaining the native capability of promptable segmentation backbones. As an initial step toward segmenting any concept, we hope ConceptSeg-R1 can serve as a practical baseline for advancing segmentation from object-level prediction toward concept-level understanding.

2605.20373 2026-05-21 cs.RO cs.AI cs.CV 版本更新

SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

SUGAR: 一种可扩展的人类-视频驱动的通用人形机器人运动-操作学习框架

Tianshu Wu, Xiangqi Kong, Yue Chen, Qize Yu, Hang Ye, Jia Li, Yizhou Wang, Hao Dong

发表机构 * CFCS, School of Computer Science, Peking University(计算机学院,北京大学计算机科学系) School of Computer Science and Engineering, Beihang University(计算机科学与工程学院,北航)

AI总结 该研究提出SUGAR框架,通过将多样化的视频转化为可部署的人形机器人运动-操作技能,无需特定任务的奖励工程或参考动作条件,在仿真和现实硬件中实现了六种代表性任务的高性能表现,展示了可扩展性和零样本现实迁移能力。

Comments Project Page: https://tianshuwu.github.io/sugar-humanoid/

详情
AI中文摘要

构建能够实现在现实世界中通用的全身体运动-操作能力的人形机器人仍是一个根本性挑战。现有方法要么依赖于繁琐的特定任务奖励工程,要么依赖于僵化的参考动作回放,无法泛化,或者依赖于昂贵的远程操作,限制了可扩展性。尽管人类视频捕捉了多样化的动作行为,但从中推断出的运动先验固有地不完美,受到遮挡、接触伪影和重定向误差的影响,使其不适合直接的策略学习。为此,我们提出了SUGAR,一种可扩展的数据驱动框架,能够将多样化的视频转化为可部署的人形机器人运动-操作技能,无需任何特定任务的奖励工程或参考动作条件。SUGAR分为三个阶段。首先,一个完全自动化的流程从无结构的人类视频中提取运动交互先验,包括人类-物体运动轨迹和接触标签。第二,一个特权物理基础的细化器利用统一的模仿奖励和渐进状态池,将不完美的先验转化为物理上可行的、高保真的技能。第三,经过细化的技能被转化为一个分层的自主策略,包括一个命令生成器和一个命令跟踪器。我们在仿真和现实世界的人形硬件中评估了SUGAR,我们的方法在六种代表性运动-操作任务上显著优于参考跟踪基线,性能随着人类视频数据量的增加而明显提升。它还实现了零样本现实迁移,具有可靠的闭环执行、自主故障恢复和在外部扰动下的稳定长时程性能。项目页面:https://tianshuwu.github.io/sugar-humanoid/

英文摘要

Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/

2605.20372 2026-05-21 cs.CV cs.AI 版本更新

Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities

基于潜在空间引导的多模态分割中缺失模态的场景采样

Irem Ulku, Ö. Özgür Tanrıöver, Erdem Akagündüz

发表机构 * organization= Department of Computer Engineering, Ankara University, Ankara, T \"u rkiye organization= Department of Modeling Simulation, Graduate School of Informatics, METU, Ankara, T \"u rkiye

AI总结 本文提出了一种新的训练策略,通过直接从预训练的潜在空间学习场景采样分布,以指导多模态分割在缺失模态下的微调,从而提高性能。

Comments 14 pages, 4 figures, 9 tables

详情
AI中文摘要

多模态语义分割通过结合不同传感器模态的互补信息,为遥感分析带来了好处。在现实中的遥感应用中,由于传感器故障、恶劣大气条件或数据采集问题,一个或多个模态可能不可用。即使有预训练的多模态表示和现有的微调或适应策略,性能仍可能受限,因为训练时通常将所有模态可用性场景视为等信息。在本文中,我们提出了一种新的训练策略,直接从预训练的潜在空间学习场景采样分布。与依赖于均匀随机模态丢弃不同,所提出的方法将微调引导到更具信息量的模态可用性场景。更具体地说,我们独立量化每个场景的影响,基于其在共享潜在表示中引起的变化。然后,我们使用径向基函数内核捕捉场景关系,并通过正则化内核平滑推导出细化的场景评分。这些评分随后在场景采样过程中转换为概率分布,用于微调。我们在三个遥感图像集(DSTL、Potsdam和Hunan)上评估了该策略,使用CBC-SLP、CBC和CMX主干网络。不同图像集和主干网络的实验结果表明,我们的方法优于标准微调和LoRA基于的适应。这些发现表明,预训练的潜在表示可以作为缺失模态微调期间采样的有效基础。代码可在https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling获取。

英文摘要

Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling

2605.20362 2026-05-21 cs.CV 版本更新

HAPS: Rethinking Image Similarity for Virtual Staining

HAPS: 重新思考图像相似性以适应虚拟染色

Fedor Gubanov, Svetlana Illarionova, Vlad Kozlovskiy, Mikhail Romanov, Yersultan Akhmetov, Aida Akaeva, Vyacheslav Grinevich, Rifat Hamoudi, Maxim Sharaev

发表机构 * Skolkovo Institute of Science and Technology(斯克洛洛沃科学与技术研究所) University of Sharjah(沙迦大学) National Medical Research Radiological Centre of the Ministry of Health of the Russian Federation(俄罗斯联邦卫生部国家医学研究放射中心)

AI总结 本文提出HAPS指标,通过分析组织学图像的相似性,改进了传统通用度量标准在评估虚拟染色质量时的不足,从而提升训练数据质量。

Comments 17 pages, 3 figures

详情
AI中文摘要

虚拟染色是数字病理学中的一项新兴技术,通过合成目标染色来加快和降低成本。然而,虚拟染色模型的质量仍主要依赖于通用度量标准如SSIM、PSNR和LPIPS。这些指标最初为自然图像设计,与组织学数据的领域特性不匹配,无法捕捉组织形态的保持和生物标记物的表达模式。因此,建立一个稳健的、领域特定的度量标准来量化不同组织学模态间的相似性仍然是该领域的关键缺口。本文将组织学图像相似性作为独立问题进行系统评估,并对一系列全参考度量标准进行了评估。我们进一步分析了度量标准对受控几何畸变(位移、旋转和非刚性变形)的敏感性,这些畸变模拟了连续切片之间的现实对齐误差。基于这些观察,我们提出了组织学感知感知相似性(HAPS)度量标准。HAPS在冻结的预训练于组织病理学数据的编码器的特征空间中计算距离,并添加一个线性头部来将特征层面的差异聚合为最终得分,该得分与专家评估一致。最后,我们展示了HAPS在训练数据质量控制中的实际价值。通过在MIST数据集中量化训练对的相似性并过滤低分样本,我们创建了一个更干净的训练集。在该精炼数据上训练的虚拟染色模型优于在原始未经过滤数据集上训练的模型。

英文摘要

Virtual staining of histopathology images (e.g., H&E-IHC) is an emerging tool in digital pathology, enabling faster and cheaper workflows by synthesizing target stains from routinely acquired slides. Yet, the quality of virtual staining models is still predominantly assessed with generic metrics such as SSIM, PSNR, and LPIPS. Originally developed for natural images, these metrics are inherently misaligned with the domain-specific characteristics of histological data, failing to capture tissue morphology preservation and biomarker expression patterns. Consequently, a robust, domain-specific standard for quantifying similarity across diverse histological modalities remains a critical gap in the field. In this work, we formalize histology image similarity as a standalone problem and systematically evaluate a broad set of full-reference metrics against a dataset of H&E-IHC patch pairs annotated with expert similarity scores. We further analyze metrics sensitivity to controlled geometric distortions (shifts, rotations and non-rigid deformations) that mimic realistic registration errors between serial sections. Guided by these observations, we propose the Histology-Aware Perceptual Similarity (HAPS) metric. HAPS computes distances in the feature space of a frozen encoder pretrained on histopathology data, adding a linear head to aggregate feature-level differences into a final score that aligns with expert assessments. Finally, we demonstrate the practical value of HAPS for quality control of training data. By quantifying the similarity of training pairs in the MIST dataset and filtering low-scoring samples, we create a cleaner training set. Virtual staining models trained on this refined data outperform those trained on the original, unfiltered dataset.

2605.20337 2026-05-21 cs.CV 版本更新

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

能力 ≠ 可解释性:视觉基础模型的人类可解释性

Julien Colin, Lore Goetschalckx, Nuria Oliver, Thomas Serre

发表机构 * ELLIS Alicante, Spain(阿利坎特ELLIS研究所,西班牙) Brown University, USA(布朗大学,美国) imec, Leuven, Belgium(imec,比利时卢旺达)

AI总结 本文研究了视觉基础模型的人类可解释性,提出了一种评估框架,发现基础模型比监督模型更难解释,且可解释性与下游任务性能无关,而是与特征的局部性和粗粒度语义对齐有关。

详情
AI中文摘要

领先的视觉模型的可解释性如何?随着这些模型从研究基准转向高风险部署,这个问题变得日益紧迫,但现有方法无法可靠地回答这个问题。我们通过两种互补的心理物理学协议构建了一个框架来衡量和比较视觉模型的人类可解释性:(1)局部化性——观察者能否预测特征在新图像上的位置?(2)可命名性——观察者能否准确描述特征所代表的内容?特征通过稀疏自编码器恢复,一个基于偶然锚定的评分函数将每个模型置于同一尺度上。将该框架应用于六个视觉Transformer——两个监督ViT和四个基础模型(DINOv2、DINOv3、CLIP、SigLIP)——我们收集了超过15,000个行为响应,分析了377名通过我们预设质量检查的参与者中的13,400个响应。基础模型比其监督模型更难解释,且差距不是能力取舍:可解释性不与我们在任何基准上的下游任务性能相关。相关的是特征激活的局部性和与人类的粗粒度语义对齐——具有聚焦激活和反映世界广泛类别结构的表示模型产生更可解释的特征,而细粒度感知对齐则不。两种协议产生高度相关的排名并共享相同的预测因素,确立可解释性作为表示质量的独立、可测量的维度——令人惊讶的是,我们测试的每个基础模型都低于先前的监督基线。仅靠能力无法弥合这一差距;局部性和粗粒度对齐可以。

英文摘要

How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.

2605.20316 2026-05-21 cs.CV cs.AI 版本更新

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

FullFlow: 通过双向视觉-语言生成升级文本到图像流匹配模型

Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann

发表机构 * ETH Zurich(苏黎世联邦理工学院) Google(谷歌)

AI总结 本文提出FullFlow方法,通过仅训练LoRA适配器和轻量级文本头部,将预训练的rectified-flow文本到图像模型升级为双向视觉-语言生成器,从而在保持图像连续流的同时添加文本离散插入过程,提升文本到图像和图像到文本的生成质量。

Comments project page: https://ericbill21.github.io/fullflow/

详情
AI中文摘要

现代文本到图像扩散模型编码了丰富的视觉先验,但只能通过单向文本条件生成暴露。现有统一的视觉-语言模型通过大规模联合预训练或对文本路径进行大量重训练来恢复双向能力,但丢弃了文本到图像模型本身已编码的强图像先验。我们介绍了FullFlow,一种参数高效的配方,通过仅训练LoRA适配器和轻量级文本头部,将预训练的rectified-flow文本到图像模型升级为双向视觉-语言生成器。FullFlow保持图像在原生连续流中,并添加文本的离散插入过程。分离的图像和文本时间步将推断转化为二维生成空间中的轨迹选择,使文本→图像、图像→文本、联合采样和部分文本预测能够通过单一主干模型完成。在Stable Diffusion 3 (SD3)上,FullFlow在相同可训练参数数量和匹配LoRA秩的情况下,将文本→图像的FID从62.7提升到31.6,将图像→文本的CIDEr从2.0提升到99.4,同时在两个RTX A5000 GPU上训练时间不超过24小时的情况下,将峰值VRAM从约84GB降低到约38GB,并将吞吐量提高约8倍,仅训练主干参数的约5%。同样的配方适用于FLUX.1-dev,并通过部分文本生成支持下游VQA。这些结果表明,强大的双向视觉-语言能力可以从预训练的文本到图像流模型中解锁,而无需完整的多模态预训练。

英文摘要

Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

2605.20309 2026-05-21 cs.CV cs.AI 版本更新

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

Tiny-Engram: 生成视觉中的触发索引概念表

Runyuan Cai, Yiming Wang, Yu Lin, Xiaodong Zeng

发表机构 * AutoArk-AI

AI总结 本文提出Tiny-Engram,一种紧凑的触发索引概念表,通过显式地为视觉记忆分配词汇地址和激活边界,实现对冻结图像和视频生成器中的概念的控制。该方法通过注册的n-gram匹配索引参数化每个概念,仅在匹配触发区域调节文本编码器的隐藏状态,从而在保持周围提示的组合控制的同时,将罕见触发短语绑定到目标身份。

详情
AI中文摘要

当前生成视觉模型的个性化方法通常通过连续适配器或权重更新来编码新概念,但对是否以及何时检索概念的控制有限。在本工作中,我们引入Tiny-Engram,一种紧凑的触发索引概念表,为冻结的图像和视频生成器中的视觉记忆提供显式的词汇地址和激活边界。Tiny-Engram将每个概念参数化为一组小的记忆条目,这些条目通过注册的n-gram匹配进行索引,仅在匹配的触发区域调节文本编码器的隐藏状态。在该词汇支持之外,条件路径与冻结的基础模型相同。在单编码器潜在扩散和多编码器扩散-变压器骨干结构上,这种公式将罕见触发短语绑定到目标身份,同时保持周围提示的组合控制。我们进一步在文本条件的视频生成设置中评估相同的表式记忆,其中触发路径可靠地改变生成的主题,但保持在排除的视频提示中精细的身份持续性仍然有限。综合来看,这些结果表明,小型、显式地址的概念表是实现模块化视觉个性化的一种实用途径,尤其在图像生成中证据最强。对于视频扩散,剩余的差距指向更广泛的需求:时间稳定的身份可能依赖于文本侧记忆与不断演变的视觉状态之间的更紧密耦合,这促使未来在记忆注入方面的工作超越文本条件接口。

英文摘要

Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

2605.20308 2026-05-21 cs.CV cs.AI cs.LG 版本更新

SDM: A Powerful Tool for Evaluating Model Robustness

SDM:评估模型鲁棒性的强大工具

Xinlei Liu, Tao Hu, Jichao Xie, Peng Yi, Hailong Ma, Baolin Li

发表机构 * Information Engineering University, Zhengzhou, China Key Laboratory of Cyberspace Endogenous Safety \& Security of Henan Province, Zhengzhou, China Key Laboratory of Cyberspace Security Ministry of Education of China, Zhengzhou, China Songshan Laboratory, Zhengzhou, China

AI总结 本文提出了一种名为SDM的新型梯度攻击方法,通过重新定义对抗样本生成的目标,解决了传统方法中'高损失非对抗样本'导致的性能下降问题,并在实验中证明了其在攻击性能和成本效率上的优势。

Comments 16 pages

详情
Journal ref
Forty-third International Conference on Machine Learning (ICML 2026)
AI中文摘要

基于梯度的攻击方法是评估模型鲁棒性的重要方法。然而,自从提出APGD以来,此类方法难以取得显著突破。为了实现这一效果,我们首先分析了先前方法中导致攻击性能下降的'高损失非对抗样本'问题,并证明该问题源于对抗样本生成目标的不恰当。随后,我们将目标重新定义为

英文摘要

Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at https://github.com/X-L-Liu/ICML-SDM.

2605.20297 2026-05-21 cs.CV cs.LG 版本更新

MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery

MedCRP-CL: 通过贝叶斯非参数语义模态发现实现连续医学图像分割

Ziyuan Gao

发表机构 * University College London, London, United Kingdom(伦敦大学学院)

AI总结 该研究提出MedCRP-CL框架,通过在线任务结构发现和结构感知的连续学习方法,解决医学图像分割在持续学习中的挑战,实现了73.3%的Dice得分和仅4.1%的遗忘率。

Comments Accepted by ICML 2026

详情
AI中文摘要

医学图像分割在持续学习中面临根本性挑战:数据按顺序从异质源到来,但有效的持续学习需要发现哪些任务共享足够的结构以受益于联合学习。现有方法要么在所有任务上应用统一约束,导致任务冲突时发生灾难性遗忘,要么需要预定义的任务分组,无法预测未来任务多样性。我们引入MedCRP-CL框架,实现在线任务结构发现和结构感知的持续学习。利用中文餐厅过程(CRP),我们的方法从临床文本提示中动态推断任务分组,无需预定义聚类数量或访问未来任务。我们将发现的分组称为语义模态,因为它们通过整合解剖区域和病理背景捕捉更细粒度的结构。在发现的结构指导下,我们维护语义模态特定的LoRA适配器,通过内模态EWC正则化,确保在不同任务组之间参数隔离,同时促进相似组的知识转移。该框架也是无回放的,仅存储聚合统计信息而非原始患者数据。在16个医学分割任务和四种成像模态上的实验表明,MedCRP-CL实现了73.3%的Dice得分,仅4.1%的遗忘率,优于最佳基线8.0%,同时仅需6倍更少的参数。代码可在https://github.com/zygao930/MedCRP-CL获取。

英文摘要

Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRP-CL, a framework that performs online task structure discovery and structure-aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer-grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality-specific LoRA adapters regularized by intra-modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay-free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP-CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6$\times$ fewer parameters. Code is available at https://github.com/zygao930/MedCRP-CL.

2605.20290 2026-05-21 cs.GR cs.CV 版本更新

TelePhysics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction

TelePhysics: 从单张图像生成物理一致的多物体场景 with 实时交互

Xin Zhang, Yabo Chen, Yijie Fang, Wanying Qu, Haibin Huang, Chi Zhang, Feng Xu, Xuelong Li

发表机构 * Fudan University(复旦大学) Institute of Artificial Intelligence, China Telecom (TeleAI)(中国电信人工智能研究院(TeleAI))

AI总结 本文提出TelePhysics,一种无需训练的框架,通过整体场景级3D重建将单张图像转换为物理一致且可控的视频。该方法通过统一的空间坐标系统表示完整场景几何,解决物体穿透和对齐模糊问题,实现准确的多物体交互和更丰富的复杂控制类型,从而在保持逼真视觉保真度的同时实现实时物理交互预览。

详情
AI中文摘要

Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a 免训练 framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at https://github.com/xinzhang007/TelePhysics.

英文摘要

Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at https://github.com/xinzhang007/TelePhysics.

2605.20287 2026-05-21 cs.LG cs.AI cs.CV 版本更新

FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

FusionCell: 跨注意力融合布局几何与网络列表拓扑以实现标准单元性能预测

Haoyi Zhang, Kairong Guo, Bojie Zhang, Yibo Lin, Runsheng Wang

发表机构 * School of Integrated Circuits, Peking University, Beijing, China(集成电路学院,北京大学,北京,中国)

AI总结 本文提出FusionCell,通过跨注意力机制融合布局几何和网络列表拓扑,以提高标准单元性能预测的准确性,解决了传统方法忽略布局几何导致的耦合和布局依赖效应的问题。

详情
AI中文摘要

标准单元是数字电路的基本构建块,其延迟和功率对芯片级性能有关键影响;然而,其表征仍依赖于缓慢的仿真扫描,许多快速预测器忽略了布局几何,未能捕捉到耦合和布局依赖效应。挑战在于如何联合表示布局几何和网络列表拓扑,使模型能够同时捕捉细粒度的空间细节和结构连接,以实现准确的性能预测。我们引入FusionCell,一种双模态预测器,将路由布局几何和网络列表拓扑作为输入,并在统一模型中显式融合它们。一个DeiT编码器处理三层路由布局,而图Transformer模型异构设备/网络图。模态通过拓扑引导机制集成,其中网络列表作为结构“地图”主动查询布局中的相关物理区域,以实现联合几何和拓扑推理。我们构建了一个基于ASAP7 PDK的7nm数据集,使用自动工具生成超过19500个单元,涵盖149种类型,针对六个指标:信号上升/下降延迟、过渡和功率。实验结果表明,FusionCell减少了回归误差,平均MAPE为0.92个百分点,并在基线模型上提高了Spearman/Kendall排名,同时将表征过程的速度提高了数十倍,相比电路仿真。

英文摘要

Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural "map" to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error, with an average MAPE of 0.92 percent, and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

2605.20284 2026-05-21 cs.CV cs.AI cs.LG 版本更新

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

JUDO: 一种面向工业异常问答的多模态推理框架

Hyunju Kang, Woohyun Lee, Jaewon Kim, Hogun Park

发表机构 * Sungkyunkwan University(成均馆大学) Seoul National University(首尔国立大学)

AI总结 本文提出JUDO框架,通过结合领域知识和上下文提升多模态推理能力,以解决工业异常检测中模型缺乏领域知识的问题,实验表明其在MMAD基准上优于Qwen2.5-VL-7B和GPT-4o。

Comments Published at ICLR 2026

详情
AI中文摘要

工业异常检测已显著受益于大多模态模型(LMMs),使检测能力超越了单纯的检测,尤其通过视觉引导推理提升图像理解能力。然而,LMMs缺乏领域特定知识,限制了其在复杂工业场景中生成准确响应的能力。在本工作中,我们提出了JUDO,即Juxtaposed Domain-Oriented Multimodal Reasoner,一种能够高效整合领域知识和上下文的视觉和文本推理框架。通过视觉推理,我们的模型通过将查询图像与正常图像进行对比,分割缺陷区域,实现细粒度的视觉比较检查。此外,我们通过监督微调(SFT)注入领域知识,以增强上下文理解,并通过强化学习(GRPO)引导领域推理,采用领域导向的推理过程。实验结果表明,JUDO在MMAD基准上表现优异,超越了Qwen2.5-VL-7B和GPT-4o等模型。这些结果突显了增强领域知识和上下文对有效推理在异常理解中的重要性。

英文摘要

Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

2605.20277 2026-05-21 cs.CV cs.AI 版本更新

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

通过轨迹积分反馈调节解剖感知奖励用于体积计算断层扫描分析

Tianwei Lin, Zhongwei Qiu, Jie Cao, Jiang Liu, Wenjie Yan, Bo Zhang, Yu Zhong, Wenqiao Zhang, Yingda Xia, Ling Zhang

发表机构 * Zhejiang University(浙江大学) DAMO Academy, Alibaba Group(阿里集团达摩院) Hupan Lab(虎扑实验室) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出了一种新的框架,通过轨迹积分反馈GRPO(TIF-GRPO)来改进医疗视觉语言模型在三维CT分析中的性能,通过引入临床异常基准评估子系统(CABS)来解决优化目标与临床严谨性之间的不匹配问题,提升异常检测和临床准确性。

详情
AI中文摘要

医学视觉-语言模型(VLMs)已迅速发展为通用多模态助手,但其在三维计算机断层扫描(CT)分析中的应用仍受到优化目标与临床严谨性之间持续不匹配的限制。当前的强化学习(RL)范式仍然依赖于词汇代理信号,导致``评估幻觉'',即模型优化语言流畅性而非事实性临床正确性,从而导致诊断性关键错误。为弥合这一差距,我们引入了临床异常基准评估子系统(CABS),一个将放射学报告分解为可验证的临床语义单元的结构化系统。利用CABS,我们识别出标准RL中的``机理分歧'',即表面相似性奖励驱动策略梯度绕过医学事实。因此,我们提出了轨迹积分反馈GRPO(TIF-GRPO),一种将控制理论原理整合到策略优化中的新框架。通过将临床推理建模为伪时间轨迹以发现异常,TIF-GRPO通过积分反馈回路调节解剖感知奖励,该回路将持续遗漏视为累积状态误差,并将幻觉视为过度的控制努力。在3D CT基准测试中,我们的方法显著提高了异常检测和临床忠实度,建立了医疗VLMs中细粒度调节的新范式。我们的项目可在GitHub上获取。

英文摘要

Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.

2605.20275 2026-05-21 cs.CV cs.AI 版本更新

You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection

你不需要注意力:基于门控卷积的基于手表的跌倒检测

Sana Alamgeer, Ronish Kumar, Awatif Yasmin, Muhammad Irshad, Anne H. H. Ngu

发表机构 * Texas State University(德克萨斯州立大学)

AI总结 本文提出了一种轻量级的双流架构Gated-CNN,用于基于手表的跌倒检测,通过门控机制提升特征提取效率,实现在不使用注意力机制的情况下达到更高的检测精度。

详情
AI中文摘要

现有的基于可穿戴设备的跌倒检测系统依赖于自注意力机制,这种机制带来了二次计算开销,将权重分布到所有时间步。这种全局权重分布会损害在短固定长度窗口中跌倒特征的精确定位。为克服这一挑战,我们提出Gated-CNN,一种轻量级双流架构,通过独立的一维卷积特征提取器处理加速度计和陀螺仪流,随后(i)一个sigmoid门控模块,选择性地抑制无信息的背景激活,同时增强跌倒区分特征;(ii)一个全局平均池化层,将每个流压缩成紧凑的固定长度描述符;(iii)一个共享的分类头,融合两个描述符进行二分类跌倒预测。对于离线评估,我们在五个腕部惯性测量单元(IMU)数据集上评估模型,分别在SmartFallMM、WEDA-Fall、FallAllD、UMAFall和UP-Fall数据集上获得平均F1分数为93%、93%、90%、91%和90%的结果,优于Transformer基线。对于实时评估,我们将模型部署在Google Pixel Watch 3上,并在12名参与者上进行测试。模型在零次遗漏的情况下实现了97%的平均F1分数和98%的准确率,表明sigmoid门控提供了一种在结构上更一致且计算更高效的替代方案,用于商用智能手表的跌倒检测。

英文摘要

Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across all time steps. This global weight distribution impairs the precise localization of the brief impact signatures that characterize falls within short, fixed-length windows. To overcome this challenge, we propose Gated-CNN, a lightweight dual-stream architecture that processes accelerometer and gyroscope streams through independent one-dimensional convolutional feature extractors, followed by (i) a sigmoid gating module that selectively suppresses uninformative background activations while amplifying fall-discriminative features, (ii) a global average pooling layer that compresses each stream into a compact fixed-length descriptor, and (iii) a shared classification head that fuses both descriptors for binary fall prediction. For offline evaluation, we evaluate the model across five wrist-mounted inertial measurement unit (IMU) datasets, achieving average F1-scores of 93%, 93%, 90%, 91%, and 90% on SmartFallMM, WEDA-Fall, FallAllD, UMAFall, and UP-Fall, outperforming Transformer baselines. For real-time evaluation, we deployed the model on a Google Pixel Watch 3 and tested across 12 participants. The model achieves an average F1-score of 97% and an accuracy of 98% with zero missed falls, showing that sigmoid gating offers a more structurally aligned and computationally efficient alternative to attention for commodity smartwatch-based fall detection.

2605.20267 2026-05-21 cs.CV cs.AI 版本更新

Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model

基于预训练域适应扩散模型生成异质性PET图像

Suya Li, Kaushik Dutta, Debojyoti Pal, Jingqin Luo, Kooresh I. Shoghi

发表机构 * Mallinckrodt Institute of Radiology, Washington University School of Medicine, St. Louis, USA(华盛顿大学医学院马林克罗德特放射医学研究所,圣路易斯,美国) Imaging Science Program, McKelvey School of Engineering, Washington University in St Louis, St. Louis, USA(华盛顿大学圣路易斯分校麦克雷高中工程学院成像科学计划,圣路易斯,美国) Department of Surgery, Washington University School of Medicine, St. Louis, USA(华盛顿大学医学院外科部,圣路易斯,美国) Department of Biomedical Engineering, Washington University in St Louis, St. Louis, USA(华盛顿大学圣路易斯分校生物医学工程部,圣路易斯,美国)

AI总结 本文提出了一种预训练域适应扩散模型,用于从均匀器官活动图生成临床相关的异质性PET图像,通过两阶段训练策略提高合成图像的定量精度和肿瘤分割性能。

Comments 18 pages, 7 figures

详情
AI中文摘要

合成PET图像在定量成像工作流程开发、可扩展的虚拟成像试验和深度学习模型训练中具有重要价值,但传统基于物理的模拟方法计算成本高,解剖变化有限,且难以捕捉异质性PET摄取。本研究开发了一种预训练域适应扩散(PAD)模型,用于从均匀器官活动图生成解剖条件化的PET合成图像。PAD采用预训练的自然图像文本到图像解码器,结合上游的条件编码器和下游的PET领域适配器。采用两阶段训练策略,第一阶段学习粗略摄取分布,第二阶段细化局部图像细节。均匀器官活动图通过CT基分割生成,通过将每个器官的平均摄取值分配自配对PET图像。评估包括定量准确性、噪声评估、放射组学分析、肿瘤分割性能和人类观察者研究。PAD生成的图像在定量准确性方面表现优异,器官平均SUV与分配活动值的符合度系数超过0.92。合成图像的噪声水平和纹理特征与目标PET图像相似,并产生了可比的肿瘤分割性能。在两项选择强制选择观察者研究中,四名读者的准确率约为50%,表明合成图像与目标图像在视觉上不可区分。PAD还能从XCAT衍生的活动图生成逼真的PET图像,证明了其与基于幻影的解剖先验的兼容性。总体而言,PAD提供了一种基于扩散的框架,用于从临床分割或数字幻影中导出的均匀器官活动图生成临床相关的异质性PET图像,支持数据增强和下游成像研究。

英文摘要

Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.

2605.20254 2026-05-21 cs.IR cs.AI cs.CV cs.LG 版本更新

Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

通过表格网格导航和逐步推理提示实现高效的表格问答

Amritansh Maurya, Navjot Singh, Mohammed Javed, Omar Moured

发表机构 * Vision Intelligence Lab, IIIT Allahabad, Prayagraj, India(视觉智能实验室,印度拉贾斯坦邦阿拉哈巴德)

AI总结 本文提出了一种无需训练的表格问答方法,通过TableGrid导航和Progressive Inference Prompting框架,提升了表格问答的精度和效率,并在多个数据集上验证了其有效性。

Comments Accepted for Presentation in ICDAR 2026, Vienna, Austria

详情
AI中文摘要

大型语言模型(LLMs)在自然语言处理任务中表现出色,但在表格数据上的表现仍需进一步研究,因为表格问答(TQA)需要精确的单元格检索和多步结构化推理。现有工作通过微调或在任务特定的表格数据上训练LLMs来改进TQA,但通常缺乏对模型如何导航表格和推导答案的可验证控制。在本文中,我们提出了一种无需训练的TQA方法,包含两个结构化提示框架:TableGrid导航(TGN),通过三模块循环迭代导航行和列以定位证据并细化答案;Progressive Inference Prompting(PIP),通过根据查询强制识别列,以明确的逐步行选择约束进行推理。我们在TableBench和FeTaQa数据集上评估了17个LLMs和6个基线模型。在TableBench上,TGN比最强基线提高了3.8分,而在FeTaQa上,PIP在ReAct和Chain-of-Thought上实现了SOTA性能。除了推理时间的提升外,PIP和TGN还可以作为监督模板来微调小型模型,在资源受限的设置中缩小与更大架构之间的性能差距,为TQA提供了多功能且成本效益高的解决方案。

英文摘要

Large Language Models (LLMs) have shown promising results on NLP tasks, however, their performance on tabular data still needs research attention, because Table Question-Answering (TQA) requires precise cell retrieval and multi-step structured reasoning. Existing work improves TQA either by fine-tuning or training LLMs on task-specific tabular data, but often lacks verifiable control over how the model navigates tables and derives answers. In this work, we propose a training-free TQA approach with two structured prompting frameworks: TableGrid Navigation (TGN), which iteratively navigates rows and columns via a three-module loop to locate evidence and refine answers, and Progressive Inference Prompting (PIP), which enforces columns identification for explicit progressive row selection constraint according to the query. We evaluate 17 LLMs against 6 baselines on TableBench and FeTaQa dataset. On TableBench, TGN improves over the strongest baseline by 3.8 points, and on FeTaQa, PIP achieves SOTA performance over ReAct and Chain-of-Thought. Beyond inference-time gains, PIP and TGN can also serve as supervision templates to fine-tune small models, narrowing the performance gap to much larger architectures in resource-constrained settings, offering versatile and cost-efficient solution for TQA.

2605.20247 2026-05-21 cs.LG cs.AI cs.CL cs.CV 版本更新

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

CP-MoE:一致性保留的混合专家用于持续学习

Yang Liu, Toan Nguyen, Flora D. Salim

发表机构 * School of Computer Science and Engineering University of New South Wales(计算机科学与工程学院 新南威尔士大学)

AI总结 本文提出CP-MoE,一种基于瞬时专家的持续学习框架,通过一致性保留的路由偏置和瞬时专家引导的正则化机制,减少参数干扰和遗忘,同时保留跨任务知识转移。

详情
AI中文摘要

持续学习在大语言模型(LLMs)和视觉-语言模型(VLMs)中仍面临灾难性遗忘的严重障碍。尽管混合专家(MoE)架构提供了扩展的有效途径,但现有的基于LoRA的MoE持续学习方法仍面临根本性的权衡:要么过于激进地隔离专家,限制任务间的知识转移,要么允许任务特定的更新覆盖重要的现有参数,导致严重的遗忘。为此,我们提出了CP-MoE,一种持续学习框架,围绕瞬时专家构建,该专家捕捉早期任务特定的更新并引导其整合到稳定的专家中。CP-MoE引入了一种一致性保留的路由偏置,利用瞬时专家估计与稳定专家的表示相似性,并引导路由向更兼容的专家选择方向;还引入了一种瞬时专家引导的正则化机制,该机制在合并过程中选择性地保护重要历史参数。这些组件共同减少了参数干扰和遗忘,同时保留了跨任务的知识转移。我们在基于LLM和VLM的MoE模型上验证了CP-MoE,既在单模态又在多模态持续学习基准上进行了测试。在SuperNI基准上,涵盖多样化的序列语言任务,CP-MoE实现了最先进的性能,并在未见任务上表现出更强的零样本迁移能力。在VQA v2数据集上,它能有效扩展到多模态视觉推理,一致地减少遗忘,并优于强大的MoE基线。

英文摘要

Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

2605.20237 2026-05-21 cs.CV 版本更新

AnimeAdapter: Fine-grained and Consistent Zero-shot Anime Character Generation

AnimeAdapter: 细粒度且一致的零样本动漫人物生成

Yixuan Han

发表机构 * National University of Singapore, Singapore(新加坡国立大学)

AI总结 本文提出了一种轻量级的外观适配器,用于在多样编辑条件下实现可控且一致的动漫人物生成,通过注入单张参考图像的细粒度视觉特征到扩散过程中,并结合CLIP的局部空间化特性,开发出语义选择性局部注意力机制,进一步解耦人物外观与空间布局,从而实现高效的动漫人物生成。

详情
AI中文摘要

我们提出了一种轻量级的外观适配器,用于在多样编辑条件下实现可控且一致的动漫人物生成。与依赖大规模视觉-语言模型或针对特定主体的微调不同,我们的方法将单张参考图像的细粒度视觉特征注入扩散过程。基于CLIP的局部空间化特性,我们开发了语义选择性局部注意力机制。为了进一步解耦人物外观与空间布局,我们在适配器训练过程中引入姿态感知的条件。所得到的预训练适配器保持紧凑、模块化,并且完全兼容Stable Diffusion社区工作流程,同时在部署时不需要额外的微调。此外,我们还提出了一个基于精选和重构的Danbooru提示的高质量动漫人物数据集,并在多个实际的人物编辑场景中评估了我们的方法。我们的代码、模型权重和数据集将在接受后公开发布。

英文摘要

We present a lightweight appearance adapter for Stable Diffusion that enables controllable and consistent anime character generation under diverse editing conditions. Instead of relying on large-scale vision-language models or per-subject fine-tuning, our method injects fine-grained visual features from a single reference image into the diffusion process. Based on CLIP emergent local spatialization, we develop semantic-selective local attention. To further disentangle character appearance from spatial layout, we incorporate pose-aware conditioning during adapter training. The resulting pretrained adapter remains compact, modular, and fully compatible with Stable Diffusion community workflows, while requiring no additional fine-tuning at deployment time. Furthermore, we present a high-quality anime character dataset based on curated and restructured Danbooru prompts, and evaluate our method across several practical character editing scenarios. Our code, model weights, and dataset will be publicly released upon acceptance.

2605.20233 2026-05-21 cs.CV cs.AI 版本更新

AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education

基于仿真护理教育的自主学习能力评估:通过第一人称视频进行AI辅助评估

Hanchen David Wang, Yilin Liu, Madison J. Lee, Surya Chand Rayala, Gautam Biswas, Daniel T. Levin, Meiyi Ma

发表机构 * Vanderbilt University(范德比大学)

AI总结 本文提出了一种基于第一人称视频的AI辅助评估框架,通过提取动作时间线、序列特征和识别指标,发现识别准确率与能力之间存在负相关关系,表明识别准确率可以作为自动化评估中的教学信息信号。

Comments Accepted at CVPR Workshop

详情
AI中文摘要

在临床仿真中评估学习者的能力需要专家观察,这种观察过程耗时、难以扩展且受评分者变异影响。视觉-语言模型已成为理解复杂视觉行为的有希望的工具。在本工作中,我们探讨了视觉观察是否能通过一个三阶段框架提供教育意义的信号,该框架(1)使用冻结的视觉编码器和少样本学习从第一人称护理仿真视频中提取动作时间线,(2)推导序列级特征和每会话识别指标,(3)将这些与指导教师评分的能力相关联。在22个密集标注的会话(3.8小时,493个动作)中,使用冻结的DINOv2主干和HMM Viterbi解码器,在留一法1次样本识别中实现了57.4%的MOF。令人惊讶的是,我们观察到识别准确率与能力之间存在负相关关系(rho = -0.524,p = 0.012 for mIoU),这种关系在六种混杂控制下仍然稳健:更熟练的学生产生多样、更难分类的工作流程,而简单的序列特征没有这种关系。逐项分析表明,患者安全协议和团队沟通是这种模式中预期的行为,过程模型比较显示,能力更高的学生表现出更一致的协议行动转换。这些发现表明,识别准确率可能可以补充预测的动作时间线作为自动化能力评估中的教学信息信号。

英文摘要

Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.

2605.20223 2026-05-21 cs.CV 版本更新

Why Latent Actions Fail, and How to Prevent It

为何潜在动作失效,以及如何防止它

Jung Min Lee, Taehyun Cho, Li Zhao, Jungwoo Lee

发表机构 * Seoul National University(首尔国立大学) Microsoft Research(微软研究院)

AI总结 本文研究了潜在动作模型中外部状态对动作学习的干扰问题,并提出通过聚焦内生成分来缓解噪声干扰的方法。

详情
AI中文摘要

潜在动作模型(LAMs)旨在通过压缩帧间变化来从未标记视频中学习动作样表示。然而,现实视频中的帧不仅包含主体自身状态,还包含如背景杂乱等外源状态。由于外源状态引入与动作无关的变化,这阻碍了可靠的潜在动作学习。本文通过扩展线性LAM框架,明确建模外源状态来分析这一问题。我们的分析揭示了两个见解:(1)最小化标准重建目标会产生编码未来观察中外源信息的潜在动作;(2)在专注于内生成分的表示空间中学习是缓解噪声干扰的关键。我们进一步表明,之前提出的辅助目标,如动作监督,确实促使潜在动作在不同外源状态下保持一致。这些发现通过线性和非线性LAMs的实验得到验证,提供了统一的理论分析,说明外源状态如何阻碍潜在动作学习以及为何常见的缓解方法有效。

英文摘要

Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent's own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably encourage latent actions to be consistent across exogenous states. These findings are validated through experiments on both linear and nonlinear LAMs, providing a unified theoretical analysis of how exogenous state hinders latent action learning and why common remedies work.

2605.20211 2026-05-21 cs.CV cs.AI 版本更新

Leveraging Vision-Language Models to Detect Attention in Educational Videos

利用视觉-语言模型检测教育视频中的注意力

Gabriel Becquet, Sébastien Lallé, Vanda Luengo, Ali Abou-Hassan

发表机构 * Sorbonne University, CNRS, LIP6 & PHENIX(索邦大学、国家科学研究中心、LIP6与PHENIX)

AI总结 本文研究利用视觉-语言模型直接分析教育视频内容,结合眼动数据以提高注意力检测的准确性,但发现其在实时教育诊断中的局限性。

详情
AI中文摘要

教育视频是远程和混合学习的核心组成部分。然而,学习者注意力的波动仍然是有效信息保留的重要障碍。先前的研究尝试通过在运行时检测和响应注意力丧失来缓解这一问题,使用眼动追踪数据。这些检测方法目前基于经典机器学习分类器,训练于工程化特征,如学习者注视和跳跃的汇总统计。这些方法难以捕捉学习者参与的复杂和时间特性,因此表现出中等的预测性能。在本研究中,我们旨在通过从标准工程化特征转向多模态基础模型来提高注意力检测。使用一个教育眼动追踪数据集(N = 70),我们研究了一种新的方法,利用视觉-语言模型(VLM)直接分析视频内容,结合叠加的注视数据。该方法旨在利用基础模型的语义推理能力,将学习者的注意力置于视频流中进行上下文化。我们通过几种提示策略使用Gemini 3评估了这种VLM方法的性能,但最终发现这些策略都无法超越统计基准。我们的结果为使用VLM进行实时教育诊断的局限性提供了新的见解。

英文摘要

Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.

2605.17630 2026-05-21 cs.CV 版本更新

SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

SegRAG: 无需训练的检索增强语义分割

Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed

发表机构 * Khalifa University(卡利法大学)

AI总结 本文提出SegRAG,一种无需训练的检索增强语义分割框架,通过结合类特定点提示和文本信息,提升分割性能,在多个基准测试中取得显著提升。

详情
AI中文摘要

开放词汇分割模型如SAM3通过文本提示在广泛类别上表现良好,但当目标类别在预训练中视觉表示不足或偏离标准描述时性能下降,文本提示无法解决空间问题。我们提出了SegRAG,一种无需训练的检索增强分割框架,通过从精心编纂的DINOv3特征库中提取类特定点提示来增强SAM3。离线时,从注释参考中提取密集的块级描述符,并通过类内凝聚力蒸馏(ICCD)过滤,仅保留能可靠检索同类前景的原型。在推理时,地形相似性接地(TSG)计算与检索原型的余弦相似性景观,通过连通组件分析识别相干的高置信度区域,并通过非极大值抑制提取峰值位置。所得点提示与类名文本共同传递,通过单次SAM3前向传递。在四个标准基准测试中,SegRAG在文本基线基础上持续优于,LVIS上提升至+3.92 mIoU。在零样本领域迁移的AgML农业基准测试中,其均IoU从25.27提升至59.24(+33.97),并恢复个体类别的mIoU从零到超过95。消融实验证实ICCD、TSG和联合提示各自独立贡献,组合时效果更佳。代码可在(https://github.com/boudiafA/SegRAG)获取。

英文摘要

Open-vocabulary segmentation models such as SAM3 perform well across broad categories via text prompting, yet degrade when target classes are visually underrepresented in pretraining or depart from canonical depictions-limitations text prompts cannot resolve spatially. We present SegRAG, a training-free retrieval-augmented segmentation framework that grounds SAM3 with class-specific point prompts derived from a curated DINOv3 feature bank. Offline, dense patch-level descriptors are extracted from annotated references and filtered by Intra-Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within-class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape against retrieved prototypes, identifies coherent high-confidence regions via connected-component analysis, and extracts peak locations through non-maximum suppression. The resulting point prompts are delivered jointly with class-name text in a single SAM3 forward pass. On four standard benchmarks, SegRAG consistently outperforms the text-only baseline, gaining up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks under zero-shot domain transfer, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablations confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at (https://github.com/boudiafA/SegRAG).

2605.17140 2026-05-21 cs.CV cs.AI cs.CL 版本更新

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

UCSF-PDGM-VQA: 用于脑肿瘤MRI解读的视觉问答数据集

Shiv Ghosh, Junayd Lateef, Chih-Hua Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil

发表机构 * Fung Institute for Engineering Leadership(工程领导力基金会) University of California, Berkeley(加州大学伯克利分校) Department of Radiology(放射科) University of California, San Francisco(加州大学旧金山分校) Division of Clinical Informatics and Digital Transformation(临床信息学与数字转型部) Department of Neurological Surgery(神经外科部)

AI总结 本文提出一个临床相关的视觉问答基准数据集UCSF-PDGM-VQA,包含2387个问题-答案对,用于评估视觉语言模型在处理多序列3D MRI扫描中的能力,发现现有模型在多模态处理上存在缺陷。

Comments 10 pages, 2 figures, 6 tables

详情
AI中文摘要

脑肿瘤诊断很大程度上依赖于磁共振成像(MRI)评估,这需要放射科医生综合分析成千上万张来自多种3D序列和纵向研究的图像。这一过程需要高级的神经放射学培训,具有显著的认知负荷,并且非常耗时。尽管放射学需求不断增长,但这种专业知识难以扩展,给当前的医疗系统带来压力。视觉-语言模型(VLMs)提供了一种通过半自动化、互动解释复杂脑MRI来减轻这种负担的机会。然而,由于缺乏专门的评估基准,它们在神经肿瘤学中目前使用有限。我们介绍了一个临床相关的视觉问答(VQA)基准——UCSF-PDGM-VQA数据集,包含来自公共UCSF-PDGM数据集中473个胶质瘤相关MRI研究的2387个QA对。我们进一步在该数据集上建立了六种最先进的视觉语言模型(VLMs)和一个大型语言模型的性能基线。我们发现,当前模型无法有效处理多序列、三维MRI扫描,导致视觉特征的抑制和对语言先验的过度依赖,从而造成模态崩溃。这些发现突显了当前模型在临床环境中的可靠性和安全性方面的关键缺陷,需要开发稳健的、领域特定的VLMs。

英文摘要

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

2604.15038 2026-05-21 cs.LG cs.AI cs.CV 版本更新

When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

当公平性指标产生分歧:评估机器学习中人口公平性评估的可靠性

Khalid Adnan Alsayed

发表机构 * Founder, Ducaltus(Ducaltus创始人) BSc (Hons) Artificial Intelligence(人工智能学士(荣誉)) School of Computing, Engineering & Digital Technologies(计算、工程与数字技术学院) Teesside University, UK(英国泰赛德大学)

AI总结 本文研究了公平性评估的一致性问题,通过多指标分析评估机器学习模型中的人口偏见,发现不同公平性指标可能导致矛盾的评估结果,引入了公平性分歧指数(FDI)来量化指标间的不一致程度。

Comments 15 pages, 4 figues, 5 tables

详情
AI中文摘要

在高风险应用中,机器学习系统的公平性评估已成为核心问题,包括生物识别、医疗决策和自动风险评估。现有方法通常依赖少量公平性指标来评估模型行为,隐含假设这些指标能提供一致和可靠的结论。然而,不同公平性指标捕捉模型性能的不同统计属性,可能在相同系统上产生冲突的评估。本文通过系统性的多指标分析,评估机器学习模型中的人口偏见,使用面部识别作为受控实验环境,评估模型在多个群体分区下的性能,包括误差率差异和基于性能的指标。结果表明,公平性评估可能因指标选择而显著变化,导致关于模型偏见的矛盾结论。为量化此现象,我们引入公平性分歧指数(FDI),以捕捉公平性指标间的不一致程度。进一步表明,分歧在阈值和模型配置下仍保持高位。这些发现突显了当前公平性评估实践的关键限制,并表明单一指标报告不足以可靠地评估偏见。

英文摘要

The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.

2603.28675 2026-05-21 cs.CV cs.AI cs.LG 版本更新

Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

为何聚合准确率不足以评估执法面部识别系统的公平性

Khalid Adnan Alsayed

发表机构 * Ducaltus School of Computing, Engineering & Digital Technologies(计算、工程与数字技术学院) Teesside University(泰赛德大学)

AI总结 本文探讨了在执法场景中,面部识别系统的聚合准确率作为公平性评估指标的不足,通过分析子群体误差分布,指出聚合指标可能掩盖不同群体间的显著差异,并强调需要更全面的评估框架来确保负责任的AI部署。

Comments 9 pages, 2 tables, 1 figure. Position paper with empirical subgroup analysis highlighting limitations of aggregate accuracy in fairness evaluation

详情
AI中文摘要

面部识别系统正在越来越多地应用于执法和安全领域,在这些领域中算法决策可能带来重大社会影响。尽管报告的准确率较高,但越来越多的证据表明,这些系统在不同群体中的表现往往不均衡,导致不公正的误差率和潜在危害。本文认为,聚合准确率是评估执法中面部识别系统公平性和可靠性不足的指标。通过分析子群体层面的误差分布,包括假阳性率(FPR)和假阴性率(FNR),本文展示了聚合性能指标如何掩盖不同群体间的关键差异。实证观察表明,具有相似总体准确率的系统可以表现出显著不同的公平性特征,子群体误差率在单一聚合指标下可能有显著差异。本文进一步探讨了在执法应用中以准确率为中心的评估实践所带来的操作风险,其中误分类可能导致错误怀疑或遗漏识别。它强调了公平性意识评估方法和模型无关审计策略的重要性,这些方法能够实现部署后的现实系统评估。研究结果强调了需要超越准确率作为主要指标,并采用更全面的评估框架来确保负责任的AI部署。

英文摘要

Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.

2601.05639 2026-05-21 cs.CV cs.LG 版本更新

Efficient training for compact compression models via sequential distillation

通过序列知识蒸馏实现紧凑压缩模型的高效训练

Caroline Mazini Rodrigues, Nicolas Keriven, Thomas Maugey

发表机构 * Univ. Rennes, Inria, CNRS, IRISA, Rennes, France(里昂大学、法国国家科学研究中心、法国国家信息与自动化研究所、IRISA、里昂,法国)

AI总结 本文提出了一种通过序列知识蒸馏减少自动编码器压缩网络的方法,通过简化早期优化目标和逐步引入复杂性,提高了轻量级模型的重建质量与统计保真度,适用于资源受限环境。

详情
AI中文摘要

深度学习图像压缩模型在硬件受限的应用中常面临实际限制。尽管这些模型能够实现高质量的重建,但它们通常复杂、重量大且需要大量的训练数据和计算资源。我们提出了一种方法,通过更稳定的知识蒸馏过程显著减少基于自动编码器的压缩网络。其核心思想是高度减少的架构可以从早期训练中的简化优化目标中受益,随后逐步引入复杂性。因此,我们的方法首先通过序列编码器-解码器知识蒸馏阶段为轻量模型提供稳健的初始化,随后通过标准训练并可使用潜在蒸馏进行正则化。我们在两个不同的架构上评估了所得到的轻量级自动编码器在图像压缩任务中的表现。实验表明,与使用原始损失训练的轻量级自动编码器相比,我们的方法在早期epoch中更好地保持了重建质量和统计保真度,使其在资源受限环境中更具实用性。

英文摘要

Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to significantly reduce autoencoder-based compression networks in a more stable Knowledge Distillation process. The intuition is that highly reduced architectures benefit from simplified optimization objectives in early training, with complexity gradually introduced later. Therefore, our approach begins with a sequential encoder--decoder distillation stage that provides a robust initialization for the lightweight model. This is followed by standard training that can be regularized with latent distillation. We evaluate the resulting lightweight autoencoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity in early epochs better than training lightweight autoencoders with the original loss, making it practical for resource-limited environments.

2510.09060 2026-05-21 cs.AI cs.CV 版本更新

Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching

让轨迹扩散:用于多样化流匹配的质量保持控制

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang, Yang You

发表机构 * The University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Nanyang Technological University(南洋理工大学) CFAR, Agency for Science, Technology and Research, Singapore(新加坡科技研究局CFAR) University of California, Santa Barbara(加州大学圣巴巴拉分校) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种无需训练的推理时控制机制,使流本身具备多样性意识,通过几何上与模式质量寻求方向解耦的引导来鼓励轨迹横向扩散,同时通过时间调度的随机扰动重新引入不确定性,从而在不降低图像细节和提示忠实度的情况下提升多样性。

详情
AI中文摘要

基于流的文本到图像模型遵循确定性轨迹,这使得在有限的采样预算下探索多样模式成本较高。现有方法提高多样性通常依赖于重新训练或降低图像保真度。为了解决这一限制,我们提出了一种无需训练的推理时控制机制,使流本身具备多样性意识。我们的核心见解是通过几何上与模式质量寻求方向解耦的引导来鼓励多样性。我们的方法通过特征空间目标同时鼓励轨迹横向扩散,并通过时间调度的随机扰动重新引入不确定性。关键在于这种扰动被投影为与生成流正交,这是一个几何约束,允许其在不降低图像细节或提示保真度的情况下提升多样性。理论上,我们证明了这种设计单调地增加了一个体积代理,同时近似地保持边际分布,为生成质量的鲁棒性提供了原理性解释。经验上,在多个文本到图像设置下,固定采样预算下,我们的方法在Vendi分数和Brisque等多样性指标上一致优于强基线,同时保持图像质量和对齐。

英文摘要

Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geometrically decoupled from the mode's quality-seeking direction. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Theoretically, we show that this design monotonically increases a volume surrogate while approximately preserving the marginal distribution, providing a principled explanation for the robustness of generation quality. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

2506.16950 2026-05-21 cs.CV cs.LG 版本更新

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

LAION-C: 一个用于网络级视觉模型的分布外基准

Fanfei Li, Thomas Klein, Wieland Brendel, Robert Geirhos, Roland S. Zimmermann

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所,图宾根,德国) ELLIS Institute Tübingen(图宾根ELLIS研究所) Tübingen AI Center(图宾根人工智能中心) Google DeepMind(谷歌DeepMind)

AI总结 本文提出LAION-C作为ImageNet-C的替代基准,旨在评估网络级数据集下的分布外鲁棒性,通过引入六种新的分布外扰动类型,发现现代模型在这些扰动下的表现显著提升,甚至超过人类观察者。

Comments ICML 2025 camera ready version

详情
AI中文摘要

分布外鲁棒性是计算机视觉模型的期望属性。提高模型鲁棒性需要高质量的鲁棒性基准信号来量化进展。尽管在ImageNet时代提出了多种基准数据集,如ImageNet-C,但大多数ImageNet-C的腐蚀类型不再相对于当今的大型网络爬取数据集是分布外的,因为这些数据集已经包含常见的腐蚀如模糊或JPEG压缩伪影。因此,这些基准不再适合评估网络级数据集中的分布外鲁棒性。事实上,最近的模型在ImageNet时代的分布外基准上显示出饱和分数,表明不清楚在网络级数据集上训练的模型是否真的在分布外泛化上更好,或者是否只是在训练过程中暴露于测试扭曲。为此,我们引入LAION-C作为ImageNet-C的替代基准。LAION-C包含六种新的扰动类型,专门设计为即使对于LAION这样的网络级数据集也是分布外的。在对最新模型的全面评估中,我们发现LAION-C数据集对当代模型提出了重大挑战,包括Gemini和GPT-4o等大语言模型。我们还进行了心理物理实验来评估我们扰动对人类观察者难度,从而能够将模型与实验室质量的人类鲁棒性数据进行比较。我们观察到分布外泛化的一个范式转变:从人类优于模型,到最佳模型现在匹配或优于最佳人类观察者。

英文摘要

Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

2506.08277 2026-05-21 q-bio.NC cs.AI cs.CL cs.CV cs.LG 版本更新

Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

基于任务的指令调制多模态大语言模型探测:在自然主义刺激下的区域特定大脑对齐模式

Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

发表机构 * Technische Universität Berlin(柏林技术大学) Rice University(Rice 大学) AWS AI Labs, Amazon(Amazon 人工智能实验室) IIT Delhi(德里理工学院) University of Wisconsin - Madison(威斯康星大学麦迪逊分校) Spector Inc(Spector 公司) IIIT-Hyderabad(海得拉巴理工学院) Microsoft(微软)

AI总结 本研究探讨了指令调制多模态大语言模型在自然主义刺激下的大脑对齐模式,通过比较不同模型在视频和音频任务中的表现,揭示了指令调制对模型表示能力的影响。

Comments 57 pages, 39 figures

详情
AI中文摘要

近期的体素级多模态脑编码研究显示,多模态大语言模型(MLLMs)在大脑对齐程度上高于单模态模型。更近期的研究表明,指令调制多模态(IT)模型能够生成与大脑活动强相关的任务特定表示,但大多数先前评估集中在单模态刺激或非指令调制模型上。我们仍然缺乏对指令调制是否使IT-MLLMs围绕功能任务需求组织其表示,还是仅反映表面语义的清晰理解。为此,我们通过预测自然主义电影观看(带音频的视频)期间记录的fMRI响应,来估计大脑对齐情况。使用来自六个视频和两个音频IT-MLLMs的指令特定嵌入,跨13个视频任务指令,我们发现指令调制视频MLLMs的大脑对齐程度高于上下文学习(ICL)多模态模型(~9%)、非指令调制多模态模型(~15%)和单模态基线(~20%)。我们对视频和音频任务以及语言引导的探测评估,产生了不同任务特定的MLLM表示,这些表示在不同大脑区域中变化。我们还发现,ICL模型表现出强语义组织(r=0.78),而IT模型与指令文本语义的耦合较弱(r=0.14),这与与更高大脑对齐相关的任务条件子空间一致。这些发现支持了任务特定指令与更强的大脑-MLLM对齐之间的关联,并为映射两个系统中的联合信息处理开辟了新途径。我们公开了代码 [https://github.com/subbareddy248/mllm_videos]。

英文摘要

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs show higher brain alignment than in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].