2605.28780 2026-05-28 cs.CV cs.LG 版本更新

Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

偏差留下梯度痕迹：基于概念分解的梯度探针实现无标签偏差识别

Thomas Vitry, Kieran Edgeworth, Stefan Wermter, Jae Hee Lee

发表机构 * University of Hamburg（汉堡大学）； Ecole Normale Superieure de Rennes（里昂高等师范学校）

AI总结提出一种无需偏差标签的后处理方法，通过非负矩阵分解提取概念向量，并利用误分类样本的梯度信号识别视觉模型中的虚假关联，在不重新训练的情况下提升最差组准确率。

Comments Accepted to the 49th German Conference on Artificial Intelligence (KI2026)

详情

AI中文摘要

视觉分类器可能利用虚假关联，在分布内取得高准确率但在分布偏移下失败。现有的偏差缓解和分析方法通常依赖于精心策划的数据集、虚假属性或组标签，或重新训练，这在模型部署后或相关偏差未知时可能不可行。我们提出一种无需偏差标签的后处理方法，用于识别冻结视觉模型中的虚假概念，仅依赖于来自保留审计数据集的标准类标签。对于每个目标类，我们从预测为该类的输入中收集补丁，并对中间激活应用非负矩阵分解，以获得可解释的概念向量库。然后，通过从误分类示例的反向传播梯度与这些概念的相互作用导出的偏差估计器对候选概念进行排序：偏差概念在纠正假阴性时倾向于被激活，而在纠正假阳性时被抑制。在Colored MNIST和Waterbirds上，该方法恢复了与已知虚假线索一致的概念；在CelebA上，它揭示了仅部分与注释性别属性重合的决策相关方向；在推理时抑制排名靠前的概念，无需任何重新训练或参数更新，即可将Waterbirds的最差组准确率提高最多17.9个百分点，CelebA提高10.4个百分点。我们的方法识别出不一定与注释属性重合的决策相关虚假方向，为冻结视觉模型提供了可解释的审计工具和可操作的去偏处理。代码可在https://github.com/vitryt/label-free-bias-identification获取。

英文摘要

Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias mitigation and analysis often depend on curated datasets, spurious-attribute or group labels, or retraining, which may be infeasible once a model is deployed or the relevant bias is unknown. We present a bias-label-free, post-hoc method for identifying spurious concepts in frozen vision models, relying only on standard class labels from a held-out audit dataset. For each target class, we collect patches from inputs predicted as that class and apply non-negative matrix factorization to intermediate activations to obtain a bank of interpretable concept vectors. Candidate concepts are then ranked with a bias estimator derived from their interaction with backpropagated gradients on misclassified examples: bias concepts tend to get activated when correcting false negatives and suppressed when correcting false positives. On Colored MNIST and Waterbirds the method recovers concepts aligned with the known spurious cue, and on CelebA it surfaces decision-relevant directions that only partially coincide with the annotated gender attribute; suppressing the top-ranked concepts at inference time improves worst-group accuracy by up to 17.9 percentage points on Waterbirds and 10.4 on CelebA without any retraining or parameter updates. Our method identifies decision-relevant spurious directions that need not coincide with annotated ones, providing both an interpretable auditing tool and an actionable debiasing handle for frozen vision models. Code is available at https://github.com/vitryt/label-free-bias-identification.

URL PDF HTML ☆

赞 0 踩 0

2605.28779 2026-05-28 cs.CL cs.CV 版本更新

The Abstraction Gap in Vision-Language Causal Reasoning

视觉-语言因果推理中的抽象差距

Chinh Hoang, Mohammad Rashedul Hasan

发表机构 * Department of Electrical and Computer Engineering, University of Nebraska--Lincoln, Lincoln, Nebraska, USA（电气与计算机工程系，内布拉斯加大学林肯分校，内布拉斯加，林肯，美国）

AI总结针对视觉-语言模型（VLM）生成因果解释时语言流畅性与忠实因果推理的混淆问题，提出双探针方法和抽象差距（AG）指标，通过CAGE基准评估发现多数模型存在显著AG，但通过预训练和架构选择可缩小差距。

详情

AI中文摘要

视觉-语言模型（VLM）能生成流畅的因果解释，但当前的评估无法区分语言合理性与忠实因果推理。我们提出一种双探针方法来分离这些属性。文本探针测量语言质量。链式文本探针要求模型首先生成显式因果链。抽象差距（AG）指标量化归一化的性能差异。在CAGE（因果抽象差距评估）基准上评估八个VLM，该基准包含跨越Pearl因果层次的5,500张图像上的49,500个问题，我们发现七个模型的AG超过0.50，文本得分为6-8，但链式得分低于2.5。在45,000个链式标注样本上进行微调未能缩小差距。然而，一个模型实现了接近零的AG。该能力存在于当前VLM架构中，并取决于预训练和架构选择。CAGE为评估VLM中的忠实因果推理提供了诊断工具。

英文摘要

Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.28741 2026-05-28 cs.CV 版本更新

Self-Prophetic Decoding to Unlock Visual Search in LVLMs

自预言解码以解锁LVLM中的视觉搜索

Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； ShanghaiTech University（上海科技大学）

AI总结提出SeProD框架，通过自预言解码利用预训练模型的内在单步能力，以无训练、即插即用的方式增强LVLM在多步视觉搜索中的连贯推理，在4个基准的12个分割上一致提升性能。

Comments Accepted at ICML 2026

详情

AI中文摘要

大型视觉语言模型（LVLM）正迅速向真正的多模态推理发展，视觉搜索代表了“用图像思考”范式的具体实例。然而，LVLM视觉搜索面临两个关键挑战：后训练后内在能力之间的不兼容性，以及长多步推理上下文中的干扰。为解决这些问题，我们提出了两个新颖的见解。首先，预训练和后训练LVLM之间的自我调节利用了预训练模型的内在单步能力，以减轻能力退化和长上下文干扰。其次，基于概率的预言采样取代了简单的提示，提供了一个概率接口，其中预训练模型充当预言家，后训练模型在其输出分布下选择性地接受预言令牌，从而保持连贯的多步推理。基于这些见解，我们引入了SeProD，一个自预言解码框架，它利用内在的单步能力以无训练、即插即用的方式实现连贯的多步推理。实验表明，由于并行的预言接受机制，SeProD在4个视觉搜索基准的所有12个分割以及通用VQA基准上一致地提升了多个视觉搜索LVLM的性能，且没有增加计算开销。

英文摘要

Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.28735 2026-05-28 cs.CV 版本更新

SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

SeeGroup: 通过自确定分组的透明表面多层深度估计

Hongyu Wen, Jia Deng

发表机构 * Department of Computer Science, Princeton University（普林斯顿大学计算机科学系）

AI总结提出SeeGroup方法，通过将多层深度建模为点过程并采用置换不变损失，实现自适应分组，显著提升透明表面多层深度估计精度。

详情

AI中文摘要

透明物体在日常生活中很常见，理解其多层深度（包括透明表面及其背后的物体）非常重要。现有的多层深度方法通常扩展单层预测，通过3D点的前后顺序定义层并顺序预测。然而，由于分层几何允许将3D点分组为多个有效层，预定义的分组策略本质上是受限的。在这项工作中，我们提出了SeeGroup，一种避免施加预定义分组并允许模型自适应地将表面分配到深度图的多层深度估计方法。我们将逐像素多层深度公式化为一个点过程，将深度层视为沿每条相机射线的无序事件。这引出了观测深度层上的置换不变似然，产生了一个自然支持任意层分组的损失函数。实验表明，我们的方法显著推进了多层深度估计的最新水平，在LayeredDepth基准上将四重相对深度准确率从61.34%提升至70.09%。代码可在https://github.com/princeton-vl/SeeGroup获取。

英文摘要

Transparent objects are common in daily life, and it is important to understand their multilayer depth, including the transparent surface and the objects behind it. Existing methods for multilayer depth typically extend single-layer prediction. They define layers by the front-to-back ordering of 3D points and predict the layers sequentially. However, as layered geometry can admit multiple valid groupings of 3D points into layers, a predefined grouping strategy is inherently restrictive. In this work, we propose SeeGroup, a multi-layer depth estimation method that avoids imposing a predefined grouping and allows the model itself to adaptively assign surfaces to depth maps. We formulate per-pixel multi-layer depth as a point process, treating depth layers as unordered events along each camera ray. This induces a permutation-invariant likelihood over the observed depth layers, yielding a loss that naturally supports arbitrary layer groupings. Experiments demonstrate that our method significantly advances the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34% to 70.09%. Code is available at https://github.com/princeton-vl/SeeGroup.

URL PDF HTML ☆

赞 0 踩 0

2605.28697 2026-05-28 eess.IV cs.AI cs.CV 版本更新

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

深度学习应变估计：基于物理的模拟是解决方案吗？

Thierry Judge, Nicolas Duchateau, Andreas Østvik, Khuram Faraz, Anders Austlid Taskén, Sigve Karlsen, Thor Edvardsen, Harald Brunvand, Md Abulkalam Azad, Havard Dalen, Bjørnar Grenne, Gabriel Kiss, Pierre-Yves Courand, Lasse Lovstakken, Pierre-Marc Jodoin, Olivier Bernard

发表机构 * Dept. of Computer Science, University of Sherbrooke（计算机科学系， Sherbrooke 大学）； INSA, Université Lyon 1, CNRS UMR 5220, Inserm U1206, CREATIS（INSA，里昂 1 大学，CNRS UMR 5220，Inserm U1206，CREATIS）； Institut Universitaire de France (IUF)（法国国家研究院（IUF））； Cardiology Dept., Hôpital Croix-Rousse, Hospices Civils de Lyon（里昂医院心血管科，Hospices Civils de Lyon）； Cardiology Dept., Hôpital Lyon Sud, Hospices Civils de Lyon（里昂南部医院心血管科，Hospices Civils de Lyon）； Dept. of Computer Science, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology (NTNU)（计算机科学系，信息科技与电气工程学院，挪威科学技术大学（NTNU））； Dept. of Circulation and Medical Imaging, NTNU（循环医学与医学影像系，NTNU）； Department of Medicine, Hospital of Southern Norway, Arendal, Norway（南部挪威医院医学部，Arendal，挪威）； Dept. of Cardiology and Cardiothoracic Surgery, St. Olavs Hospital, Trondheim, Norway（心内科和心胸外科部，St. Olavs 医院，Trondheim，挪威）； Dept. of Health Research, SINTEF Digital, Trondheim, Norway（健康研究部，SINTEF 数字技术，Trondheim，挪威）； Dept. of Medicine, Levanger Hospital, Nord-Trøndelag Hospital Trust, Levanger, Norway（医学部，Levanger 医院，Nord-Trøndelag 医院信托，Levanger，挪威）； Dept. of Cardiology, Oslo University Hospital, Rikshospitalet and the Faaculty of Medicine, University of Oslo, Norway（心内科，奥斯陆大学医院 Rikshospitalet，奥斯陆大学医学院，挪威）

AI总结针对超声心动图中应变估计缺乏可靠运动参考的问题，提出一种结合真实视频散斑去相关测量与迭代细化过程的模拟策略，生成逼真数据集训练运动估计算法，在全局和区域应变上达到优于临床参考的性能。

Comments 10 pages

详情

AI中文摘要

斑点追踪超声心动图（STE）是心肌应变估计的临床标准。尽管在全局应变（GLS）上表现良好，但其区域应变的准确性仍然有限，尽管这一生物标志物对于早期诊断和表征细微异常高度相关。深度学习是一种有前景的替代方案，但其发展受到缺乏可靠运动参考的限制。现有解决方案要么依赖于STE衍生的标签，要么依赖于基于物理模型生成的模拟，但这些合成序列与临床数据相比仍缺乏足够的真实性。在本文中，我们提出了一种新的模拟策略，该策略结合了来自真实视频的散斑去相关测量，并使用迭代细化过程来改善模拟中的运动真实性。我们创建了一个包含1,478个视频及其参考运动的开源逼真数据集，用于训练超声心动图运动估计算法。所提出的方法在全局和区域应变上实现了无与伦比的性能，特别是在专家间设置中，GLS变异性达到1.42%，而临床参考为1.78%。

英文摘要

Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical data.In this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.

URL PDF HTML ☆

赞 0 踩 0

2605.28691 2026-05-28 cs.CV 版本更新

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

OSP-Next: 结合稀疏序列并行、HiF8量化和强化学习的高效高质量视频生成

Yunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin, Bin Zhu, Xinhua Cheng, Li Yuan

发表机构 * Peking University（北京大学）； Nanyang Technological University, Singapore（南洋理工大学）

AI总结提出OSP-Next文本到视频生成模型，通过混合全稀疏注意力架构、稀疏序列并行（SSP）、HiF8量化和混合GRPO后训练，在保持高质量的同时显著提升效率，在NVIDIA H200和Ascend 950PR上实现1.5倍以上加速。

详情

AI中文摘要

扩散Transformer在视频生成中取得了高质量，但全注意力的二次成本限制了效率。我们提出OSP-Next，一种高效的文本到视频生成模型，集成了稀疏注意力、并行、量化和强化学习。OSP-Next采用混合全稀疏注意力架构，其中稀疏组件通过Skiparse-2D注意力实现。这种固定模式机制沿空间维度应用逐token和逐组的稀疏注意力，利用局部性同时保持与FlashAttention内核的原生兼容性。基于Skiparse-2D注意力中重排的局部等价性，我们进一步提出稀疏序列并行（SSP），它将子序列划分到多个rank，并通过一次All-to-All通信切换稀疏模式。与Ulysses序列并行（SP）相比，SSP为稀疏注意力提供了原生并行策略，并将通信量减少了75%。OSP-Next还引入了HiF8量化，以实现8位量化和稀疏微调的稳定联合训练，并应用Mix-GRPO后训练来提升稀疏模型的性能。实验表明，OSP-Next的VBench总得分为83.73%，超过了Wan2.1基线。在5秒720P和5秒768P设置下，OSP-Next在NVIDIA H200 GPU上实现了高达1.64倍的单GPU加速和超过1.52倍的八GPU加速。此外，在VBench总分仅下降0.4%的情况下，OSP-Next-HiF8在单个Ascend 950PR上分别实现了1.69倍和2.27倍的加速，展示了OSP-Next跨硬件平台的效率和性能。

英文摘要

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

URL PDF HTML ☆

赞 0 踩 0

2605.28630 2026-05-28 cs.CV cs.MM 版本更新

EntroAD: Structural Entropy-Guided Prompt Adaptation for Zero-Shot Anomaly Detection

EntroAD: 结构熵引导的提示自适应用于零样本异常检测

Xinyu Zhao, Qingyun Sun, Jiayi Luo, Jianxin Li

发表机构 * Beihang University（北京航空航天大学）

AI总结提出EntroAD框架，利用结构熵引导动态路由机制和置信度感知双分支提示自适应，实现零样本异常检测，在跨数据集设置中达到最优性能。

详情

AI中文摘要

零样本异常检测（ZSAD）旨在无需目标域适应的情况下检测未见域中的异常。最近的基于CLIP的方法通过利用提示学习和视觉-文本对齐展示了有前景的性能。然而，大多数现有方法依赖于单一适应路径，这可能不足以处理跨域的异质异常模式。在实践中，异常表现出截然不同的特征，从显著、局部的结构破坏到微妙、扩散且不规则的变异。为了解决这一挑战，我们提出了EntroAD，一种结构熵引导的零样本异常检测框架。与以往方法不同，EntroAD引入了一种动态路由机制，通过专门的适应策略处理不同类型的异常。具体地，我们从自注意力诱导的补丁关系中估计补丁级结构熵，并将其作为关系不确定性的代理来指导异常感知的令牌路由。基于该路由信号，我们构建异常感知的路由令牌，以更好地捕捉具有不同结构特征的异常线索。我们进一步引入了一个置信度感知的双分支提示自适应模块，以稳定视觉-文本对齐，同时保留CLIP的可迁移先验。在10个工业和医学基准上的大量实验表明，EntroAD在具有挑战性的跨数据集ZSAD设置中达到了最先进的性能。

英文摘要

Zero-Shot Anomaly Detection (ZSAD) aims to detect anomalies in unseen domains without target-domain adaptation. Recent CLIP-based methods have shown promising performance by leveraging prompt learning and visual-text alignment. However, most existing approaches rely on a single adaptation pathway, which may be insufficient for heterogeneous anomaly patterns across domains. In practice, anomalies exhibit vastly different characteristics, ranging from salient, localized structural disruptions to subtle, diffuse, and irregular variations. To address this challenge, we propose EntroAD, a structural entropy-guided zero-shot anomaly detection framework. Unlike previous methods, EntroAD introduces a dynamic routing mechanism to process different types of anomalies with specialized adaptation strategies. Specifically, we estimate patch-level structural entropy from self-attention-induced patch relations and use it as a proxy for relational uncertainty to guide anomaly-aware token routing. Based on this routing signal, we construct anomaly-aware routed tokens to better capture anomaly cues with different structural characteristics. We further introduce a confidence-aware dual-branch prompt adaptation module to stabilize visual-text alignment while preserving CLIP's transferable prior. Extensive experiments on 10 industrial and medical benchmarks show that EntroAD achieves state-of-the-art performance in challenging cross-dataset ZSAD settings.

URL PDF HTML ☆

赞 0 踩 0

2605.28619 2026-05-28 cs.CV nlin.AO 版本更新

A Multiscale Kinetic Framework for Image Segmentation: From Particle Systems to Continuum Models

图像分割的多尺度动力学框架：从粒子系统到连续模型

Horacio Tettamanti, Giulia Guicciardi, Mattia Zanella

AI总结提出一种基于共识的多尺度动力学框架，通过将图像视为粒子系统并推导动力学方程与宏观模型，结合粒子优化实现图像分割。

Comments 26 pages, 34 figures

详情

AI中文摘要

在这项工作中，我们提出了一种用于基于共识的图像分割的多尺度动力学框架。通过将图像解释为相互作用的粒子系统，每个像素由其空间位置和编码颜色信息的内部特征来表征。我们引入了一个耦合相互作用方案，控制粒子在位置和特征空间中的演化，由此推导出空间-特征域中粒子密度的动力学公式，结合了输运、聚集和扩散效应。此外，通过适当的缩放，我们获得了一个一阶宏观模型，描述携带关于具有特定特征的像素分数信息的像素分数的演化。基于这个简化复杂度的模型，我们提出了一种数据导向的方法，利用基于粒子的优化技术进行精确的图像分割。数值测试显示了所提出框架的有效性及其在不同噪声条件下的鲁棒性。

英文摘要

In this work, we present a multiscale kinetic framework for consensus-based image segmentation. By interpreting an image as a system of interacting particles, each pixel is characterised by its spatial position and an internal feature encoding color information. We introduce a coupled interaction scheme governing the evolution of particles in both position and feature spaces, from which we derive a kinetic formulation for the particle density in the space-feature domain combining transport, aggregation, and diffusion effects. Furthermore, through a suitable scaling, we obtain a first-order macroscopic model describing the evolution of the fraction of pixels carrying information on the fraction of pixels having a certain feature. Based on this reduced-complexity model, we present a data-oriented approach where we make use of particle-based optimisation techniques for the accurate segmentation of images. Numerical tests show the effectiveness of the proposed framework and its robustness under different noise conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.28615 2026-05-28 cs.CV 版本更新

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

基于区域感知双模态直接偏好优化的组合式文本到图像生成

Zhuohan Liu, Wujian Peng, Yitong Chen, Zuxuan Wu

发表机构 * Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University（上海智能信息处理关键实验室，复旦大学计算机学院）； Shanghai Collaborative Innovation Center of Intelligent Visual Computing（上海智能视觉计算协同创新中心）

AI总结提出BiDPO框架，通过构建大规模偏好数据集BiComp和扩展Diffusion DPO联合优化图像与文本偏好，结合区域级引导方法，提升文本到图像模型对复杂组合提示的生成保真度。

详情

AI中文摘要

尽管文本到图像（T2I）模型取得了快速进展，但生成准确反映复杂组合提示（涵盖属性绑定、对象关系、计数）的图像仍然具有挑战性。为了解决这个问题，我们提出了BiDPO，一个增强T2I模型组合式文本到图像生成能力的框架。我们首先引入一个精心设计的流程，构建大规模偏好数据集BiComp，并进行严格的质量控制。然后，我们将Diffusion DPO扩展到联合优化图像和文本偏好，这在提高模型遵循复杂文本提示生成方面被证明非常有效。为了进一步增强模型的细粒度对齐，我们采用区域级引导方法，聚焦于与组合概念相关的区域。实验结果表明，我们的BiDPO显著提高了组合保真度，在多个基准测试中持续优于先前方法。我们的方法突显了基于偏好微调在复杂文本到图像任务中的潜力，为现有技术提供了一种灵活且可扩展的替代方案。

英文摘要

Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.

URL PDF HTML ☆

赞 0 踩 0

2605.28609 2026-05-28 cs.CV 版本更新

JECA^2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models

JECA^2: 面向取证视觉语言模型的判断-解释一致对抗攻击

Jiachen Qian

发表机构 * City University of Hong Kong（香港城市大学）

AI总结针对取证视觉语言模型，提出一种白盒对抗攻击方法JECA^2，通过Grad-CAM引导的视觉扰动和令牌邻近约束的文本嵌入优化，实现判断与解释的一致性，实验表明攻击成功率和一致性优于基线。

Comments 37 pages, 6 figures. Includes supplementary material

详情

AI中文摘要

取证视觉语言模型（VLM）最近被开发用于检测图像篡改并提供自然语言解释。然而，它们对抗对抗性操纵的鲁棒性仍未得到充分探索。现有的对抗攻击通常旨在翻转模型的二元判断，而伴随的解释可能仍然揭示取证线索并与被攻击的判断相矛盾。在本文中，我们研究了针对取证VLM的判断-解释一致对抗攻击，并提出了JECA^2，一种受控的白盒红队诊断方法，它联合重定向视觉注意力并将文本解释与目标判断对齐。在视觉方面，JECA^2使用Grad-CAM引导的扰动将注意力从篡改区域转移到良性区域。在文本方面，它在令牌邻近约束下优化提示嵌入，使其朝向真实性肯定的语义。在取证VLM基准上的实验表明，在白盒威胁设置下，JECA^2比实现的基线实现了更高的攻击成功率和自动判断-解释一致性，而迁移到闭源VLM仍然可测量但有限。我们的结果突显了基于解释的取证VLM中的一致性失败模式，并激励了超越二元检测准确性的未来鲁棒性评估。

英文摘要

Forensic vision-language models (VLMs) have recently been developed to detect image tampering and provide natural-language explanations. However, their robustness against adversarial manipulation remains underexplored. Existing adversarial attacks typically aim to flip the model's binary judgment, while the accompanying explanation may still reveal forensic cues and contradict the attacked judgment. In this paper, we study judgment-explanation consistent adversarial attacks against forensic VLMs and propose JECA^2, a controlled white-box red-team diagnostic that jointly redirects visual attribution and aligns textual explanations with the target judgment. On the visual side, JECA^2 uses Grad-CAM-guided perturbations to divert attribution from tampered regions toward benign regions. On the textual side, it optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint. Experiments on forensic VLM benchmarks show that JECA^2 achieves higher attack success and automated judgment-explanation consistency than implemented baselines under white-box threat settings, while transfer to closed-source VLMs remains measurable but limited. Our results highlight a consistency failure mode in explanation-based forensic VLMs and motivate future robustness evaluation beyond binary detection accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.28605 2026-05-28 cs.CV 版本更新

Internally Referenced Low-Light Enhancement

内部参考的低光增强

Peiyuan He, Hainuo Wang, Hengxing Liu, Mingjia Li, Xiaojie Guo

发表机构 * College of Intelligence and Computing（智能与计算学院）； Tianjin University（天津大学）； Tianjin 300350, China（天津300350, 中国）

AI总结提出一种内部参考低光增强框架，通过从退化输入中提取物理和结构参考，结合局部曝光模拟、双域保持策略和增益自适应特征调制，实现自监督低光图像增强，在噪声抑制和纹理保真度上达到最优性能。

详情

AI中文摘要

自监督低光图像增强（LLIE）因其消除了对外部配对数据的依赖而极具吸引力。然而，缺乏外部参考导致网络难以解耦纠缠的照明、精细纹理和放大的噪声。为解决这一挑战，我们提出了一种内部参考的LLIE框架，该框架从退化输入图像本身提取可靠的物理和结构参考。首先，我们引入了一种局部曝光模拟方案来提取低频伪真值。这作为内部物理参考，用于指导全局照明估计和校正色偏。其次，我们提出了一种具有空间和光谱约束的双域保持策略来构建内部结构参考。具体来说，照明对齐感知损失在照明变化下保留全局结构，而平移不变光谱相关损失捕获细粒度局部结构并抑制高频噪声。最后，我们提出了一种增益自适应特征调制（GAFM）机制来处理高度空间变化的残余噪声。通过将自估计的照明图转换为内部空间增益先验，GAFM动态引导盲点网络进行空间感知去噪。大量实验表明，我们的方法实现了最先进的性能，提供了卓越的噪声抑制和纹理保真度。代码将在https://visonj.github.io/IRLE/公开。

英文摘要

Self-supervised low-light image enhancement (LLIE) is highly appealing as it eliminates the reliance on external paired data. However, the lack of external references causes networks to struggle with decoupling entangled illumination, delicate textures, and amplified noise. To resolve this challenge, we propose an Internally Referenced LLIE framework that extracts reliable physical and structural references from the degraded input image itself. First, we introduce a local exposure-simulated scheme to extract a low-frequency pseudo ground-truth. This serves as an internal physical reference to guide global illumination estimation and correct color casts. Second, we propose a dual-domain preservation strategy with spatial and spectral constraints to construct internal structural references. Specifically, an Illumination-Aligned Perceptual loss preserves global structures under illumination shifts, while a Shift-Invariant Spectral Correlation loss captures fine-grained local structures and suppresses high-frequency noise. Finally, we propose a Gain-Adaptive Feature Modulation (GAFM) mechanism to address highly spatially-variant residual noise. By transforming the self-estimated illumination map into an internal spatial gain prior, GAFM dynamically guides a blind-spot network for spatially-aware denoising. Extensive experiments demonstrate that our method achieves state-of-the-art performance, delivering superior noise suppression and textural fidelity. Code will be publicly released at https://visonj.github.io/IRLE/.

URL PDF HTML ☆

赞 0 踩 0

2605.28604 2026-05-28 cs.CV cs.AI 版本更新

DriveWAM: 视频生成先验实现自动驾驶的可扩展世界-动作建模

Chen Shi, Jinrui Xu, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Voyager Research, Didi Chuxing（Voyager Research，滴滴出行）

AI总结提出DriveWAM，通过将预训练视频扩散Transformer适配为自回归视频-动作策略，并引入场景演化驾驶引导和选择性KV记忆，实现可扩展的世界-动作建模，在NAVSIM和PhysicalAI基准上取得强规划性能。

详情

AI中文摘要

预训练基础模型已成为端到端自动驾驶的重要基础。与主要在静态图像-文本对上预训练的视觉-语言模型相比，视频生成模型捕获了自然适合驾驶的时间动态和运动先验。我们提出DriveWAM，一种驾驶世界-动作模型，它将预训练的视频扩散Transformer适配为自回归视频-动作策略。DriveWAM将视频和动作流组织成统一的时序token序列，并在联合流匹配目标下训练它们，保留预训练的视频生成架构，同时将其大规模视频先验适应于动作生成。为了融入高层场景理解，我们引入了场景演化驾驶引导，其中冻结的VLM生成块特定的语义意图以指导视频-动作生成。为了保持长时域推演有界，我们进一步引入了选择性KV记忆，通过推理时的相关性-冗余性缓存选择来维护有界的模态感知视频和动作记忆池。在NAVSIM和PhysicalAI-Autonomous-Vehicles基准上的实验表明，DriveWAM实现了强大的规划性能，从4k到100k驾驶片段的数据缩放研究进一步证实了世界-动作建模在端到端自动驾驶中的扩展潜力。

英文摘要

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.28495 2026-05-28 cs.CV 版本更新

Janus-LoRA: A Balanced Low-Rank Adaptation for Continual Learning

Janus-LoRA：面向持续学习的平衡低秩适配

Cheng Chen, Pengpeng Zeng, Yuyu Guo, Lianli Gao, Hengtao Shen, Jingkuan Song

发表机构 * School of Computer Science and Technology, Tongji University, Shanghai, China（同济大学计算机科学与技术学院，上海，中国）； School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China（电子科技大学计算机科学与工程学院，成都，中国）； Shanghai Innovation Institute, Shanghai, China（上海创新研究院，上海，中国）； Independent Researcher（独立研究者）

AI总结提出Janus-LoRA框架，通过梯度修正实现参数级正交性以克服灾难性遗忘，并利用解耦边际损失增强特征级分离，从而在持续学习中平衡稳定性与可塑性。

Comments 9pages, International Conference on Machine Learning

详情

AI中文摘要

低秩适配（LoRA）已成为持续学习的一种有前景的范式。它独立更新其低秩因子（$A$和$B$），通过它们的相互作用对完整权重矩阵产生复合更新。为了防止灾难性遗忘，该更新应保持与包含先前学习知识的任务特定子空间正交。然而，我们发现这种复合更新系统性地违反了这种正交性，重新引入了干扰并破坏了稳定性。此外，天真地强制执行这种正交性会损害可塑性，破坏微妙的稳定性-可塑性权衡。为了解决这些问题，我们提出了 extbf{Janus-LoRA}框架，通过两个新颖的组件恢复这种平衡。具体来说，我们首先引入梯度修正，这是一种闭式解，数学上解耦LoRA的因子更新，针对通过高效在线估计识别的历史知识子空间强制执行正交性。接下来，为了增强可塑性，我们引入解耦边际损失，通过将新特征表示推离旧特征表示来促进特征级分离，从而为新学习创建独特、低干扰的区域。在具有挑战性的基准上的全面实验表明，通过协调参数级正交性与特征级分离，Janus-LoRA实现了优越的平衡，并建立了新的最先进性能。

英文摘要

Low-Rank Adaptation (LoRA) has emerged as a promising paradigm for Continual Learning. It independently updates its low-rank factors ($A$ and $B$), creating a composite update to the full weight matrix through their interaction. To prevent catastrophic forgetting, this update should remain orthogonal to the task-specific subspace that contains previously learned knowledge. However, we identify that this composite update systematically violates this orthogonality, reintroducing interference and undermining stability. Furthermore, naively enforcing this orthogonality compromises plasticity, disrupting the delicate stability-plasticity trade-off. To resolve these issues, we propose \textbf{Janus-LoRA}, a framework that restores this balance through two novel components. Specifically, we first introduce Gradient Rectification, a closed-form solution that mathematically decouples LoRA's factor updates, enforcing orthogonality against the historical knowledge subspace identified by an efficient Online Estimation. Next, to enhance plasticity, we introduce a Decoupled Margin Loss that promotes feature-level separation by pushing new feature representations away from old ones, thus creating distinct, low-interference regions for new learning. Comprehensive experiments on challenging benchmarks demonstrate that by harmonizing parameter-level orthogonality with feature-level separation, Janus-LoRA achieves a superior balance and establishes new state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2605.28491 2026-05-28 cs.CV 版本更新

DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

DiscoForcing：基于扩散强制的实时音频驱动角色控制统一框架

Kaiyang Ji, Bingsheng Qian, Binghuan Wu, Kangyi Chen, Ye Shi, Jingya Wang

发表机构 * ShanghaiTech University（上海科技大学）

AI总结针对实时音频响应角色控制问题，提出DiscoForcing框架，结合因果音乐编码器和扩散强制序列模型，在严格因果、有限延迟的流式生成中实现音频与全身运动的稳定对齐。

Comments accepted by ICML 2026

详情

AI中文摘要

我们研究实时音频响应角色控制作为一个部署忠实性问题：严格因果、有限延迟的流式生成，必须在交互帧率下生成连贯的全身运动，同时音频条件可能突然变化，包括节奏变化、音频丢失或用户编辑。先前的音乐到运动系统主要针对具有全局上下文的离线生成进行优化，在流式部署中，当条件历史变得过时或不可靠时，性能会下降。我们引入了DiscoForcing，一个流式音频驱动扩散框架，它将捕获节奏结构和相位动态的因果音乐编码器与在时间范围内以异构噪声水平训练的扩散强制序列模型相结合。在此基础上，我们设计了一个混合时间调度和一个历史引导的流式采样器，以明确权衡响应性与非平稳音频下的长期一致性。在端到端实时交互系统中实现，包括在线虚拟角色回放和人形部署工作流，DiscoForcing在匹配因果性和延迟约束下，比先前基线提供更稳定的长期展开和更清晰的音频-运动对齐，同时保持实时吞吐量。

英文摘要

We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly, including tempo shifts, drops, or user edits. Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.

URL PDF HTML ☆

赞 0 踩 0

2605.28490 2026-05-28 cs.CV cs.AI 版本更新

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

SSR3D-LLM: 通过潜在步骤实现结构化空间推理以实现统一3D-LLM中的细粒度定位

Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen, Jiajie Xu, Xiaofang Zhou

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Soochow University（苏州大学）

AI总结针对统一3D-LLM中细粒度查询的脆弱性，提出SSR3D-LLM，通过潜在空间推理步骤和几何感知评分器逐步精炼候选排名，在多个基准上取得最优结果。

详情

AI中文摘要

3D物体定位从自然语言中定位3D场景中的所指对象。统一的以实例为中心的3D-LLM旨在同时解决定位、对话、问答和描述任务，但许多方法依赖于单一的指针式定位决策，将关系指令压缩为一个选择。这对于需要根据上下文对象和空间关系排除多个同类候选的细粒度查询来说是脆弱的。我们提出结构化空间推理3D-LLM（SSR3D-LLM），一种用于统一3D-LLM的结构化定位接口。给定固定的Mask3D物体提议，LLM从查询中写出一系列潜在的空间推理步骤和记忆令牌，然后一个几何感知评分器读取这些潜在步骤，通过逐步长度掩码逐步精炼候选排名。潜在步骤从标准基准目标监督和训练期间的辅助指代线索监督中学习，而推理仅使用输入查询和Mask3D提议。在ReferIt3D、ScanRefer和Multi3DRef上，SSR3D-LLM在统一3D-LLM基线中取得了最强结果，在细粒度定位上相比单指针QPG基线有显著提升，并相比先前的统一3D-LLM有一致改进，同时保留了默认的语言任务路径。

英文摘要

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

URL PDF HTML ☆

赞 0 踩 0

2605.28459 2026-05-28 cs.CV 版本更新

VITAL: 视觉-语义双重监督增强可解释的医学多模态大语言模型潜在推理

Qiaoru Li, Shaotian Liang, Jintao Chen, Haoran Sun, Yuxiang Cai, Jianwei Yin, Yankai Jiang

发表机构 * Zhejiang University（浙江大学）； Shanghai AI Laboratory（上海人工智能实验室）； Tencent（腾讯）； Ningbo Global Innovation Center, Zhejiang University（宁波全球创新中心，浙江大学）； Zhejiang Key Laboratory of Digital-Intelligence Service Technology（浙江省数字智能服务技术重点实验室）

AI总结提出VITAL框架，通过视觉-语义双重监督（文本解码器重构推理链、视觉投影器回归ROI特征）实现医学MLLM的可解释潜在推理，在7个基准上达到SOTA。

详情

AI中文摘要

潜在推理能够对连续隐藏状态而非显式token进行推理，避免了医学VQA中思维链的语言瓶颈和推理开销。然而，现有方法存在模态崩溃、视觉监督不足以及训练-推理不匹配的问题。此外，其不透明的潜在状态缺乏可解释性，而这在临床应用中至关重要。我们提出VITAL，一个用于医学MLLM的潜在空间推理框架，具有视觉-语义双重监督：一个辅助文本解码器从潜在状态重建推理链，同时一个视觉投影器从冻结的独立医学视觉编码器回归ROI特征。两个模块在推理时被丢弃，零开销，但可以在事后重新附加以实现双重可解释性，在不牺牲效率的情况下提供推理过程的文本和视觉解释。我们构建了一个涵盖9种成像模态的61K数据集，比之前的医学视觉潜在推理数据集大一个数量级。在7个基准上的实验表明，VITAL一致且显著优于骨干模型、所有潜在推理基线以及在更大数据上训练的医学MLLM，达到了与万亿参数专有模型竞争的最先进结果。

英文摘要

Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.

URL PDF HTML ☆

赞 0 踩 0

2605.28401 2026-05-28 cs.CV 版本更新

EgoRelight: Egocentric Human Capture and Illumination Recovery for Relightable and Photoreal Avatar Rendering

EgoRelight: 基于自我中心的人体捕捉与光照恢复实现可重光照和逼真化身渲染

Jianchun Chen, Yinda Zhang, Rohit Pandey, Thabo Beeler, Marc Habermann, Christian Theobalt

发表机构 * MPI for Informatic, SIC \& VIA Research Center Saarbr \"u cken Germany ； Google Mountain View USA ； Google Z \"u rich Switzerland ； MPI for Informatic \& VIA Research Center Saarbr \"u cken Germany ； MPI for Informatic, SIC \& VIA Research Center ； Google ； MPI for Informatic \& VIA Research Center

AI总结提出EgoRelight框架，通过头戴显示器上的立体下视相机提取深度图驱动网格化身，并利用神经外观模型分别合成视角相关镜面反射和视角无关漫反射，结合测试时逆渲染恢复HDR环境图，实现从单一HMD进行全身性能捕捉、逼真可重光照外观合成和环境光照估计。

详情

DOI: 10.1145/3811346

AI中文摘要

混合现实（MR）头戴显示器承诺了一个沉浸式远程呈现的未来，其中虚拟人无缝地融入真实或虚拟环境。实现这一愿景需要一种方法，能够从头戴显示器（HMD）的受限视角捕捉用户的运动、估计新光照下的外观并理解环境。现有方法将这些视为孤立问题：它们要么专注于驱动具有固定光照的化身，要么依赖工作室设置进行重光照。在本文中，我们提出了EgoRelight，一个用于自我中心远程呈现的整体框架，它同时捕捉全身人体性能、合成逼真且可重光照的外观，并从单个HMD估计高动态范围（HDR）环境图。首先，为了确保运动和表面重建，我们提出了一个自我中心感知模块，利用立体下视相机提取密集深度图，作为几何控制信号驱动基于网格的化身。其次，我们引入了一种新颖的神经外观模型，该模型学习分别合成视角相关的镜面反射和视角无关的漫反射。通过采用专门的射线采样策略，我们的模型能够泛化到未见过的光照，而不依赖限制性的解析BRDF先验。第三，我们通过测试时逆渲染过程实现化身无缝集成到物理世界，该过程通过将预训练化身的外观与实时自我中心相机观测匹配来恢复HDR环境图。我们通过一个社交远程呈现应用演示了我们的系统，其中远程用户根据其物理环境被一致地重光照。大量实验表明，我们的组件和集成系统在几何精度、渲染以及重光照保真度方面显著优于最先进的基线方法。

英文摘要

Mixed Reality (MR) headsets promise a future of immersive telepresence where virtual humans blend indistinguishably into real or virtual surroundings. Achieving this vision requires a method for capturing a user's motion, estimating appearance under novel lighting, and understanding the environment - all from the constrained viewpoint of a head-mounted display (HMD). Existing approaches treat these as isolated problems: they either focus on driving avatars with baked-in lighting or rely on studio setups for relighting. In this paper, we present EgoRelight, a holistic framework for egocentric telepresence that simultaneously captures full-body human performance, synthesizes photorealistic and relightable appearance, and estimates high dynamic range (HDR) environment maps from a single HMD. First, to ensure motion and surface reconstruction, we propose an egocentric perception module that leverages stereo down-facing cameras to extract dense depth maps, which serve as geometric control signals to drive a mesh-based avatar. Second, we introduce a novel neural appearance model that learns to synthesize view-dependent specular and view-independent diffuse shading separately. By employing a specialized ray-sampling strategy, our model generalizes to unseen illumination without relying on restrictive analytical BRDF priors. Third, we enable seamless avatar integration into the physical world via a test-time inverse rendering process, which recovers an HDR environment map by matching the pre-trained avatar's appearance to live egocentric camera observations. We demonstrate our system through a social telepresence application, where remote users are coherently relit according to their physical environment. Extensive experiments show that our components and the integrated system significantly outperform state-of-the-art baselines in geometric accuracy and rendering as well as relighting fidelity.

URL PDF HTML ☆

赞 0 踩 0

2605.28397 2026-05-28 cs.CV 版本更新

Adaptive Temporal Gating of Longitudinal Magnetic Resonance Imaging for Alzheimer's Prediction

用于阿尔茨海默病预测的纵向磁共振成像自适应时间门控

Alireza Moayedikia, Sara Fin, Alicia Troncoso Lora, Uffe Kock Wiil

发表机构 * organization= School of Business Law ； Entrepreneurship, Swinburne University of Technology , city= Melbourne , state= VIC , country= Australia ； organization= Australian Regenerative Medicine Institute, Monash University , city= Melbourne , state= VIC , country= Australia ； organization= Data Science \& Big Data Lab, Universidad Pablo de Olavide , city= Seville , country= Spain ； organization= The Maersk Mc-Kinney M ller Institute, University of Southern Denmark , city= Odense , country= Denmark

AI总结提出TAF-Net混合CNN-Transformer架构，通过自适应时间门控融合纵向3D MRI的时空表示，在MCI-to-AD转化预测中仅用结构MRI即达到最优性能，接近需多模态数据的方法。

详情

AI中文摘要

从轻度认知障碍（MCI）到阿尔茨海默病（AD）的转化预测对于早期干预至关重要。当前的深度学习范式主要依赖于横截面结构MRI，忽略了患者特定解剖轨迹中的预后价值。我们引入了时间自适应融合网络（TAF-Net），这是一种混合CNN-Transformer架构，用于建模配对的纵向3D MRI扫描。TAF-Net的核心是由自适应时间门控的时间融合模块，该模块学习患者特定的权重以合成三种时空表示：显式结构变化、区域间时间交叉注意力和双侧特征拼接。在阿尔茨海默病神经影像学倡议队列上进行的三年MCI-to-AD转化预测评估中，TAF-Net仅使用结构MRI就在所有评估方法中取得了最高的判别性能，显著优于最强基线，并接近需要PET、CSF或遗传数据的多模态方法。该架构表现出卓越的数据效率，仅用少量训练数据即可匹配基线性能。消融研究表明，纵向融合提高了判别能力，同时与单时间点评估相比，预测方差降低了48%。可解释性分析显示，空间注意力与内侧颞叶和脑室中已建立的AD病理学一致，而门控机制优先考虑与转化风险强正相关的显式体积变化。

英文摘要

Predicting conversion from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) is critical for early intervention. Current deep learning paradigms predominantly rely on cross-sectional structural MRI, neglecting prognostic value in patient-specific anatomical trajectories. We introduce the Temporal Adaptive Fusion Network (TAF-Net), a hybrid CNN-Transformer architecture that models paired longitudinal 3D MRI scans. Central to TAF-Net is a Temporal Fusion Module governed by an Adaptive Temporal Gate, which learns patient-specific weightings to synthesize three spatiotemporal representations: explicit structural change, region-to-region temporal cross-attention, and bilateral feature concatenation. Evaluated on the Alzheimer's Disease Neuroimaging Initiative cohort for three-year MCI-to-AD conversion prediction, TAF-Net achieved the highest discriminative performance among all evaluated methods using only structural MRI, significantly outperforming the strongest baseline and approaching multimodal methods requiring PET, CSF, or genetic data. The architecture exhibited exceptional data efficiency, matching baseline performance with a fraction of training data. Ablation studies demonstrate that longitudinal fusion improves discrimination while reducing predictive variance by 48% compared to single-timepoint evaluation. Interpretability analyses reveal spatial attention aligned with established AD pathology in the medial temporal lobe and ventricles, while the gating mechanism prioritizes explicit volumetric change with strong positive correlation to conversion risk.

URL PDF HTML ☆

赞 0 踩 0

2605.28394 2026-05-28 cs.CV cs.GR 版本更新

Sketch2Motion: Text-driven 2D Sketch to 3D Animation via Diffusion-guided Skeleton Optimization

Sketch2Motion: 文本驱动的二维草图到三维动画的扩散引导骨架优化

Gaurav Rai, Ojaswa Sharma

发表机构 * Graphics Research Group, IIIT Delhi（IIIT德里图形研究组）

AI总结提出Sketch2Motion框架，结合扩散模型和骨架优化，将二维草图转化为三维动画，无需配对运动数据，支持多种角色类型。

详情

AI中文摘要

二维手绘草图的动画化提供了一种有效的视觉交流媒介。然而，这些草图带来了挑战，特别是在处理遮挡和准确映射运动方面。虽然三维动画自然地解决了这些挑战，但估计三维运动仍然是一项非常复杂的任务。最近将二维草图转换为三维动画的方法主要集中在特定类型的运动上，例如双足运动和面部表情。我们提出了Sketch2Motion，一个基于扩散引导的骨架运动合成框架，它将经典的角色动画流程与深度生成先验相结合。我们的方法使用骨架变换来表示运动，通过线性混合蒙皮将其传播到网格变形。为了引导生成的动画朝向真实且语义上有意义的运动，我们通过运动感知分数蒸馏采样（MoSDS）集成了文本到视频扩散模型，从而无需配对运动数据即可进行优化。此外，我们应用物理启发的平滑性、拓扑和接触约束来稳定优化并保持运动合理性。进一步地，我们集成了一个弹簧-质量模拟器来引入次级运动效果。所提出的框架是通用的、完全可微的、模块化的，并且兼容双足、四足和非生命体铰接角色。实验表明，我们的方法生成了时间上连贯、与文本对齐的动画，其性能优于缺乏生成先验或显式物理约束的基线运动迁移方法。我们将公开我们的代码和数据集。

英文摘要

Animation of 2D hand-drawn sketches provides an effective medium for visual communication. However, these sketches pose challenges, particularly in handling occlusions and accurately mapping motion. While 3D animation naturally addresses these challenges, estimating 3D motion remains a very complex task. Recent approaches to converting 2D sketches to 3D animations have mainly focused on specific types of motion, such as bipedal movements and facial expressions. We propose Sketch2Motion, a diffusion-guided framework for skeleton-based motion synthesis that combines classical character animation pipelines with deep generative priors. Our method represents motion using skeletal transformations, which are propagated to mesh deformations via linear blend skinning. To guide the resulting animation toward realistic and semantically meaningful motion, we integrate a text-to-video diffusion model via motion-aware score-distillation sampling (MoSDS), enabling optimization without paired motion data. Additionally, we apply physics-inspired smoothness, topological, and contact constraints to stabilize optimization and preserve motion plausibility. Further, we integrate a spring-mass simulator to introduce secondary motion effects. The proposed framework is generalized, fully differentiable, modular, and compatible with biped, quadruped, and non-living articulated characters. Experiments demonstrate that our approach produces temporally coherent, text-aligned animations that outperform baseline motion transfer methods that lack generative priors or explicit physical constraints. We will make our code and dataset publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.28392 2026-05-28 cs.CV 版本更新

Bound-Constrained Sparse Representation for Electrical Impedance Tomography

边界约束稀疏表示用于电阻抗成像

Chun Zhang, Dong Liu

发表机构 * School of Biomedical Engineering（生物医学工程学院）； Suzhou Institute for Advanced Research（苏州先进研究院）； University of Science and Technology of China（中国科学技术大学）； Laboratory of Spin Magnetic Resonance（磁共振实验室）； Anhui Province Key Laboratory of Scientific Instrument Development and Application（安徽省科学仪器开发与应用重点实验室）； Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology（江苏省多模态数字孪生技术重点实验室）； Institute of Quantum Sensing of WuXi（武西安量子传感研究所）

AI总结提出一种边界约束稀疏表示框架，通过隐式复合参数化从低维潜变量生成电导率，无需显式正则化即可改善电阻抗成像中的电导率估计。

详情

AI中文摘要

本研究提出了一种用于电阻抗成像（EIT）的边界约束稀疏表示（BC-SR）框架，旨在无需显式正则化的情况下改善电导率估计。BC-SR采用表示驱动策略，通过隐式复合参数化从低维潜变量生成电导率。利用截断图拉普拉斯基嵌入结构先验，同时通过边界保持非线性映射强制电导率处于允许范围内，并通过隐式梯度调制改善条件。该方法即使在噪声或不完整数据下也能确保鲁棒收敛。在2D/3D模拟、水箱实验和体内肺部数据上的广泛验证表明，BC-SR提高了物理一致性和结构保真度，与传统方法相比具有更强的鲁棒性。此外，BC-SR能够实现3D时差EIT重建，提供更好的空间分辨率和更连贯的3D电导率分布表示，尤其对于体内肺部数据。这表明其在EIT中具有改进性能的潜力，特别是在呼吸监测的临床应用中。

英文摘要

This study proposes a bound-constrained sparse representation (BC-SR) framework for electrical impedance tomography (EIT), aimed at improving conductivity estimation without explicit regularization. BC-SR adopts a representation-driven strategy, generating conductivity from low-dimensional latent variables via an implicit composite parameterization. Structural priors are embedded using a truncated graph-Laplacian basis, while a bound-preserving nonlinear mapping enforces admissible conductivity ranges and improves conditioning through implicit gradient modulation. The approach ensures robust convergence, even under noisy or incomplete data. Extensive validation on 2D/3D simulations, tank experiments, and in-vivo lung data shows that BC-SR improves physical consistency and structural fidelity, offering enhanced robustness compared to traditional methods. Additionally, BC-SR enables 3D time-difference EIT reconstruction, offering improved spatial resolution and a more coherent representation of 3D conductivity distributions, particularly for in-vivo lung data. This suggests potential for improved performance in EIT, particularly in clinical applications for respiratory monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.28348 2026-05-28 cs.CV 版本更新

Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

面向语义无关和形状感知的视觉-语言分割模型

Corentin Seutin, Mohamed Amine Ettaki, Michaël Clément, Pierrick Coupé, Rémi Giraud

发表机构 * Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, France（波尔多大学，法国国家科学研究中心，波尔多理工学院，LaBRI实验室，UMR 5800，法国）； Univ. Bordeaux, CNRS, Bordeaux INP, IMS, UMR 5218, France（波尔多大学，法国国家科学研究中心，波尔多理工学院，IMS，UMR 5218，法国）

AI总结提出语义无关且形状感知（SANSA）分割范式，通过非语义文本描述微调模型，在保持语义提示性能的同时，在新任务上提升高达20% mIoU。

Comments Accepted at the 2026 IEEE International Conference on Image Processing (ICIP 2026)

详情

AI中文摘要

视觉-语言分割模型最近通过利用自然语言表达的高层语义对象类别取得了强大性能。然而，这种语义依赖性限制了它们对形状、几何或纹理等内在视觉属性的推理能力，而这些属性在许多实际应用中至关重要。在这项工作中，我们引入了语义无关且形状感知（SANSA）分割，这是一种新的范式，要求分割模型仅从非语义文本描述中运行。为此，我们提出了两种基于字典约束或示例指导生成SANSA分割提示的策略，两者都生成语义无关的文本描述。然后使用这些提示在语义无关监督下微调分割模型。实验表明，与预训练的最先进模型相比，在此新分割任务上对SANSA提示进行微调可带来高达20%的mIoU改进，同时在标准语义提示上保持强劲性能。这些结果强调了低层和中层视觉推理对于提高视觉-语言分割模型的泛化性和可控性的重要性。

英文摘要

Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.

URL PDF HTML ☆

赞 0 踩 0

2605.28331 2026-05-28 cs.CV 版本更新

Transfer learning RGB models to hyperspectral images with trainable tensor decompositions

使用可训练张量分解将RGB模型迁移到高光谱图像

Mariette Schönfeld, Laurens Devos, Wannes Meert, Hendrik Blockeel

发表机构 * KU Leuven, Dept. of Computer Science（库勒芬大学计算机科学系）； Leuven.AI - KU Leuven Institute for AI（莱文人工智能 - 库勒芬大学人工智能研究所）

AI总结提出一种通过可训练张量分解将预训练RGB模型的卷积滤波器分解为空间和光谱成分，并替换光谱成分以适应高光谱图像通道数的方法，实现高光谱图像迁移学习，实验表明该方法比其他方法更准确和鲁棒。

详情

AI中文摘要

迁移学习使得大型视觉网络能够通过将模型的通用滤波器专门化到新任务，从而应用于各种领域。然而，这些网络假设输入图像具有3个输入通道，使其与多光谱或高光谱图像不兼容。当前缓解这种不兼容性的方法要么牺牲图像信息，要么牺牲模型信息。本文提出了一种新颖的方法，通过使用部分可训练的张量分解来保留图像和模型中的空间信息。我们创建预训练卷积滤波器的这种分解，将滤波器分离为空间和光谱成分。然后，将光谱成分替换为具有更高通道维度的可训练成分。这创建了能够专门化到新数据集的高光谱滤波器，同时保留原始滤波器的空间模式。在各种高光谱数据集上的实验表明，我们的方法比其他高光谱迁移学习方法更准确和鲁棒。

英文摘要

Transfer learning makes it possible to use large vision networks on a variety of domains, by specializing their models' general filters to new tasks. However, these networks assume the input images to have 3 input channels, making them incompatible with multi- or hyperspectral images. Current approaches that mitigate this incompatibility sacrifice information in either the image, or the model. This work proposes a novel approach that preserves the image and spatial information present in the model by using partially trainable tensor decompositions. We create such decompositions of pretrained convolutional filters, separating the filters into spatial and spectral components. The spectral components are then replaced with trainable components of higher channel dimensionality. This creates hyperspectral filters that can specialize to new datasets, while retaining the spatial patterns of the original filter. Experiments on a variety of hyperspectral datasets show that our approach is more accurate and robust than other hyperspectral transfer learning methods.

URL PDF HTML ☆

赞 0 踩 0

2605.28324 2026-05-28 cs.CV 版本更新

LV-OSD: 语言-视觉互补的开放集目标检测

Yupeng Zhang, Ruize Han, Wei Feng, Song Wang, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University（智能与计算学院，天津大学）； Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology（计算机科学与人工智能学院，深圳先进技术大学）

AI总结提出语言-视觉互补开放集目标检测问题，设计双分支检测框架LVDor，通过目标引导提示动态加权模块和提示随机掩码机制实现文本与图像提示的灵活组合与语义对齐。

详情

AI中文摘要

目标检测是计算机视觉中的重要任务，旨在通过给定的类别列表或查询图像检测感兴趣的目标。在这项工作中，我们提出了一个语言-视觉互补开放集目标检测（LV-OSD）的新问题，即使用灵活的基于文本和/或基于图像的提示来指定所需的目标类别。这种设置在现实应用中更为常见和实用。为此，我们设计了一个双分支检测框架LVDor，它可以同时接受文本和图像提示。具体来说，我们首先为每个类别构建包含多种文本描述和图像样本的多模态提示（MPr）。随后，为了弥合输入图像、文本提示和图像提示之间的语义差距，我们设计了一个目标引导提示动态加权（TPDW）模块。在该模块中，通过目标图像的先验信息，动态生成与目标语义最对齐的文本和图像提示，实现精确对齐并有效减少两种模态之间的差异，从而适应LV-OSD设置。我们还提出了一种简单的训练时提示随机掩码（PRM）机制，以模拟测试时文本和/或图像提示的任意组合。大量的实验结果验证了我们问题表述的合理性和方法的有效性。提示和代码将公开发布。

英文摘要

Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation's reasonability and our method's effectiveness. Prompts and code will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.28270 2026-05-28 cs.CV 版本更新

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

Every9D-21M：日常物体的大规模真实世界9D规范化

Leonhard Sommer, Emil Akopyan, Adam Kortylewski

发表机构 * University of Freiburg（弗赖堡大学）； CISPA Helmholtz Center for Information Security（信息安全赫尔姆霍茨研究中心）

AI总结针对真实世界9D姿态数据缺乏的问题，提出包含2180万张图像、700类物体的Every9D-21M数据集，通过多视图几何重建点云并跨实例对齐实现大规模标注，验证了其在多个基准上的性能提升。

详情

AI中文摘要

从单张真实世界图像估计日常物体的9D姿态仍然具有挑战性。这很大程度上是由于缺乏大规模监督。大多数现有数据集要么严重依赖合成渲染，要么对真实世界物体的覆盖有限：迄今为止最大的真实世界9D姿态数据集仅包含9个类别的17K个标注物体。我们通过Every9D-21M数据集填补了这一空白，该数据集包含来自109K个以物体为中心的视频的2180万张真实世界图像的9D姿态标注，涵盖700个日常物体类别——在图像和类别数量上比之前的真实世界9D姿态基准大两个数量级。为了实现这一规模，我们利用以物体为中心的视频，通过多视图几何重建物体级点云，并将相似实例对齐到共享的规范坐标系中。仅对一小部分参考物体（少于所有图像的0.01%）手动标注规范姿态，并通过跨实例对齐传播到其余实例。然后从多个视角验证所有传播的规范姿态。我们进一步引入了跨类别方向规则，以诱导类别级对称性，从而实现对称感知评估。除了建立专用的训练和评估划分作为9D姿态基础模型的基准外，我们还表明，在Every9D-21M上训练可提高在ImageNet3D和PASCAL3D+上的性能，并且比在ImageNet3D上训练更好地泛化到HANDAL。数据和代码可在https://github.com/GenIntel/Every9D获取。

英文摘要

Estimating the 9D pose of everyday objects from a single real-world image remains challenging. This is largely due to the lack of large-scale supervision. Most existing datasets either rely heavily on synthetic renderings or provide limited coverage of real-world objects: the largest real-world 9D pose dataset to date contains only 17K annotated objects across 9 categories. We address this gap with Every9D-21M, a dataset of 9D pose annotations for 21.8M real-world images from 109K object- centric videos spanning 700 everyday object categories - two orders of magnitude larger than prior real-world 9D pose benchmarks in both image and category count. To achieve this scale, we leverage object-centric videos by reconstructing object- level point clouds via multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for only a small set of reference objects (fewer than 0.01% of all images) and propagated to the remaining instances via cross-instance alignment. All propagated canonical poses are then verified from multiple viewpoints. We further introduce cross-category orientation rules that induce category-level symmetries, enabling symmetry-aware evaluation. Beyond establishing dedicated training and evaluation splits as a benchmark for 9D pose foundation models, we show that training on Every9D-21M improves performance on ImageNet3D and PASCAL3D+, and generalizes to HANDAL substantially better than training on ImageNet3D. Data and code are available at https://github.com/GenIntel/Every9D.

URL PDF HTML ☆

赞 0 踩 0

2605.28261 2026-05-28 cs.CV 版本更新

MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance Annotations

MORI-Seg: 无需实例标注的形态学几何学习用于实例分割

Leiyue Zhao, Tianyu Shi, Daniel Reisenbuchler, Xinzi He, Junchao Zhu, Tianyuan Yao, Yuechen Yang, Yanfan Zhu, Junlin Guo, Gelei Xu, Haichun Yang, Yuankai Huo, Mert R. Sabuncu, Yihe Yang, Ruining Deng

发表机构 * Southern University of Science and Technology（南方科技大学）； Sichuan University（四川大学）； University of Regensburg（莱茵-魏尔堡大学）； Cornell University（康奈尔大学）； Vanderbilt University（范德比尔特大学）； University of Notre Dame（Notre Dame 大学）； Vanderbilt University Medical Center（范德比尔特大学医学中心）； Cornell Tech（康奈尔科技）； Weill Medical College of Cornell University（康奈尔大学韦尔医学院）

AI总结提出MORI-Seg框架，通过从语义掩码学习形态感知几何表示（对象中心距离场和边界带表示）以及类条件特征解耦模块，在仅语义监督下实现端到端的实例分割，提升拥挤粘连区域的实例分离精度。

详情

AI中文摘要

肾脏功能单元的实例级量化对于形态测量分析至关重要，然而大多数公开可用的病理数据集仅提供语义分割标注，其中同一类别的相邻结构被合并为单个区域。这阻碍了可靠的实例级分析，并限制了后续的定量研究。现有的启发式后处理方法在拥挤和粘连区域往往产生次优的实例分离，而基于深度学习的实例分割方法通常需要密集的实例级标注，这些标注成本高昂且劳动密集。我们提出MORI-Seg，一个无需实例级标注即可实现实例分割的深度学习框架。MORI-Seg不依赖启发式分割或实例监督，而是通过联合建模对象中心距离场和边界带表示，直接从语义掩码学习形态感知的几何表示，以编码内部结构和接触界面。类条件特征解耦模块进一步促进实例内一致性和实例间分离。在仅语义监督下，MORI-Seg以端到端的方式将连接的语义区域分解为不同的实例掩码。实验表明，与经典的后处理流程和代表性的语义到实例学习方法相比，MORI-Seg在实例分离准确性和更可靠的形态测量量化方面表现更优。官方实现已在 https://github.com/ddrrnn123/MORI-Seg 公开。

英文摘要

Instance-level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance-level analysis and limits downstream quantitative studies. Existing heuristic post-processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning-based instance segmentation approaches typically require intensive instance-level annotations that are costly and labor-intensive to obtain. We propose MORI-Seg, a deep learning framework that enables instance segmentation without requiring instance-level annotations. Instead of heuristic splitting or instance supervision, MORI-Seg learns morphology-aware geometric representations directly from semantic masks by jointly modeling object-centric distance fields and boundary-band representations to encode interior structure and contact interfaces. A class-conditioned feature disentanglement module further promotes intra-instance coherence and inter-instance separation. Under semantic-only supervision, MORI-Seg decomposes connected semantic regions into distinct instance masks in an end-to-end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post-processing pipelines and representative semantic-to-instance learning approaches. The official implementation is publicly available at https://github.com/ddrrnn123/MORI-Seg.

URL PDF HTML ☆

赞 0 踩 0

2605.28258 2026-05-28 cs.SE cs.AI cs.CV cs.HC 版本更新

GUI Agents for Continual Game Generation

面向持续游戏生成的GUI智能体

Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo

发表机构 * Fudan University（复旦大学）； Xiaohongshu Inc.（小红书公司）； Tongji University（同济大学）； University of California, Santa Barbara（加州大学圣芭芭拉分校）

AI总结提出利用GUI智能体作为客观评估者和主观测试者，通过PlaytestArena和Play2Code框架实现持续游戏生成，显著提升可玩性。

详情

AI中文摘要

生成一个游戏与制作一个可玩的游戏不同。尽管代码生成取得了进展，现有方法将游戏生成视为从提示到产物的单次翻译，导致交互层面的失败未被检测。我们认为评估和改进游戏生成需要一个玩家，并研究了图形用户界面（GUI）智能体在此过程中的两个角色：（1）作为客观评估者，为此我们引入了PlaytestArena，这是一个新的评估环境，将8个游戏类型的200个基于浏览器的游戏生成任务与预期的游戏行为准则配对，由GUI智能体在浏览器中加载每个构建并玩它来裁决；（2）作为主观测试者，为此我们提出了Play2Code，其中游戏智能体和GUI智能体在共享内存的持续循环中运行，将游戏生成转化为编码和游戏之间的对话。我们的实验表明，即使是前沿模型也难以直接生成可玩的游戏，而Play2Code达到了66.8%的准则通过率，分别比单次传递和智能体编码基线提高了37.1和14.6个百分点。进一步分析表明，GUI测试者的反馈比人类报告更可追溯，但在某些方面具有类似人类测试者的特质，将游戏测试确立为交互式代码生成的关键测试平台。我们的项目网站位于https://continual-game-generation.vercel.app/。

英文摘要

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.

URL PDF HTML ☆

赞 0 踩 0

2605.28257 2026-05-28 cs.CV 版本更新

Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

基于可变形对象先验的相机空间类别级3D对应

Leonhard Sommer, Artur Jesslen, Basavaraj Sunagad, Adam Kortylewski

发表机构 * University of Freiburg, Germany（弗赖堡大学，德国）； CISPA Helmholtz Center for Information Security, Germany（信息安全部署中心，德国）

AI总结通过学习共享可变形对象先验，从单张图像预测类别内实例间一致的3D位置，无需显式对应监督，并在新基准HouseCorr3D上达到最优。

Comments 14 pages, 4 figures. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D

详情

AI中文摘要

从图像理解3D对象是机器人和AR/VR应用的基础。尽管近期工作在类别级姿态估计方面取得了进展，但当前的表示未能捕捉到推理对象部件、功能和交互所需的细粒度语义。在这项工作中，我们研究了相机空间中的类别级3D对应——从单张图像预测在类别内实例间保持一致的3D位置——并展示了通过学习共享的可变形对象先验，无需显式对应监督即可涌现出这种对应。为了推动这一方向的研究，我们引入了HouseCorr3D，这是首个大规模的单目类别级3D对应基准，包含50个家庭对象类别的178k张图像、280个独特实例以及直接标注在CAD模型上的3D关键点。关键的是，HouseCorr3D提供了遮挡区域的模态补全对应标签和显式对称标注，解决了现有数据集的主要局限性。我们进一步提出了Morpheus，一种通过学习解耦规范形状、形变和对象姿态来学习可变形类别级形状先验的方法。通过这种共享的规范基础，相机空间中有语义意义的3D对应隐式地涌现出来。这些涌现的3D对应在HouseCorr3D上达到了新的最优水平，证明了无需直接对应监督即可实现语义3D对象理解。数据和代码公开于https://github.com/GenIntel/HouseCorr3D。

英文摘要

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.

URL PDF HTML ☆

赞 0 踩 0

2605.28241 2026-05-28 cs.CV 版本更新

桥接无线电地图估计中的采样分布偏移：一种轨迹感知范式

Feng Qiu, Zheng Fang, Shuhang Zhang, Kangjun Liu, Longkun Zou, Jing Liu, Ke Chen

发表机构 * School of Artificial Intelligence, Xidian University（西安电子科技大学人工智能学院）； Pengcheng Laboratory（鹏城实验室）； Department of Computer Science and Engineering, Southern University of Science and Technology（南方科技大学计算机科学与工程系）； Department of Electronics, Peking University（北京大学电子系）； Guangzhou Institute of Technology, Xidian University（西安电子科技大学广州研究院）

AI总结针对无人机轨迹采样与随机采样分布不匹配导致的性能下降，提出基于随机触发轨迹采样的轨迹感知训练范式，有效降低估计误差。

详情

AI中文摘要

基于学习的无线电地图估计（RME）在无人机辅助无线感知中扮演关键角色，支持覆盖预测和网络优化等任务。当前大多数方法假设基于随机采样的独立同分布（i.i.d.）训练和测试设置。然而，实际无人机测量是沿着可行轨迹顺序收集的，导致高度结构化和空间相关的模式。这种不匹配引入了采样分布偏移，增加了空间场恢复的内在难度，并损害了在i.i.d.假设下训练的模型的泛化能力。为缓解这一问题，我们提出了一种基于随机触发轨迹采样（ST-TBS）的轨迹感知训练范式，该范式在保持轨迹连续性的同时引入采样变异性。此外，从统计角度来看，我们表明与随机采样相比，基于轨迹的采样降低了空间多样性并增加了信息冗余。在RadioMapSeer和SpectrumNet数据集上的大量实验表明，在基于轨迹的观测下，使用随机采样训练的模型性能显著下降，在SpectrumNet上RMSE从0.0391增加到0.2632。相反，我们提出的ST-TBS方法有效将RMSE降低到0.0571。这些结果强调了对齐训练和部署采样分布对于可靠RME的必要性。

英文摘要

Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.

URL PDF HTML ☆

赞 0 踩 0

2605.28230 2026-05-28 cs.CV 版本更新

从Kellgren-Lawrence到焦磷酸钙晶体沉积：一种用于膝骨关节炎评估的软标签框架

Francisco Bérchez-Moreno, Riccardo Rosati, Maria Chiara Fiorentino, Víctor M. Vargas, Edoardo Cipolletta, Emilio Filippucci, Luca Romeo, Pedro A. Gutiérrez, César Hervás-Martínez

发表机构 * organization= Department of Political Science, Communication ； International Relations, University of Macerata , city= Macerata , country= Italy ； organization= Department of Economics ； Law, University of Macerata , city= Macerata , country= Italy ； organization= Department of Innovative Technologies in Medicine \& Dentistry, Università degli Studi "G. D'Annunzio" Chieti - Pescara , city= Chieti , country= Italy ； organization= Department of Internal Medicine, Azienda Ospedaliero Universitaria delle Marche , city = Ancona , country= Italy ； organization= Academic Rheumatology, University of Nottingham , city = Nottingham , country= UK ； organization= Department of Rheumatology, Polytechnic University of Marche , city= Ancona , country= Italy

AI总结提出基于软标签的序贯深度学习框架，通过单峰概率分布替代独热编码，同时处理KL和CPPD分级中的序数不确定性和不对称关系，在膝X光图像上显著提升分级性能。

详情

AI中文摘要

背景与目标。传统的膝骨关节炎（KOA）分级深度学习方法依赖于独热标签，未能捕捉Kellgren-Lawrence（KL）和焦磷酸钙沉积病（CPPD）严重程度评分的序数不确定性，以及临床实践中观察到的两个量表之间的不对称关系。方法。我们回顾性收集了2172张膝关节X光图像，包括968张同时标注了KL和CPPD严重程度的X光片。开发了一个基于软标签的序贯深度学习框架用于两项任务，用以标注等级为中心的单峰概率分布替代独热目标。研究了四种分布形式：二项分布、贝塔分布、三角分布和指数分布。结果。所有软标签策略均持续优于名义基线。对于CPPD分级，三角分布实现了最高的二次加权卡帕（QWK）和最低的平均绝对误差（MAE）（QWK = 0.796；MAE = 0.438），而贝塔分布在考虑各类别的平均MAE（AMAE）和最大MAE（MMAE）时产生了最平衡的类别性能（AMAE = 0.458；MMAE = 0.573）。对于KL分级，基于贝塔的方法提供了最佳整体性能，实现了最高的QWK以及最低的MAE和类别误差（QWK = 0.777；MAE = 0.529；AMAE = 0.523；MMAE = 0.775）。统计分析表明，与传统的独热监督相比有显著改进（p < 0.001）。

英文摘要

Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren--Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p < 0.001).

URL PDF HTML ☆

赞 0 踩 0

2605.28174 2026-05-28 cs.CV cs.AI 版本更新

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

FLORO：面向跨传感器与尺度的生态遥感多模态地理空间基础模型

Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara, Fida Mohammad Thoker, Kasper Johansen, Bernard Ghanem, Fernando T. Maestre, Matthew F. McCabe

发表机构 * Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology（国王阿卜杜勒·阿齐兹科技大学生物与环境科学与工程 division）； Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology（国王阿卜杜勒·阿齐兹科技大学计算机、电气与数学科学与工程 division）

AI总结提出FLORO多模态地理空间基础模型，通过掩码自编码在异构遥感数据上预训练，利用可用性感知输入统一异构传感器配置，在PANGAEA基准上实现强迁移性能。

Comments 29 pages, 9 figures

详情

AI中文摘要

基础模型为可迁移的遥感表示提供了有前景的途径，但许多当前方法依赖于非常大的预训练数据集和固定的传感器配置，限制了它们在生态和环境应用中的适用性，这些应用中的观测通常跨平台、空间和光谱分辨率以及可用模态而变化。我们提出了FLORO，一个多模态地理空间基础模型，旨在从一个小型但高度多样化的遥感语料库中学习可迁移表示。FLORO使用掩码自编码在Sentinel-1、Sentinel-2、SkySAT影像、高程和无人机数据的异构组合上进行预训练。为了适应传感器变异性，FLORO结合了可用性感知输入，指示每个样本中存在哪些光谱波段和辅助模态，从而在异构传感器配置上实现统一的输入空间。我们在PANGAEA基准上，在冻结编码器协议下，评估了FLORO的场景分类、分割和回归任务。尽管在比竞争基础模型更小的语料库上预训练，FLORO在跨光学、光学-SAR和光学-高程基准（涵盖中分辨率卫星、航空和超高分辨率无人机影像）上实现了强大且稳定的迁移。FLORO在六个PANGAEA基准上取得了第二好的平均分割性能，仅次于最近引入的预训练图像数量超过两个数量级的基础模型，在场景分类上保持竞争力，在回归任务中表现稳健，而定性结果显示在洪水、城市、生物量和冠层高度预测设置中空间结构的保存有所改善。在EuroSAT-MS上的单独对照实验中，相对于绝对位置编码，地理位置编码进一步提高了分类性能。

英文摘要

Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.

URL PDF HTML ☆

赞 0 踩 0

2605.28173 2026-05-28 cs.CV 版本更新

Intra-YOLO：基于迁移学习与强化学习的口内摄影龋齿与磨牙-切牙矿化不良小目标检测模型

Po-Lun Chwang, Po-Yu Chang, Wen-Liang Lin, Tung-Sheng Wu, Min-Ching Wang, Yun-Chien Cheng

发表机构 * Department of Mechanical Engineering, College of Engineering, National Yang Ming Chiao Tung University（国家阳明交通大学机械工程系）； Taipei Medical University Hospital（台北医学大学医院）； Wan Fang Hospital, Taipei Medical University（台北医学大学万芳医院）

AI总结提出Intra-YOLO模型，结合迁移学习与强化学习，解决口内照片中龋齿和MIH小目标检测难题。

2605.28151 2026-05-28 cs.CV 版本更新

A novel ordinal multi-view aggregation scheme for oak defoliation

一种用于橡树落叶的新型有序多视图聚合方案

Francisco Bérchez-Moreno, Ricardo Enrique Hernández-Lambraño, David Guijo-Rubio, Víctor Manuel Vargas, Francisco José Ruiz-Gómez, Juan Carlos Fernández, Pablo González-Moreno

发表机构 * Department of Forest Engineering, Laboratory of Dendrochronology, Silviculture and Global Change – DendrodatLab, Universidad de Córdoba（森林工程系、树轮学实验室、林学与全球变化——DendrodatLab，科尔多瓦大学）； ERSAF. Andalusian Institute for Earth System Research (IISTA), Universidad de Córdoba（安达卢西亚地球系统研究所（IISTA）、科尔多瓦大学）

AI总结提出一种基于有序分类的多视图集成框架，通过聚合从不同视角（北、南、树冠）训练的CNN预测，实现更稳健准确的橡树落叶估计。

详情

AI中文摘要

由气候和生物胁迫驱动的森林衰退威胁着生态系统功能，使得准确监测树木健康至关重要。在这项工作中，我们将树木落叶估计视为一个有序分类问题，使用地面图像。我们提出了一种新颖的多视图集成框架，该框架聚合了从不同视角（北、南和树冠）训练的卷积神经网络（CNN）的预测。该方法通过同质集成设计利用互补的视觉信息，同时保持建模一致性。通过比较多种有序分类方法并分析每个视图及其组合的贡献，进行了全面评估。结果表明，对落叶水平的有序结构进行建模比名义方法提高了性能，而所提出的多视图集成始终优于单视图和成对配置。特别是，三视图集成在所有评估指标上实现了最稳健和准确的预测。这些发现凸显了结合深度学习（DL）、有序分类（OC）和多视图聚合在地中海牧场等复杂生态系统中进行可扩展、一致和客观的森林健康评估的潜力。

英文摘要

Forest decline driven by climate and biotic stressors threatens ecosystem functioning, making accurate monitoring of tree health essential. In this work, we address tree defoliation estimation as an ordinal classification problem using ground-level imagery. We propose a novel multi-view ensemble framework that aggregates predictions from Convolutional Neural Networks (CNNs) trained on different perspectives of individual trees (north, south, and crown). This approach leverages complementary visual information while preserving modelling consistency through a homogeneous ensemble design. A comprehensive evaluation is conducted by comparing multiple ordinal classification methods and analysing the contribution of each view and their combinations. Results show that modelling the ordinal structure of defoliation levels improves performance over nominal approaches, while the proposed multi-view ensemble consistently outperforms single-view and pairwise configurations. In particular, the three-view ensemble achieves the most robust and accurate predictions across all evaluation metrics. These findings highlight the potential of combining Deep Learning (DL), Ordinal Classification (OC), and multi-view aggregation for scalable, consistent, and objective forest health assessment in complex ecosystems such as Mediterranean dehesas.

URL PDF HTML ☆

赞 0 踩 0

2605.27351 2026-05-28 cs.CV 版本更新

Feedforward 3D Editing Learns from Semantic-Part Transformation

前馈3D编辑从语义部分变换中学习

Jiawei Weng, Saining Zhang, Zhenxin Diao, Peishuo Li, Henghaofan Zhang, Junhao Chen, Hao Zhao

发表机构 * Nanyang Technological University（南洋理工大学）； Tsinghua University（清华大学）

AI总结提出Pxform数据集和PartFlow网络，通过语义部分变换实现高质量前馈3D编辑，在几何和外观编辑基准上达到最优性能。

Comments 31 pages, 22 figures. Project Page: https://dennis-jwweng.github.io/pxform/

详情

AI中文摘要

3D编辑是可扩展3D内容创作的基本能力。虽然图像编辑已迅速向大规模前馈生成范式发展，但3D AI生成仍以无需训练的编辑流程为主。前馈3D编辑的核心挑战在于缺乏高质量配对监督。可编辑的3D资产需要同时保持几何、多视图一致性、结构连贯性和局部编辑可控性。现有的3D编辑数据集通常依赖于独立生成的资产、图像介导的重建或狭窄的编辑分类，导致定位不准确、保持性弱、编辑边界模糊和语义一致性有限。在这项工作中，我们引入了一个新视角：可扩展的前馈3D编辑应从语义部分变换中学习。基于这一见解，我们提出了Pxform，一个高质量的3D编辑数据集，包含超过10万对七种编辑类型的一致前后编辑对。我们的流程不是将对象视为无结构形状，而是直接将编辑锚定在语义3D部分。基于Pxform，我们进一步提出了PartFlow，一个前馈3D编辑网络，它将源感知潜在控制注入预训练的3D生成先验中。PartFlow引入了掩码感知速度保持和渲染空间一致性监督，以共同提高编辑保真度和源保持，同时在推理时不需要3D编辑掩码。大量实验表明，高质量的语义部分监督显著改进了可扩展的3D编辑，使PartFlow在几何和外观编辑基准上均达到了最先进的性能。

英文摘要

3D editing is a fundamental capability for scalable 3D content creation. While image editing has rapidly evolved toward large-scale feedforward generative paradigms, 3D AI generation remains dominated by training-free editing pipelines. A central challenge of feedforward 3D editing lies in the lack of high-quality paired supervision. Editable 3D assets require simultaneous preservation of geometry, multi-view consistency, structural coherence, and localized edit controllability. Existing 3D editing datasets often rely on independently generated assets, image-mediated reconstruction or narrow edit taxonomies, leading to inaccurate localization, weak preservation, blurred edit boundaries, and limited semantic consistency. In this work, we introduce a new perspective: scalable feedforward 3D editing should be learned from semantic-part transformations. Based on this insight, we propose Pxform, a high-quality 3D editing dataset with over 100K consistent before/after editing pairs across seven edit types. Instead of treating objects as unstructured shapes, our pipeline grounds edits directly in semantic 3D parts. Built upon Pxform, we further propose PartFlow, a feedforward 3D editing network that injects source-aware latent control into pretrained 3D generative priors. PartFlow introduces mask-aware velocity preservation and render-space consistency supervision to jointly improve edit fidelity and source preservation, while requiring no 3D edit mask during inference. Extensive experiments demonstrate that high-quality semantic-part supervision substantially improves scalable 3D editing, enabling PartFlow to achieve state-of-the-art performance on both geometric and appearance editing benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.27102 2026-05-28 cs.CV cs.LG 版本更新

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

JLT: 潜在扩散Transformer中的干净潜在预测

Funing Fu, Tenghui Wang, Guanyu Zhou, Junyong Cen, Qichao Zhu

发表机构 * Independent Researcher（独立研究者）； Wuhan University of Technology（武汉理工大学）； Hangzhou Jiyi Artificial Intelligence Co., Ltd.（杭州智益人工智能有限公司）

AI总结本文提出JLT，一种在冻结的FLUX.2 VAE编码上训练的130M潜在扩散Transformer，通过干净潜在预测相比速度预测在ImageNet 256×256上获得更优的FID分数，表明潜在扩散中的预测目标是依赖于表示的几何选择。

详情

AI中文摘要

使用干净数据预测的流匹配表明，回归干净点比预测环境噪声量更能有效利用低维结构。我们询问在图像被映射到学习到的潜在空间后，这一原则是否仍然有用，因为压缩已经去除了原始像素的大部分变异性。我们引入了JLT，一个在冻结的FLUX.2 VAE编码上的130M潜在扩散Transformer，并在相同的表示、主干和训练设置下，将干净潜在预测与匹配的速度预测DiT进行比较。尽管三个变量x、epsilon和v在固定损坏时间下是线性可转换的，但局部高斯分析表明，速度回归继承了各向同性的目标协方差下限，并放大了低方差潜在方向，而干净预测则抑制了它们。在ImageNet 256×256上，JLT-B/1在无分类器引导下获得了FID-50K 2.50，与速度预测相比有较大的匹配目标差距。这些结果表明，潜在扩散中的预测目标是依赖于表示的几何选择，而不是可互换的代数参数化。

英文摘要

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.

URL PDF HTML ☆

赞 0 踩 0

2605.26368 2026-05-28 cs.CV cs.AI 版本更新

小米自动驾驶世界模型：一个融合重建与生成的联合世界模型

Lijun Zhou, Hongcheng Luo, Zhenxin Zhu, Cheng Chi, Mingfei Tu, Kaixin Xiong, Lei Gong, Zhanqian Wu, Zehan Zhang, Fangzhen Li, Hao Li, Yingying Shen, Jiale He, Haohui Zhu, Shan Zhao, Kai Wang, Zhiwei Zhan, Yuechuan Pu, Kaiyuan Tan, Ruiling Yang, Xianqi Wang, Tianyi Yan, Jiawei Zhou, Lei Zhang, Jingyang Zhao, Xi Zhou, Chitian Sun, Chenming Wu, Jiong Deng, Hongwei Xie, Ming Lu, Kun Ma, Long Chen, Guang Chen, Hangjun Ye, Bing Wang, Haiyang Sun

发表机构 * Xiaomi（小米）

AI总结提出一个统一技术系统，通过稀疏场景查询驱动的重建模块WorldRec和两阶段训练框架WorldGen，实现高保真3D场景表示与高质量因果视频生成，并联合优化以提升生成稳定性、跨帧一致性和视觉保真度。

详情

AI中文摘要

本报告提出了一个统一的技术系统，解决自动驾驶世界模型的两个核心能力：世界表示和世界生成。对于世界表示，我们提出了WorldRec，一种由稀疏场景查询驱动的前馈重建架构。WorldRec在3D空间中初始化结构化查询，利用它们聚合跨视图、跨时间特征，从而自然地强制帧间空间一致性，并产生紧凑且高保真的3D高斯场景表示。对于世界生成，我们提出了WorldGen，一个两阶段训练框架，包括双向预训练和随后通过三个渐进阶段（教师强制、ODE蒸馏和DMD）的因果微调，使得在仅4个去噪步骤中实现高质量的在线因果视频生成。基于这两个模块，我们进一步引入了JWM，它深度融合了WorldRec和WorldGen，在生成稳定性、跨帧一致性和视觉保真度方面实现协同增益，为自动驾驶中的闭环仿真、数据合成和端到端训练提供了坚实基础。

英文摘要

This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross-view, cross-temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross-frame consistency, and visual fidelity, providing a solid foundation for closed-loop simulation, data synthesis, and end-to-end training in autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.02417 2026-05-28 cs.CV 版本更新

DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

DirectEdit: 基于流的图像编辑的逐步骤精确反演

Desong Yang, Mang Ye

发表机构 * National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China（多媒体软件国家工程研究中心，计算机科学学院，武汉大学，中国武汉）

AI总结提出DirectEdit方法，通过直接对齐前向路径消除反演过程中的累积漂移，实现精确重建和可靠特征共享，无需额外神经函数评估，在多种场景下优于现有方法。

Comments ICML 2026. Project page: https://desongyang.github.io/Directedit/

详情

AI中文摘要

随着大规模预训练文本到图像（T2I）模型的最新进展，免训练的图像编辑方法已展现出显著成功。通常，这些方法通过反演过程向干净图像添加噪声，随后在前向过程中分别对重建路径和编辑路径进行去噪步骤。然而，由于重建路径使用来自不匹配时间步的噪声潜变量进行近似，现有方法不可避免地遭受累积漂移，这从根本上限制了重建保真度。为了解决这一挑战，我们系统地分析了流变换器中的反演过程，并提出了DirectEdit，一种简单而有效的编辑方法，无需引入额外的神经函数评估（NFE）即可消除固有的重建误差。与大多数试图纠正反演路径的先前工作不同，DirectEdit专注于直接对齐前向路径，从而实现精确重建和可靠的特征共享。此外，我们引入了一种基于注意力特征注入和多分支掩码引导噪声混合的保留机制，有效平衡了保真度和可编辑性。跨多种场景的大量实验表明，DirectEdit实现了高效准确的图像编辑，其优越性能优于最先进的方法。代码和示例可在 https://desongyang.github.io/Directedit 获取。

英文摘要

With recent advancements in large-scale pre-trained text-to-image (T2I) models, training-free image editing methods have demonstrated remarkable success. Typically, these methods involve adding noise to a clean image via an inversion process, followed by separate denoising steps for the reconstruction and editing paths during the forward process. However, since the reconstruction path is approximated using noisy latents from mismatched timesteps, existing methods inevitably suffer from accumulated drift, which fundamentally limits reconstruction fidelity. To address this challenge, we systematically analyze the inversion process within the flow transformer and propose DirectEdit, a simple yet effective editing method that eliminates the inherent reconstruction error without introducing additional neural function evaluations (NFEs). Unlike most prior works that attempt to rectify the inversion path, DirectEdit focuses on directly aligning the forward paths, enabling precise reconstruction and reliable feature sharing. Furthermore, we introduce a preservation mechanism based on attention feature injection and multi-branch mask-guided noise blending, which effectively balances fidelity and editability. Extensive experiments across diverse scenarios demonstrate that DirectEdit achieves efficient and accurate image editing, delivering superior performance that outperforms state-of-the-art methods. Code and examples are available at https://desongyang.github.io/Directedit.

URL PDF HTML ☆

赞 0 踩 0

2604.21668 2026-05-28 cs.CV 版本更新

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

通过结构化运动描述实现无编码器的人体运动理解

Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao

发表机构 * Aalto University（阿alto大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出结构化运动描述（SMD），将关节位置序列转换为结构化自然语言描述，使大语言模型无需专用编码器即可直接进行运动推理，在运动问答和字幕生成任务上超越现有方法。

详情

AI中文摘要

基于文本的大语言模型（LLM）的世界知识和推理能力正在快速发展，但目前的人体运动理解方法（包括运动问答和字幕生成）尚未充分利用这些能力。现有的基于LLM的方法通常通过专用编码器学习运动-语言对齐，将运动特征投影到LLM的嵌入空间中，但仍受限于跨模态表示和对齐。受生物力学分析的启发（其中关节角度和身体部位运动学长期以来一直作为人体运动的精确描述语言），我们提出了 extbf{结构化运动描述（SMD）}，一种基于规则的确定性方法，将关节位置序列转换为关节角度、身体部位运动和全局轨迹的结构化自然语言描述。通过将运动表示为文本，SMD使LLM能够直接将其关于身体部位、空间方向和运动语义的预训练知识应用于运动推理，无需学习编码器或对齐模块。我们表明，该方法在运动问答（BABEL-QA上66.7%，HuMMan-QA上90.1%）和运动字幕生成（HumanML3D上R@1为0.584，CIDEr为53.16）上均超越了所有先前方法，达到了最先进的结果。SMD还提供了实际优势：相同的文本输入可适用于不同的LLM，仅需轻量级的LoRA适配（在6个模型家族的8个LLM上验证），并且其人类可读的表示能够对运动描述进行可解释的注意力分析。代码、数据和预训练的LoRA适配器可在https://yaozhang182.github.io/motion-smd/获取。

英文摘要

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.

URL PDF HTML ☆

赞 0 踩 0

2605.28137 2026-05-28 cs.CV cs.LG 版本更新

No Safe Dose: How Training Data Drives Unsafe Image Generation

无安全剂量：训练数据如何驱动不安全图像生成

Felix Friedrich, Lukas Helff, Niharika Hegde, Patrick Schramowski, Kristian Kersting

发表机构 * Black Forest Labs（黑森林实验室）； TU Darmstadt & hessian.AI（图腾达姆施塔特大学 & heessian.AI）； DFKI（德意志联邦鹰嘴豆研究所）； Lab1141（Lab1141实验室）

AI总结通过控制训练数据中不安全图像的比例（0%至9.6%），发现输出不安全率随比例单调上升，且比例而非绝对数量是关键因素，同时文本编码器（如SafeCLIP）可降低基线风险，但剂量效应持续存在。

详情

AI中文摘要

基于大规模数据训练的文本到图像模型往往不可避免地包含不安全内容。虽然有人观察到输入输出放大效应，但训练数据组成是否以及如何直接驱动模型输出安全性，还是由其他因素决定，仍不清楚。我们通过隔离这一变量来阐明问题：在多个数据集规模（10万到800万）下，我们在仅在不安全图像比例（0%到9.6%）上不同的数据集上训练相同的文本到图像模型。然后使用生成的模型生成图像，并用四个独立的安全分类器进行评估。输出不安全率从0%污染时的16.6%单调上升到5%污染时的25.5%。析因设计揭示，不安全训练图像的 extit{比例}而非绝对数量是操作变量。零污染时16.6%的不可降低基线表明其他组件（如冻结的文本编码器）是残余安全风险——通过文本编码器消融实验证实，SafeCLIP将这一底线降至9.6%，而剂量效应在所有测试的三个编码器中持续存在。关键的是，在FID、CLIPscore和ImageReward方面，安全过滤并未伴随质量下降。这些结果表明，数据整理和文本编码器安全是互补且独立有效的干预措施。同时，剩余的不安全水平为未来关于新兴能力和组合性的研究提出了问题。

英文摘要

Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

URL PDF HTML ☆

赞 0 踩 0

2605.28136 2026-05-28 cs.CV cs.RO 版本更新

SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

SAM增强的道路数据集分割：自动驾驶中关键类别的平衡

Toomas Tahves, Mauro Bellone, Junyi Gu, Raivo Sell

发表机构 * Department of Mechanical and Industrial Engineering, Tallinn University of Technology（塔林技术大学机械与工业工程系）； FinEst Centre for Smart Cities, Tallinn University of Technology（塔林技术大学智能城市研究中心）； Department of Computer Science and Engineering, Universitas Mercatorum（默卡托姆大学计算机科学与工程系）； Department of Computer Science and Engineering, Chalmers University of Technology（挑战者技术大学计算机科学与工程系）； University of Gothenburg（哥德堡大学）

AI总结提出基于SAM的标注流水线，将ZOD数据集的边界框转换为密集像素级语义掩码，并评估不同架构在类别不平衡下的性能，通过双向迁移学习实现跨传感器配置的有效迁移。

详情

AI中文摘要

密集语义分割对于自动驾驶至关重要，然而许多多模态数据集缺乏像素级标注。Zenseact开放数据集（ZOD）提供丰富的多传感器数据，但仅有边界框标签，限制了其在分割研究中的应用。我们的主要贡献是一个基于Segment Anything Model（SAM）的标注流水线，通过将边界框转换为语义掩码，为ZOD生成密集的像素级标注。在这项初步研究中，我们处理了超过10万帧，并手动筛选出一个2300帧的子集（接受率36%），以建立可靠的基线。利用这些标注，我们评估了基于Transformer的CLFT和基于CNN的DeepLabV3+架构在不同天气条件下的性能，其中CLFT-Hybrid达到了48.1%的mIoU。为了解决极端类别不平衡问题（行人、骑行者、标志牌像素占比不足1%），我们探索了针对稀有类别的专门模型。我们还在Iseauto自动驾驶平台上验证了该流水线，达到了77.5%的mIoU，并展示了通过双向迁移学习，SAM导出的表示能够有效地跨传感器配置迁移。所有代码和标注均已发布，以支持可重复研究。

英文摘要

Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

URL PDF HTML ☆

赞 0 踩 0

2605.28132 2026-05-28 cs.CV 版本更新

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

哪种预训练范式更有利于空间智能？视觉-语言模型与视频生成模型的实证比较

Haozhan Shen, Tiancheng Zhao, Kangjia Zhao, Jianwei Yin

发表机构 * Zhejiang University（浙江大学）； Om AI Research（Om人工智能研究）； Binjiang Institute of Zhejiang University（浙江大学滨江研究院）

AI总结本文通过冻结特征探测研究，系统比较了视觉-语言模型（VLM）和视频生成模型（VGM）在语义标注、实例分组和3D几何预测三个空间智能维度上的表现，发现两者互补且简单融合可提升整体性能。

Comments Code is here: \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}

详情

AI中文摘要

空间智能需要能够捕捉物理世界中语义对象和几何结构的视觉表示。为此，两种主要的预训练方案被广泛用作基础骨干：视觉-语言模型（VLM），它使用语言监督将视觉观察与语义概念对齐；以及视频生成模型（VGM），它从时间演变的视觉世界中学习。然而，目前尚不清楚哪种预训练方案为空间智能提供了更好的表示基础。在本文中，我们首次对VLM和VGM在空间智能的三个代表性维度上进行了系统的冻结特征探测研究：语义标注、实例分组和3D几何预测。通过轻量级探测，我们的框架能够控制性地比较两个模型家族的冻结表示中已经编码的信息。实验结果显示明显的互补性：VLM在语义标注和实例分组方面更强，而VGM为密集几何和相机运动提供了更易获取的信号。此外，两者的简单融合已经产生了在几何和语义方面都表现出色的表示，这表明通过有效整合两个模型家族的特征来构建更强的空间智能骨干是一个有前景的方向。我们的代码可在\href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}获取。

英文摘要

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}.

URL PDF HTML ☆

赞 0 踩 0

2605.28125 2026-05-28 cs.CV cs.GR 版本更新

CLEAR-NeRF: Collinearity and Local-region Enhanced Accurate 3D Reconstruction in Unbounded Scenes

CLEAR-NeRF: 共线性和局部区域增强的无界场景精确三维重建

Vladislav Polianskii, Elijs Dima, Isabel Salmerón Marazuela, Gergő László Nagy, Sigurdur Sverrisson, Volodya Grancharov

发表机构 * Ericsson Research（爱立信研究）

AI总结提出CLEAR-NeRF方法，通过自动局部区域定位、共线性射线采样、深度局部邻域点提取和几何相关颜色聚合，在无界复杂场景中实现高保真度和度量精度的三维重建。

详情

AI中文摘要

许多真实世界的三维重建应用要求在无界、复杂场景中实现照片级真实感和度量精度，这些场景具有挑战性的光照和不完美的捕获，而当前的神经辐射场（NeRF）流程仅部分满足这些需求。本研究将基于NeRF的三维重建适应于多兴趣区域的无界场景，以提高对光照和姿态变化的鲁棒性，同时确保适用于数字孪生应用的度量精度。我们的方法引入了（i）自动局部区域定位/检测和重建，以无缝优先考虑感兴趣区域而不增加子模块；（ii）共线性强制射线采样，以学习平滑的平面和曲面；（iii）深度局部邻域点提取，以抑制表面伪影；以及（iv）几何相关颜色聚合，以减轻光照和姿态引起的变化。结果表明，所提出的流程在基线NeRF模型以及成熟的结构从运动（SfM）-多视图立体（MVS）解决方案上均表现出优越的性能。

英文摘要

Many real-world 3D reconstruction applications demand photorealism and metric accuracy across unbounded, complex scenes with challenging lighting and imperfect captures that current Neural Radiance Field (NeRF) pipelines only partly satisfy. This study adapts NeRF-based 3D reconstruction to multi-region of interest unbounded scenes to improve robustness to lighting and pose variation while enforcing metric accuracy suitable for digital-twin applications. Our approach introduces (i) automated local region localization/detection and reconstruction to seamlessly prioritize areas of interest without proliferating submodules, (ii) collinearity-enforcing ray sampling to learn smooth planar and curved surfaces, (iii) depth-localized neighborhood point extraction to suppress surface artifacts, and (iv) geometry-relevant color aggregation to mitigate lighting- and pose-caused variations. Results indicate superior performance of the proposed pipeline over the baseline NeRF models and established Structure from Motion (SfM) - Multi-View Stereo (MVS) solutions.

URL PDF HTML ☆

赞 0 踩 0

2605.28100 2026-05-28 cs.CV cs.AI 版本更新

Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring

重新审视变化检测方法在冰塔崩塌延时监测中的应用

Arthur Dérédel, Carlos Crispim-Junior, Pierre Lemaire, Johan Berthet, Laure Tougne Rodet

发表机构 * Université Lumière Lyon 2, CNRS, Ecole Centrale de Lyon, INSA Lyon, Université Claude Bernard Lyon 1, LIRIS, UMR5205（里尔大学 Lyon 2，法国国家科学研究中心，中央理工学院，里昂国立应用科学学院，里尔大学 Lyon 1，LIRIS，UMR5205）； Styx4D, 19 rue lac Saint André, Le Bourget-du-Lac, 73370, France（Styx4D，19 rue lac Saint André，Le Bourget-du-Lac，73370，法国）

AI总结针对延时相机在监测冰塔崩塌时面临的形状和光照变化挑战，本文提出体积变化检测子任务，通过新数据集SeracFallDet评估现有方法，发现密集和半密集特征匹配表现稳健，而监督方法受限于数据稀缺。

Comments Preprint, 19 pages, 8 figures

详情

AI中文摘要

在气候变化加剧环境不确定性的时代，识别和检测事件前兆对于减轻灾难性自然灾害的影响变得至关重要。虽然干涉激光或地震仪等经典传感器可靠，但其广泛部署常受后勤和经济障碍阻碍，留下众多盲点。延时相机已为这类传感器提供经济高效的高分辨率视觉背景，是一种有前景的替代方案。然而，自动处理其输出面临重大挑战，尤其与极端形状和光照变化相关。克服这些问题对于将其大规模部署为监测工具至关重要。本文引入变化检测的一个新颖子任务，即体积变化检测，应用于延时相机和斜坡不稳定性。我们对现有最先进的变化检测方法及相关任务进行全面回顾，分析其核心组件，并评估其在此场景中的适用性。为此，我们引入新数据集SeracFallDet，其中包含冰塔崩塌注释，并已彻底注释以满足后者需求。通过泛化实验，我们证明密集和半密集特征匹配虽未专门针对此任务训练，但表现出稳健性能。相反，监督方法在数据稀缺和注释不平衡方面存在困难。这表明混合方法可能通过利用两种任务的优势提供前进路径。这些发现凸显了特征匹配技术的潜力，以及需要进一步创新以克服环境监测中实际部署的挑战。

英文摘要

In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming crucial to mitigate the impacts of disastrous natural hazards. While classical sensors such as interferometric lasers or seismometers are reliable, their widespread deployment is often hindered by logistical and economic barriers, leaving numerous blind spots. Time-lapse cameras, which already provide cost-effective, high-resolution visual context to such sensors, present a promising alternative. However, processing their output automatically faces significant challenges, notably linked to extreme shape and lighting variations. Overcoming those issues is essential to deploy them at large-scale as a monitoring tool. This paper introduces a novel sub-task of change detection, namely volumetric change detection, applied to time-lapse cameras and slope instabilities. We conduct a comprehensive review of state-of-the-art change detection methods and related tasks, analyze their core components and assess their applicability to this context. To that end, we introduce the new dataset SeracFallDet, which contains serac fall annotations and has been thoroughly annotated to meet the latter demand. Through generalization experiments, we demonstrate that dense and semi-dense feature matching, although not trained specifically for this task, exhibit robust performance. Alternatively, supervised approaches struggle with data scarcity and annotation imbalance. This suggests that hybrid methods may offer a path forward by leveraging the strengths of both tasks. These findings highlight the potential of feature matching techniques and the need for further innovation to overcome the challenges of real-world deployment in environmental monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.28091 2026-05-28 cs.CV 版本更新

Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation

Qwen-Image-Bench：从生成到创造——文本到图像评估

Niantong Li, Guangzheng Hu, Weixu Qiao, Ying Ba, Qichen Hong, Shijun Shen, Jinlin Wang, Fan Zhou, Jianye Kang, Xin Shang, Ziyi He, Wei Wang, Dalin Li, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yuxiang Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Zekai Zhang, Zhendong Wang, Zihao Liu, Zikai Zhou, Hongzhu Shi, Yi Wang, Bing Zhao, Hu Wei, Lin Qu, Chenfei Wu

发表机构 * Alibaba（阿里巴巴）

AI总结针对现有文本到图像评估基准缺乏对真实世界保真度和创造性生成能力的考量，本文提出Qwen-Image-Bench，一个与专业艺术家共同设计的创作者中心基准，通过分层分类体系、1000个分层提示和基于Qwen3.6-27B的统一评判模型Q-Judger，实现细粒度、可归因的诊断，有效区分领先的T2I模型。

详情

AI中文摘要

文本到图像生成已从基础图像合成演变为专业创意工作流程中频繁使用的核心能力，简单的文本-图像对齐已无法满足用户对忠实真实世界重建和真正创意表达的迫切需求。然而，现有基准仍停留在这些基础标准上，未能捕捉真实艺术实践中重要的细微能力，使得可靠区分最先进的T2I模型变得困难。为弥补这一差距，我们引入了Qwen-Image-Bench，一个与专业艺术家共同设计、基于真实创作场景的创作者中心基准。Qwen-Image-Bench通过两个应用驱动维度丰富了传统评估：真实世界保真度和创意生成。借鉴专业艺术工作流程中固有的分阶段推理，我们将这五个支柱组织成一个自上而下的分层分类体系，进一步分解为23个二级子能力和56个三级可验证准则。为确保广泛覆盖，我们策划了1000个分层提示，每个提示联合锻炼多个支柱中的四个以上细粒度方面。我们训练了一个基于Qwen3.6-27B的统一评判模型Q-Judger，由来自全球艺术学院的80名专业标注员在盲标和三重审核协议下监督，对每张图像在所有56个可验证方面进行评分，产生细粒度、基于准则且完全可归因的诊断，而非单一不透明分数。实验表明，Qwen-Image-Bench可靠地区分领先的T2I模型，在现有基准几乎无法提供洞察的两个应用驱动维度（真实世界保真度和创意生成）上实现了最大分离，同时为生产级T2I开发提供了可信的优化信号。

英文摘要

Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users' pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.

URL PDF HTML ☆

赞 0 踩 0

2605.28083 2026-05-28 cs.CV 版本更新

VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking

VLA-Hijack: 通过视觉本体感觉劫持实现针对视觉-语言-动作模型的可迁移补丁攻击

Jiyuan Fu, Kaixun Jiang, Jingkai Jia, Zhaoyu Chen, Xueyao Chen, Lingyi Hong, Shuyong Gao, Chenzhi Tan, Dingkang Yang, Wenqiang Zhang

发表机构 * Fudan University（复旦大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出VLA-Hijack框架，通过注意力引导的本体感觉抑制和多模态本体感觉注入攻击视觉自定位过程，实现跨架构黑盒迁移攻击。

详情

AI中文摘要

虽然视觉-语言-动作（VLA）模型已成为强大的通用策略，但它们对对抗性补丁的严重脆弱性显著阻碍了其在安全关键领域的部署。此外，现有的补丁攻击主要关注白盒设置，严重过拟合目标模型的特定动作输出空间，导致跨架构迁移性差。为了克服这一限制，我们提出了VLA-Hijack，一个统一的对抗框架，通过利用本工作中发现的基本漏洞来突破迁移性瓶颈：在规划任何运动之前，VLA模型必须首先使用视觉信息在环境中定位自己的机械臂。针对这一共享的视觉自定位过程，我们的方法同时优化注意力引导的本体感觉抑制以抑制真实机械臂的特征，以及多模态本体感觉注入以将补丁建立为替代的“幻影实体”。通过在语义概念锚定和视觉原型投影之间交替，VLA-Hijack有效地切断了智能体真实实体与其控制策略之间的语义关系。跨多种架构（OpenVLA、UniVLA和CronusVLA）的大量实验表明，VLA-Hijack在白盒设置中实现了卓越的优化效率，并为跨架构和跨域黑盒迁移性设立了新的SOTA。

英文摘要

While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm's features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate "phantom embodiment". By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent's true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.

URL PDF HTML ☆

赞 0 踩 0

2605.28056 2026-05-28 cs.CV 版本更新

CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

CogPortrait: 通过分层智能体规划实现肖像动画中的细粒度眼部区域控制

He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； University of New South Wales（新南威尔士大学）

AI总结提出CogPortrait两阶段框架，利用多模态大语言模型智能体从高层标签生成关键点，再通过DiT视频生成骨干合成动画，实现细粒度眼部控制，并引入EMH基准评估。

详情

AI中文摘要

肖像动画方法已实现显著的视觉质量和唇形同步，但眼部区域的细粒度操控仍面临输入粒度与运动精度之间的权衡。现有方法使用情感标签或粗略文本提示不足以描述细微的眼部动态，而基于动作单元或驱动视频的方法以更高的输入负担为代价提供更高的保真度。这些限制对于超越情感状态（例如思考）和困倦状态仍然具有局限性。鉴于此，我们提出CogPortrait，一个从高层标签生成肖像动画的两阶段框架。在第一阶段，三个思维链多模态大语言模型（MLLMs）智能体通过时间事件规划、原型检索和从真实行为库中组合以及语义-生理约束执行，将高层标签编译为面部关键点。在第二阶段，基于DiT的视频生成骨干以关键点、参考肖像、音频和文本提示为条件合成最终动画，并通过动态无分类器引导策略（具有眼部区域感知重新加权和基于KTO的边界情况细化）增强。我们进一步引入了EMH基准，涵盖多样化的情感和超越情感类别，并带有两个AU级指标用于评估细粒度眼部区域和头部运动控制。在HDTF和EMH基准上的大量实验表明，CogPortrait在保持优越视觉质量和身份一致性的同时，实现了比现有方法更精确的眼部区域控制。

英文摘要

Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

URL PDF HTML ☆

赞 0 踩 0

2605.28051 2026-05-28 cs.CV 版本更新

利用分割引导的对抗学习增强超低场MRI

James Grover, Andrew Phair, Michael Ferraro, David E. J. Waddington

发表机构 * Image X Institute, Sydney School of Health Sciences, Faculty of Medicine and Health（Image X研究院，悉尼健康科学学院，医学与健康学院）

AI总结提出结合解剖条件分割先验和模型集成的方法，通过Swin UNETR生成组织分割先验，并利用CycleGAN和T-REX两个增强网络合成3T级MRI，有效提升64 mT超低场MRI的图像质量。

2605.28011 2026-05-28 cs.CV 版本更新

Automated Estimation of Impact Time, Impact Location, and Shuttlecock Speed in Badminton Smashes Using Event Cameras

使用事件相机自动估计羽毛球扣杀中的撞击时间、撞击位置和球速

Yudai Washida, Yuto Kase, Kai Ishibe, Ryoma Yasuda, Sakiko Hashimoto

发表机构 * MIZUNO Corporation（MIZUNO公司）； Suminoe-ku, Osaka-shi, Osaka（大阪府大阪市西淀川区）

AI总结提出一种使用两台同步事件相机的方法，在同一试验中自动估计羽毛球扣杀的撞击时间、球拍面撞击位置和球速，并通过Bland-Altman分析验证其与高速相机参考方法的一致性。

Comments 24 pages, 5 figures

详情

AI中文摘要

量化羽毛球扣杀中的撞击现象对于评估运动表现和装备性能都很重要；然而，传统测量系统在时间分辨率、数据效率和准备工作之间存在权衡。本研究提出了一种使用两台同步事件相机的测量方法，在同一试验中自动估计撞击时间、球拍面上的撞击位置以及撞击后的球速。通过事件率统计检测挥拍区间，从侧视事件数据中的羽毛球轨迹拐点估计撞击时间，通过椭圆拟合后视事件图像中的球拍面确定撞击位置，并在矢状面计算球速。为了验证所提出的方法，使用来自五名运动员的125次扣杀试验，与基于高速相机的参考方法进行了Bland-Altman分析。在所有124次可分析试验中估计了撞击时间和球速，在93.5%（116/124）的试验中估计了撞击位置。撞击时间、内侧-外侧撞击位置、纵向撞击位置和球速的偏差（95%置信区间）分别为1.84毫秒（1.45至2.23）、3.45毫米（2.18至4.72）、-1.92毫米（-2.97至-0.88）和-1.00米/秒（-2.46至0.46）。所有指标均未观察到比例偏差。这些结果表明，所提出的方法可以作为在实际环境中综合评估羽毛球扣杀性能和装备的有用工具。

英文摘要

Quantifying impact phenomena in badminton smashes is important for evaluating both athletic performance and equipment; however, conventional measurement systems involve trade-offs between temporal resolution, data efficiency, and preparation effort. This study proposes a measurement method using two synchronized event cameras to automatically estimate impact time, impact location on the racket face, and post-impact shuttlecock speed in an integrated manner within the same trial. The swing interval was detected from event rate statistics, impact time was estimated from the shuttlecock trajectory inflection in the lateral-view event data, impact location was determined by ellipse fitting to the racket face in the rear-view event image, and shuttlecock speed was calculated in the sagittal plane. To validate the proposed method, Bland-Altman analysis was performed against a high-speed camera-based reference method using 125 smash trials from five players. Impact time and shuttlecock speed were estimated in all 124 analyzable trials, and impact location was estimated in 93.5% (116/124). The bias (95% CI) for impact time, medio-lateral impact location, longitudinal impact location, and shuttlecock speed were 1.84 ms (1.45 to 2.23), 3.45 mm (2.18 to 4.72), -1.92 mm (-2.97 to -0.88), and -1.00 m/s (-2.46 to 0.46), respectively. No proportional bias was observed for any metric. These results suggest that the proposed method can serve as a useful tool for integrated assessment of badminton smash performance and equipment in practical settings.

URL PDF HTML ☆

赞 0 踩 0

2605.27990 2026-05-28 cs.LG cs.AI cs.CV 版本更新

Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

几何校正扩散后验采样：基于去噪器回拉曲率引导与流形对齐阻尼

Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim

发表机构 * Department of Electrical and Computer Engineering, Inha University, Incheon, 22212, South Korea（电气与计算机工程系，Inha大学，Incheon，22212，韩国）

AI总结提出一种基于去噪器回拉曲率引导和流形对齐阻尼的几何校正扩散后验采样方法，通过每噪声水平的阻尼高斯-牛顿校正替代标量引导，实现稳定高效的后验采样。

Comments Code: https://github.com/Seunghyeok0715/CLAMP

详情

Journal ref: International Conference on Machine Learning 2026

AI中文摘要

扩散后验采样将扩散先验条件于测量值，但数据一致性更新通常由手动调整的引导权重缩放，并且在刚性、算子依赖的曲率下可能破坏采样稳定性。我们使用在扩散状态坐标中计算的每噪声水平阻尼高斯-牛顿校正替代标量引导。该校正通过去噪器回拉似然梯度，使用避免前向去噪器雅可比矩阵的单侧曲率模型，并应用与去噪器残差对齐的扩散校准秩一阻尼。每个校正通过自动微分的无矩阵GMRES求解，采样通过具有闭式漂移/噪声分离的方差保持朗之万转移进行。在FFHQ和ImageNet上的逆问题中，该方法在PSNR/SSIM/LPIPS上达到竞争性能，同时运行速度显著快于大多数对比基线；在加速MRI重建中，它在对比基线中取得了最佳的PSNR/SSIM。

英文摘要

Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.27978 2026-05-28 cs.CV 版本更新

ABot-OCR Technical Report

ABot-OCR 技术报告

Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li, Mu Xu

发表机构 * AMAP CV Lab（AMAP视觉实验室）

AI总结提出端到端视觉语言模型ABot-OCR，通过单次前向传播将页面图像直接转录为干净Markdown，并采用解耦异构文档优化的强化学习方法提升文本准确性和标记格式正确性，在OmniDocBench基准上达到最先进水平。

Comments 21 pages, 11 figures, technical report

详情

AI中文摘要

我们介绍了ABot-OCR，一个端到端的视觉语言模型，它通过单次前向传播将页面图像直接转录为干净的Markdown。通过这样做，我们的方法完全消除了脆弱的模块化编排。为了最大化解析保真度，我们开发了一个专用数据引擎，以提供大规模、结构一致的监督。此外，我们提出了解耦异构文档优化，一种结构约束的强化学习方法，它在监督微调之外进一步提高了文本准确性并严格强制执行标记格式的正确性。大量评估证明了我们框架的优越性能。在OmniDocBench v1.5和v1.6基准测试中，ABot-OCR在所有端到端系统中达到了最先进的分数92.81和93.30，显著缩小了与强流水线基线之间的性能差距。最后，跨十种不同语言的全面多语言文本识别进一步证实了ABot-OCR的鲁棒泛化能力。

英文摘要

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

URL PDF HTML ☆

赞 0 踩 0

2605.27962 2026-05-28 cs.CV 版本更新

Bridging the Generalization Gap in Adverse Weather Segmentation: A Training Recipe Perspective

缩小恶劣天气分割中的泛化差距：训练方案视角

Cong Xu, Pu Luo, Yumei Li, Boyou Xue

发表机构 * Xidian University（西安电子科技大学）

AI总结本文从训练方案角度出发，通过域自适应微调、多源数据混合、场景平衡采样和合成退化增强等方法，显著缩小了恶劣天气语义分割中的验证-测试泛化差距。

详情

AI中文摘要

本文描述了我们在第8届UG2+研讨会（CVPR 2026）Track 2中的方法，该赛道针对五种天气条件（模糊、黑暗、雪、雾和眩光）退化的户外场景进行语义分割。我们观察到一个核心挑战是严重的泛化差距——在验证集上表现良好的模型在测试集上往往崩溃。例如，SegFormer-B5从验证到测试下降了16.1 mIoU点，表明仅靠模型容量不足以实现鲁棒性。我们研究精心设计的训练方案（而非架构复杂性）是否可以解决这一差距。从预训练的SegMAN-S骨干开始，我们系统地研究了域自适应微调、多源数据混合、场景平衡采样和合成退化增强的效果。我们的最终系统在官方测试集上达到了59.9%的mIoU，同时验证-测试差距仅为6.5个点——不到更大模型的一半。我们分析了架构修改、损失函数变体和模型缩放的负面结果，为有限数据下天气鲁棒分割提供实用见解。

英文摘要

This paper describes our approach for the 8th UG2+ Workshop (CVPR 2026) Track~2, which targets semantic segmentation of outdoor scenes degraded by five weather conditions: blur, darkness, snow, haze, and glare. A central challenge we observe is a severe generalization gap -- models that perform well on the validation set often collapse on the test set. For instance, SegFormer-B5 drops 16.1 mIoU points from validation to test, suggesting that model capacity alone is insufficient for robustness. We investigate whether a carefully designed training recipe, rather than architectural complexity, can address this gap. Starting from a pre-trained SegMAN-S backbone, we systematically study the effects of domain-adaptive fine-tuning, multi-source data mixing, scene-balanced sampling, and synthetic degradation augmentation. Our final system achieves 59.9\% mIoU on the official test set while maintaining a validation-test gap of only 6.5 points -- less than half that of larger models. We analyze negative results from architectural modifications, loss function variants, and model scaling to provide practical insights for weather-robust segmentation under limited data.

URL PDF HTML ☆

赞 0 踩 0

2605.27960 2026-05-28 cs.CV 版本更新

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Mags-RL: 通过智能体强化学习为多模态大语言模型戴上放大镜以进行复杂场景推理

Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu, Xin Li, Zhipeng Wang, Shao Tang, Gen Li, Yujian Xiong, Hao Wang, Yanxi Chen, Prayag Tiwari, Yalin Wang

发表机构 * Arizona State University（亚利桑那州立大学）； Clemson University（克莱姆森大学）； Washington University in St. Louis（圣路易斯华盛顿大学）； Halmstad University（哈姆斯塔德大学）； Florida State University（佛罗里达州立大学）； Rice University（里士满大学）

AI总结提出Mags-RL框架，通过智能体强化学习让多模态大语言模型调用超分辨率代理进行高分辨率细粒度检查，实现两轮推理以提升复杂场景下的视觉推理能力。

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）广受欢迎且成功，但它们在准确解释图像方面常常遇到困难，这限制了它们在复杂场景（如高物体密度和复杂背景杂乱）中的推理能力。先前的工作主要通过引入额外的显式视觉线索（如需要额外标注的边界框）来解决这一限制。此外，由此产生的低分辨率裁剪往往丢失了MLLMs进行准确推理所需的细粒度细节。因此，我们提出了Mags-RL，一个智能体强化学习（RL）框架，它为MLLMs配备了一个外部超分辨率“放大镜”代理，用于高分辨率细粒度检查。具体来说，该模型执行两轮推理：第一轮，它生成初始推理并自主识别感兴趣区域，无需依赖额外标注；第二轮，它调用超分辨率代理裁剪并放大这些区域，然后重新审视并验证其先前的推理以产生最终答案。我们还引入了一种新颖的课程学习策略，实现了数据高效的RL训练，仅需少至40个训练样本即可达到合理的性能。在VSR、TallyQA和GQA子集上的实验表明，与近期强竞争方法相比，它表现出优越的性能，展示了具有精确视觉基础的高质量推理。代码和权重将很快发布。

英文摘要

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

URL PDF HTML ☆

赞 0 踩 0

2605.27952 2026-05-28 cs.CV cs.RO 版本更新

Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse Odometry

Con-DSO：学习RGB-D直接稀疏里程计的短时一致性先验

Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Ziyan Gao, Xiongwen Jiang, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology（信息科学学系，日本科学技术先进研究院）； College of Information Engineering, Shenyang University of Chemical Technology（信息工程学院，沈阳化学工业大学）

AI总结提出Con-DSO框架，通过预测光度与深度几何一致性不确定性，实现质量感知的像素选择和加权，提升RGB-D直接稀疏里程计在动态、遮挡等挑战环境下的鲁棒性。

Comments Submitted

详情

AI中文摘要

视觉里程计（VO）是机器人和增强现实中的基础组件。RGB-D直接VO受益于度量深度测量，但在动态物体、遮挡、光照变化和不可靠深度违反直接对齐所使用的短时光度和深度几何一致性假设的挑战环境中，性能会下降。现有方法通过语义过滤、显式遮挡推理、光照适应或手工几何准则来缓解这些问题，但通常依赖外部模块或针对个别故障模式的固定假设，限制了其灵活性和以统一方式处理多样挑战的能力。本文提出Con-DSO，一种一致性感知的RGB-D直接稀疏里程计框架，从时间相邻的RGB-D帧对预测密集的光度和深度几何一致性不确定性。一致性网络通过流引导的光度误差和投影深度一致性误差进行训练，使得一致性违规可表示为像素级不确定性。这些成对不确定性预测被转换为关键帧跟踪的主机侧质量先验。该先验随后通过质量感知的支持像素选择和位姿估计中的解耦光度-几何加权应用于VO，使得不可靠观测持续衰减，而非硬拒绝或基于阈值的门控。在五个公开RGB-D基准上的实验表明，与直接RGB-D VO基线相比，在ICL-NUIM上绝对轨迹误差降低超过20%，在RGB-D Scenes V2、TUM/Bonn Dynamic和OpenLORIS序列上降低50%-80%。

英文摘要

Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20\% absolute trajectory error reduction on ICL-NUIM and 50\%--80\% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.

URL PDF HTML ☆

赞 0 踩 0

2605.27950 2026-05-28 cs.CV 版本更新

Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating Environment

从以自我为中心的饮食环境图像推断饮食行为改变接受度的可行性评估

Long Li, Yuning Huang, Heather A. Eicher-Miller, J. Graham Thomas, Fengqing Zhu, Edward Sazonov

发表机构 * The University of Alabama（阿拉巴马大学）； Purdue University（普渡大学）； Brown University（布朗大学）

AI总结本研究利用可穿戴相机收集的以自我为中心的饮食图像，通过预训练CLIP视觉编码器和轻量级Transformer分类器，初步验证了被动感知推断饮食行为改变接受度的可行性。

详情

AI中文摘要

准确评估饮食行为改变接受度对于设计有效的即时自适应干预措施（JITAIs）以促进更健康的饮食习惯至关重要。然而，基于自我报告的行为改变接受度评估稀疏且延迟，限制了其在持续监测中的实际应用。为探索被动感知是否有助于解决这一挑战，本研究进行了一项初步调查，从可穿戴相机收集的以自我为中心的饮食图像中推断参与者自我报告的行为改变接受度。我们使用自动摄入监测器v2（AIM-2）在自由生活饮食事件中获取的初步数据。数据包括饮食期间捕获的以自我为中心的图像序列，并配以评估行为改变接受度特定维度（意识、互动能力和动机）的问题的回答。为了检查视觉信息是否与这些回答相关，我们评估了一个结合预训练对比语言-图像预训练（CLIP）视觉编码器和轻量级Transformer分类器的迁移学习辅助框架。该模型处理饮食事件图像序列，以提取与行为改变接受度相关的潜在语义和时间线索。初步实验结果显示，在行为改变接受度指标上，相比于简单基线模型有显著改进。这些早期发现表明，以自我为中心的饮食事件图像可能包含与饮食行为改变接受度相关的线索，并需要在更大、更全面的数据集上进行进一步研究。

英文摘要

Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants' self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.27938 2026-05-28 cs.CV 版本更新

SEMAGIC: Learning Semantically Consistent Deformable 3D Representations from In-the-Wild Images

SEMAGIC: 从野外图像中学习语义一致的可变形3D表示

Sky Cen, Wufei Ma, Guofeng Zhang, Alan Yuille, Adam Kortylewski

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； CISPA Helmholtz Center for Information Security（信息安全霍普金斯中心）

AI总结针对现有可变形3D重建方法语义对应不稳定的问题，提出SEMAGIC框架，通过特征级一致性损失和顶点索引条件变形，在重建过程中强制语义一致性，从而提升类别级语义对应性能。

详情

AI中文摘要

从单视图野外图像中学习可变形3D物体模型已实现了无需监督的令人印象深刻的3D形状重建。然而，这些模型是否捕捉到下游任务所需的语义结构仍不清楚。我们发现，现有的可变形重建方法尽管生成了视觉上合理的几何形状，但在实例间产生了不稳定的对应关系，并在语义对应基准上表现不佳。我们引入了SEMAGIC，一个从单视图野外图像中学习语义一致的可变形3D表示的框架。SEMAGIC不将重建视为最终目标，而是将可变形建模作为发现类别级对应关系的机制。每个类别由一个规范模板网格和一个学习到的变形场表示，其功能类似于一个从图像特征重建实例几何的自编码器，使得顶点能够在实例间保持一致的语义含义。训练过程中通过(i)对齐规范网格和变形网格之间语义特征的特征级一致性损失，以及(ii)保持实例间语义对应的顶点索引条件变形，来强制语义一致性。通过将几何变形与语义对齐显式耦合，SEMAGIC生成了在类别内变化中保持稳定部件对应的表示。实验表明，SEMAGIC在SPair-71k上将可变形模型的语义对应提高了+14.7 PCK@0.1，确立了可变形模型作为有效语义3D表示的地位。

英文摘要

Learning deformable 3D object models from single-view in-the-wild images has enabled impressive 3D shape reconstruction without supervision. However, it remains unclear whether these models capture the semantic structure required for downstream tasks. We find that existing deformable reconstruction approaches, despite producing visually plausible geometry, yield unstable correspondences across instances and perform poorly on semantic correspondence benchmarks. We introduce SEMAGIC, a framework for learning semantically consistent deformable 3D representations from single-view in-the-wild images. Rather than treating reconstruction as the end goal, SEMAGIC uses deformable modeling as a mechanism to discover category-level correspondences. Each category is represented by a canonical template mesh and a learned deformation field, functioning similarly to an autoencoder that reconstructs instance geometry from image features, enabling vertices to maintain consistent semantic meaning across instances. Semantic consistency is enforced during training through (i) a feature-level consistency loss aligning semantic features between canonical and deformed meshes, and (ii) vertex-index-conditioned deformation that preserves semantic correspondence across instances. By explicitly coupling geometric deformation with semantic alignment, SEMAGIC produces representations that maintain stable part correspondences across intra-category variation. Experiments demonstrate that SEMAGIC improves semantic correspondence of deformable models by +14.7 PCK@0.1 on SPair-71k, establishing deformable models as effective semantic 3D representations.

URL PDF HTML ☆

赞 0 踩 0

2605.27932 2026-05-28 cs.CV cs.AI cs.CL cs.CR cs.LG 版本更新

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

当图文推理遇上安全：什么决定了多模态越狱鲁棒性？

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

发表机构 * Independent Researcher（独立研究者）； Stanford University（斯坦福大学）； Harvard University（哈佛大学）； Purdue University（普渡大学）； Duke University（杜克大学）

AI总结本文研究多模态大语言模型中不同图文推理范式对越狱鲁棒性的影响，发现显式图像工具交互能显著降低攻击成功率，并通过引入图像工具安全向量框架从表征层面解释其机制。

Comments 17 pages, 6 figures, 7 tables

详情

AI中文摘要

图文推理正成为大型视觉-语言模型的一种新推理范式，但其安全性影响尚不明确。现有系统已涵盖多种流程设计，包括直接响应生成、纯文本前轮、视觉状态操作以及显式外部图像工具调用。本文探究这些评估范式中哪一种能提升多模态越狱鲁棒性及其原因。在多个视觉-语言模型上，我们的实验表明显式图像工具交互的攻击成功率最低，平均相对降低约30%。这一发现起初令人惊讶：即使返回的图像工具输出被人为覆盖或本身不安全，攻击成功率仍保持较低，但在纯文本前轮控制下又恢复到接近直接回答的水平。这些结果表明，较低的攻击成功率并非由良性返回图像语义或仅文本图像工具轨迹解释。为解释这一模式，我们引入了一个图像工具安全向量框架，将图像工具调用建模为隐藏表示向安全相关方向的残差偏移。表征层面的分析和激活干预支持了这一解释。总体而言，我们的结果表明，显式图像工具交互是提升越狱鲁棒性的一种有前景的设计模式，同时也推动了针对特定流程的安全性评估。

英文摘要

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.27927 2026-05-28 cs.CV cs.LG 版本更新

Structure-Guided Visual Perturbation Neutralization for LVLMs

结构引导的视觉扰动中和用于大型视觉语言模型

Yuanhe Zhang, Xueting Wang, YanBin Ren, Haoran Gao, Xinhan Zheng, Zhenhong Zhou, Fanyu Meng, Li Sun, Sen Su

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； University of Science and Technology of China（中国科学技术大学）； JIUTIAN Research（JIUTIAN研究所）； Nanyang Technological University（南洋理工大学）； Chongqing University of Posts and Telecommunications（重庆邮电大学）

AI总结提出结构诱导引导中和（SIGN）框架，通过先验结构提取和动态引导中和实现轻量级、即插即用的对抗性防御，在仅0.5%像素修改和0.16秒每图下达到87%以上防御成功率。

详情

AI中文摘要

图像输入使大型视觉语言模型（LVLMs）能够感知细粒度的视觉信息，但也引入了一个像素级攻击面，通过该攻击面，对抗性扰动可以引发不安全的模型行为。然而，大多数现有防御是为传统计算机视觉场景设计的，因此常常忽略LVLMs所需的跨模态对齐，导致性能下降。同时，针对LVLMs的有限防御通常需要大量的图像修改并引入可观的计算开销，从而损害推理质量和效率。为解决这些限制，我们提出了结构诱导引导中和（SIGN），一个轻量级、即插即用的防御框架，通过先验结构提取提高LVLM兼容性，并通过动态引导中和实现高效的扰动抑制。大量实验表明，SIGN在仅0.5%像素修改和每张图像0.16秒的情况下实现了超过87%的防御成功率，同时几乎保留了原始视觉表示和良性任务性能。我们的工作为需要昂贵模型训练的防御提供了一种轻量级替代方案，并突显了利用视觉编码器进行高效对抗性保护的潜力。我们的代码已在 https://anonymous.4open.science/r/SIGN-BCB1 开源。

英文摘要

Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87\% defense success rate with only 0.5\% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on https://anonymous.4open.science/r/SIGN-BCB1.

URL PDF HTML ☆

赞 0 踩 0

2605.27924 2026-05-28 cs.CV 版本更新

SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization

SIGMA: 基于语义差异的指令引导掩码标注器用于文本驱动图像操作定位

Peiyu Zhuang, Jianquan Yang, Haodong Li, Zhuoying Cai, Ruitao Xie, Jishen Zeng, Baoying Chen, Jiwu Huang, Xiaochun Cao

发表机构 * Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区）； Guangdong Provincial Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security（广东省智能信息处理重点实验室和深圳媒体安全重点实验室）； Shenzhen University of Advanced Technology and Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences（深圳先进技术大学和深圳先进技术研究所，中国科学院）； Alibaba Group（阿里巴巴集团）； Shenzhen MSU-BIT University（深圳MSU-BIT大学）

AI总结提出SIGMA方法，通过视觉基础模型中的语义特征差异和指令引导的空间先验，自动从公开编辑数据集中生成像素级掩码，用于训练图像操作定位模型，在五个基准上F1提升12.20%，并生成约110万训练集使六个检测器平均F1提升18.34%。

详情

AI中文摘要

文本驱动的图像编辑发展迅速，但可靠地定位这些操作需要在大规模像素标注数据集上训练的图像操作定位（IML）模型，目前尚无低成本获取此类训练数据的方法。我们观察到这些数据实际上已经以伪装形式存在：公开编辑数据集包含数百万个与IML训练样本结构相同的（原始、编辑）图像对，仅缺少像素级掩码。自动恢复这些掩码并非易事：像素差异被扩散引起的所有像素扰动淹没，而仅基于指令的定位只能定位提示描述的内容，遗漏了意外的编辑副作用。我们提出SIGMA（语义差异指令引导掩码标注器），它在视觉基础骨干网络中进行语义特征差异计算，并通过双向跨模态精炼将指令导出的空间先验注入视觉流，在编辑器忠实实现用户意图时放大预期编辑区域的差异信号。SIGMA通过两个互补阶段训练：第一阶段在修复掩码上进行监督；第二阶段通过VAE往返噪声校准、EMA自训练和编辑噪声解耦损失来弥合扩散域偏移。SIGMA在五个基准上优于现有自动掩码生成器（F1提升12.20%，IoU提升11.16%）。当应用于公开编辑语料库时，它生成了约110万IML训练集，使六个不同检测器在五个数据集上平均F1提升18.34%，将以前未使用的编辑数据转化为IML的模型无关监督资源。论文被接收后我们将立即发布完整代码库。

英文摘要

Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We'll release the full codebase as soon as the paper is accepted.

URL PDF HTML ☆

赞 0 踩 0

2605.27923 2026-05-28 cs.CV cs.AI cs.LG quant-ph 版本更新

Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

我们真的需要量子机器学习吗？：一项多维实证研究

Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

发表机构 * Department of Computer Science, University of Alabama, AL 35487（1 计算机科学系，阿拉巴马大学，AL 35487）

AI总结通过在MNIST手写数字数据集上对经典和量子机器学习模型进行多维基准测试，发现量子模型在准确率、参数和内存效率上优于经典模型，但计算成本更高。

详情

AI中文摘要

计算机视觉的快速发展和日益复杂的图像识别任务暴露了经典机器学习模型的基本计算限制，推动了量子计算作为一种新兴范式的探索。本文对MNIST手写数字数据集上的经典和量子机器学习模型进行了全面的基准测试，评估了传统模型（经典支持向量机CSVM和量子支持向量机QSVM）以及深度神经网络模型（经典卷积神经网络CCNN和量子卷积神经网络QCNN）在四个性能维度上的表现：分类准确率、计算运行时间、参数数量和内存需求。实验作为特征维度和样本量的函数进行，并在CPU和GPU执行环境下进行，提供了受控的多维比较，以解决先前工作中的空白。对于基于SVM的模型，QSVM在准确率上始终优于CSVM，在1000个样本时达到约0.90对比约0.85，但计算成本更高。10个量子比特的特征数和200-500的样本量成为平衡准确率和运行时间的实际工作点。对于神经网络模型，CCNN和QCNN实现了可比的分类准确率，在64个特征和60000个样本时均超过0.96，但QCNN在参数和内存效率上显著更优，在较高特征数下比CCNN少约94%的参数和约75%的内存，但运行时间更长。在两个模型家族中，随着特征维度或样本量的增加，量子模型在准确率上始终以更大优势超越经典模型。

英文摘要

The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 -- 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94\% fewer parameters and $\sim$ 75\% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

URL PDF HTML ☆

赞 0 踩 0

2605.27920 2026-05-28 cs.CV 版本更新

Rethinking Video-Language Model from the Language Input Perspective

从语言输入角度重新思考视频-语言模型

Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

发表机构 * School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件学院）； Nanyang Technological University, Singapore（新加坡南洋理工大学）； University College London（伦敦大学学院）； Huazhong University of Science and Technology（华中科技大学）； Wuhan University（武汉大学）

AI总结本文从语言输入角度出发，提出一种即插即用的框架，通过生成正负文本、属性文本推理和自加权损失，提升视频-语言模型的性能。

Comments Published in AAAI 2026

详情

AI中文摘要

受大语言模型浪潮的驱动，视频-语言模型（VLM）已成为弥合视频与文本之间差距的重要且具有挑战性的技术。尽管先前的VLM工作取得了显著进展，但几乎所有工作都隐含地假设所有文本都是由特定模板预定义的。在实际应用中，这种严格的假设无法满足，因为1）预定义所有文本极其耗时费力；2）这些预定义的文本输入过于限制且不友好，限制了其应用。观察到，给定视频输入，语义相似但模板不同的文本会导致不同的性能。为此，本文提出了一种新颖的即插即用框架，用于各种基于VLM的方法，以充分弥合视频和文本。具体来说，我们首先从原始文本中生成正负文本，以针对特定的文本组件。然后，我们提出了一种基于属性的文本推理策略，以挖掘生成文本的细粒度语义。最后，我们利用视频作为指导，通过设计自加权损失来进行跨模态桥接。大量实验表明，所提出的方法可以作为即插即用模块，有效提升最先进VLM的性能。

英文摘要

Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.27916 2026-05-28 cs.CV cs.CL 版本更新

SIGMA：弥合视觉基础模型适应的结构与分布差距

Lingyu Xiong, Jinjin Shi, Xuran Xu, Cong Luo, Runyu Shi, Ying Huang

发表机构 * Xiaomi Corporation（小米公司）

AI总结提出SIGMA方法，通过尺度自适应融合和语义调制模块，以1.72%可训练参数实现视觉基础模型在密集预测任务上的高效微调，性能优于现有PEFT方法。

详情

AI中文摘要

视觉基础模型（VFM）展示了令人印象深刻的表示能力。然而，通过全微调将它们适应到下游任务会带来高昂的计算和存储开销。参数高效微调（PEFT）作为一种有吸引力的替代方案应运而生，旨在以最小的训练成本实现与全微调相当的性能。尽管如此，将PEFT应用于VFM进行密集预测任务仍然具有挑战性，因为存在结构和分布差距。为了弥合这些差距，我们提出了尺度集成全局调制适配器（SIGMA），一种新颖的轻量级PEFT方法，它由两个模块组成：尺度自适应融合和语义调制。具体来说，尺度自适应融合模块用于通过增强多粒度视觉信息的提取来弥合结构差距。此外，SIGMA在融合特征上引入语义调制以执行全局特征对齐，进一步消除分布差距。这种设计促进了统一的空间和分布适应，相对于VFM骨干网络仅需1.72%的可训练参数。在各种下游密集任务和多个VFM骨干网络上的全面实验表明，SIGMA在性能上一致且优于最先进的PEFT方法。

英文摘要

Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72\% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

URL PDF HTML ☆

赞 0 踩 0

2605.27891 2026-05-28 cs.CV cs.AI 版本更新

一种用于纹理识别的深度滤波器组的自监督学习方法

Joao B. Florindo, Lucas O. Lyra, Antonio E. Fabris

发表机构 * Institute of Mathematics and Statistics of the University of Sao Paulo（圣保罗大学数学与统计学研究所）； Institute of Mathematics, Statistics and Scientific Computing of the University of Campinas（坎皮纳斯大学数学、统计与科学计算研究所）

AI总结针对纹理识别中训练数据有限的问题，提出一种基于卷积自编码器的自监督预训练框架，结合深度滤波器和Fisher向量池化，在不显著增加计算负担的情况下提升识别性能。

详情

AI中文摘要

纹理识别中的一个重要挑战是实际应用中经常遇到的训练数据有限。在计算机视觉中，缓解这一问题的一个成功策略是使用预训练阶段，其中神经网络以自监督方式学习识别数据各部分之间的关系。在这方面，一个成熟的框架是掩码自编码器。然而，这些模型通常依赖于计算密集型的架构，如视觉变换器。在纹理图像的特定情况下，大多数相关信息被压缩在每个像素周围的有限区域内，这表明通过注意力机制捕获长距离依赖可能是不必要的。基于这一假设，本文提出了一种预训练模型为卷积自编码器的框架。为了利用纹理模式传递的丰富信息，我们采用了深度滤波器与Fisher向量池化相结合的方法。通过这种方式，我们在不增加显著计算负担的情况下提高了纹理识别的性能。我们的方法与多个纹理数据库中的几种最先进方法进行了比较，证实了其在分类精度和计算复杂度方面的潜力。

英文摘要

An important challenge in texture recognition is the limited amount of data for training frequently found in real-world applications. In computer vision in general, a successful strategy to mitigate this issue is the use of a pretraining stage where the neural network learns to identify relations between parts of the data in a self-supervised manner. A well-established framework in this direction is masked autoencoder. Nevertheless, these models usually rely on computationally intensive architectures, such as vision transformers. In the particular case of texture images, most of the relevant information is compacted within a delimited area around each pixel, which suggests that capturing long-range dependence via the attention mechanism may be unnecessary. Based on that assumption, here we propose a framework where the pretraining model is a convolutional autoencoder. To leverage the rich information conveyed by texture patterns, we employ deep filters coupled with Fisher vector pooling. In this way, we improve the performance of texture recognition without adding significant computational burden. Our approach is compared with several state-of-the-art methods in different texture databases, confirming its potential both in terms of classification accuracy and computational complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.27823 2026-05-28 cs.CR cs.AI cs.CV 版本更新

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

解耦对抗性提示：基于语义图的鲁棒大语言模型安全防御

Xiang Fang, Wanlong Fang

发表机构 * Xiang Fang（1. 方翔）； Wanlong Fang（2. 方万龙）

AI总结提出对抗性提示解耦（APD）框架，通过互信息语义分解、图谱分析和轻量级分类器，在输入处理前识别并中和恶意组件，将有害输出减少85%以上。

Comments Published in AAAI 2026

详情

AI中文摘要

大语言模型（LLMs）越来越容易受到利用语义歧义绕过安全机制的对抗性提示的攻击，导致有害或不适当的输出。此类攻击，包括越狱和提示注入，对安全关键应用中LLMs的完整性和可用性构成重大风险。本文提出对抗性提示解耦（APD）框架，一种新颖的防御机制，在输入提示被LLM处理之前主动识别并中和其中的恶意组件。APD框架集成了三项关键创新：（1）基于互信息的语义分解方法，用于分离对抗性和良性提示组件，确保统计独立性；（2）基于图的意图分类方法，利用谱分析检测提示语义中的恶意模式；（3）轻量级基于Transformer的分类器，在真实世界的毒性和越狱提示数据集上训练，实现高效准确的对抗性意图检测。在包含对抗性提示的多样化数据集上评估，APD展现出卓越的鲁棒性，将有害输出生成减少超过85%，同时保持对模型性能的 negligible 影响。该框架的计算效率支持实时部署，使其成为保护LLMs的实用解决方案。我们的工作解决了机器学习安全中关于新型攻击和ML系统完整性方法的关键挑战，并提供了一种可扩展、符合伦理的防御手段来对抗基于提示的对抗性威胁。

英文摘要

Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanisms, resulting in harmful or inappropriate outputs. Such attacks, including jailbreaking and prompt injection, pose significant risks to the integrity and availability of LLMs in security-critical applications. This paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in input prompts before they are processed by the LLM. The APD framework integrates three key innovations: (1) a mutual information-based semantic decomposition method to isolate adversarial and benign prompt components, ensuring statistical independence; (2) a graph-based intent classification approach that leverages spectral analysis to detect malicious patterns in prompt semantics; and (3) a lightweight transformer-based classifier trained on real-world datasets of toxic and jailbreaking prompts, enabling efficient and accurate adversarial intent detection. Evaluated on diverse datasets containing adversarial prompts, APD demonstrates superior robustness, reducing harmful output generation by over 85\% while maintaining negligible impact on model performance. The framework's computational efficiency supports real-time deployment, making it a practical solution for securing LLMs. Our work addresses critical challenges in machine learning security on novel attacks and integrity methods for ML systems, and offers a scalable, ethically grounded defense against prompt-based adversarial threats.

URL PDF HTML ☆

赞 0 踩 0

2605.27817 2026-05-28 cs.RO cs.AI cs.CV cs.LG 版本更新

Turning Video Models into Generalist Robot Policies

将视频模型转化为通用机器人策略

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann

发表机构 * MIT（麻省理工学院）； CMU（卡内基梅隆大学）； Amazon FAR（亚马逊公司）

AI总结提出一种解耦的视频到动作策略VERA，利用无动作视频世界模型和基于机器人雅可比矩阵的逆动力学模型，实现跨本体的零样本机器人控制。

Comments project page: https://vera.csail.mit.edu

详情

AI中文摘要

视频生成模型已成为一种有前景的机器人骨干网络，能够生成描绘跨本体和环境完成复杂任务的视频。最近的工作提出了机器人基础模型，通过使用带有动作标签的数据微调视频模型，联合预测未来观测和动作。在本文中，我们测试了一种替代方法的极限：保持视频规划器不变，同时训练一个特定本体的逆动力学模型（IDM）。这种解耦带来了几个自然的好处：视频规划器保持本体无关，不同的视频模型可以轻松互换而无需重新训练IDM，并且IDM可以独立地使用现成的自对弈数据进行训练。我们提出了一种闭环的视频到动作策略，该策略将无动作视频世界模型与基于机器人本体雅可比矩阵的精心设计的IDM相结合。我们证明了我们的IDM设计既数据高效又可扩展到高维动作空间。我们将该策略命名为视频到具身机器人动作模型（VERA），在模拟和真实世界基准测试中取得了强劲的性能，包括零样本的Panda机械臂操作和16自由度Allegro灵巧手立方体重新定向。通过将相同的视频规划器与不同的本体特定IDM配对，可以在多个本体上使用。我们的结果表明，解耦的视频规划加上忠实的视频到动作翻译是实现零样本、跨本体和可泛化机器人控制的可行替代途径。更多结果请访问我们的项目网站：https://vera.csail.mit.edu。

英文摘要

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

URL PDF HTML ☆

赞 0 踩 0

2605.27816 2026-05-28 cs.CV 版本更新

Pattern Recognition Tasks with Personalized Federated Learning

个性化联邦学习的模式识别任务

Md. Arifur Rahman, Isha Das, Mushfiqur Rahman Abir, B. M. Taslimul Haque, Abdullah Al Noman, Abir Ahmed, Md. Jakir Hossen

发表机构 * College of Graduate and Professional Studies, Trine University（特灵大学研究生与专业研究学院）； Network Communication and IoT Lab, Chittagong University of Engineering and Technology（恰 TAGONG 工程技术大学网络通信与物联网实验室）； Department of Computer Science and Engineering, American International University-Bangladesh（美国国际大学-孟加拉国计算机科学与工程系）； Information Systems, Central Michigan University（中央密歇根大学信息系统系）； Wilmington University（维尔明顿大学）； Department of Information Technology, Washington University of Science & Technology（华盛顿科学与技术大学信息科技系）； Center for Advanced Analytics (CAA), COE for Artificial Intelligence, Faculty of Engineering & Technology (FET), Multimedia University, Melaka（多媒体大学马六甲工程与技术学院（FET）、人工智能学院（COE）高级分析中心（CAA））

AI总结本文通过比较七种个性化联邦学习算法在MNIST、SignMNIST和Digit5数据集上的性能，发现APPLE、FedGC和FedProto在准确率、精确率、召回率和F1分数上表现优异。

Comments Comprehensive comparative analysis of 7 Personalized Federated Learning algorithms across MNIST, SignMNIST, and Digit5 datasets. The paper presents detailed methodology, workflow architecture, experimental evaluation, and privacy-preserving AI analysis for distributed intelligent systems, secure collaborative learning, and critical infrastructure applications

详情

DOI: 10.28991/ESJ-2026-010-02-020
Journal ref: Emerging Science Journal 10(2):974-990 (2026)

AI中文摘要

个性化联邦学习（PFL）构成了一种新颖的范式，它为每个客户端定制机器学习（ML）模型，从而在维护严格数据隐私原则的同时提供个性化的模型更新。与传统的标准联邦学习（FL）方法不同，PFL使模型适应不同的客户端数据分布，从而在最小化通信开销的同时，实现更高水平的准确性、定制化和数据安全性。这种方法在依赖于异构数据源且以隐私问题为关键的模式识别任务背景下尤为突出。在本研究工作中，本文对七种不同的PFL算法进行了全面的比较分析，这些算法在三个不同的数据集（即MNIST、SignMNIST和Digit5）上部署。总体目标是通过基于准确率、精确率、召回率和F1分数等指标的严格评估，确定在模式识别任务框架内最优秀的PFL算法。同时，对这些PFL算法进行了深入审查，阐明了它们的工作流程、优点和局限性。通过实证研究，结果表明APPLE、FedGC和FedProto是强有力的竞争者，在评估的数据集范围内始终提供优越的性能，同时承认其他算法的上下文特异性以及通过迭代改进实现最优结果的潜力。

英文摘要

Personalized Federated Learning (PFL) constitutes a novel paradigm that tailors Machine Learning (ML) models to individual clients, thereby furnishing personalized model updates whilst upholding stringent data privacy principles. Diverging from conventional standard Federated Learning (FL) approaches, PFL adapts models to distinct client data distributions, engendering heightened levels of accuracy, customization, and data security, all while minimizing communication overhead. This methodology proves particularly salient in contexts marked by pattern recognition tasks reliant upon heterogeneous data sources and underpinned by paramount privacy apprehensions. In the present research endeavor, this article undertake a comprehensive comparative analysis of seven distinct PFL algorithms deployed across three diverse datasets, namely MNIST, SignMNIST, and Digit5. The overarching objective entails ascertaining the preeminent PFL algorithm, within the framework of pattern recognition tasks, through a rigorous evaluation anchored in metrics encompassing Accuracy, Precision, Recall, and F1 Score. Concurrently, an in-depth scrutiny of these PFL algorithms is conducted, elucidating their operative workflows, advantages, and limitations. Through empirical investigation, the findings evince that APPLE, FedGC, and FedProto emerge as stalwart contenders, consistently furnishing superior performance across the spectrum of assessed datasets, while acknowledging the contextual specificity of alternative algorithms and the potential for iterative refinement to realize optimal outcomes.

URL PDF HTML ☆

赞 0 踩 0

2605.27813 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

残差化时间稀疏自编码器用于解释扩散模型

Calvin Yeung, Prathyush Poduval, Ali Zakeri, Zhuowen Zou, Mohsen Imani

发表机构 * University of California, Irvine（加州大学 Irvine 分校）

AI总结提出残差化时间稀疏自编码器，通过去噪时间步间的线性预测残差学习扩散激活轨迹中的可解释特征，并在Stable Diffusion 1.5上验证其有效性。

详情

AI中文摘要

文本到图像扩散模型通过迭代去噪过程生成图像，因此内部神经层产生激活轨迹而非单一静态表示。稀疏自编码器（SAE）最近被用于将扩散激活分解为可解释的特征方向，但大多数方法在单个时间步分析激活或基于时间条件，而非直接从完整激活轨迹中学习。在这项工作中，我们引入了用于扩散激活轨迹的残差化时间SAE。我们收集去噪时间上的激活，拟合相邻时间步之间的线性预测器，并使用初始激活以及这些线性动力学未解释的残差分量来表示每个轨迹。在这种残差化表示上训练SAE鼓励稀疏潜在变量捕捉超出线性可预测范围的结构。残差化解码器方向可以映射回激活空间，使得每个潜在变量可以作为去噪时间上的特征轨迹进行分析。通过在Stable Diffusion 1.5上的重建与消融研究、时空特征分析和定性引导实验，我们表明残差化时间SAE为研究时间结构化的扩散激活提供了一个有用的框架。

AndroidDaily: 面向真实世界闭源应用的可验证移动GUI智能体基准

Yifan Sui, Xin Huang, Hongbing Li, Fang Xu, Jiahe Lv, Haolong Yan, Yeqing Shen, Litao Liu, Zhimin Fan, Ziyang Meng, Jia Wang, Junbo Qi, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Osamu Yoshie

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； StepFun ； Waseda University（早稻田大学）

AI总结针对闭源应用无法获取内部状态导致自动验证困难的问题，提出AndroidDaily基准（350个日常任务）和GRADE评估器（基于可观察外部指南的三层系统），实现无需内部状态的可验证评估，最强模型成功率为62.0%。

Comments 11 pages, 6 figures. Preprint

详情

AI中文摘要

GUI基础模型和移动GUI智能体的快速发展催生了众多评估基准，但大多数依赖于模拟环境或开源应用，真实世界的闭源应用在很大程度上未得到评估。核心困难在于闭源应用不暴露内部状态，使得传统的自动验证不适用。为弥合这一差距，我们引入了AndroidDaily，一个大规模基准，包含跨94个高频Android应用的350个现实日常任务，涵盖交通、购物、本地服务、娱乐、内容创作、社交媒体和日常实用工具。为了在这些不透明环境中实现自动且可验证的评估，我们提出了基于指南的自动诊断评估评审器（GRADE），这是一个基于三层可观察外部指南系统构建的过程感知评估器：操作义务、输出质量和负面约束。GRADE根据这些标准跟踪智能体的视觉轨迹，并产生步骤级诊断判断，将长期、开放式的移动交互转化为可验证的评估，而无需依赖隐藏的内部状态。实验表明，GRADE与人类评估者的一致性达到87.37%。最强模型在AndroidDaily上的成功率为62.0%，凸显了当前推理能力与现实移动工作流实际执行之间的巨大差距。

英文摘要

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.27750 2026-05-28 cs.CL cs.AI cs.CV cs.DL 版本更新

显式评论家引导的对齐扩散模型

Zhengyang Liang, Qihang Zhang, Ceyuan Yang

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出一种状态对齐的潜在演员-评论家框架，通过将扩散模型自身作为时间步条件价值函数，实现轨迹级PPO训练和推理时引导，在单/多奖励基准上优于先前方法。

详情

AI中文摘要

在线强化学习对于将扩散模型与不可微目标对齐变得越来越重要。然而，现有方法在沿去噪轨迹分配细粒度信用和实现稳定的基于价值的优化方面仍面临限制。我们提出了一种用于扩散后训练的状态对齐潜在演员-评论家框架，其中扩散模型自身作为时间步条件价值函数，并直接在噪声潜在状态上预测价值。这使得轨迹级PPO训练成为可能，通过简单的条件和价值预训练策略支持稳定的演员-评论家优化，并自然地允许学习到的评论家用于推理时引导。我们进一步将框架扩展到多奖励优化，其中与互补奖励的联合训练有助于减轻奖励破解。在基于UNet和DiT的骨干网络上，我们的方法在单奖励和多奖励基准上始终优于先前的组相对RL和演员-评论家基线，同时测试时引导在生成质量上提供了额外提升。

英文摘要

Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.

URL PDF HTML ☆

赞 0 踩 0

2605.27726 2026-05-28 cs.CV 版本更新

Asynchronous Remote Sensing Time-Series Fusion for Cloud Removal and Anytime Reconstruction

异步遥感时间序列融合用于云去除与任意时间重建

Forouzan Fallah, Chia Yu Hsu, Wenwen Li, Anna Liljedahl, Yezhou Yang

发表机构 * School of Computing and Augmented Intelligence, Arizona State University（计算与增强智能学院，亚利桑那州立大学）； School of Geographical Sciences and Urban Planning, Arizona State University（地理科学与城市规划学院，亚利桑那州立大学）； Woodwell Climate Research Center（伍德沃德气候研究中心）

AI总结提出AGFlow模型，通过时间对齐生成流匹配融合异步S1/S2数据，实现云去除、缺失帧重建及任意时间查询。

Comments CVPR 2026 MORSE Workshop

详情

AI中文摘要

频繁的云层覆盖严重限制了Sentinel-2 (S2) 光学时间序列在地球表面监测中的可用性。Sentinel-1 (S1) SAR提供了全天候的互补观测，但由于采集不规则且异步，实际的S1/S2融合仍然困难。许多现有方法假设时间对齐的输入（或需要外部最近日期匹配），并且通常仅恢复观测时间戳，限制了长间隙下的重建并阻止了按需合成。我们提出AGFlow（时间对齐生成流匹配），一种用于S1/S2云去除和时间序列重建的时空流匹配模型，具有三种能力：(1) 时间戳条件内部对齐，融合异步S1和含云S2观测，无需基于预处理的配对；(2) 时空上下文感知去噪，联合建模空间结构与时间动态（而非独立的逐像素时间序列）；(3) 任意时间查询，能够在监测窗口内的观测时间戳和用户指定时间戳生成无云S2帧。我们在RESTORE-DiT基准协议上进行了评估，包括定量指标、定性比较和组件消融。AGFlow显著改善了完全缺失帧的重建（MAE和RMSE相比RESTORE-DiT降低16-19%），并在持续间隙下提供可靠重建，同时具有竞争力的云去除性能和灵活的时间查询能力，适用于密集植被监测等下游任务。

英文摘要

Frequent cloud cover severely limits the usability of Sentinel-2 (S2) optical time series for Earth surface monitoring. Sentinel-1 (S1) SAR provides all-weather complementary observations, but practical S1/S2 fusion remains difficult because acquisitions are irregular and asynchronous. Many existing approaches assume temporally aligned inputs (or require external nearest-date matching) and typically restore only observed timestamps, limiting reconstruction under long gaps and preventing on-demand synthesis. We propose AGFlow (Time Aligned Generative Flow Matching), a spatiotemporal flow-matching model for S1/S2 cloud removal and time-series reconstruction with three capabilities: (1) timestamp-conditioned internal alignment that fuses asynchronous S1 and cloudy S2 observations without preprocessing-based pairing; (2) spatiotemporal, context-aware denoising that models spatial structure jointly with temporal dynamics (rather than independent per-pixel time series); and (3) anytime querying, enabling generation of cloud-free S2 frames at both observed and user-specified timestamps within the monitoring window. We evaluate on the RESTORE-DiT benchmark protocol with quantitative metrics, qualitative comparisons, and component ablations. AGFlow notably improves fully missing-frame reconstruction (MAE and RMSE reduce by 16-19% over RESTORE-DiT) and provides reliable reconstructions under persistent gaps, while also yielding competitive cloud removal performance and flexible temporal querying for downstream tasks such as dense vegetation monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.27686 2026-05-28 cs.CV cs.AI 版本更新

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

张量记忆：用于长程Transformer的固定大小循环状态

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA（麻省理工学院）； IBM Research, Cambridge, MA, USA（IBM研究院）； University of Toronto, Toronto, Canada（多伦多大学）

AI总结提出张量记忆模块，通过固定大小的3D循环张量状态增强Transformer，以解耦状态容量与输入长度，并保持空间归纳偏置，适用于长程视频理解。

详情

AI中文摘要

Transformer通过将空间和时间展平为长令牌序列来处理图像和视频。虽然注意力和KV缓存保留了过去的特征，但其内存随序列长度增长，并且缺乏显式的、持久化的空间状态，这使得长程视频理解和遮挡敏感推理变得困难。我们提出张量记忆，一种轻量级模块，通过固定大小的循环3D记忆张量增强Transformer块：令牌通过可微的软写入将内容沉积为围绕预测连续3D位置的高斯加权体积到体素网格中，记忆通过高效的局部交互算子和门控循环动态更新，令牌通过连续采样和门控残差融合读取上下文。由于记忆张量大小固定，张量记忆将状态容量与输入长度解耦，同时保持空间归纳偏置。我们在标准语言、图像和视频基准测试以及一个旨在隔离持久状态何时有益的受控玩具诊断套件上评估该模块；它与标准Transformer训练流程集成，可以附加到现有块或从中移除，而无需其他架构更改。

英文摘要

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

URL PDF HTML ☆

赞 0 踩 0

2605.27679 2026-05-28 cond-mat.soft cs.CV cs.LG 版本更新

On the Equivariant Learning of the $Q$-tensor Order Parameter

$Q$ 张量序参数的等变学习

Julia Navarro, Mark Wilkinson

发表机构 * Nottingham Trent University, UK（诺丁汉特伦特大学，英国）； Berea College, USA（贝雷学院，美国）

AI总结本文构建并评估了群等变神经网络，用于从合成生成的微观纹理预测向列液晶的二维 $Q$ 张量序参数，发现等变模型相比非等变基准具有更低的误差和更强的泛化能力。

Comments 15 pages (excluding 7-page appendix); 6 figures

详情

AI中文摘要

我们构建并评估了群等变神经网络，用于从合成生成的微观纹理预测向列液晶的二维 $Q$ 张量序参数。使用权重共享约束、等变激活和正则化技术的组合，构建了七个等变于 $k=4,8,16,32,64,128,256$ 阶循环群 $C_k$ 的架构。为此，我们构造了旋转类置换矩阵群，其元素 $\varrho_{C_k}(g)$ 作用于按行向量化的图像，从而近似方形图像上圆形子域的 $\frac{2\pi}{k}$ 旋转。我们展示了所有七个等变模型在单精度浮点精度内满足 $Q$ 张量等变性约束。与近似参数匹配的非等变基准（有或没有数据增强）相比，我们发现等变模型始终实现更低的误差，并且对未见过的缺陷配置具有更强的泛化能力。性能随群阶增加而提高，表明纳入更精细的旋转对称性会导致更低的误差。

英文摘要

We construct and evaluate group-equivariant neural networks for the prediction of the two-dimensional $Q$-tensor order parameter of nematic liquid crystals from synthetically generated microscopic textures. Seven architectures, equivariant to cyclic groups $C_k$ of order $k$ for $k=4,\,8,\,16,\,32,\,64,\,128,\, 256$, are built using a combination of weight-sharing constraints, equivariant activations and regularization techniques. To do this, we construct rotation-like permutation matrix groups with elements $\varrho_{C_k}(g)$ that act on row-wise vectorized images, thereby approximating a $\frac{2π}{k}$ rotation of the circular subdomain on square images. We show that all seven equivariant models satisfy the $Q$-tensor equivariance constraint to within single-precision floating point accuracy. Comparing against approximate parameter-matched non-equivariant benchmarks, with and without data augmentation, we find that the equivariant models consistently achieve lower errors and generalize more robustly to unseen defect configurations. Performance increases with group order, suggesting that the incorporation of finer rotational symmetry leads to lower errors.

URL PDF HTML ☆

赞 0 踩 0

2605.27616 2026-05-28 cs.CV cs.AI 版本更新

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

并非所有 NVFP4 QAT 配方都相同：架构和规模如何影响异常分割的模型质量

Zijian Du, Oleg Rybakov

发表机构 * NVIDIA

AI总结本研究通过统一协议评估多种架构、规模和 FP4 量化感知训练 (QAT) 配方在脑肿瘤异常分割任务中的交互作用，发现架构选择对量化鲁棒性影响最大，注意力机制架构对配方选择具有显著韧性，而 CNN 在大规模下受梯度量化配方影响性能下降。

详情

Journal ref: CVPR2026

AI中文摘要

实时异常分割要求高召回率和高效的低精度推理。我们研究了模型架构、模型规模和 FP4 量化感知训练 (QAT) 配方在召回关键的脑肿瘤分割任务中的三方交互，在统一协议下评估了多种架构、规模和 QAT 配方。我们发现架构选择对量化鲁棒性影响最大，基于注意力的架构对配方选择表现出显著的韧性，而 CNN 在大规模下在梯度量化配方下性能下降。在低容量下，FP4 可能离散化 softmax 注意力，但高级 QAT 配方可防止这种崩溃。在更大规模下，高级配方减轻了降低 CNN 质量的梯度量化噪声。五折患者级交叉验证证实这些发现对数据划分具有鲁棒性。我们的结果表明，Swin Transformer 在所有规模下对 QAT 配方选择都具有鲁棒性，使其成为 FP4 量化异常分割的推荐架构。

英文摘要

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.27595 2026-05-28 cs.CV cs.AI 版本更新

表示条件扩散模型用于引导训练数据生成

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

发表机构 * Linköping University（利乌普斯大学）

AI总结本文提出表示条件扩散模型，通过DINOv2、DINOv3和CLIP的表示条件生成合成图像，在ImageNet100上分类准确率比类条件生成高10.76个百分点，甚至超过真实数据训练的模型2.0个百分点。

详情

AI中文摘要

数据可用性仍然是许多深度学习应用中的关键瓶颈。大规模数据集通常收集、整理和标注成本高昂，这可能限制监督学习方法的可扩展性和适用性。在这项工作中，我们评估了在由生成式深度学习产生的合成图像数据集上训练的模型的分类性能。具体而言，我们使用基于DINOv2、DINOv3和CLIP学习表示的潜在扩散模型。我们的结果表明，这种表示条件公式通过提高样本质量和模式覆盖，显著优于类条件生成（在ImageNet100上top-1准确率提高10.76个百分点）。此外，通过扩大合成数据集的规模，我们能够超越在真实数据上训练的分类器（top-1准确率提高2.0个百分点）。我们还展示了生成的图像如何用于增强目的，优于经典增强方法，以及如何利用条件空间进行样本过滤以进一步提高训练价值。总的来说，这些发现表明，表示条件扩散模型为在大规模视觉学习任务中增强、补充或潜在替代真实世界数据集提供了一种有前景的方法。

英文摘要

Data availability remains a critical bottleneck in many deep learning applications. Large-scale datasets are often expensive to collect, curate and annotate, which can limit the scalability and applicability of supervised learning methods. In this work, we evaluate the classification performance of models trained on synthetic image datasets produced by generative deep learning. In particular, we use latent diffusion models conditioned on learned representations from DINOv2, DINOv3, and CLIP. Our results demonstrates that this representation-conditioned formulation significantly outperforms class-conditioned generation by a large margin (+10.76 p.p. top-1 accuracy on ImageNet100), by improving sample quality and mode coverage. Furthermore, by scaling the size of the synthetic dataset, we are able to outperform a classifier trained on the real data (+2.0 p.p top-1 accuracy). We also demonstrate how generated images can be used for augmentation purposes, outperforming classical augmentation methods, and how the conditioning space can be used for sample filtering to further improve training value. Collectively, these findings highlight that representation-conditioned diffusion models provide a promising approach for augmenting, complementing, or potentially replacing real-world datasets in large-scale visual learning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.27487 2026-05-28 cs.CV cs.AI 版本更新

Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

基于扩散的乌克兰手写文本生成与跨域风格迁移

Andrii Ahitoliev, Pavlo Berezin

发表机构 * Ukrainian Catholic University, Lviv, Ukraine（乌克兰天主教大学，利沃夫，乌克兰）； National University of ``Kyiv-Mohyla Academy'', Kyiv, Ukraine（基輔-莫 Hil'a 学院国立大学，基輔，乌克兰）

AI总结针对乌克兰语等非拉丁文字手写文本生成缺乏数据和模型泛化研究的问题，构建了乌克兰手写单词数据集并重新训练DiffusionPen模型，通过跨语言、零样本和少样本迁移实验验证了潜在扩散模型在跨域风格迁移中的有效性。

Comments 16 pages, 7 figures. Submitted to ICTERI 2026

详情

AI中文摘要

基于书写者风格的手写文本生成（HTG）在拉丁文字中已被广泛研究，但在低资源和非拉丁书写系统中仍探索不足，现有模型在拉丁域之外的泛化能力尚不明确。西里尔字母，尤其是乌克兰语，缺乏大规模书写者标注数据集和此类泛化的经验证据。为填补这一空白，我们使用连通分量分割、质量过滤和对代表性不足的乌克兰字符进行针对性过采样，构建了一个包含308位书写者、126,177张图像的乌克兰手写单词数据集。我们在不修改架构的情况下，在该数据集上重新训练了DiffusionPen——一种带有MobileNetV2三元组损失风格编码器和CANINE条件潜在扩散U-Net的模型，测试了从拉丁到西里尔字母的直接迁移。我们在三种设置下评估跨域风格迁移：从IAM英文样本的跨语言迁移、对20世纪早期乌克兰手稿的零样本迁移，以及对当代书写者的少样本模仿。该模型生成可读且风格一致的单词图像，表明少样本潜在扩散模型能够泛化到拉丁文字域之外。我们发布了数据集、训练模型和评估协议，作为书写者感知的西里尔HTG的可复现基准，为将风格化HTG扩展到其他代表性不足的书写系统奠定了基础。

英文摘要

Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-resource and non-Latin writing systems, leaving open how well existing models generalise beyond the Latin domain. Cyrillic, particularly Ukrainian, lacks both large-scale writer-labeled datasets and empirical evidence of such generalisation. To address this gap, we construct a Ukrainian handwritten word dataset of 126,177 images from 308 writers using connected-component segmentation, quality filtering, and targeted oversampling of underrepresented Ukrainian characters. We retrain DiffusionPen, a MobileNetV2 triplet-loss style encoder with a CANINE-conditioned latent diffusion U-Net, on this dataset without architectural modification, testing direct transfer from Latin to Cyrillic. We evaluate cross-domain style transfer in three settings: cross-lingual transfer from IAM English samples, zero-shot transfer to an early 20th-century Ukrainian manuscript, and few-shot imitation of contemporary writers. The model produces legible, style-consistent word images, indicating that few-shot latent diffusion models generalize beyond the Latin-script domain. We release the dataset, trained models, and evaluation protocol as a reproducible benchmark for writer-aware Cyrillic HTG, providing a foundation for extending stylized HTG to other underrepresented writing systems.

URL PDF HTML ☆

赞 0 踩 0

2605.27467 2026-05-28 cs.LG cs.AI cs.CV 版本更新

Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

液态神经网络与LSTM在序列模式识别中的比较分析：鲁棒性、效率与临床实用性

Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

发表机构 * National Electronics and Computer Technology Center (NECTEC)（国家电子与计算机技术中心）； Language Understanding Lab.（语言理解实验室）

AI总结本文通过对比液态神经网络（LNN）与LSTM在四种序列数据上的性能，发现LNN在参数效率和鲁棒性方面更优，尤其适用于数据稀疏的临床环境。

Comments 9 pages, 7 figures, 6 tables, The conference paper will appear in Proceedings of JCSSE 2026

详情

AI中文摘要

传统的循环神经网络（RNN）和长短期记忆网络（LSTM）在离散时间步上运行，往往无法捕捉现实世界物理过程的流体时间动态。液态神经网络（LNN），特别是闭式连续时间（CfC）网络，通过将隐藏状态演化建模为连续微分方程来解决这一问题。在本文中，我们在四种不同的序列模态上进行了全面的基准测试研究：神经形态事件数据（N-MNIST）、基于笔画的绘图（QuickDraw）、视觉手写（IAM）和生理时间序列（PhysioNet Sepsis-3）。此外，我们使用时间丢弃法进行了严格的压力测试，以评估模型对缺失数据的鲁棒性。我们的研究结果表明，LNN在原生时间域和数据稀疏普遍的临床环境中，始终提供优越的参数效率和显著更高的鲁棒性。本扩展预印本提供了关于相关数据集和LNN理论谱系的额外背景，并附有详细附录，记录了我们的完整实现和实验设置。

英文摘要

Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to capture the fluid temporal dynamics of real-world physical processes. Liquid Neural Networks (LNNs), specifically Closed-form Continuous-time (CfC) networks, address this by modeling the hidden state evolution as a continuous differential equation. In this paper, we conduct a comprehensive benchmarking study across four distinct sequential modalities: neuromorphic event-based data (N-MNIST), stroke-based drawing (QuickDraw), visual handwriting (IAM), and physiological time-series (PhysioNet Sepsis-3). Furthermore, we perform a rigorous stress test using temporal dropout to evaluate model robustness against missing data. Our findings reveal that LNNs consistently provide superior parameter efficiency and significantly higher robustness in natively temporal domains and clinical environments where data sparsity is prevalent. This extended preprint provides additional background on related datasets and the LNN theoretical lineage, supplemented with a detailed appendix documenting our full implementation and experimental settings.

URL PDF HTML ☆

赞 0 踩 0

2605.27465 2026-05-28 cs.CV cs.AI 版本更新

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

AdaMerge: 面向视觉Transformer无训练加速的显著性感知自适应令牌合并

Semi Lee, Hyejin Go, Hyesong Choi

发表机构 * Electronic Engineering（电子工程）； Soongsil University（顺斯大学）

AI总结提出AdaMerge框架，通过显著性加权相似度和自适应合并强度两个互补机制，在无训练条件下提升令牌合并的精度-计算量帕累托前沿。

Comments 11 pages, 3 figures, 5 tables. Submitted to NeurIPS 2026

详情

AI中文摘要

视觉Transformer（ViT）中自注意力的二次计算成本构成了实际部署的基本瓶颈，激发了令牌缩减方面的活跃研究。在现有方法中，令牌合并（ToMe）已成为一种优雅的无训练解决方案；然而，其设计基于令牌平等的隐含前提，这与自注意力已充分证明的非均匀性相悖，并在激进压缩下导致高显著性令牌的信息丢失。我们通过AdaMerge解决了这一局限，该框架基于两个互补机制。首先，显著性加权相似度利用列式特征亲和度中心性作为令牌重要性代理，并将所得显著性分数纳入二分匹配分数，确保关键令牌对合并表示贡献更大。其次，自适应合并强度使用预先计算的逐层相似度统计量，根据输入特定的冗余性动态调整每层缩减数量。在ImageNet-1k上使用ViT-B/16，AdaMerge在所有FLOPs匹配条件下均持续优于ToMe、PiToMe和DSM。精度差距随压缩单调增大：在13.4G FLOPs操作点，AdaMerge的Top-1下降仅为-1.06%，而PiToMe为-1.45%，DSM为-4.62%。据我们所知，AdaMerge是首个将显著性加权相似度和自适应逐层缩减结合到单一无训练令牌合并框架中的方法，推动了ViT加速的精度-FLOPs帕累托前沿。

英文摘要

The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

URL PDF HTML ☆

赞 0 踩 0

2605.27464 2026-05-28 cs.CV cs.AI 版本更新

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

超越运动基元：基于头戴式IMU的行为活动识别

Chung-Ta Huang, Leopold Das, Jeffrey Zhou, Faizaan Siddique, Julia Seungjoo Baek, Serena Liu, Andrew Rusli, Todd Y. Zhou, Freddy Yu, Sinclair Hansen, Ziling Hu, Arnav Sharma, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab, Harvard University（哈佛人工智能与机器人实验室，哈佛大学）

AI总结提出HiT-HAR层次模型，利用头戴式IMU数据实现行为级活动识别，超越传统运动基元，在五类动作和八类场景识别中优于现有模型。

详情

AI中文摘要

AR智能眼镜需要连续的行为上下文来提供主动辅助，但其最实用的常开传感器——头戴式惯性测量单元（IMU）仅能检测行走或站立等运动基元。我们突破运动基元，实现行为级识别，定义了五个类别以平衡AR应用需求与传感器可观测性。为此，我们构建了一个包含16万样本的Ego4D数据集，采用四层质量保证框架覆盖8个活动场景，并提出了HiT-HAR，一个70.3万参数的层次模型，在五类动作和八类场景识别中优于先前的头戴式IMU模型。我们通过每类可分离性分析进一步绘制了头戴式IMU的可观测性边界，识别出哪些行为类别可靠可观测（移动），哪些受益于时间上下文（物体传递、任务操作），以及哪些场景依赖的信号重叠仍构成挑战。我们的结果表明，利用时间上下文和场景结构的架构选择优于单纯扩大模型规模。代码和数据集公开于https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR。

英文摘要

AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR.

URL PDF HTML ☆

赞 0 踩 0

2605.27460 2026-05-28 cs.CV 版本更新

D$^2$Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation

D$^2$Turb: 深度感知仿真与解耦学习用于单帧大气湍流抑制

Zixiao Hu, Tianyu Li, Guoqing Wang, Wei Li, Guoguo Xin, Xun Liu, Peng Wang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Beijing Institute of Space Mechanics and Electricity（北京空间机械与电子研究所）； School of Physics, Northwest University Xi'an（西安西北大学物理学院）

AI总结提出D$^2$Turb框架，通过深度感知湍流合成协议和自适应结构先验注入机制，将物理仿真与解耦恢复结合，实现单帧大气湍流下的纹理去模糊与几何校正。

Comments 14 pages, 7 figures

详情

AI中文摘要

单帧大气湍流抑制由于空间变化模糊与非刚性几何畸变并存而本质上是病态的。现有的基于平面场仿真的端到端方法通常难以平衡纹理恢复与几何校正。为克服这一限制，我们提出D$^2$Turb，一个将物理仿真与显式解耦恢复相结合的统一框架。首先，我们引入深度感知湍流合成协议，将场景深度纳入相位到空间公式中，生成物理一致、深度相关的退化，并为解耦学习提供关键的中间倾斜监督信号。基于该仿真引擎，D$^2$Turb将恢复分解为两个交互阶段：纹理去模糊和几何校正。纹理去模糊阶段采用去模糊骨干网络恢复细节，同时保留几何畸变以供后续校正阶段使用。为缓解级联设计中常见的信息碎片化问题，我们进一步提出自适应结构先验注入（ASPI）机制，动态传递去模糊模块的深层结构表示以指导密集流预测进行空间去扭曲。大量实验表明，D$^2$Turb在合成和真实数据集上均达到最先进性能，在纹理恢复和几何保真度方面均有持续改进。我们的代码和预训练模型已在 https://github.com/HertzDot222/D2Turb 公开。

英文摘要

Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D$^2$Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration. First, we introduce a Depth-Aware Turbulence Synthesis protocol that incorporates scene depth into the phase-to-space formulation. This generates physically consistent, depth-dependent degradations and provides a crucial intermediate tilt supervision signal for disentangled learning. Building upon this simulation engine, D$^2$Turb decomposes restoration into two interactive stages: texture deblurring and geometric rectification. The texture deblurring stage employs a deblurring backbone to recover fine-grained details while preserving geometric distortion for the subsequent rectification stage. To mitigate the information fragmentation commonly observed in cascaded designs, we further propose an Adaptive Structural Prior Injection (ASPI) mechanism that dynamically transfers deep structural representations from the deblurring module to guide dense flow prediction for spatial unwarping. Extensive experiments demonstrate that D$^2$Turb achieves state-of-the-art performance on both synthetic and real-world datasets, with consistent improvements in both texture recovery and geometric fidelity. Our code and pre-trained models are publicly available at https://github.com/HertzDot222/D2Turb.

URL PDF HTML ☆

赞 0 踩 0

2605.27452 2026-05-28 cs.CV 版本更新

Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent

微调视觉语言模型以理解当前损伤并使用质量卫士代理进行优先级评分

Takato Yasuno

AI总结本文通过微调LLaVA-1.5-7B视觉语言模型，结合规则引擎和质量卫士代理，实现了桥梁损伤自动理解与修复优先级评分，有效降低了评分变异并提升了效率。

Comments 23 pages, 11 figures, 13 tables

详情

AI中文摘要

日本的桥梁检查要求每五年进行一次强制性目视评估，然而不同工程师分配的定性损伤等级（a-e级）存在显著的评分者间变异性——这是实现一致基础设施管理的关键障碍。资深工程师的老龄化进一步威胁检查能力。本文提出了一种使用微调视觉语言模型（VLM）自动化桥梁损伤理解和修复优先级评分的方法。我们使用QLoRA在多达4000对桥梁损伤图像和检查文本记录上微调LLaVA-1.5-7B，然后在固定的800张图像测试集上进行评估。模型输出识别结构构件和损伤模式的自然语言描述，基于此，一个基于规则的评分引擎计算五级修复优先级指数。一项渐进式训练研究（1k/2k/3k/4k样本）表明，2000个训练样本在仅2.9小时的训练中即可达到接近最优的验证损失；超过2000后，训练样本每翻倍，验证损失改善不超过0.2%，表现出明显的收益递减。此外，在保留测试集上的语义相似度在3000样本时达到峰值（0.6909），在4000样本时下降（0.6739），表明质量策划的中等规模数据优于更大但噪声更多的语料库。结合torch.compile()和批处理（batch_size=8）的推理优化实现了每张图像10.06秒——比未优化基线降低了70.2%。我们的方法有助于桥梁检查中的数据治理，减少评分者间变异性，并提供AI辅助分诊以增强检查流程中的专家工程师。此外，我们引入了一个两阶段质量卫士，使用微调的Swallow-8B SLM在优先级评分前拒绝低质量的VLM输出，防止来自损坏或无法识别图像的虚假评分。

英文摘要

Bridge inspection in Japan requires mandatory visual assessments every five years, yet qualitative damage ratings (levels a-e) assigned by different engineers exhibit significant inter-rater variability -- a critical barrier to consistent infrastructure management. The aging of skilled engineers further threatens inspection capacity. This paper presents a methodology for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). We fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, then evaluate on a fixed test set of 800 images. The model outputs natural language descriptions identifying structural members and damage patterns, from which a rule-based scoring engine calculates a five-level repair priority index. A progressive training study (1k/2k/3k/4k samples) reveals that 2k training samples achieve near-optimal validation loss in only 2.9 hours of training; beyond 2k, validation loss improves by no more than 0.2% per doubling of training samples, exhibiting clear diminishing returns. Furthermore, semantic similarity on the held-out test set peaks at 3k (0.6909) and degrades at 4k (0.6739), indicating that quality-curated mid-scale data outperforms larger but noisier corpora. Inference optimization combining torch.compile() and batch processing (batch_size=8) achieves 10.06 seconds per image -- a 70.2% reduction over the unoptimized baseline. Our approach contributes to data governance in bridge inspection, reduces inter-rater variability, and provides AI-assisted triage to augment expert engineers in inspection workflows. Furthermore, we introduce a two-stage Quality Guard using a fine-tuned Swallow-8B SLM to reject low-quality VLM outputs before priority scoring, preventing spurious scores from damaged or unrecognised images.

URL PDF HTML ☆

赞 0 踩 0

2605.27451 2026-05-28 cs.CV 版本更新

From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop & Competition

从情感到复杂行为：第十届ABAW研讨会与竞赛推进多模态以人为中心的人工智能

Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, Stefanos Zafeiriou, Irene Kotsia, Eric Granger, Marco Pedersoli, Simon Bacon, Jens Madsen, Soufiane Belharbi, Muhammad Haseeb Aslam, Chunchang Shao, Guanyu Hu

发表机构 * Queen Mary University of London（伦敦皇后玛丽大学）； Hume AI ； Google Deepmind（谷歌Deepmind）； Imperial College London（伦敦帝国理工学院）； Cogitat ； LIVIA ； ILLS ； ETS Montreal（蒙特利尔ETS）； Concordia University（Concordia大学）； Xi’an Jiaotong University（西安交通大学）

AI总结本文介绍了第十届ABAW研讨会与竞赛，通过多模态挑战和论文，推动真实环境下人类情感与行为的建模、分析和理解。

Comments accepted at CVPR 2026

详情

AI中文摘要

第十届真实世界情感与行为分析（ABAW）研讨会与竞赛，与CVPR 2026同期举办，持续推动在真实、无约束环境中对人类情感与行为的建模、分析和理解研究。研讨会保持双重结构，包括竞赛和论文轨道。ABAW竞赛引入了一系列多样化的挑战，针对情感与行为理解的关键方面，包括连续情感（效价-唤醒度）估计、离散情感（表情和动作单元）识别，以及更复杂的行为分析任务，如情感模仿强度估计、矛盾/犹豫识别和细粒度暴力检测。这些挑战基于大规模真实世界数据集，为最先进方法提供了全面的基准。与此同时，论文轨道展示了广泛的贡献，涵盖姿态、运动与行为估计、情感建模与多模态学习、基准、数据集与评估协议、公平性、鲁棒性与部署。总体而言，第十届ABAW研讨会与竞赛继续作为基准测试、合作与创新的关键平台，塑造下一代多模态、以人为中心的人工智能系统的发展。

英文摘要

The 10th Affective & Behavior Analysis in-the-Wild (ABAW) Workshop and Competition, held at CVPR 2026, continues to advance research on modelling, analysis, understanding of human affect and behavior in real-world, unconstrained environments. The workshop maintains its dual structure, comprising both a competition and a paper track. The ABAW Competition introduces a diverse set of challenges targeting key aspects of affective and behavioral understanding, including continuous affect (valence-arousal) estimation, discrete affect (expression and action unit) recognition, as well as more complex behavior analysis tasks, such as emotional mimicry intensity estimation, ambivalence/hesitancy recognition and fine-grained violence detection. These challenges are built upon large-scale in-the-wild datasets, providing comprehensive benchmarks for state-of-the-art approaches. In parallel, the paper track presents a wide range of contributions spanning pose, motion & behavior estimation, affect modelling & multimodal learning, benchmarks, datasets & evaluation protocols, fairness, robustness & deployment. Overall, the 10th ABAW Workshop and Competition continues to serve as a key platform for benchmarking, collaboration and innovation, shaping the development of next-generation multimodal, human-centered AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.27436 2026-05-28 cs.IR cs.AI cs.CV 版本更新

通过修复进行语义鲁棒性探测：面向安全关键目标检测的交互工具

Nico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock

发表机构 * Federal Institute for Occupational Safety and Health (BAuA)（联邦职业安全与健康研究所）； Fraunhofer Institute for Manufacturing Engineering and Automation IPA（弗劳恩霍夫研究所（制造工程与自动化IPA））

AI总结提出SemProbe工具，通过扩散模型可控修复生成语义探针，支持用户自定义掩码和因素，自动评估并记录目标检测模型的鲁棒性变化。

2605.26624 2026-05-28 cs.CV 版本更新

MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition

MSCGC-KAN: 用于脑电情感识别的多尺度因果图卷积与Kolmogorov-Arnold特征映射

Haoliang Gong, Qingshan She, Jiale Xu, Yunyan Gao, Xugang Xi

发表机构 * School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China（杭州电子科技大学自动化学院）； Zhejiang Provincial Key Laboratory of Brain Computer Collaborative Intelligence Technology（浙江省脑机协同智能技术与应用重点实验室）

AI总结本文提出MSCGC-KAN方法，通过多尺度因果图卷积和Kolmogorov-Arnold特征映射构建结构化任务头，在预训练CBraMod骨干上增强多尺度时间建模、可学习通道间连接建模和非线性判别映射，显著提升脑电情感识别性能。

详情

AI中文摘要

基于脑电图的情感识别是一项重要的情感计算任务，最近的脑电图基础模型为下游适应提供了有用的通用表示。然而，在微调设置下，三个局限性仍然突出：多尺度情感动态建模不足、通道间功能连接利用不充分以及简单线性分类头的表达能力有限。为了解决这些问题，本文提出了一种新的脑电情感识别方法，称为MSCGC-KAN，它引入了一个由多尺度因果图卷积和Kolmogorov-Arnold特征映射组成的结构化任务头。基于预训练的CBraMod骨干，MSCGC-KAN通过在紧凑的任务特定头中联合加强多尺度时间建模、可学习通道间连接建模和非线性判别映射来增强下游适应。这种设计保留了基础模型的表示优势，同时使分类器对情感相关的时空模式更加敏感。在公开的FACED和SEED-VII数据集上进行了大量实验。所提方法在FACED上实现了60.66%的平衡准确率、0.5525的Cohen's Kappa和60.40%的加权F1分数，在SEED-VII上分别获得了33.27%、0.2223和33.64%。与CBraMod+Linear基线相比，在两个数据集上平衡准确率分别提高了5.91和2.03个百分点。这些结果表明，在微调预训练脑电模型时，结构化任务头设计是改进脑电情感识别的有效方法。

英文摘要

Electroencephalogram (EEG)-based emotion recognition is an important affective computing task, and recent EEG foundation models provide useful generic representations for downstream adaptation. However, under the fine-tuning setting, three limitations remain prominent: insufficient modeling of multi-scale emotional dynamics, inadequate exploitation of inter-channel functional connectivity, and the limited expressive power of simple linear classification heads. To address these issues, this paper proposes a new EEG emotion recognition method, termed MSCGC-KAN, which introduces a structured task head composed of multi-scale causal graph convolution and Kolmogorov--Arnold feature mapping. Built on a pre-trained CBraMod backbone, MSCGC-KAN enhances downstream adaptation by jointly strengthening multi-scale temporal modeling, learnable inter-channel connectivity modeling, and nonlinear discriminative mapping within a compact task-specific head. This design preserves the representation advantage of the foundation model while making the classifier more sensitive to emotion-related spatiotemporal patterns. Extensive experiments are conducted on the public FACED and SEED-VII datasets. The proposed method achieves a balanced accuracy of 60.66\%, a Cohen's Kappa of 0.5525, and a weighted F1-score of 60.40\% on FACED, and obtains 33.27\%, 0.2223, and 33.64\%, respectively, on SEED-VII. Compared with the CBraMod+Linear baseline, the balanced accuracy is improved by 5.91 and 2.03 percentage points on the two datasets, respectively. These results indicate that structured task-head design is an effective way to improve EEG emotion recognition when fine-tuning pre-trained EEG models.

URL PDF HTML ☆

赞 0 踩 0

2605.26391 2026-05-28 cs.GR cs.CV 版本更新

Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing

Garment Particles: 一种用于生成和编辑的2D-3D对称服装表示

Kiyohiro Nakayama, I-Chao Shen, Ruofan Liu, Yiming Wang, Gordon Wetzstein, Takeo Igarashi

发表机构 * Stanford University USA（斯坦福大学）； The University of Tokyo Japan（东京大学）； Institute of Science Tokyo Japan（东京科学研究所）； ETH Zurich Switzerland（苏黎世联邦理工学院）

AI总结提出Garment Particles，一种5D点云表示，联合编码2D裁剪图和3D几何，通过Garment Particles Flow框架支持从高级输入生成和多种编辑操作，实现最先进的服装生成效果。

详情

AI中文摘要

实际服装设计跨越两种模式：从高级意图（如参考图像或文本描述）进行直观创建，以及在2D裁剪图和3D悬垂几何之间进行复杂的低级编辑，这需要专业培训才能驾驭其复杂的相互依赖性。然而，现有框架仅解决了这一挑战的一部分，提供了从随意输入生成服装或直接在裁剪图上编辑的功能。为了支持这两种需求，我们提出了Garment Particles，一种5D点云表示，联合编码2D裁剪图和3D几何。这种表示使得Garment Particles Flow（GPF）成为可能，这是一个整流流框架，支持从高级输入（文本、图像、草图）进行直观生成，并通过扩散后验采样对2D裁剪图和3D几何进行各种编辑操作。最后，我们引入了Particles-to-Pattern Flow，将生成的服装粒子转换为基于曲线的裁剪图以进行模拟。我们在多个数据集上验证了模型的生成能力，与竞争基线相比实现了最先进的服装生成结果。我们的模型还支持许多服装编辑场景，包括服装插值、裁剪图编辑、点云和轮廓条件服装生成。我们的项目网站位于 https://garment-particles.github.io。

英文摘要

Practical garment design spans two modes: intuitive creation from high-level intent, such as a reference image or text description, and complex low-level editing across 2D sewing patterns and 3D draped geometry, which requires professional training to navigate their complex interdependencies. Yet existing frameworks address only part of this challenge, offering either garment generation from casual inputs or direct editing on sewing patterns. To support both ends of the spectrum, we propose Garment Particles, a 5D point-cloud representation that jointly encodes 2D sewing patterns and 3D geometry. This representation enables Garment Particles Flow (GPF), a rectified flow framework that supports intuitive generation from high-level inputs (text, images, sketches) and various editing operations on 2D sewing patterns and 3D geometries via diffusion posterior sampling. Finally, we introduce Particles-to-Pattern Flow that converts generated garment particles into curved-based patterns for simulation. We validate our model's generation ability on multiple datasets, achieving state-of-the-art garment generation results against competitive baselines. Our model also enables many garment editing scenarios, including garment interpolation, sewing pattern editing, point-cloud- and silhouette-conditioned garment generation. Our project website is at https://garment-particles.github.io .

URL PDF HTML ☆

赞 0 踩 0

2605.25767 2026-05-28 cs.CV 版本更新

SAFE-Diff: Scale-Aware Attention and Feature-Dispersive Diffusion with Uncertainty Estimation for Contrast-Enhanced Breast MRI Synthesis

SAFE-Diff: 用于对比增强乳腺MRI合成的尺度感知注意力与特征分散扩散及不确定性估计

Tianyu Zhang, Xinglong Liang, Jarek van Dijk, Luyi Han, Chunyao Lu, Antonio Portaluri, Xinghe Xie, Yaofei Duan, Nika Rasoolzadeh, Xin Wang, Yuan Gao, Muzhen He, Yue Sun, Jonas Teuwen, Tao Tan, Ritse Mann

发表机构 * Department of Medical Imaging, Radboud University Medical Center（鲁文大学医学中心医学影像部）； Department of Radiology, Netherlands Cancer Institute（荷兰癌症研究所放射科）； Maastro Clinic（马斯垂克诊所）； Faculty of Applied Science, Macao Polytechnic University（澳门理工大学应用科学学院）； Department of Radiation Oncology, Netherlands Cancer Institute（荷兰癌症研究所放射肿瘤科）

AI总结提出SAFE-Diff模型，通过尺度感知注意力、特征分散扩散和不确定性估计，解决对比增强乳腺MRI合成中复杂病灶纹理和异质性增强模式的挑战。

Comments Early accepted by MICCAI 2026

2605.25378 2026-05-28 cs.CV cs.AI 版本更新

遮挡感知的物理-语义关键帧选择用于鲁棒视频编辑

Lin Liu, Zhihan Xiao, Haohang Xu, Rong Cong, Zhibo Zhang, Xiaopeng Zhang, Qi Tian

发表机构 * Huawei（华为）； Tsinghua University（清华大学）； East China Normal University（华东师范大学）

AI总结提出一种遮挡感知的物理-语义关键帧选择框架，通过从结构完整性、跟踪稳定性和属性可见性三个角度评估候选帧，自动选择最优锚定帧，并利用双向跟踪生成时空掩码，实现鲁棒且时序一致的视频编辑。

详情

AI中文摘要

近年来，基于扩散的生成模型在视频编辑领域取得了显著进展，能够根据自然语言指令实现多样化的对象级操作。然而，现有方法在遮挡、视角变化和快速物体运动场景下常常表现不佳，不可靠的视觉观测导致定位不准确、时间闪烁和编辑不一致。在本工作中，我们识别出缺乏可靠视觉锚点是遮挡鲁棒视频编辑的一个根本瓶颈。为解决此问题，我们提出了一种遮挡感知的物理-语义关键帧选择框架，该框架自动为下游编辑识别最优锚定帧。具体而言，我们的方法从三个互补角度评估候选帧：避免截断观测的结构完整性、衡量物理可靠性的循环一致跟踪稳定性、以及确保语义清晰性的基于视觉语言的属性可见性。选定的关键帧随后通过双向跟踪传播，生成密集的时空掩码，这些掩码作为扩散视频编辑骨干的辅助监督。通过将遮挡处理从显式重建转变为可靠锚点选择，我们的框架无需手动标注即可实现精确且时序一致的编辑。在具有挑战性的视频编辑基准上的大量实验证明了我们方法的有效性和高质量性能。

英文摘要

Video editing has recently achieved remarkable progress with diffusion-based generative models, enabling diverse object-level manipulations from natural language instructions. However, existing methods often struggle under occlusion, viewpoint changes, and fast object motion, where unreliable visual observations lead to inaccurate localization, temporal flickering, and inconsistent edits. In this work, we identify the absence of reliable visual anchors as a fundamental bottleneck in occlusion-robust video editing. To address this issue, we propose an occlusion-aware physics-semantic keyframe selection framework that automatically identifies an optimal anchor frame for downstream editing. Specifically, our method evaluates candidate frames from three complementary perspectives: structural completeness for avoiding truncated observations, cycle-consistent tracking stability for measuring physical reliability, and vision-language-based attribute visibility for ensuring semantic clarity. The selected keyframe is then propagated through bidirectional tracking to generate dense spatiotemporal masks, which are used as auxiliary supervision for a diffusion-based video editing backbone. By transforming occlusion handling from explicit reconstruction into reliable anchor selection, our framework enables precise and temporally consistent editing without requiring manual annotations. Extensive experiments on challenging video editing benchmarks demonstrate the effectiveness and high-quality performance of our method.

URL PDF HTML ☆

赞 0 踩 0

2605.23137 2026-05-28 eess.IV cs.CV 版本更新

STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding

STAMBRIDGE：用于脑电视觉解码的谱时幅度感知中间特征桥

Jiahe Meng, Weiming Zeng, Yueyang Li, Bo Chai, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University（数字图像与智能计算实验室，上海 Maritime 大学）； Department of Language Science and Technology, The Hong Kong Polytechnic University（语言科学与技术系，香港理工大学）； Department of Neurology, Affiliated Lianyungang Hospital of Xuzhou Medical University（神经内科，徐州医学院附属连云港医院）； Institute of Computing and Intelligence, Harbin Institute of Technology Shenzhen（计算与智能研究所，哈尔滨工业大学深圳研究院）

AI总结提出STAMBRIDGE两阶段框架，通过谱时幅度感知调制（STAM）提取稳健脑电特征，并利用中间特征语义桥（MFSB）实现稳定的跨模态对齐，在THINGS-EEG基准上取得34.50% Top-1和65.95% Top-5的200路零样本检索准确率。

详情

AI中文摘要

脑电图（EEG）视觉解码由于低信噪比神经信号与高度结构化的视觉-语言空间之间的模态差距而仍然具有挑战性，使得直接的跨模态对齐不稳定。为了解决这个问题，我们提出了STAMBRIDGE，一个通用的两阶段框架，依次处理特征条件和跨模态对齐。首先，我们引入谱时幅度感知调制（STAM）来提取良好条件的EEG表示。通过用幅度导出的软通道权重和多尺度时间卷积替代硬频率掩蔽，STAM明确保留了频率感知的瞬态，同时降低了时域振铃伪影的风险。在这些稳健的神经特征基础上，我们进一步引入了一个模型无关的中间特征语义桥（MFSB），通过定向的跨模态交互构建一个正则化的中间空间，实现分阶段蒸馏和更稳定的语义对齐。在THINGS-EEG基准上的实验显示了具有竞争力的200路零样本检索性能，Top-1准确率为34.50%，Top-5准确率为65.95%。此外，STAMBRIDGE学习的嵌入使用扩散模型产生了语义连贯的图像重建，展示了稳健的EEG到视觉语义对齐。代码可在https://github.com/thabeatmjh/STAMBRIDGE获取。

英文摘要

Electroencephalography (EEG) visual decoding remains challenging due to the modality gap between low-SNR neural signals and highly structured vision--language spaces, making direct cross-modal alignment unstable. To address this, we propose STAMBRIDGE, a versatile two-stage framework that sequentially tackles feature conditioning and cross-modal alignment. First, we introduce a Spectral-Temporal Amplitude-aware Modulation (STAM) to extract well-conditioned EEG representations. By replacing hard frequency masking with amplitude-derived soft channel weighting and multi-scale temporal convolutions, STAM explicitly preserves frequency-aware transients while reducing the risk of time-domain ringing artifacts. Building upon these robust neural features, we further introduce a model-agnostic Mid-Feature Semantic Bridge (MFSB) that constructs a regularized intermediate space through directed cross-modal interactions, enabling staged distillation and more stable semantic alignment. Experiments on the THINGS-EEG benchmark show competitive 200-way zero-shot retrieval performance, with 34.50\% Top-1 and 65.95\% Top-5 accuracy. In addition, embeddings learned by STAMBRIDGE produce semantically coherent image reconstructions with a diffusion model, demonstrating robust EEG-to-vision semantic alignment. The code is available at: https://github.com/thabeatmjh/STAMBRIDGE.

URL PDF HTML ☆

赞 0 踩 0

2605.22547 2026-05-28 cs.CV cs.AI 版本更新

语义增强的潜在视觉推理

Tianrun Xu, Yue Sun, Qixun Wang, Jingyi Lu, Yuan Wang, Tianren Zhang, Longteng Guo, Fengyun Rao, Jing Lyu, Feng Chen, Jing Liu

发表机构 * Department of Automation, Tsinghua University, Beijing, China（清华大学自动化系）； Department of Electronic Engineering, Tsinghua University, Beijing, China（清华大学电子工程系）； Zhongguancun Academy, Beijing, China（中关村学院）； China Agricultural University, Beijing, China（中国农业大学）； Peking University, Beijing, China（北京大学）； Beijing Institute of Technology, Beijing, China（北京理工大学）； Institute of Automation, Chinese Academy of Sciences, Beijing, China（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学人工智能学院）； WeChat Vision, Tencent Inc, Beijing, China（微信视觉，腾讯公司）

AI总结提出两阶段学习框架SLVR，通过属性级语义监督和多查询组相对策略优化增强潜在表示的语义丰富性，提升潜在视觉推理的鲁棒性和语义一致性。

详情

AI中文摘要

多模态潜在空间推理旨在通过在紧凑的潜在空间中直接进行视觉推理，来替代使用图像的显式思考。然而，现有方法主要依赖视觉监督，产生的潜在表示缺乏足够的语义丰富性，限制了它们支持多样化区域级推理任务的能力。在这项工作中，我们引入了语义增强的潜在视觉推理（SLVR），这是一个两阶段学习框架，用属性级视觉语义丰富潜在表示，并将其与多样化的推理目标对齐。在第一阶段，SLVR在细粒度属性监督下学习语义增强的区域中心潜在表示。在第二阶段，我们设计了多查询组相对策略优化（M-GRPO），以对齐基于同一区域的多个查询的潜在表示。为了支持这一框架，我们构建了SLV-Set，包含约40万条区域级属性标注和80万个多查询问答样本，并引入了SV-QA，一个评估语义变化下潜在推理的基准。实验表明，与现有基线相比，SLVR提高了潜在视觉推理的鲁棒性和语义一致性。

英文摘要

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15864 2026-05-28 cs.CV cs.CL 版本更新

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

VLMs 是在看还是只是在说？揭示视觉重新检查的幻觉

Chufan Shi, Cheng Yang, Yaokang Wu, Linghao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma

发表机构 * University of Southern California（南加州大学）； University of California San Diego（加州大学圣地亚哥分校）； Carnegie Mellon University（卡内基梅隆大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结通过图像交换探测框架 VisualSwap 和 800 对图像基准 VS-Bench，发现视觉语言模型在推理时声称的“重新检查图像”多为文本模式，而非真正的视觉重新检查，且思考模型更易受影响，用户指令可恢复视觉基础但自我反思无效。

Comments ICML 2026 Oral

详情

AI中文摘要

视觉语言模型（VLM）在推理过程中经常产生自我反思的语句，如“让我再检查一下图片”。这样的语句是否触发了真正的视觉重新检查，还是仅仅是习得的文本模式？我们通过 VisualSwap（一种图像交换探测框架）对此进行研究：在模型对一张图像进行推理后，我们将其替换为视觉上相似但语义不同的图像，并测试模型是否注意到这一变化。我们引入了 VS-Bench，包含从 MathVista、MathVerse、MathVision 和 MMMU-Pro 中精选的 800 对图像。在 Qwen3-VL、Kimi-VL 和 ERNIE-VL 上的实验揭示了一个惊人的失败：模型绝大多数情况下忽略了图像交换，准确率下降高达 60%。与直觉相反，思考模型比其指令对应模型脆弱近 3 倍，且扩展规模无法缓解。多轮用户指令可以恢复视觉基础，但连续生成过程中自我生成的反思语句则不能。注意力分析解释了原因：用户指令显著提高了对视觉标记的注意力，而自我反思则没有。当前的 VLM 在声称执行视觉重新检查时倾向于“说”而非真正“看”。我们的代码和数据集可在项目页面获取：https://visualswap.github.io

英文摘要

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

URL PDF HTML ☆

赞 0 踩 0

2605.15523 2026-05-28 cs.CV 版本更新

Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

自提示扩散变压器用于开放词汇场景文本编辑的上下文学习

Hongxi Li, Tong Wang, Chengjing Wu, Tianbao Liu, Jiangtao Yao, Xiaochao Qu, Xinxiao Wu, Luoqi Liu, Ting Liu

发表机构 * MT Lab, Meitu Inc., Beijing, China ； School of Computer Science \& Technology, Beijing Institute of Technology, Beijing, China

AI总结提出一种自提示场景文本编辑方法，通过构建风格和字形提示，利用多模态扩散变压器的上下文学习能力，实现开放词汇和风格一致的文本编辑。

Comments ICML 2026

详情

AI中文摘要

场景文本编辑旨在修改图像目标区域中的文本，同时保留周围的背景风格和纹理。现有方法仅依赖图像背景信息，而忽略了目标区域的视觉细节，这丢弃了原始文本中的风格特征，本质上将任务降级为文本渲染。此外，预训练的字形编码器施加的条件限制了可编辑文本的范围。为了解决这些问题，本文提出了一种自提示场景文本编辑方法，直接从原始图像构建风格和字形提示，无需引入额外的风格或字形编码器。我们采用两阶段训练策略：扩散变压器首先在大规模自监督数据上训练，然后使用少量配对图像进行微调。通过利用多模态扩散变压器（MM-DiT）的上下文学习能力，它实现了开放词汇和风格一致的文本编辑。在各种语言上的实验结果表明，我们的方法在文本准确性和风格一致性方面均达到了最先进的性能。我们的项目页面：hongxiii.github.io/mstedit。

英文摘要

Scene text editing aims to modify text in a target region of an image while preserving surrounding background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pre-trained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method that constructs style and glyph prompts directly from the original image, without introducing additional style or glyph encoders. We employ a two-stage training strategy: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency. Our project page: hongxiii.github.io/mstedit.

URL PDF HTML ☆

赞 0 踩 0

2605.13517 2026-05-28 cs.CV cs.AI cs.LG 版本更新

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

ArcVQ-VAE：一种带有反余弦加性边界的球面向量量化框架

Jaeyung Kim, YoungJoon Yoo

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea（韩国首尔 Chung-Ang 大学人工智能系）； SNUAILAB, Seoul, Republic of Korea（韩国首尔 SNUAILAB 实验室）

AI总结针对VQ-VAE有限码本容量限制表示能力的问题，提出ArcVQ-VAE框架，通过引入球面角边先验（包括球界范数正则化和反余弦加性边界损失）增强潜在表示的判别性和均匀分散性，提升码本利用率，在图像重建和生成任务上取得竞争性能。

Comments To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

向量量化变分自编码器（VQ-VAE）已成为图像建模中学习离散表示的基本框架。然而，VQ-VAE模型必须使用有限的码本向量集对整张图像进行分词，这种容量限制限制了其捕获丰富多样表示的能力。在本文中，我们提出反余弦加性边界VQ-VAE（ArcVQ-VAE），一种新颖的向量量化框架，该框架为传统VQ-VAE的码本引入了球面角边先验（SAMP）。所提出的SAMP由球界范数正则化（将所有码本向量约束在时间相关的欧几里得球内）和反余弦加性边界损失（鼓励潜在向量之间更大的角度可分性）组成。这种公式在受限空间内促进了更具判别性和均匀分散的潜在表示，从而提高了有效的潜在空间覆盖范围，并导致码本利用率提升。在标准图像重建和生成任务上的实验结果表明，ArcVQ-VAE在重建精度、表示多样性和样本质量方面与基线模型相比取得了竞争性能。代码可在 https://github.com/goals4292/ArcVQ-VAE 获取。

英文摘要

Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ-VAE

URL PDF HTML ☆

赞 0 踩 0

2506.22726 2026-05-28 cs.CV cs.LG 版本更新

XTransfer: Modality-Agnostic Few-Shot Model Transfer for Human Sensing at the Edge

XTransfer: 面向边缘人体感知的模态无关小样本模型迁移

Yu Zhang, Xi Zhang, Hualin Zhou, Xinyuan Chen, Shang Gao, Hong Jia, Jianfei Yang, Yuankai Qi, Tao Gu

发表机构 * Macquarie University, Sydney, NSW, Australia（麦考瑞大学，悉尼，新南威尔士州，澳大利亚）； Nanyang Technological University, Singapore（南洋理工大学，新加坡）； The University of Auckland, Auckland, New Zealand（奥克兰大学，奥克兰，新西兰）

AI总结提出XTransfer方法，通过模型修复和层重组实现模态无关的小样本模型迁移，降低传感器数据收集、模型训练和边缘部署成本。

Comments Accepted at ICML2026

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, 6-11 July 2026

AI中文摘要

边缘系统上用于人体感知的深度学习具有巨大的智能应用潜力。然而，其训练和开发受到传感器数据有限和边缘系统资源约束的限制。虽然将预训练模型迁移到不同的感知应用很有前景，但现有方法通常需要大量的传感器数据和计算资源，导致成本高且可迁移性有限。在本文中，我们提出了XTransfer，这是一种首创的方法，实现了模态无关、小样本模型迁移，并具有资源高效的设计。XTransfer通过以下方式灵活地使用预训练模型并在不同模态间迁移知识：(i) 模型修复，通过仅使用少量传感器数据适配预训练层来安全地缓解模态偏移；(ii) 层重组，以逐层方式高效地搜索和重组源模型中的感兴趣层以重构模型。我们在跨不同模态的多种人体感知数据集上对各种基线进行了基准测试。结果表明，XTransfer实现了最先进的性能，同时显著降低了传感器数据收集、模型训练和边缘部署的成本。

英文摘要

Deep learning for human sensing on edge systems presents significant potential for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. While transferring pre-trained models to different sensing applications is promising, existing methods often require extensive sensor data and computational resources, resulting in high costs and limited transferability. In this paper, we propose XTransfer, a first-of-its-kind method enabling modality-agnostic, few-shot model transfer with resource-efficient design. XTransfer flexibly uses pre-trained models and transfers knowledge across different modalities by (i) model repairing that safely mitigates modality shift by adapting pre-trained layers with only few sensor data, and (ii) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to restructure models. We benchmark various baselines across diverse human sensing datasets spanning different modalities. The results show that XTransfer achieves state-of-the-art performance while significantly reducing the costs of sensor data collection, model training, and edge deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.12929 2026-05-28 cs.CV cs.AI 版本更新

Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

Anatomy-Slot: 用于视网膜诊断中同源双侧推理的无监督解剖分解

Yingzhe Ma, Xiao Yang, Yuguo Yin, Zheyu Wang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Peking University（北京大学）

AI总结提出Anatomy-Slot方法，通过无监督解剖瓶颈分解斑块令牌为结构一致的解剖区域槽，并利用双向交叉注意力对齐双眼槽，在ODIR-5K上相比ViT-L基线提升AUC 4.2点，验证了显式结构对应改善诊断的假设。

Comments 15 pages, 3 figures

详情

AI中文摘要

视网膜诊断本质上是双侧的：临床医生比较双眼的同源结构（例如，视盘不对称），然而大多数深度模型基于单眼表示。我们研究显式结构对应是否改善诊断，并提出Anatomy-Slot来操作化这一假设。Anatomy-Slot通过将斑块令牌分解为一组涌现的、结构一致的槽（对应于解剖区域）来引入无监督解剖瓶颈，然后通过双向交叉注意力对齐双眼的槽。在ODIR-5K上使用$n=10$个种子，该方法相比匹配的ViT-L基线在AUC上提升$4.2$个点（95%置信区间；Wilcoxon符号秩检验，$W=0$，$p=0.002$）。配对破坏和高斯噪声下的压力测试提供了对应依赖性和鲁棒性的受控测试。我们进一步在REFUGE上报告了定量视盘定位和交叉注意力定位分析。除了报告的性能提升外，这些结果表明，以对象为中心的解剖对应为与临床双侧比较一致的可解释诊断系统提供了一条原则性路径。

英文摘要

Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into a set of emergent, structurally-coherent slots that correspond to anatomical regions, then aligning these slots across eyes via bidirectional cross-attention. On ODIR-5K with $n=10$ seeds, the method improves AUC by $4.2$ points over a matched ViT-L baseline (95% CIs; Wilcoxon signed-rank test, $W=0$, $p=0.002$). Pairing disruption and stress testing under Gaussian noise provide controlled tests of correspondence dependence and robustness under corruption. We further report quantitative optic disc grounding on REFUGE and cross-attention localization analysis. Beyond the reported gains, these results indicate that object-centric anatomical correspondence offers a principled path toward interpretable diagnostic systems aligned with clinical bilateral comparison.

URL PDF HTML ☆

赞 0 踩 0

2605.11755 2026-05-28 cs.LG cs.CV stat.ML 版本更新

One-Step Generative Modeling via Wasserstein Gradient Flows

通过Wasserstein梯度流的一步生成建模

Jiaqi Han, Puheng Li, Qiushan Guo, Renyuan Xu, Stefano Ermon, Emmanuel J. Candès

发表机构 * Stanford University（斯坦福大学）； ByteDance（字节跳动）

AI总结提出W-Flow框架，通过Wasserstein梯度流将参考分布到目标分布的演化压缩为一步生成，结合Sinkhorn散度实现高效最优传输，在ImageNet 256×256上达到1.29 FID且采样速度提升约100倍。

Comments 40 pages, 14 figures

详情

AI中文摘要

扩散模型和基于流的方法展现了令人印象深刻的生成能力，尤其对于图像，但其采样成本高昂，因为需要多次迭代更新。我们引入了W-Flow，一个训练生成器的框架，该生成器在单步中将来自简单参考分布的样本转换为来自目标数据分布的样本。这通过两步实现：首先，通过最小化能量泛函的Wasserstein梯度流，定义从参考分布到目标分布的演化；其次，训练一个静态神经生成器将此演化压缩为一步生成。我们用Sinkhorn散度实例化能量泛函，该散度产生一种高效的基于最优传输的更新规则，捕获全局分布差异并改善目标分布的覆盖。我们进一步证明了在适当假设下，有限样本训练动力学收敛到连续时间分布动力学。实验上，W-Flow为一步ImageNet 256×256生成设立了新的最先进水平，实现了1.29 FID，并改善了模式覆盖和域迁移。与具有相似FID分数的多步扩散模型相比，我们的方法实现了约100倍的采样加速。这些结果表明，Wasserstein梯度流为快速且高保真的生成建模提供了原则性和有效的基础。

英文摘要

Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.10583 2026-05-28 cs.CV 版本更新

FrequencyCT: Frequency Domain Self-supervised Low-dose CT Denoising

FrequencyCT：频域自监督低剂量CT去噪

Guoquan Wei, Liu Shi, Chong Chen, Qiegen Liu

发表机构 * School of Information Engineering, Nanchang University（南昌大学信息工程学院）； SKLMS, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences（中国科学院数学与系统科学研究院SKLMS、ICMSEC）

AI总结提出FrequencyCT，一种在频域中利用噪声与真实信号分布差异生成伪样本的零样本自监督方法，用于低剂量CT去噪，并通过数据截断稳定优化，实验验证了其临床潜力。

详情

AI中文摘要

尽管对计算机断层扫描（CT）去噪进行了广泛研究，但很少有研究利用投影域数据特性来减轻噪声相关性。为填补这一空白，本文提出FrequencyCT，这是第一种在频域中为低剂量CT去噪生成伪样本的零样本自监督方法。具体而言，通过利用噪声和真实信号在频域分布上的差异，提出了一种区域低频锚定技术。对高频区域应用相位保持噪声和掩膜扰动，生成用于自监督的伪样本。基于含噪投影的噪声方差与底层真实信号之间的指数相关性，对生成的样本进行一致的数据截断，以稳定优化梯度。在多个公开和真实数据集上的评估结果证实了本研究的临床应用潜力，为去噪领域提供了创新视角。代码可在 https://github.com/yqx7150/FrequencyCT 获取。

英文摘要

Despite extensive research on computed tomography (CT) denoising, few studies exploit projection-domain data characteristics to mitigate noise correlation. To bridge this gap, this work proposes FrequencyCT, the first zero-shot self-supervised method for pseudo-sample generation in the frequency domain for low-dose CT denoising. Specifically, by exploiting the distinct frequency-domain distributions of noise and true signal, a regional low-frequency anchoring technique is proposed. Applying phase-preserving noise and mask perturbations to the high-frequency region generates pseudo-samples for self-supervision. Driven by the exponential correlation between noise variance of noisy projections and the underlying true signal, consistent data truncation is applied to the generated samples to stabilize optimization gradients. Evaluation results on multiple public and real datasets confirm the clinical application potential of this research, which provides an innovative perspective for the field of denoising. The code is available at: https://github.com/yqx7150/FrequencyCT.

URL PDF HTML ☆

赞 0 踩 0

2605.10581 2026-05-28 cs.CV 版本更新

Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention

Polygon-mamba: 使用多边形扫描Mamba和空间-频率协同注意力进行视网膜血管分割

Yuanyuan Peng, Wen Li

发表机构 * School of Electrical and Automation Engineering, East China Jiaotong University（东华理工大学电气与自动化工程学院）

AI总结针对视网膜小血管分割难题，提出一种混合CNN-Mamba融合网络，通过多边形扫描Mamba和空间-频率协同注意力机制，有效保留像素连通性并增强关键特征，在三个公开数据集上取得优异性能。

详情

AI中文摘要

视网膜血管分割对于眼部疾病的诊断和评估至关重要。值得注意的是，小血管的分割一直被认为是一项具有挑战性和复杂性的任务。为了应对这一挑战，我们设计了一种混合CNN-Mamba融合网络，该网络集成了多边形扫描Mamba和空间-频率协同注意力机制，用于检测小血管。考虑到传统的水平-垂直扫描Mamba架构可能会破坏目标结构的拓扑完整性，并导致视网膜小血管的局部不连续性，我们提出了一种多边形扫描视觉状态空间模型（PS-VSS），通过多层反向扫描方式识别小血管结构特征。该方法有效保留了像素连通性，从而显著减轻了小血管信息的丢失。此外，众所周知，空间域优先考虑位置和结构信息，而频率域强调全局感知和局部细节成分，我们在跳跃连接中引入了空间-频率协同注意力机制（SFCAM），以从空间域和频率域提取高效特征。该策略使模型能够动态增强关键特征，同时有效抑制杂乱信息。为了评估模型的有效性，我们在三个公开数据集：DRIVE、STARE和CHASE_DB1上进行了测试。与手动标注相比，我们的模型在三个数据集上的F1分数分别为0.8283、0.8282和0.8251，曲线下面积（AUC）值分别为0.9806、0.9840和0.9866，灵敏度（SE）值分别为0.8268、0.8314和0.8484。通过视觉检查和定量分析验证了模型的有效性。

英文摘要

Retinal vessel segmentation is crucial for diagnosis and assessment of ocular diseases. Notably, segmentation of small retinal vessels has been consistently recognized as a challenging and complex task. To tackle this challenge, we design a hybrid CNN-Mamba fusion network that integrates polygon scanning mamba and space-frequency collaborative attention mechanism for the detection of small vessels. Considering that the traditional mamba architecture with horizontal-vertical scanning may compromise the topological integrity of target structures and result in local discontinuities in small retinal vessels, we present a polygon scanning visual state space model (PS-VSS) to identify small vessel structural features by multi-layer reverse scanning way. Which effectively preserves pixels connectivity, thereby substantially mitigating the loss of information pertaining to small vessels. Furthermore, as we all known that the spatial domain prioritizes positional and structural information, while the frequency domain emphasizes global perception and local detail components, a space-frequency collaborative attention mechanism (SFCAM) is introduced within the skip connection to extract efficient features from the spatial and frequency domains. This strategy empowers the model to dynamically enhance the key features while effectively suppressing clutters. To assess the efficacy of our model, it was tested on three publicly available datasets: DRIVE, STARE, and CHASE_DB1. Compared to manual annotations, our model demonstrated F1 scores of 0.8283, 0.8282, and 0.8251, Area Under Curve (AUC) values of 0.9806, 0.9840, and 0.9866, and Sensitivity (SE) values of of 0.8268, 0.8314, and 0.8484 across three datasets, respectively. The effectiveness of our model was validated through both visual inspection and quantitative analysis.

URL PDF HTML ☆

赞 0 踩 0

2508.11011 2026-05-28 cs.CV 版本更新

Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

大型预训练视觉语言模型能否成为有效的施工安全检查员？

Xuezheng Chen, Zhengbo Zou

发表机构 * Mechanical Engineering（机械工程）

AI总结本文提出ConstructionSite 10k数据集，包含1万张施工图像及三项任务标注，评估大型预训练视觉语言模型在零样本和小样本下的泛化能力，为施工安全检查提供基准。

详情

DOI: 10.1017/dce.2026.10044

AI中文摘要

施工安全检查通常涉及人类检查员在现场识别安全问题。随着强大的视觉语言模型（VLM）的兴起，研究人员正在探索将其用于从现场图像中检测安全违规等任务。然而，目前缺乏公开数据集来全面评估和进一步微调VLM在施工安全检查中的应用。当前VLM的应用使用小型监督数据集，限制了它们在未直接训练的任务中的适用性。在本文中，我们提出了ConstructionSite 10k数据集，包含10,000张施工场地图像，并为三个相互关联的任务提供标注，包括图像描述、安全违规视觉问答（VQA）和施工元素视觉定位。随后我们对当前最先进的大型预训练VLM的评估显示，它们在零样本和小样本设置下具有显著的泛化能力，但需要额外训练才能应用于实际施工场地。该数据集允许研究人员使用新的架构和技术训练和评估自己的VLM，为施工安全检查提供了有价值的基准。

英文摘要

Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.

URL PDF HTML ☆

赞 0 踩 0

2508.05417 2026-05-28 cs.CV 版本更新

Smoothing Slot Attention Iterations and Recurrences

平滑槽注意力迭代与循环

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

发表机构 * Department of Electrical Engineering and Automation, Aalto University, Espoo, Finland（电气工程与自动化系，阿alto大学，埃斯波，芬兰）； Department of Computer Science, Aalto University, Espoo, Finland（计算机科学系，阿alto大学，埃斯波，芬兰）； Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland（机器视觉与信号分析中心，奥卢大学，奥卢，芬兰）

AI总结针对槽注意力在图像首帧冷启动查询缺乏样本特异性及视频帧间聚合变换同质化的问题，提出SmoothSA方法，通过预热冷启动查询和差异化迭代次数来平滑迭代与循环，提升目标发现、识别与推理性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

槽注意力（SA）是主流面向对象学习（OCL）的核心。图像特征可以通过SA迭代地细化冷启动查询槽来聚合成对象级表示。对于视频，这种聚合通过SA在帧间共享的循环进行，查询在第一帧冷启动，之后从上一帧的槽过渡。然而，冷启动查询缺乏样本特异性，从而阻碍了图像或视频第一帧的精确聚合；非第一帧的查询已经具有样本特异性，因此需要与第一帧不同的聚合变换。我们通过SmoothSA解决这些问题：（1）为了平滑图像或视频第一帧上的SA迭代，我们通过OCL内部自蒸馏的微型模块预热冷启动查询，使其具有丰富的输入特征信息；（2）为了平滑视频第一帧和非第一帧之间的SA循环，我们分别使用完整迭代和单次迭代来区分同质的聚合变换。在目标发现、识别和视觉推理上的综合实验验证了我们方法的有效性。进一步的视觉分析阐明了其潜在机制。我们的源代码、模型检查点和训练日志可在https://github.com/Genera1Z/SmoothSA获取。

英文摘要

Slot Attention (SA) lies at the heart of mainstream Object-Centric Learning (OCL). Image features can be aggregated into object-level representations by SA \textit{iteratively} refining cold-start query slots. For video, such aggregation proceeds by SA \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots thereafter. However, cold-start queries lack sample-specific cues thus hindering precise aggregation on image or video's first frame; Non-first frames' queries are already sample-specific thus requiring aggregation transforms different from the first frame. We address these issues with our \textit{SmoothSA}: (1) To smooth SA iterations on image or video's first frame, we \textit{preheat} cold-start queries with rich input-feature information, by a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across video's first and non-first frames, we \textit{differentiate} the homogeneous aggregation transforms by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and visual reasoning validate our method's effectiveness. Further visual analyses illuminate the underline mechanisms. Our \textit{source code}, \textit{model checkpoints} and \textit{training logs} are provided on https://github.com/Genera1Z/SmoothSA.

URL PDF HTML ☆

赞 0 踩 0

2604.25491 2026-05-28 cs.CV cs.AI 版本更新

The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing

水印移除的法医成本：从专用攻击到图像编辑

Gautier Evennou, Ewa Kijak

发表机构 * IMATAG（IMATAG机构）； IRISA, Univ. Rennes, INRIA, CNRS（IRISA大学、INRIA和CNRS）

AI总结本文提出水印移除检测（WRD）作为新评估维度，通过训练分类器检测移除痕迹，在10^{-3}假阳性率下实现最优检测，证明法医隐蔽性是水印移除的必要条件。

Comments v1:The Forensic Cost of Watermark Removal, accepted at IH&MMSEC 2026, Special Session "Watermarking Across the Lifecycle of Generative Models". v2: extended version, under review

详情

AI中文摘要

当前水印移除方法在两个轴上进行评估：攻击成功率和感知质量。我们证明这是不够的。虽然最先进的攻击成功地在没有可见失真的情况下降低了水印信号，但它们留下了明显的统计伪影，暴露了移除尝试。我们将这个被忽视的轴命名为水印移除检测（WRD），并证明基于这些伪影训练的现代分类器在10^{-3}假阳性率下，对每种测试的移除方法都达到了最先进的检测率。没有现有的攻击考虑到这种法医泄漏。我们在扩展的评估三元组（攻击成功率、感知质量和法医可检测性）下，对领先的水印方案与标准移除流水线进行了基准测试，发现当前没有方法能平衡所有三个。我们的结果确立了法医隐蔽性作为水印移除的必要要求。

英文摘要

Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at $10^{-3}$ FPR across every removal method tested. No existing attack accounts for this forensic leakage. We benchmark leading watermarking schemes against standard removal pipelines under the extended evaluation triple of attack success, perceptual quality, and forensic detectability, and find that no current method balances all three. Our results establish forensic stealthiness as a necessary requirement for watermark removal.

URL PDF HTML ☆

赞 0 踩 0

2604.23282 2026-05-28 cs.CV cs.MM 版本更新

Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

弥合姿态-语义鸿沟：基于文本的人物异常搜索的级联框架

Zequn Xie, Guijin Luo, Chuxin Wang, Sihang Cai, Tao Jin, Zhou Zhao, Yixuan Tang

发表机构 * Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）

AI总结提出结构-语义解耦级联（SSDC）框架，通过两阶段检索（结构感知粗检索和多智能体语义验证）平衡效率与语义推理，在PAB基准上达到最优性能。

Comments Accepted to ACL 2026.10 pages, 5 figures

详情

AI中文摘要

基于文本的人物异常搜索利用自然语言查询从监控档案中检索特定行为事件。尽管最近的姿态感知方法能够很好地对齐几何结构，但它们面临一个根本性的姿态-语义鸿沟：语义不同的动作可能共享相似的骨骼几何结构。虽然多模态大语言模型（MLLMs）可以减少这种歧义，但将其用于大规模检索在计算上代价高昂。我们提出了结构-语义解耦级联（SSDC）框架，将检索解耦为两个阶段：（1）结构感知粗检索，其中轻量级模型通过骨骼相似性快速筛选候选；（2）侦探小组交互，一个多智能体语义验证模块。该小组包括一个用于快速二元过滤的侦探、一个用于证据提取的分析师和一个用于语义合成的写手。最后，通过将合成描述与结构先验融合，对候选进行重新排序。在PAB基准上的实验表明，SSDC通过平衡效率和语义推理实现了最先进的性能。

英文摘要

Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.

URL PDF HTML ☆

赞 0 踩 0

2601.11632 2026-05-28 cs.CV 版本更新

KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

KG-ViP：在多模态大语言模型中桥接知识基础与视觉感知以进行视觉问答

Zhiyang Li, Ao Ke, Yukun Cao, Xike Xie

发表机构 * University of Science and Technology of China（中国科学技术大学）； Data Darkness Lab, MIRACLE Center, USTC（数据黑暗实验室，MIRACLE中心，中国科学技术大学）； School of Computer Science and Technology, Xidian University（西安电子科技大学计算机科学与技术学院）

AI总结提出KG-ViP框架，通过检索与融合场景图和常识图，统一外部知识与细粒度视觉细节，缓解多模态大语言模型在视觉问答中的知识幻觉和视觉感知不足问题。

详情

AI中文摘要

用于视觉问答（VQA）的多模态大语言模型（MLLMs）通常面临双重限制：知识幻觉和细粒度视觉感知不足。关键的是，我们发现常识图和场景图通过提供丰富的外部知识和捕捉细粒度视觉细节，恰好为这些缺陷提供了互补的解决方案。然而，先前的工作通常孤立地处理它们，忽视了它们的协同潜力。为了弥合这一差距，我们提出了KG-ViP，一个统一的框架，通过融合场景图和常识图来增强MLLMs。KG-ViP框架的核心是一个新颖的检索与融合流程，利用查询作为语义桥逐步整合两种图，合成统一的结构化上下文，促进可靠的多模态推理。在FVQA 2.0+和MVQA基准上的大量实验表明，KG-ViP显著优于现有的VQA方法。

英文摘要

Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.

URL PDF HTML ☆

赞 0 踩 0

2604.17110 2026-05-28 cs.CV 版本更新

From Clinical Intent to Clinical Model: Autonomous Coding-Agents for Clinician-driven AI Development

从临床意图到临床模型：面向临床医生驱动AI开发的自主编码代理

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Mathis Bode, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

AI总结提出一种自主编码代理系统，允许临床医生用自然语言描述任务，系统自动生成并迭代优化模型，在五项临床任务中达到竞争性能，并显著减少胸部X光片模型对胸腔引流管的依赖。

Comments Code is available at https://github.com/zhaozh10/clinical-automata/

详情

AI中文摘要

开发在临床实践中有用的AI模型需要临床医生和AI开发者之间的高效协作。这带来了一个实际挑战：临床医生必须反复与AI开发者沟通并完善其需求，然后这些需求才能转化为可执行的模型开发。这种迭代过程耗时，即使经过反复讨论，由于双方未能完全共享专业知识，仍可能存在不一致。编码代理可能有助于弥合这一差距。它们可以自主编写和优化代码，并具备医学和AI的工作知识，以理解医学专家和开发者制定的命令。我们提出了一个原型，让临床医生直接驱动AI开发。临床医生用自然语言描述任务，系统将描述转化为可工作的流程，通过与临床医生一起反复实验进行优化，并返回满足既定临床目标的模型。在五项临床任务中，该系统可靠地生成了与临床医生请求匹配且达到竞争性能的模型。最值得注意的是，在胸部X光片上，该系统显著减少了模型对胸腔引流管的依赖（气胸分类的已知捷径），在一个数据集上从60%降至31%，在另一个数据集上从50%降至18%。我们的结果表明，编码代理可以将临床AI开发转向更以临床医生驱动的模式，使领域专家能够直接塑造模型，而不是通过专门的AI团队传递需求。

英文摘要

Developing AI models that are useful in clinical practice, requires efficient collaboration between clinicians and AI developers. This poses a practical challenge: clinicians must repeatedly communicate and refine their requirements with AI developers before those requirements can be translated into executable model development. This iterative process is time-consuming, and even after repeated discussion, misalignment may still exist because the two sides do not fully share each other's expertise. Coding agents may help close this gap. They can write and refine code on their own, and they carry working knowledge of both medicine and AI to understand commands formulated by both medical experts and developers. We present a prototype that lets clinicians drive AI development directly. A clinician describes the task in plain language, and the system turns the description into a working pipeline, refines it through repeated experiments together with the clinician, and returns a model that meets the stated clinical objective. Across five clinical tasks, the system reliably produces models that matched the clinician's request and reached competitive performance. Most notably, on chest radiographs the system sharply reduced the model's reliance on chest drains, a well-known shortcut for pneumothorax classification, from 60% to 31% on one dataset and from 50% to 18% on another. Our results suggest that coding agents can shift clinical AI development toward a more clinician-driven mode, allowing domain experts to shape models directly instead of relaying requirements through specialized AI teams.

URL PDF HTML ☆

赞 0 踩 0

2506.01247 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering

超越可解释性：稀疏自编码器何时、为何以及如何实现无标签视觉引导

Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

发表机构 * Department of Computer Science, Rutgers University（罗格斯大学计算机科学系）； Department of Statistics, Rutgers University（罗格斯大学统计系）

AI总结本文提出无标签视觉稀疏引导方法VS2，通过训练稀疏自编码器并利用其重构误差和稀疏特征放大来引导冻结的视觉语言模型，在九个图像分类数据集上提升零样本准确率。

详情

AI中文摘要

稀疏自编码器（SAE）越来越多地被用于解释基础模型，但它们作为可操作干预空间的作用仍不太被理解，尤其是在视觉领域。我们研究稀疏视觉特征是否不仅可用于事后分析，还可用于引导冻结的视觉语言模型。我们引入视觉稀疏引导（VS2），一种无标签方法，它在冻结的CLIP图像编码器的无标签激活上训练一个top-$k$ SAE，并在测试时通过放大输入的活跃稀疏特征并解码诱导的变化来构建一个可解释的引导向量。我们证明该过程可分解为质心偏差引导：每个输入沿着其与SAE学习到的质心的偏差移动。残差项由SAE的每样本重构误差（通过FVU测量）精确控制，从而产生基于FVU的残差界限，并促使在SAE重构不可靠时回退到零样本CLIP的可靠性门控。通过使用在无标签CLIP图像编码器激活上训练的目标域SAE，VS2在九个图像分类数据集上提高了零样本准确率，在推理计算量增加不到0.1%的情况下实现了高达+4.12%的提升。最后，一项受控的上界研究VS2++表明，选择性放大稀疏特征可带来高达+21.44%的提升，揭示了一个重构与任务显著性的差距：对重构显著的稀疏特征不一定与对下游预测有用的特征一致。

英文摘要

Sparse Autoencoders (SAEs) are increasingly used to interpret foundation models, but their role as an actionable intervention space remains less understood, especially in vision. We study whether sparse visual features can be used not only for post-hoc analysis, but also to steer frozen vision-language models. We introduce Visual Sparse Steering (VS2), a label-free method that trains a top-$k$ SAE on unlabeled activations from a frozen CLIP image encoder and, at test time, constructs an interpretable steering vector by amplifying the input's active sparse features and decoding the induced change. We show that this procedure admits a closed-form decomposition as centroid-deviation steering: each input is moved along its deviation from the SAE-learned centroid. The residual term is controlled exactly by the SAE's per-sample reconstruction error, measured by FVU, yielding an FVU-based residual bound and motivating a reliability gate that falls back to zero-shot CLIP when SAE reconstruction is unreliable. With target-domain SAEs trained on unlabeled CLIP image-encoder activations, VS2 improves zero-shot accuracy across nine image-classification datasets, achieving gains up to $+4.12\%$ with less than $0.1\%$ additional inference compute. Finally, a controlled upper-bound study, VS2++, shows that selective amplification of sparse features can yield gains up to $+21.44\%$, exposing a reconstruction-vs-task saliency gap: features salient for reconstruction need not align with features useful for downstream prediction.

URL PDF HTML ☆

赞 0 踩 0

2604.09367 2026-05-28 cs.CV 版本更新

EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

EpiAgent: 一种以智能体为中心的古铭文修复系统

Shipeng Zhu, Ang Chen, Na Nie, Pengfei Fang, Min-Ling Zhang, Hui Xue

发表机构 * School of Computer Science and Engineering, Southeast University, China（东南大学计算机科学与工程学院）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China（新一代人工智能技术及其跨学科应用关键实验室（东南大学），教育部，中国）； Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China（计算机网络与信息集成关键实验室（东南大学），教育部，中国）； Nanjing University Museum, Nanjing University, China（南京大学博物馆，南京大学，中国）； The China Centre for Linguistic and Strategic Studies, Nanjing University, China（中国语言战略研究中心，南京大学，中国）

AI总结提出基于智能体的EpiAgent系统，通过分层规划与LLM协调多模态分析、历史经验和专用工具，实现灵活自适应的古铭文修复，在真实退化铭文上取得更优修复质量和泛化能力。

Comments Accepted by CVPR 2026

详情

AI中文摘要

古铭文作为文化记忆的载体，历经数世纪的环境和人为退化。恢复其交织的视觉和文本完整性是数字遗产保护中最具挑战性的任务之一。然而，现有基于AI的方法通常依赖刚性流水线，难以泛化到如此复杂和异质的真实退化场景。受人类金石学家技能协调工作流程的启发，我们提出EpiAgent，一个以智能体为中心的系统，将铭文修复形式化为分层规划问题。遵循观察-构思-执行-重新评估范式，基于LLM的中央规划器协调多模态分析、历史经验、专用修复工具和迭代自我精炼之间的协作。这种以智能体为中心的协调使得修复过程比传统的单次通过方法更加灵活和自适应。在真实退化的铭文上，EpiAgent相比现有方法实现了更优的修复质量和更强的泛化能力。我们的工作标志着向专家级智能体驱动的文化遗产修复迈出了重要一步。代码可在 https://github.com/blackprotoss/EpiAgent 获取。

英文摘要

Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.

URL PDF HTML ☆

赞 0 踩 0

2604.05378 2026-05-28 cs.CL cs.CV 版本更新

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

ICR-Drive：面向端到端语言驱动自动驾驶的指令反事实鲁棒性

Kaiser Hamid, Can Cui, Nade Liang

发表机构 * Texas Tech University（德克萨斯科技大学）； Bosch Center for Artificial Intelligence (BCAI)（博世人工智能中心（BCAI））

AI总结提出ICR-Drive框架，通过生成四类扰动指令（改写、歧义、噪声、误导）并基于CARLA仿真评估，揭示语言条件驾驶模型对指令变化的脆弱性。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 872-880

AI中文摘要

视觉-语言-动作（VLA）模型的最新进展使得语言条件驾驶代理能够在闭环仿真中执行自然语言导航命令，但标准评估大多假设指令精确且格式良好。在实际部署中，指令的措辞和具体性各不相同，可能省略关键限定词，偶尔还包含误导性的权威框架文本，导致指令级鲁棒性未被充分衡量。我们提出了ICR-Drive，一个用于端到端语言条件自动驾驶中指令反事实鲁棒性的诊断框架。ICR-Drive生成受控的指令变体，涵盖四类扰动：改写、歧义、噪声和误导，其中误导变体与导航目标冲突并试图覆盖意图。我们在匹配的仿真器配置和种子下重放相同的CARLA路线，以隔离由指令语言引起的性能变化。鲁棒性通过标准CARLA排行榜指标和相对于基线指令的每族性能下降来量化。在LMDrive和BEVDriver上的实验表明，微小的指令变化可能导致显著的性能下降和不同的故障模式，揭示了在安全关键驾驶中部署具身基础模型的可靠性差距。

英文摘要

Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

URL PDF HTML ☆

赞 0 踩 0

2604.03799 2026-05-28 cs.CV 版本更新

Next-Scale Autoregressive Models for Text-to-Motion Generation

Next-Scale 自回归模型用于文本到运动生成

Zhiwei Zheng, Shibo Jin, Lingjie Liu, Mingmin Zhao

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出 MoScale 框架，通过从粗到细的时间分辨率分层生成运动，结合跨尺度和尺度内细化，实现高效、可扩展的文本到运动生成。

Comments Accepted to CVPR 2026

2604.00913 2026-05-28 cs.CV cs.CL 版本更新

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

跨描绘装配指令对齐的视觉-语言模型基准测试与机制分析

Zhuchenyang Liu, Yao Zhang, Yu Xiao

发表机构 * Aalto University（阿alto大学）

AI总结构建IKEA-Bench基准，评估19个视觉-语言模型在装配图与视频帧对齐任务上的表现，发现视觉编码是提升跨描绘鲁棒性的关键瓶颈。

详情

AI中文摘要

二维装配图通常是抽象的且难以遵循，因此需要智能助手来监控进度、检测错误并提供逐步指导。在混合现实环境中，此类系统必须从摄像头画面中识别已完成和正在进行的步骤，并将其与图示指令对齐。视觉语言模型（VLM）在此任务上展现出潜力，但由于装配图和视频帧共享的视觉特征极少，面临描绘鸿沟。为系统评估这一鸿沟，我们构建了IKEA-Bench基准，包含29个宜家家具产品的6种任务类型共1623个问题，并在三种对齐策略下评估了19个VLM（2B-38B）。主要发现：（1）装配指令理解可通过文本恢复，但文本同时降低了图到视频的对齐性能；（2）架构族比参数数量更能预测对齐精度；（3）视频理解是难以通过策略影响的硬瓶颈。三级机制分析进一步揭示，图和视频占据不相交的ViT子空间，且添加文本会使模型从视觉驱动转向文本驱动的推理。这些结果表明，视觉编码是提升跨描绘鲁棒性的主要目标。项目页面：https://ryenhails.github.io/IKEA-Bench/

英文摘要

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

URL PDF HTML ☆

赞 0 踩 0

2604.00402 2026-05-28 cs.CV cs.AI 版本更新

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

COTTA: 面向自动驾驶轨迹预测的上下文感知迁移适应

Seohyoung Park, Jaeyeol Lim, Seoyoung Ju, Kyeonghun Kim, Nam-Joon Kim, Hyuk-Jae Lee

发表机构 * Ewha Womans University（成均馆大学）； Seoul National University（首尔国立大学）； Sangmyung University（Sangmyung 大学）； NVIDIA

AI总结本文研究将基于美国数据训练的轨迹预测模型QCNet迁移到韩国道路环境，通过对比四种训练策略，发现冻结编码器并微调解码器可在精度和效率间取得最佳平衡，预测误差降低66%以上。

Comments 4 pages, 2 figures. Accepted at ICEIC 2026

详情

AI中文摘要

开发鲁棒模型以准确预测周围代理的轨迹是自动驾驶安全的基础。然而，大多数公开数据集（如Waymo Open Motion Dataset和Argoverse）是在西方道路环境中收集的，并未反映其他地区（包括韩国）独特的交通模式、基础设施和驾驶行为。当在西方数据上训练的最先进模型部署到不同地理环境时，这种领域差异会导致性能下降。在本工作中，我们研究了查询中心轨迹预测（QCNet）从美国数据迁移到韩国道路环境时的适应性。使用韩国自动驾驶数据集，我们比较了四种训练策略：零样本迁移、从头训练、全微调和编码器冻结。实验结果表明，利用预训练知识显著提高了预测性能。具体而言，在冻结编码器的同时选择性微调解码器，在精度和训练效率之间取得了最佳平衡，与从头训练相比，预测误差降低了66%以上。本研究为在新地理领域部署轨迹预测模型提供了有效的迁移学习策略的实用见解。

英文摘要

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

URL PDF HTML ☆

赞 0 踩 0

2512.11524 2026-05-28 cs.CV cs.LG 版本更新

Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using Airborne LiDAR HD Reference Data across Metropolitan France

利用法国大都市机载LiDAR HD参考数据从Sentinel-2时间序列进行超分辨率冠层高度制图

Ekaterina Kalinicheva, Florian Helen, Stéphane Mermoz, Florian Mouret, Milena Planells

发表机构 * CESBIO ； GlobEO

AI总结提出THREASURE-Net端到端框架，利用Sentinel-2时间序列和LiDAR HD数据生成2.5m、5m和10m分辨率的年度冠层高度图，无需预训练模型或高分辨率光学图像，在法国大都市区实现优于现有方法的精度。

详情

AI中文摘要

精细尺度的森林监测对于理解冠层结构及其动态至关重要，这些是碳储量、生物多样性和森林健康的关键指标。深度学习特别有效，因为它整合了共同反映冠层结构的光谱、时间和空间信号。为满足这一需求，我们提出了THREASURE-Net，一种新颖的端到端树高回归与超分辨率框架。该模型使用来自法国大都市区多个空间分辨率的LiDAR HD数据导出的参考高度指标，在Sentinel-2时间序列上训练，以生成年度高度图。我们评估了三种模型变体，分别产生2.5米、5米和10米分辨率的树高预测。THREASURE-Net不依赖任何预训练模型或参考甚高分辨率光学图像来训练其超分辨率模块；相反，它仅从LiDAR导出的高度信息中学习。我们的方法优于现有基于Sentinel数据的最先进方法，并与基于甚高分辨率图像的方法具有竞争力。它可以部署生成高精度年度冠层高度图，在2.5米、5米和10米分辨率下分别实现2.63米、2.70米和2.88米的平均绝对误差。这些结果凸显了THREASURE-Net仅使用免费卫星数据对温带森林进行可扩展且经济高效的结构监测的潜力。THREASURE-Net的源代码可在以下网址获取：https://github.com/Global-Earth-Observation/threasure-net。

英文摘要

Fine-scale forest monitoring is essential for understanding canopy structure and its dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning is particularly effective for this task, as it integrates spectral, temporal, and spatial signals that jointly reflect the canopy structure. To address this need, we introduce THREASURE-Net, a novel end-to-end framework for Tree Height Regression And Super-Resolution. The model is trained on Sentinel-2 time series using reference height metrics derived from LiDAR HD data at multiple spatial resolutions over Metropolitan France to produce annual height maps. We evaluate three model variants, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution. THREASURE-Net does not rely on any pretrained model nor on reference very high resolution optical imagery to train its super-resolution module; instead, it learns solely from LiDAR-derived height information. Our approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods based on very high resolution imagery. It can be deployed to generate high-precision annual canopy-height maps, achieving mean absolute errors of 2.63 m, 2.70 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution, respectively. These results highlight the potential of THREASURE-Net for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data. The source code for THREASURE-Net is available at: https://github.com/Global-Earth-Observation/threasure-net.

URL PDF HTML ☆

赞 0 踩 0

2601.17354 2026-05-28 cs.CV cs.GR 版本更新

NanoVDR：将20亿参数的视觉语言检索器蒸馏为7000万参数的纯文本编码器用于视觉文档检索

Zhuchenyang Liu, Yao Zhang, Yu Xiao

发表机构 * Aalto University（阿alto大学）

AI总结利用查询-文档不对称性，通过蒸馏将20亿参数的视觉语言模型教师蒸馏为7000万参数的纯文本学生编码器，采用点态余弦对齐目标，实现视觉文档检索的高效推理。

详情

AI中文摘要

基于视觉语言模型（VLM）的检索器已将视觉文档检索（VDR）提升到令人印象深刻的水平。它们需要相同的数十亿参数编码器用于文档索引和查询编码，即使对于纯文本查询也会导致高延迟和GPU依赖。我们观察到这种设计是不必要对称的：文档在视觉上复杂且需要强大的视觉理解，而查询只是短文本字符串。NanoVDR利用这种查询-文档不对称性，解耦两个编码路径：冻结的20亿VLM教师离线索引文档，而蒸馏后的纯文本学生（小至6900万参数）在推理时编码查询。关键设计选择是蒸馏目标。通过对三个骨干网络和22个ViDoRe基准数据集的六个目标进行系统比较，我们发现查询文本上的点态余弦对齐始终优于基于排序和对比的替代方案，同时在训练期间仅需要预缓存的教师查询嵌入，无需处理文档。此外，我们识别出跨语言迁移是主要性能瓶颈，并通过使用机器翻译的查询增强训练数据廉价地解决它。最终的NanoVDR-S-Multi（DistilBERT，6900万）保留了教师质量的95.1%，在v2和v3上以32倍更少的参数和50倍更低的CPU查询延迟优于DSE-Qwen2（20亿），总训练成本低于13 GPU小时。

英文摘要

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

URL PDF HTML ☆

赞 0 踩 0

2603.08264 2026-05-28 cs.CV 版本更新

Event-based Motion & Appearance Fusion for 6D Object Pose Tracking

基于事件的运动与外观融合的6D物体姿态跟踪

Zhichao Li, Chiara Bartolozzi, Lorenzo Natale, Arren Glover

发表机构 * Event-driven Perception for Robotics, Istituto Italiano di Tecnologia, Italy（事件驱动感知机器人实验室，意大利理工学院）； Humanoid Sensing and Perception, Istituto Italiano di Tecnologia, Italy（人形感知与感知，意大利理工学院）； University of Genoa, Genoa, Italy（热那亚大学，意大利）

AI总结提出一种结合事件相机高时间分辨率优势的无学习方法，通过事件光流传播姿态并利用模板匹配校正，在高速运动物体上达到或超越现有算法性能。

详情

AI中文摘要

物体姿态跟踪是机器人在家庭和工业环境中执行任务的基本且必要的任务。最常用的传感器是RGB-D相机，但在高动态环境中，由于运动模糊和帧率限制，它们可能达到极限。事件相机具有高时间分辨率和低延迟等显著特性，使其成为高速物体姿态跟踪的理想视觉传感器。尽管如此，目前仅有少数工作涉及事件相机的6D姿态跟踪。在这项工作中，我们利用高时间分辨率的优势，提出了一种结合传播步骤与姿态校正策略的方法。具体而言，我们使用从事件光流中获得的6D物体速度进行姿态传播，然后利用基于模板的局部姿态校正模块进行姿态校正。我们的无学习方法与最先进的算法性能相当，并且在某些情况下对快速移动物体的表现更优。结果表明，在深度网络方法受限于低更新速率的高动态场景中，事件相机具有应用潜力。

英文摘要

Object pose tracking is a fundamental and essential task for robotics to perform tasks in the home and industrial settings. The most commonly used sensors to do so are RGB-D cameras, which can hit limitations in highly dynamic environments due to motion blur and frame-rate constraints. Event cameras have remarkable features such as high temporal resolution and low latency, which make them a potentially ideal vision sensors for object pose tracking at high speed. Even so, there are still only few works on 6D pose tracking with event cameras. In this work, we take advantage of the high temporal resolution and propose a method that uses both a propagation step fused with a pose correction strategy. Specifically, we use 6D object velocity obtained from event-based optical flow for pose propagation, after which, a template-based local pose correction module is utilized for pose correction. Our learning-free method has comparable performance to the state-of-the-art algorithms, and in some cases out performs them for fast-moving objects. The results indicate the potential for using event cameras in highly-dynamic scenarios where the use of deep network approaches are limited by low update rates.

URL PDF HTML ☆

赞 0 踩 0

2603.05425 2026-05-28 cs.CV cs.AI 版本更新

RelaxFlow: Text-Driven Amodal 3D Generation

RelaxFlow: 文本驱动的非模态3D生成

Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

发表机构 * National University of Singapore（新加坡国立大学）； Zhejiang University（浙江大学）； University of Science and Technology of China（中国科学技术大学）

AI总结针对遮挡下图像到3D生成的语义歧义问题，提出无训练的双分支框架RelaxFlow，通过多先验共识模块和松弛机制解耦控制粒度，实现文本提示引导下对未观察区域的补全，同时严格保留输入观测。

Comments Accepted as a spotlight presentation at ICML 2026. Code: https://github.com/viridityzhu/RelaxFlow

详情

AI中文摘要

图像到3D生成在遮挡下面临固有的语义歧义，仅凭部分观测通常不足以确定物体类别。在这项工作中，我们形式化了文本驱动的非模态3D生成，其中文本提示引导对未观察区域的补全，同时严格保留输入观测。关键的是，我们识别出这些目标需要不同的控制粒度：对观测的刚性控制与对提示的松弛结构控制。为此，我们提出RelaxFlow，一个无训练的双分支框架，通过多先验共识模块和松弛机制解耦控制粒度。理论上，我们证明我们的松弛等价于在生成向量场上应用低通滤波器，抑制高频实例细节以隔离适应观测的几何结构。为便于评估，我们引入了两个诊断基准：ExtremeOcc-3D和AmbiSem-3D。大量实验表明，RelaxFlow成功引导未观察区域的生成以匹配提示意图，同时不损害视觉保真度。

英文摘要

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2602.23754 2026-05-28 cs.GR cs.CV 版本更新

Neural Image Space Tessellation efect

神经图像空间镶嵌效应

Youyang Du, Junqiu Zhu, Zheng Zeng, Lu Wang, Lingqi Yan

发表机构 * Shandong University（山东大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； University of California, Santa Barbara（加州大学圣芭芭拉分校）

AI总结提出一种轻量级屏幕空间后处理方法NIST，通过隐式变形图像空间轮廓并重新分配外观，减少低多边形渲染中的面状轮廓，实现接近基于镶嵌的平滑效果，且每帧成本几乎恒定。

详情

AI中文摘要

我们提出神经图像空间镶嵌效应（NIST），一种轻量级的屏幕空间后处理方法，用于减少低多边形渲染中的面状轮廓。NIST不进行图元镶嵌、创建新几何体或修改底层网格，而是利用低多边形渲染结果和简单的辅助G缓冲区属性，学习在图像空间中几何引导的对象轮廓平滑。其核心是，NIST首先隐式变形图像空间轮廓，然后学习在整个图像空间（包括变形区域）重新分配外观，保持纹理连续性并避免接缝伪影。实验表明，NIST减少了视觉上明显的几何面状化，并产生接近基于镶嵌的平滑参考的平滑、连贯轮廓，在我们测试的设置中每帧成本几乎恒定。据我们所知，NIST是第一个将低多边形轮廓面状化解决方案从渲染前几何阶段转移到渲染后屏幕空间阶段的工作。

英文摘要

We present Neural Image Space Tessellation effect (NIST), a lightweight screen-space post-processing approach for reducing the faceted silhouettes of low-poly renderings. Instead of tessellating primitives, creating new geometry, or modifying the underlying mesh, NIST uses the low-poly rendering result together with simple auxiliary G-buffer attributes to learn geometry-guided smoothing of object contours in image space. At its core, NIST first deforms image-space contours implicitly and then learns to reassign appearance in the whole image-space, including the deformed regions, preserving texture continuity and avoiding seam artifacts. Experiments show that NIST reduces visually apparent geometric faceting and produces smooth, coherent silhouettes close to tessellation-based smoothing references, with a nearly constant per-frame cost in our tested settings. To the best of our knowledge, NIST is the first work to move the solution of low-poly silhouette faceting from the pre-rendering geometry stage to a post-rendering screen-space stage.

URL PDF HTML ☆

赞 0 踩 0

2602.22096 2026-05-28 cs.CV 版本更新

WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

WeatherCity: 可控多天气变换的城市场景重建

Wenhua Wu, Huai Guan, Zhe Liu, Hesheng Wang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University（自动化与智能感知学院，上海交通大学）

AI总结提出WeatherCity框架，利用文本引导的图像编辑、天气高斯表示和物理驱动模型，实现高保真、时间一致的4D城市场景重建与多天气编辑。

详情

AI中文摘要

可编辑的高保真4D场景对于自动驾驶至关重要，因为它们可以应用于端到端训练和闭环仿真。然而，现有的重建方法主要局限于复制观察到的场景，缺乏多样化的天气模拟能力。而图像级别的天气编辑方法往往引入场景伪影，并且对天气效果的可控性较差。为了解决这些限制，我们提出了 extbf{WeatherCity}，一个用于4D城市场景重建和天气编辑的新框架。具体来说，我们利用文本引导的图像编辑模型来实现图像天气背景的灵活编辑。为了应对多天气建模的挑战，我们引入了一种基于共享场景特征和专用天气解码器的新型天气高斯表示。这种表示进一步通过内容一致性优化得到增强，确保不同天气条件下的连贯建模。此外，我们设计了一个物理驱动模型，通过粒子和运动模式模拟动态天气效果。在多个数据集和各种场景上的大量实验表明，WeatherCity在4D重建和天气编辑中实现了灵活的可控性、高保真度和时间一致性。我们的框架不仅能够对天气条件（例如小雨和大雪）进行细粒度控制，还支持场景内的物体级操作。代码已发布在https://github.com/IRMVLab/WeatherCity。

英文摘要

Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose \textbf{WeatherCity}, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns. Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene. Codes are released at https://github.com/IRMVLab/WeatherCity.

URL PDF HTML ☆

赞 0 踩 0

2602.18647 2026-05-28 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

Noise Scheduling as Information-Guided Allocation in Diffusion Training

噪声调度作为扩散训练中的信息引导分配

Gabriel Raya, Bac Nguyen, Georgios Batzolis, Yuhta Takida, Dejan Stancevic, Naoki Murata, Chieh-Hsin Lai, Yuki Mitsufuji, Luca Ambrogioni

发表机构 * Tilburg University & JADS（蒂尔堡大学及JADS）； Sony AI（索尼人工智能）； University of Cambridge（剑桥大学）； Radboud University（拉德堡德大学）； Sony Group Corporation（索尼集团公司）

AI总结提出InfoNoise，一种在线自适应噪声调度方法，通过估计条件熵率剖面动态调整训练噪声分布，以优化去噪任务中的信息增益，在图像、DNA和语言生成等任务中达到或超越基线，并节省高达3倍训练计算量。

详情

AI中文摘要

我们引入了InfoNoise，一种用于扩散训练的在线自适应噪声调度，它将优化努力重新分配到去噪最具信息量的噪声水平上。与损失加权一起，噪声调度在去噪问题之间诱导出有效的分配，而这种分配通常在知道信息性噪声水平之前就已固定。InfoNoise通过从训练期间的去噪损失中估计条件熵率剖面，使这种分配具有数据自适应性，无需辅助模型或离线搜索。通过I--MMSE，该剖面识别出噪声观测在何处能快速减少关于干净样本的不确定性，并指导训练噪声分布的适应。它只改变这个分布，保持目标、加权和参数化不变。在图像基准测试中，调度已被广泛调整，InfoNoise匹配或略微超过强基线，并且可以用更少的更新达到相同的质量。在表示、序列和模态转换（包括DNA和语言生成）上，InfoNoise优于固定和自适应基线，并且达到目标质量所需的训练计算量最多减少3倍。这些结果确立了条件熵率剖面作为噪声调度设计的数据依赖目标，并使在线自适应成为手动调度搜索的实用替代方案。

英文摘要

We introduce InfoNoise, an online adaptive noise schedule for diffusion training that reallocates optimization effort toward noise levels where denoising is most informative. Together with loss weighting, a noise schedule induces an effective allocation across denoising problems, often fixed before informative noise levels are known. InfoNoise makes this allocation data-adaptive by estimating a conditional-entropy-rate profile from denoising losses during training, without auxiliary models or offline search. Through I--MMSE, this profile identifies where noisy observations rapidly reduce uncertainty about the clean sample and guides adaptation of the training noise distribution. It changes only this distribution, keeping the objective, weighting, and parameterization fixed. On image benchmarks, where schedules have been extensively tuned, InfoNoise matches or slightly exceeds strong baselines and can reach the same quality with fewer updates. On representation, sequence, and modality shifts, including DNA and language generation, InfoNoise improves over fixed and adaptive baselines and reaches target quality with up to $3\times$ less training compute. These results establish the conditional-entropy-rate profile as the data-dependent target for noise schedule design and make online adaptation a practical alternative to manual schedule search.

URL PDF HTML ☆

赞 0 踩 0

2602.16872 2026-05-28 cs.CV 版本更新

DODO: Discrete OCR Diffusion Models

DODO: 离散OCR扩散模型

Sean Man, Gilad Deutch, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

发表机构 * Technion - Israel Institute of Technology, Haifa, Israel.（特拉维夫大学-以色列理工学院，海法，以色列。）； Amazon Web Services（亚马逊网络服务）

AI总结针对OCR任务中自回归解码速度慢的问题，提出首个利用块离散扩散的VLM模型DODO，在保持高精度的同时实现高达5倍的推理加速。

详情

AI中文摘要

光学字符识别（OCR）是数字化信息的基础任务，是视觉数据与文本理解之间的关键桥梁。虽然现代视觉语言模型（VLM）在该领域取得了高精度，但它们主要依赖自回归解码，这需要为每个生成的token进行顺序前向传播，因此在处理长文档时计算成本高且速度慢。我们发现了一个克服这一瓶颈的关键机会：与开放式生成不同，OCR是一个高度确定性的任务，视觉输入严格决定了唯一的输出序列，理论上可以通过扩散模型实现高效的并行解码。然而，我们表明现有的掩码扩散模型未能利用这一潜力；它们引入了结构不稳定性，这在灵活任务（如字幕生成）中无害，但对于OCR的刚性精确匹配要求则是灾难性的。为了弥合这一差距，我们引入了DODO，这是首个利用块离散扩散并释放其OCR加速潜力的VLM。通过将生成分解为块，DODO减轻了全局扩散的同步误差。实验上，我们的方法在实现接近最先进精度的同时，与自回归基线相比，推理速度提高了5倍。

英文摘要

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 5x faster inference compared to autoregressive baselines.

URL PDF HTML ☆

赞 0 踩 0

2602.13748 2026-05-28 cs.CL cs.CV 版本更新

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

RMPL：基于关系感知的多任务渐进学习与分阶段训练的多媒体事件抽取

Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结提出RMPL框架，通过分阶段训练结合单模态事件抽取和多模态关系抽取的异构监督，在低资源条件下实现多媒体事件抽取，并在M2E2基准上取得一致改进。

Comments Accepted by ACM ICMR 2026

详情

DOI: 10.1145/3805622.3810577

AI中文摘要

多媒体事件抽取（MEE）旨在从包含文本和图像的文档中识别事件及其论元。它需要跨不同模态对事件语义进行 grounding。MEE 的进展受到缺乏标注训练数据的限制。M2E2 是唯一已建立的基准，但它仅提供评估用的标注。这使得直接监督训练不切实际。现有方法主要依赖于跨模态对齐或使用视觉-语言模型（VLM）进行推理时提示。这些方法没有显式学习结构化的事件表示，并且通常在多模态设置中产生较弱的论元 grounding。为解决这些限制，我们提出了 RMPL，一种用于低资源条件下 MEE 的基于关系感知的多任务渐进学习框架。RMPL 通过分阶段训练整合了来自单模态事件抽取和多模态关系抽取的异构监督。模型首先使用统一模式进行训练，以学习跨模态的共享事件中心表示。然后，使用混合文本和视觉数据对模型进行微调，以进行事件提及识别和论元角色抽取。在 M2E2 基准上使用多个 VLM 进行的实验表明，在不同模态设置下均取得了一致的改进。

英文摘要

Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.

URL PDF HTML ☆

赞 0 踩 0

2602.12843 2026-05-28 cs.CV 版本更新

MMRad-22K: A Structured Multimodal Evidence Dataset for Chest X-ray Report Generation

MMRad-22K：用于胸部X光报告生成的结构化多模态证据数据集

Yichen Zhao, Zelin Peng, Fenghe Tang, Piao Yang, Yu Huang, Wei Shen

发表机构 * MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University（人工智能MOE实验室、人工智能研究院、计算机科学学院、上海交通大学）； School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC)（生物医学工程学院、生命科学与医学系、中国科学技术大学）； Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advanced Research, USTC（医学影像、机器人、分析计算与学习中心（MIRACLE）、苏州市先进研究院、中国科学技术大学）； Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine（放射科、浙江大学医学院第一附属医院）

AI总结针对胸部X光报告生成中现有资源监督信号碎片化的问题，提出结构化多模态证据数据集MMRad-22K，并基于统一LVLM骨干进行适配，证明结构化多模态证据优于纯文本或边界框证据，在语言和临床指标上表现更优。

详情

AI中文摘要

胸部X光（CXR）报告遵循基于区域的临床工作流程，放射科医生检查解剖区域并将局部发现整合到最终报告中。然而，现有的CXR报告生成资源以碎片化形式提供这些监督信号。我们引入MMRad-22K，一个将区域文本观察、解剖定位坐标、局部图像证据和报告目标组织成结构化多模态证据单元的数据集，用于CXR报告生成。为了推动这一构想，我们首先比较了不同证据格式对报告生成的影响，发现结构化多模态证据通常比纯文本或基于边界框的证据更有用。然后，我们使用MMRad-22K适配统一的LVLM骨干，并证明多模态证据适配在语言和临床导向指标上均优于文本证据适配和端到端适配。在相同的评估协议下，适配模型也达到了与几个开源LVLM参考相当的性能水平。这些结果共同支持MMRad-22K作为实用的结构化多模态资源，用于训练和评估与临床阅读工作流程一致的CXR报告生成。

英文摘要

Chest X-ray (CXR) reporting follows a region-based clinical workflow in which radiologists inspect anatomical regions and integrate localized findings into a final report. However, existing resources for CXR report generation provide these supervision signals in fragmented forms. We introduce MMRad-22K, a dataset that organizes regional textual observations, anatomical grounding coordinates, localized image evidence, and report targets into structured multimodal evidence units for CXR report generation. To motivate this formulation, we first compare different evidence formats for report generation and find that structured multimodal evidence is generally more useful than text-only or bounding box-based evidence. We then adapt a unified LVLM backbone using MMRad-22K and show that adaptation with multimodal evidence outperforms both textual-evidence adaptation and end-to-end adaptation on language and clinically oriented metrics. Under the same evaluation protocol, the adapted model also reaches a performance level comparable to several open-source LVLM references. Together, these results support MMRad-22K as a practical structured multimodal resource for training and evaluating CXR report generation aligned with clinical reading workflows.

URL PDF HTML ☆

赞 0 踩 0

2602.11564 2026-05-28 cs.CV 版本更新

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

LUVE：基于双频专家的潜在级联超高分辨率视频生成

Chen Zhao, Jiawei Chen, Hongyu Li, Zhuoliang Kang, Shilin Lu, Xiaoming Wei, Kai Zhang, Jian Yang, Ying Tai

发表机构 * Nanjing University（南京大学）； Nanyang Technological University（南洋理工大学）

AI总结提出LUVE框架，通过三阶段潜在级联架构（低分辨率运动生成、潜在空间上采样、高分辨率内容精炼）结合双频专家，解决超高分辨率视频生成中的运动建模、语义规划和细节合成难题。

Comments ICML 2026

详情

AI中文摘要

近期视频扩散模型在视觉质量上取得了显著进步，但超高分辨率（UHR）视频生成由于运动建模、语义规划和细节合成的复合困难，仍然是一个严峻挑战。为解决这些限制，我们提出了 extbf{LUVE}，一个基于双频 extbf{专}家的 extbf{潜}在级联 extbf{UHR} extbf{V}ideo生成框架。LUVE采用三阶段架构，包括用于运动一致潜在合成的低分辨率运动生成、直接在潜在空间进行分辨率上采样以减少内存和计算开销的视频潜在上采样，以及集成低频和高频专家以共同增强语义连贯性和细粒度细节生成的高分辨率内容精炼。大量实验表明，我们的LUVE在UHR视频生成中实现了卓越的照片真实感和内容保真度，全面的消融研究进一步验证了每个组件的有效性。项目可在\href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}获取。

英文摘要

Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.

URL PDF HTML ☆

赞 0 踩 0

2511.18894 2026-05-28 cs.CV cs.AI 版本更新

Not All Pixels Are Equal: Pixel-wise Meta-Learning for Medical Segmentation with Noisy Labels

并非所有像素都平等：面向含噪标签医学分割的像素级元学习

Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng

发表机构 * Xidian University（西安电子科技大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出MetaDCSeg框架，通过动态学习像素级权重并引入动态中心距离机制建模边界不确定性，抑制噪声标签影响并提升边界分割性能。

详情

AI中文摘要

医学图像分割对于临床应用至关重要，但常常受到噪声标注和模糊解剖边界的干扰，限制了其在现实场景中的应用。现有方法通常直接适应为实例分类设计的噪声标签学习技术，忽视了医学分割中像素级异质性及其空间和解剖上的难度差异。因此，全局假设或简单的置信度指标无法解决这些局部变化，导致边界模糊问题未得到解决。为解决这一问题，我们提出MetaDCSeg，一个鲁棒的框架，动态学习最优像素级权重以抑制噪声标签的影响，同时保留可靠标注。通过动态中心距离（DCD）机制显式建模边界不确定性，我们的方法利用前景、背景和边界中心的加权特征距离，引导模型关注模糊边界附近的难分割像素。该策略能够更精确地处理结构边界（这些边界常被现有方法忽略），并显著提升分割性能。在四个不同噪声水平的基准数据集上的大量实验表明，MetaDCSeg优于现有最先进方法。

英文摘要

Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, limiting its application in real-world scenarios. Existing methods often directly adapt noisy label learning techniques designed for instance classification, overlooking the pixel-wise heterogeneity in medical segmentation with its spatially and anatomically varying difficulties. Consequently, global assumptions or simple confidence metrics fail to address these local variations, leaving boundary ambiguities unresolved. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg outperforms existing state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2602.07574 2026-05-28 cs.CV cs.CL 版本更新

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

ViCA：仅视觉交叉注意力的高效多模态大语言模型

Wenjie Liu, Hao Wu, Xin Qiu, Xudong Wang, Yingqi Fan, Yihan Zhang, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology（宁波数字孪生研究院、东部技术研究院）； Munich Center for Machine Learning, LMU Munich（慕尼黑机器学习中心、慕尼黑大学）

AI总结提出ViCA架构，通过仅视觉交叉注意力减少视觉令牌计算，在保持98%准确率的同时将视觉计算降至4%，实现显著加速。

详情

AI中文摘要

现代多模态大语言模型（MLLMs）采用统一的自我注意设计，在每个Transformer层处理视觉和文本令牌，导致大量计算开销。在这项工作中，我们重新审视了这种密集视觉处理的必要性，并表明投影的视觉嵌入已经与语言空间良好对齐，而有效的视觉-语言交互仅发生在少数层中。基于这些见解，我们提出了ViCA（仅视觉交叉注意力），一种最小的MLLM架构，其中视觉令牌绕过所有自我注意和前馈层，仅通过稀疏的交叉注意力在选定层与文本交互。在三个MLLM骨干、九个多模态基准和26个基于剪枝的基线上的广泛评估表明，ViCA在将视觉侧计算减少到4%的同时保持了98%的基线准确率，始终实现了优越的性能-效率权衡。此外，ViCA提供了一个规则的、硬件友好的推理流水线，在单批推理中实现了超过3.5倍的加速，在多批推理中实现了超过10倍的加速，与仅文本的LLM相比，将视觉定位减少到接近零的开销。它还与令牌剪枝方法正交，可以无缝结合以进一步提高效率。我们的代码可在https://github.com/EIT-NLP/ViCA获取。

英文摘要

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

URL PDF HTML ☆

赞 0 踩 0

2412.01004 2026-05-28 cs.CV 版本更新

Take Only What You Need: Rank Minimization as an Implicit Forgetting Regularizer in Continual Learning

只取所需：秩最小化作为持续学习中的隐式遗忘正则化器

Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong

发表机构 * University of New South Wales（新南威尔士大学）； CSIRO（澳大利亚联邦科学工业研究组织）

AI总结本文提出CoDyRA方法，通过秩最小化作为隐式遗忘正则化器，在持续学习中平衡可塑性与稳定性，在多个基准上优于现有方法。

Comments Preprint

详情

AI中文摘要

持续学习中的核心张力是可塑性（获取新知识）与稳定性（保留先前知识）之间的权衡。我们研究如何通过容量控制（即调节每次参数更新的有效秩，这是LoRA更新中可直接控制的逐步骤量）来持续更新预训练骨干网络，使其吸收新知识的同时保留现有能力。对模块和任务间LoRA秩和放置的受控探测揭示了一致的权衡，存在一个随放置和任务变化的中等秩最佳点，没有普遍最优的固定秩；一个形式化界限表明遗忘随秩增长。基于这些发现，我们提出了持续动态秩选择LoRA（CoDyRA），该方法通过在每个组件重要性权重上施加稀疏性促进正则化，联合训练每个LoRA更新与秩最小化。监督目标驱动可塑性；秩最小化正则化遗忘。我们证明秩最小化在持续学习机制中充当隐式遗忘正则化器，通过控制相对于当前模型状态的遗忘，同时保护通用能力和先前任务知识。在MTIL、X-TAIL和TRACE（CLIP、LLaMA、Gemma）上，CoDyRA在新知识学习和遗忘方面优于先前的持续学习方法，实现了强大的可塑性-稳定性平衡。代码可在https://github.com/jeff024/codyra获取。

英文摘要

The central tension in continual learning (CL) is the trade-off between plasticity (acquiring new knowledge) and stability (retaining prior knowledge). We study how a pre-trained backbone can be continually updated to absorb new knowledge while preserving existing capabilities, via capacity control: regulating the effective rank of each parameter update, a per-step quantity directly controllable inside a LoRA update. A controlled probe of LoRA rank and placement across modules and tasks reveals a consistent trade-off, with a moderate-rank sweet spot that varies by placement and task, leaving no universally optimal fixed rank; a formal bound shows forgetting grows with rank. Building on these findings, we propose Continual Dynamic Rank-Selective LoRA (CoDyRA), which jointly trains each LoRA update with rank minimization via sparsity-promoting regularization on per-component importance weights. The supervised objective drives plasticity; rank minimization regularizes forgetting. We show that rank minimization serves as an implicit forgetting regularizer in the CL regime, protecting general capability and prior-task knowledge simultaneously by controlling forgetting against the current model state. Across MTIL, X-TAIL, and TRACE (CLIP, LLaMA, Gemma), CoDyRA outperforms prior CL methods on new knowledge learning and forgetting, achieving a strong plasticity-stability balance. Code is available at https://github.com/jeff024/codyra.

URL PDF HTML ☆

赞 0 踩 0

2602.03668 2026-05-28 cs.RO cs.CV 版本更新

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

MVP-LAM：通过跨视角重建学习以动作为中心的潜在动作

Jung Min Lee, Dohyeok Lee, Seokhun Ju, Taehyun Cho, Jin Woo Koo, Li Zhao, Sangwoo Hong, Jungwoo Lee

发表机构 * Seoul National University, Seoul, South Korea（首尔国立大学，首尔，韩国）； Konkuk University, Seoul, South Korea（韩国konkuk大学，首尔，韩国）； Microsoft Research Asia, Beijing, China（微软亚洲研究院，北京，中国）； HodooAI Labs, Seoul, South Korea（HodooAI实验室，首尔，韩国）

AI总结提出MVP-LAM模型，利用多视角视频通过跨视角重建目标学习与真实动作高度相关的潜在动作，提升动作预测和下游操作性能。

详情

AI中文摘要

从多样化人类视频中学习的潜在动作作为视觉-语言-动作（VLA）预训练的伪标签，但只有当它们对底层真实动作保持信息量时才能提供有效监督。为了有效监督，潜在动作应包含关于底层动作的信息，尽管这些信息不可直接获取。我们提出多视角潜在动作模型（MVP-LAM），该模型从多视角视频中学习与真实动作高度相关的潜在动作。MVP-LAM通过跨视角重建目标训练潜在动作，使得一个视角的潜在动作必须解释另一个视角的未来，从而减少对视角特定线索的依赖。在Bridge V2上，MVP-LAM生成更以动作为中心的潜在动作，与真实动作的互信息更高，动作预测性能提升，包括在分布外评估下。最后，使用MVP-LAM潜在动作预训练VLA模型提高了各种基准上的下游操作性能。代码和训练好的检查点可在https://jmsnu.github.io获取。

英文摘要

Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Moel (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jmsnu.github.io.

URL PDF HTML ☆

赞 0 踩 0

2602.03491 2026-05-28 cs.CV cs.CL 版本更新

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

解耦骨架与血肉：基于解缠对齐和结构感知引导的高效多模态表格推理

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳））； Peng Cheng Laboratory, Shenzhen, China（鹏城实验室）

AI总结提出DiSCo解缠结构-内容对齐框架和Table-GLS全局到局部结构引导推理框架，高效增强LVLM的表格理解与推理能力，无需昂贵监督或外部工具。

Comments Accepted as a Spotlight Paper at ICML 2026

详情

AI中文摘要

由于复杂的布局和紧密耦合的结构-内容信息，对表格图像进行推理对于大型视觉语言模型（LVLM）仍然具有挑战性。现有解决方案通常依赖于昂贵的监督训练、强化学习或外部工具，限制了效率和可扩展性。这项工作解决了一个关键问题：如何以最少的标注且无需外部工具来使LVLM适应表格推理？具体来说，我们首先引入了DiSCo，一种解缠结构-内容对齐框架，在多模态对齐期间明确分离结构抽象和语义基础，高效地将LVLM适应于表格结构。在DiSCo的基础上，我们进一步提出了Table-GLS，一种全局到局部结构引导推理框架，通过结构化探索和基于证据的推理来执行表格推理。跨多个基准的大量实验表明，我们的框架高效地增强了LVLM的表格理解和推理能力，特别是泛化到未见过的表格结构。我们的数据和代码可在https://github.com/AAAndy-Zhu/TableVLM获取。

英文摘要

Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures. Our data and code are available at https://github.com/AAAndy-Zhu/TableVLM.

URL PDF HTML ☆

赞 0 踩 0

2602.02259 2026-05-28 cs.LG cs.CV 版本更新

Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

聚焦分割：在干扰物存在下引导潜在动作模型

Marcus Fechner, Hamza Adnan, Constantin C. Lüth, Matthew T. Jackson, Alexey Zakharov, J. Marius Zöllner

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； University of Oxford（牛津大学）

AI总结针对动作相关视觉干扰导致潜在动作模型失效的问题，提出MaskLAM方法，利用分割基础模型（如SAM）零样本获取智能体掩码，限制重建目标于智能体像素，迫使潜在动作编码内源动态，显著提升下游策略性能。

详情

AI中文摘要

潜在动作模型（LAMs）为在大规模无动作视频上预训练具身智能体提供了一条有前景的路径。它们推断连续观测之间的潜在动作，之后可以使用少量标签解码为真实动作。然而，近期工作表明，在真实世界视频中常见的动作相关视觉干扰物（如动态背景、相机抖动或其他移动物体）存在时，这一方法会失败。在这些场景中，标准重建目标会驱使潜在动作编码外源运动而非智能体控制的动态，导致微调后的策略性能不佳。然而，我们观察到内源和外源因素通常在像素空间中是空间分离的：控制相关的变化集中在智能体上，而干扰物运动发生在别处。我们利用这一观察，将重建目标限制在智能体像素上，迫使潜在动作解释智能体控制的动态而非外源动态。我们将该方法称为MaskLAM；它从现成的分割基础模型（如SAM）中零样本获取智能体掩码，并且在预训练期间不需要架构更改、辅助损失或动作标签。在两个连续控制基准（Distracting Control Suite、Distracting Meta-World）上，MaskLAM将归一化线性探针MSE降低了最多$3.51 imes$，并将归一化回报提高了最多$4.97 imes$，相比LAPO，同时缩小了与依赖真实动作监督的LAOM-Labels之间的差距。

英文摘要

Latent action models (LAMs) offer a promising path to pre-training embodied agents on large amounts of action-free video. They infer latent actions between consecutive observations that can later be decoded to ground-truth actions using a small number of labels. However, recent work has shown that this recipe fails in the presence of action-correlated visual distractors common in real-world video, such as dynamic backgrounds, camera shake, or other moving objects. In these scenarios, the standard reconstruction objective drives latent actions to encode exogenous motion instead of agent-controlled dynamics, resulting in policies that underperform when fine-tuned. We observe, however, that endogenous and exogenous factors are typically spatially separated in pixel space: control-relevant change is concentrated on the agent, while distractor motion occurs elsewhere. We exploit this observation by restricting the reconstruction objective to agent pixels, forcing latent actions to explain agent-controlled dynamics rather than exogenous ones. We call this method MaskLAM; it obtains the agent mask zero-shot from off-the-shelf segmentation foundation models (e.g., SAM) and requires no architectural changes, auxiliary losses, or action labels during pre-training. Across two continuous-control benchmarks (Distracting Control Suite, Distracting Meta-World), MaskLAM reduces normalized linear-probe MSE by up to $3.51\times$ and improves normalized return by up to $4.97\times$ over LAPO, while narrowing the gap to LAOM-Labels, which relies on ground-truth action supervision.

URL PDF HTML ☆

赞 0 踩 0

2601.21666 2026-05-28 cs.AI cs.CV 版本更新

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

SONIC-O1：用于评估多模态大语言模型在音视频理解上的真实世界基准

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

发表机构 * Vector Institute for Artificial Intelligence（向量人工智能研究所）； University of Groningen（Groningen大学）； York University（约克大学）

AI总结提出SONIC-O1基准，包含60小时人工验证的音视频数据，评估多模态大语言模型在开放摘要、多项选择问答和时序定位上的能力，发现模型在时序定位上存在显著性能差距和人口统计偏差。

详情

AI中文摘要

多模态大语言模型（MLLMs）是近期AI研究的主要焦点。然而，大多数先前工作集中于静态图像理解，而它们处理序列音视频数据的能力仍未充分探索。这一差距凸显了需要一个高质量基准来系统评估MLLM在真实世界场景中的性能。我们介绍了SONIC-O1，一个全面的、完全人工验证的基准，包含60小时（231个片段）跨越13个真实世界对话领域的数据，带有4,958个注释和人口统计元数据。SONIC-O1评估三种能力：开放摘要、多项选择题（MCQ）回答以及带有支持理由（推理）的时序定位。在闭源和开源模型中，我们发现MCQ准确率显示模型家族之间的差距最小，但最好的闭源模型在时序定位上比最好的开源模型高出22.6%。我们进一步观察到不同人口统计组在时序定位上的准确率差距高达21.4%，表明模型行为存在持续差异。SONIC-O1为基于时序和人口统计鲁棒的多模态理解提供了一个开放评估套件。SONIC-O1公开可用于研究：项目页面（https://vectorinstitute.github.io/sonic-o1/）、数据集（https://huggingface.co/datasets/vector-institute/sonic-o1）、GitHub（https://github.com/vectorinstitute/sonic-o1）、排行榜（https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard）。

英文摘要

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark of 60 hours (231 clips) spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates three capabilities: open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Across closed- and open-source models, we find that the MCQ accuracy shows the smallest gap between model families, but the best closed-source model outperforms the best open-source model by 22.6% on temporal localization. We further observe accuracy gaps of up to 21.4% on temporal localization across demographic groups, indicating persistent disparities in model behaviour. SONIC-O1 provides an open evaluation suite for temporally grounded and demographically robust multimodal understanding. SONIC-O1 is publicly available for research: Project page (https://vectorinstitute.github.io/sonic-o1/), Dataset (https://huggingface.co/datasets/vector-institute/sonic-o1), GitHub (https://github.com/vectorinstitute/sonic-o1), Leaderboard (https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard).

URL PDF HTML ☆

赞 0 踩 0

2601.17737 2026-05-28 cs.CV cs.AI 版本更新

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

脚本即一切：一个用于长程对话到电影视频生成的智能体框架

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus

发表机构 * Tencent（腾讯）

AI总结提出一个端到端智能体框架，通过训练ScripterAgent将对话转化为精细脚本，并利用DirectorAgent跨场景连续生成策略，实现长程对话到电影视频的连贯生成，显著提升脚本忠实度和时间保真度。

详情

AI中文摘要

近期视频生成的进展产生了能够从简单文本提示合成惊艳视觉内容的模型。然而，这些模型难以从对话等高层概念生成连贯的长篇叙事，揭示了创意想法与其电影执行之间的“语义鸿沟”。为弥合这一鸿沟，我们引入了一个新颖的、端到端的智能体框架，用于对话到电影视频的生成。我们框架的核心是ScripterAgent，一个经过训练将粗略对话转化为精细、可执行的电影脚本的模型。为此，我们构建了ScriptBench，一个具有丰富多模态上下文的新大规模基准，通过专家引导的流程进行标注。生成的脚本随后指导DirectorAgent，它使用跨场景连续生成策略协调最先进的视频模型，以确保长程连贯性。我们的全面评估，包括一个AI驱动的CriticAgent和一个新的视觉-脚本对齐（VSA）指标，表明我们的框架在所有测试的视频模型上显著提高了脚本忠实度和时间保真度。此外，我们的分析揭示了当前SOTA模型在视觉奇观与严格脚本遵循之间的关键权衡，为自动化电影制作的未来提供了宝贵见解。

英文摘要

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

URL PDF HTML ☆

赞 0 踩 0

2601.10714 2026-05-28 cs.CV cs.GR 版本更新

Alterbute: Editing Intrinsic Attributes of Objects in Images

Alterbute: 编辑图像中物体的内在属性

Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen

发表机构 * Google（谷歌）； The Hebrew University of Jerusalem（耶路撒冷希伯来大学）； Reichman University（雷赫曼大学）

AI总结提出Alterbute方法，通过扩散模型结合松弛训练目标和视觉命名实体，在保持物体身份和场景上下文的同时编辑颜色、纹理、材质和形状等内在属性。

Comments ICML 2026. Project page is available at https://talreiss.github.io/alterbute/

详情

AI中文摘要

我们介绍了Alterbute，一种基于扩散的方法，用于编辑图像中物体的内在属性。我们允许改变物体的颜色、纹理、材质甚至形状，同时保持其感知身份和场景上下文。现有方法要么依赖无监督先验，往往无法保持身份，要么使用过度严格的监督，阻止有意义的内部变化。我们的方法依赖于：(i) 一个松弛的训练目标，允许模型在身份参考图像、描述目标内在属性的文本提示以及定义外在上下文的背景图像和物体掩码的条件下，改变内在和外在属性。在推理时，我们通过重用原始背景和物体掩码来限制外在变化，从而确保只改变所需的内在属性；(ii) 视觉命名实体（VNEs）——细粒度的视觉身份类别（例如“保时捷911 Carrera”），这些类别将共享身份定义特征的物体分组，同时允许内在属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和内在属性描述，从而实现可扩展的、保持身份的监督。Alterbute在保持身份的物体内在属性编辑方面优于现有方法。

英文摘要

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

URL PDF HTML ☆

赞 0 踩 0

2601.10334 2026-05-28 cs.CV cs.LG 版本更新

An analytic theory of convolutional neural network inverse problems solvers

卷积神经网络逆问题求解器的解析理论

Minh Hai Nguyen, Quoc Bao Do, Edouard Pauwels, Pierre Weiss

发表机构 * IRIT \& CBI, CNRS \& Université Toulouse, France ； Toulouse School of Economics, Université Toulouse Capitole, France

AI总结通过最小均方误差估计器引入平移等变性和有限感受野的归纳偏置，推导出局部等变MMSE的解析公式，并在多种逆问题、数据集和架构上验证其与神经网络输出高度一致。

详情

Journal ref: Forty-Third International Conference on Machine Learning, 2026

AI中文摘要

监督卷积神经网络（CNN）被广泛用于解决成像逆问题，在众多应用中取得了最先进的性能。然而，尽管取得了经验上的成功，这些方法从理论角度仍缺乏理解，常被视为黑箱。为弥合这一差距，我们通过最小均方误差（MMSE）估计器的视角分析训练后的神经网络，并引入捕获CNN两个基本归纳偏置（平移等变性和通过有限感受野的局部性）的功能约束。在经验训练分布下，我们推导出这种约束变体（称为局部等变MMSE，LE-MMSE）的解析、可解释且易于计算的公式。通过在不同逆问题（去噪、修复、去卷积）、数据集（FFHQ、CIFAR-10、FashionMNIST）和架构（U-Net、ResNet、PatchMLP）上的大量数值实验，我们证明了我们的理论与神经网络输出相匹配（PSNR $\gtrsim25$dB）。此外，我们提供了对物理感知和物理无关估计器之间差异、训练（补丁）分布中高密度区域的影响以及其他因素（数据集大小、补丁大小等）影响的见解。

英文摘要

Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).

URL PDF HTML ☆

赞 0 踩 0

2601.08617 2026-05-28 cs.CV 版本更新

SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

SoC: 测试时提示调优的语义正交校准

Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz

发表机构 * MICS, CentraleSupélec, Université Paris-Saclay（MICS，CentraleSupélec，巴黎萨克雷大学）； LIVIA, ILLS, ÉTS Montréal（LIVIA，ILLs，蒙特利尔ÉTS）

AI总结针对视觉语言模型测试时提示调优中校准被忽视的问题，提出基于Huber的正则化方法SoC，在保持语义邻近性的同时实现平滑的原型分离，从而改善校准性能并保持判别能力。

详情

Journal ref: CVPR 2026

AI中文摘要

随着视觉语言模型（VLM）在医疗或自动驾驶等关键决策系统中的日益普及，对其不确定性估计的校准变得至关重要。然而，这一维度在VLM测试时提示调优（TPT）文献中尚未得到充分探索，该领域主要侧重于提升其判别性能。最近的最先进方法主张对文本提示嵌入对实施完全正交约束以增强可分离性，从而改善校准。然而，正如我们在理论上所示，完全正交约束的固有梯度会强烈地将语义相关的类别推开，最终使模型过度自信。基于我们的发现，我们提出了语义正交校准（SoC），一种基于Huber的正则化器，它在保持语义邻近性的同时实现平滑的原型分离，从而相比于先前的基于正交性的方法改善了校准。通过全面的实证验证，我们证明SoC在保持竞争性判别能力的同时，持续改善了校准性能。

英文摘要

With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.

URL PDF HTML ☆

赞 0 踩 0

2601.03549 2026-05-28 cs.CV cs.CL 版本更新

FEA-SLT: A Gloss-Free End-to-End Framework for Facial-Expression-Aware Sign Language Translation

FEA-SLT：一种面向面部表情感知的手语翻译的无词汇端到端框架

Guobin Tu, Di Weng

发表机构 * School of Software Technology, Zhejiang University（浙江大学软件学院）

AI总结提出FEA-SLT框架，通过面部表情感知融合模块利用面部动态作为语义锚点，解决无词汇手语翻译中手势歧义问题，在PHOENIX14T和CSL-Daily数据集上达到最优BLEU性能。

详情

AI中文摘要

手语翻译（SLT）是一项具有挑战性的跨模态任务，需要对手部动作和非手动信号进行联合建模。现有的无词汇SLT方法有效捕捉手势动态，但常常未充分利用面部表情，而面部表情在语法和消除歧义中起着关键作用。当不同概念共享相似手部配置时，这一限制可能导致语义退化。为解决此问题，我们提出FEA-SLT（面部表情感知手语翻译），一种无词汇端到端框架，利用面部动态作为语义锚点来消除手部歧义。FEA-SLT采用领域迁移的面部编码器提取表情敏感表示，并通过语言约束的面部表情感知融合（FEAF）模块将其与手部特征集成。FEAF通过双向调制捕捉手部和面部通道之间的相互依赖关系，增强句法保真度。在PHOENIX14T和CSL-Daily上的实验表明，FEA-SLT在无词汇方法中实现了最先进的BLEU性能，而针对性分析证实了其对面部敏感语句翻译的改进。代码可在[https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT)获取。

英文摘要

Sign Language Translation (SLT) is a challenging cross-modal task requiring joint modeling of manual articulations and non-manual signals. Existing gloss-free SLT methods effectively capture gestural dynamics but often underutilize facial expressions, which play crucial grammatical and disambiguating roles. This limitation can cause semantic degradation when distinct concepts share similar manual configurations. To address this issue, we propose FEA-SLT (**F**acial-**E**xpression-**A**ware **S**ign **L**anguage **T**ranslation), a gloss-free end-to-end framework that uses facial dynamics as semantic anchors for resolving manual ambiguity. FEA-SLT employs a domain-transferred facial encoder to extract expression-sensitive representations and integrates them with manual features through a linguistically constrained *Facial-Expression-Aware Fusion* (FEAF) module. FEAF captures reciprocal dependencies between manual and facial channels via bidirectional modulation, enhancing syntactic fidelity. Experiments on PHOENIX14T and CSL-Daily show that FEA-SLT achieves state-of-the-art BLEU performance among gloss-free methods, while targeted analyses confirm improved translation of facial-sensitive utterances. Code is available at [https://github.com/TuGuobin/FEA-SLT](https://github.com/TuGuobin/FEA-SLT).

URL PDF HTML ☆

赞 0 踩 0

2601.03048 2026-05-28 cs.CV cs.AI cs.CC 版本更新

On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

关于Transformer图像嵌入在非可解空间推理中的内在限制

Siyi Lyu, Quan Liu, Feng Yan

发表机构 * School of Electronic Science and Engineering, Nanjing University, Nanjing, China（电子科学与工程学院，南京大学，南京，中国）

AI总结本文通过将空间理解形式化为群同态问题，证明恒定深度Transformer由于TC⁰复杂度限制，无法在单次前向传播中捕获非可解群（如SO(3)）的空间结构。

详情

AI中文摘要

视觉Transformer（ViT）在语义识别方面表现出色，但在心理旋转等空间推理任务中却出现系统性失败。虽然这通常归因于数据规模，但本文认为该限制源于架构的内在电路复杂度。通过将空间理解形式化为学习一个群同态问题——其中潜在嵌入保留作用于图像的物理变换的代数结构——我们识别出一个基本的计算瓶颈。具体来说，对于非可解群（例如$\mathrm{SO}(3)$），维持这种保结构嵌入的下界由单词问题决定，该问题是$\mathsf{NC^1}$-完全的。相比之下，具有多项式精度的恒定深度ViT严格受限于复杂度类$\mathsf{TC^0}$。在标准猜想$\mathsf{TC^0} \subsetneq \mathsf{NC^1}$下，出现了一个复杂度边界：恒定深度架构缺乏在单次前向传播中捕获非可解空间结构所需的逻辑深度。为了实证验证这一理论差距，我们提出了潜在空间代数（LSA）基准，该基准揭示了随着非可解任务组合深度的增加，ViT表示出现显著退化。

英文摘要

Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as learning a Group Homomorphism Problem -- where latent embeddings preserve the algebraic structure of physical transformations acting on images -- we identify a fundamental computational bottleneck. Specifically, for non-solvable groups (e.g., $\mathrm{SO}(3)$), maintaining such structure-preserving embeddings is lowerbounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class $\mathsf{TC^0}$. Under the standard conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, a complexity boundary emerges: constant-depth architectures lack the logical depth required to capture non-solvable spatial structures in a single forward pass. To empirically validate this theoretical gap, we propose the Latent Space Algebra (LSA) benchmark, which reveals a significant degradation in ViT representations as the compositional depth of non-solvable tasks increases.

URL PDF HTML ☆

赞 0 踩 0

2504.10079 2026-05-28 cs.CV 版本更新

Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

层次化关系增强表示泛化用于少样本动作识别

Hongyu Qu, Ling Xing, Jiachao Zhang, Rui Yan, Yazhou Yao, Xiangbo Shu

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology（南京理工大学计算机科学与工程学院）； Artificial Intelligence Industrial Technology Research Institute, Nanjing Institute of Technology（南京理工大学人工智能工业技术研究院）

AI总结提出HR2G-shot框架，通过统一帧间、视频间和任务间三种关系建模，从整体视角学习任务特定的时间模式，以提升少样本动作识别的性能。

详情

AI中文摘要

少样本动作识别（FSAR）旨在通过少量样本识别新动作类别。现有方法通常通过设计帧间时间建模策略或粗粒度视频级交互来学习每个视频的帧级表示。然而，它们孤立地处理每个情节任务，忽略了视频间的细粒度时间关系建模，因此无法捕获跨视频共享的细粒度时间模式，也无法重用历史任务的时间知识。鉴于此，我们提出了HR2G-shot，一种用于FSAR的层次化关系增强表示泛化框架，它统一了三种关系建模（帧间、视频间和任务间），从整体视角学习任务特定的时间模式。除了进行帧间时间交互外，我们进一步设计了两个组件分别探索视频间和任务间关系：i) 视频间语义相关性（ISC）以细粒度方式执行跨视频帧级交互，从而捕获任务特定的查询特征，并增强类内一致性和类间可分离性；ii) 任务间知识迁移（IKT）从存储历史情节任务中多样时间模式的库中检索和聚合相关时间知识。在五个基准上的大量实验表明，HR2G-shot优于当前领先的FSAR方法。

英文摘要

Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

URL PDF HTML ☆

赞 0 踩 0

2601.00501 2026-05-28 cs.CV 版本更新

CPPO: Contrastive Perception Policy Optimization for VLM Agents

CPPO: 面向VLM智能体的对比感知策略优化

Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Yong Zhang, Mohammad Akbari

发表机构 * Huawei Technologies Canada Co. Ltd.（华为技术加拿大有限公司）； Huawei Cloud（华为云）

AI总结提出一种自监督的对比感知策略优化方法CPPO，通过对比感知损失增强视觉语言模型的视觉基础能力，无需额外模型或标注，在感知关键任务中优于现有方法。

详情

AI中文摘要

我们引入了CPPO，一种用于微调视觉语言模型（VLM）的对比感知策略优化方法。可靠的感知是基于VLM的智能体在开放环境中推理和行动的核心要求：错误的视觉基础直接导致错误的行为、幻觉工具调用和不安全的决策。虽然强化学习（RL）显著提升了语言模型的推理能力，但将这些进展扩展到多模态智能体需要同时改进感知和推理。先前的工作主要通过显式感知奖励来解决这一挑战，这通常需要额外的LLM评判器、真实标注或强制将感知与推理分离。CPPO通过扩展RL目标，引入对比感知损失（CPL），以自监督方式解决了这一限制，为视觉基础提供了直接的学习信号。对比目标鼓励模型对输入的视觉信息更加敏感。为了有效应用这一信号，CPPO利用在扰动图像下模型输出分布中的熵移机制识别感知令牌，并在训练期间选择性地对这些令牌应用对比损失。实验表明，CPPO在避免额外模型的同时超越了先前方法，使训练更加高效和可扩展，并产生了更适合感知关键智能体任务的策略。

英文摘要

We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision--language models (VLMs). Reliable perception is a core requirement for VLM-based agents that must reason and act in open-ended environments: faulty visual grounding cascades directly into faulty actions, hallucinated tool calls, and unsafe decisions. While reinforcement learning (RL) has significantly improved reasoning in language models, extending these advances to multimodal agents requires improving both perception and reasoning. Prior works address this challenge mainly through explicit perception rewards, which often require extra LLM judges, ground-truth annotations, or forced separation of perception from reasoning. CPPO addresses this limitation in a self-supervised manner by extending the RL objective with a Contrastive Perception Loss (CPL) that provides a direct learning signal for visual grounding. The contrastive objective encourages the model to become more sensitive to input visual information. To apply this signal effectively, CPPO identifies perception tokens using an entropy-shift mechanism in the model's output distributions under perturbed images and applies the contrastive loss selectively to those tokens during training. Experiments show that CPPO surpasses prior methods while avoiding extra models, making training more efficient and scalable, and yielding policies that are better suited to perception-critical agentic tasks.

URL PDF HTML ☆

赞 0 踩 0

2512.16483 2026-05-28 cs.CV 版本更新

FasterVAR: Plug-and-Play Acceleration for Visual Autoregressive Models

FasterVAR：视觉自回归模型的即插即用加速

Senmao Li, Kai Wang, Salman Khan, Fahad Shahbaz Khan, Jian Yang, Yaxing Wang

发表机构 * PCA Lab, VCIP, College of Computer Science, Nankai University（南开大学计算机学院、VCIP、PCA实验室）； Program of Computer Science, City University of Hong Kong (Dongguan), China（香港城市大学（东莞）计算机系，中国）； City University of Hong Kong, HK SAR, China（香港城市大学，香港特别行政区，中国）； Mohamed bin Zayed University of Artificial Intelligence, UAE（阿联酋Mohamed bin Zayed人工智能大学）； Linkoping University, Sweden（林地平大学，瑞典）； PCA Lab, School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院、PCA实验室）； College of Artificial Intelligence, Jilin University（吉林大学人工智能学院）

AI总结针对VAR模型在大尺度步骤计算复杂度高的问题，提出一种基于阶段感知的即插即用加速框架FasterVAR，通过保留早期关键步骤并剪枝或近似后期细节步骤，实现最高3.4倍加速且几乎无性能损失。

Comments Accepted at ICML2026

详情

AI中文摘要

视觉自回归（VAR）建模通过下一尺度预测偏离了传统自回归（AR）模型的下一个标记预测范式，实现了高质量的图像生成。然而，VAR范式在大尺度步骤上面临计算复杂度和运行时间急剧增加的问题。尽管现有的加速方法减少了大尺度步骤的运行时间，但依赖于手动步骤选择，并忽略了生成过程中不同阶段的不同重要性。为了解决这一挑战，我们提出了FasterVAR，一个对VAR模型的系统研究和即插即用加速框架。我们的分析表明，早期步骤对于保持语义和结构一致性至关重要，应保持不变，而后期步骤主要细化细节，可以被剪枝或近似以加速。基于这些见解，FasterVAR引入了一种即插即用加速策略，利用后期计算中的语义无关性和低秩属性，无需额外训练。我们提出的FasterVAR实现了最高3.4倍的加速，且几乎没有性能损失，持续优于现有的加速基线。这些结果凸显了阶段感知设计作为高效视觉自回归图像生成的一个强大原则。

英文摘要

Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present FasterVAR, a systematic study and plug-and-play acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact,while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, FasterVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed FasterVAR achieves up to 3.4x speedup with almost no performance loss. consistently outperforming existing acceleration baselines.These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.

URL PDF HTML ☆

赞 0 踩 0

2512.00814 2026-05-28 cs.CV 版本更新

IRPO: Boosting Image Restoration via Post-training GRPO

IRPO：通过后训练GRPO提升图像恢复

Haoxuan Xu, Yi Liu, Tianfu Li, Ruolin Shen, Boyuan Jiang, Jinlong Peng, Donghao Luo, Xiaobin Hu, Shuicheng Yan, Haoang Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Tsinghua University（清华大学）； Technical University of Munich（慕尼黑技术大学）； Zhejiang University（浙江大学）； Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）； National University of Singapore（新加坡国立大学）

AI总结提出IRPO框架，利用GRPO后训练优化确定性恢复模型，通过数据筛选和复合奖励建模，在域内和域外任务上显著提升性能。

详情

AI中文摘要

后训练在高层次生成任务中已变得有效，但在低层次视觉中的作用仍未被充分探索。现有的图像恢复方法通常依赖于对真实图像的固定逐像素拟合，这可能导致过度平滑和泛化能力弱。我们提出了IRPO，一个基于GRPO的后训练框架，用于确定性恢复模型。IRPO围绕两个轴构建：数据公式化和奖励建模。对于数据公式化，我们从预训练阶段选择表现最差的30%样本，这提高了准确性和训练效率。对于奖励建模，我们将面向保真度和面向质量的反馈与三个组件结合：用于结构保真度的通用奖励、使用视觉-语言模型作为粗粒度视觉质量评判的专家奖励，以及用于任务特定低级线索的恢复奖励。在六个域内和五个域外基准上的实验表明，IRPO在域内任务上将AdaIR基线提高了0.93 dB，在域外设置上提高了3.43 dB。我们的代码可在https://github.com/HaoxuanXU1024/IRPO查看。

英文摘要

Post-training has become effective for high-level generation, but its role in low-level vision remains underexplored. Existing image restoration methods often rely on fixed pixel-wise fitting to ground-truth images, which can lead to over-smoothing and weak generalization. We propose IRPO, a GRPO-based post-training framework for deterministic restoration models. IRPO is built around two axes: data formulation and reward modeling. For data formulation, we select the 30% underperforming samples from the pre-training stage, which improves both accuracy and training efficiency. For reward modeling, we combine fidelity-oriented and quality-aware feedback with three components: a General Reward for structural fidelity, an Expert Reward that uses a Vision-Language Model as a coarse visual-quality judge, and a Restoration Reward for task-specific low-level cues. Experiments on six in-domain and five out-of-domain (OOD) benchmarks show that IRPO improves the AdaIR baseline by 0.93 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.

URL PDF HTML ☆

赞 0 踩 0

2508.13544 2026-05-28 cs.CV cs.AI 版本更新

FLAIR: Frequency- and Locality-Aware Implicit Neural Representations

FLAIR: 频率与位置感知的隐式神经表示

Sukhun Ko, Seokhyun Youn, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh

AI总结针对隐式神经表示缺乏频率选择性和空间定位导致频谱偏差的问题，提出带限局部激活和小波能量引导编码，提升2D图像表示、3D形状重建和新视角合成性能。

Comments CVPR Findings 2026 (camera ready ver.). Please visit our project page at https://cmlab-korea.github.io/FLAIR/

详情

AI中文摘要

隐式神经表示利用神经网络将坐标映射到对应信号，实现连续且紧凑的表示。该范式推动了各种视觉任务的重大进展。然而，现有的隐式神经表示缺乏频率选择性和空间定位，导致过度依赖冗余信号分量。因此，它们表现出频谱偏差，倾向于早期学习低频分量，而难以捕捉精细的高频细节。为了解决这些问题，我们提出了FLAIR（频率与位置感知的隐式神经表示），它包含两个关键创新。第一个是带限局部激活（BLA），这是一种新颖的激活函数，设计用于在时频不确定性原理（TFUP）约束下进行联合频率选择和空间定位。通过结构化的频率控制和空间局部响应，BLA有效减轻了频谱偏差并增强了训练稳定性。第二个是小波能量引导编码（WEGE），它利用离散小波变换计算能量分数，并显式地将频率信息引导到网络，实现精确的频率选择和自适应频带控制。我们的方法在2D图像表示、3D形状重建和新视角合成方面始终优于现有的隐式神经表示。

英文摘要

Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity and spatial localization, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is Band-Localized Activation (BLA), a novel activation designed for joint frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). Through structured frequency control and spatially localized responses, BLA effectively mitigates spectral bias and enhances training stability. The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform to compute energy scores and explicitly guide frequency information to the network, enabling precise frequency selection and adaptive band control. Our method consistently outperforms existing INRs in 2D image representation, as well as 3D shape reconstruction and novel view synthesis.

URL PDF HTML ☆

赞 0 踩 0

2512.01988 2026-05-28 cs.CV 版本更新

Artemis: Structured Visual Reasoning for Perception Policy Learning

Artemis: 用于感知策略学习的结构化视觉推理

Wei Tang, Yanpeng Sun, Shan Zhang, Weihao Bo, Xiaofan Li, Piotr Koniusz, Wei Li, Na Zhao, Zechao Li

发表机构 * NJUST IMAG（南京理工大学图像所）； YZU（宜春大学）； SUTD IMPL（新加坡科技设计大学智能感知实验室）； Adelaide AIML（阿德莱德人工智能实验室）； Data61 ； CSIRO（澳大利亚联邦科学与工业研究组织）； ZJU（浙江大学）； UNSW Sydney（新南威尔士大学悉尼分校）； SenseTime Research（秒速科技研究院）

AI总结提出Artemis方法，通过结构化视觉推理（中间步骤表示为（标签，边界框）对）替代语言推理，提升视觉感知策略的性能，并统一处理多种感知任务。

详情

AI中文摘要

最近的视觉感知策略强化学习框架通常结合用自然语言表达的中间推理链。经验观察表明，这种纯语言中间推理通常会降低感知任务的性能。我们认为核心问题不在于推理本身，而在于推理的形式：虽然这些链在非结构化的语言空间中进行语义推理，但视觉感知需要在空间和以对象为中心的空间中进行推理。为此，我们引入了Artemis，一种感知策略学习方法，它执行结构化的视觉推理，其中每个中间步骤都表示为一个（标签，边界框）对，捕获可验证的视觉状态。这种设计能够显式跟踪中间状态，直接监督提议质量，并避免基于语言的推理引入的歧义。基于可验证和空间定位的推理链，Artemis为各种感知任务提供了统一的架构，无需依赖先前感知策略模型所依赖的任务特定设计。使用自然图像域中的定位和检测样本进行训练，Artemis泛化到计数和几何感知任务。其核心是空间定位的、以对象为中心的链式规则，为可扩展和通用的感知策略提供了原则性基础。

英文摘要

Recent reinforcement-learning frameworks for visual perception policy usually incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, \textbf{visual perception requires reasoning in a spatial and object-centric space}. In response, we introduce \textbf{Artemis}, a perception-policy learning method that performs structured visual reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Building upon verifiable and spatially grounded reasoning chains, Artemis provides a unified architecture for diverse perceptual tasks, without requiring the task-specific designs relied upon by prior perceptual policy models. Trained using grounding and detection sampeles in natural image domains, Artemis generalizes to counting and geometric perception tasks. At its core, a spatially grounded, object-centric chain rule provides a principled foundation for scalable and general perceptual policies.

URL PDF HTML ☆

赞 0 踩 0

2511.20934 2026-05-28 cs.AI cs.CV cs.LG 版本更新

Guaranteed Optimal Compositional Explanations for Neurons

神经元的保证最优组合解释

Biagio La Rosa, Leilani H. Gilpin

发表机构 * Computer Science and Engineering Department, University of California, Santa Cruz, US（加州大学圣克鲁兹分校计算机科学与工程系）

AI总结提出首个框架，通过分解、启发式和算法，在完整状态空间上计算保证最优的组合解释，并证明10-40%的波束搜索解释在概念重叠时非最优。

Comments Accepted at ICML 2026 (Oral), 43 pages, 10 figures

详情

AI中文摘要

组合解释是一类方法，旨在通过逻辑规则描述神经元感受野激活与概念之间的空间对齐，通常通过搜索所有可能的概念组合来计算。由于在整个状态空间上计算空间对齐在计算上不可行，文献中通常采用与组合结构相关的假设和波束搜索来限制状态空间。然而，波束搜索无法提供任何最优性的理论保证，且当前解释与真正最优解的接近程度仍不清楚。在这篇理论性论文中，我们通过引入首个框架来解决这一差距，该框架在采用假设所涵盖的整个状态空间上计算保证最优的组合解释。具体而言，我们提出：(i) 一种识别影响空间对齐因素的分解方法，(ii) 一种在搜索任何阶段估计对齐的启发式方法，以及(iii) 第一个能够在与穷举波束搜索相当的时间内计算最优组合解释的算法。使用该框架，我们证明当涉及重叠概念时，先前通过波束搜索获得的10-40%的解释是次优的。最后，我们评估了一种由我们提出的分解和启发式方法引导的波束搜索变体，表明它在超参数和计算资源方面提供更大灵活性的同时，匹配或改进了先前方法的运行时间。

英文摘要

Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations and concepts through logical rules, typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts assumptions related to the structure of the combinations and beam search to restrict the state space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations over the entire state space spanned by the adopted assumptions. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations in a time comparable to exhaustive beam search. Using this framework, we demonstrate that 10-40% of explanations previously obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.

URL PDF HTML ☆

赞 0 踩 0

2511.20439 2026-05-28 cs.CV cs.AI 版本更新

Object-Centric Vision Token Pruning for Vision Language Models

面向视觉语言模型的以对象为中心的视觉令牌剪枝

Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen

发表机构 * Aalto University（阿alto大学）； University of Electronic Science and Technology of China（电子科学与技术大学）； Delft University of Technology（代尔夫特理工大学）

AI总结提出OC-VTP方法，通过轻量预训练以对象为中心的视觉令牌剪枝器，直接选择最具代表性的视觉令牌，在保持高精度的同时提升VLM推理效率。

详情

AI中文摘要

在视觉语言模型（VLM）中，与语言令牌相比，视觉令牌数量庞大但信息分散，因此消耗了大量不必要的计算。为了提升VLM推理效率，剪枝冗余视觉令牌的研究一直在进行，但现有方法都采用间接且无保证的方式。我们提出了OC-VTP，一种直接且有保证的方法，用于选择最具代表性的视觉令牌，以实现高效且保持精度的VLM推理。我们的OC-VTP仅需对一个小型的以对象为中心的视觉令牌剪枝器进行轻量预训练，然后即可将其插入现有VLM中，无需在任何数据集上微调任何模型。通过最小化从所选令牌重建原始未剪枝令牌的误差，保证保留最具代表性的视觉令牌。在任何视觉剪枝比例（即推理效率）下，我们的OC-VTP都能一致地帮助主流VLM保持最高的推理精度。我们的剪枝还展示了有趣的可解释性。我们的代码可在 https://github.com/GarryLarry010131/OC-VTP 获取。

英文摘要

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

URL PDF HTML ☆

赞 0 踩 0

2511.02558 2026-05-28 cs.CV cs.LG q-bio.NC 版本更新

Forecasting Future Anatomies: Longitudinal Brain Mri-to-Mri Prediction

预测未来解剖结构：纵向脑MRI到MRI的预测

Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka

发表机构 * A.I. Virtanen Institute for Molecular Sciences, University of Eastern Finland, Kuopio, Finland（A.I. Virtanen分子科学研究所，东芬兰大学，库奥普io，芬兰）

AI总结本文研究从基线MRI预测未来脑部MRI，采用五种深度学习架构（UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet）在ADNI和AIBL数据集上实现高保真体素级预测，并验证了跨队列泛化能力。

详情

DOI: 10.1109/ISBI61048.2026.11515462
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), Apr. 2026

AI中文摘要

从基线磁共振图像（MRI）预测未来脑状态是神经影像学的一个核心挑战，对研究阿尔茨海默病（AD）等神经退行性疾病具有重要意义。大多数现有方法预测未来认知评分或临床结果，例如从轻度认知障碍向痴呆的转化。相反，本文研究纵向MRI图像到图像的预测，该预测可以预测参与者未来数年的整个脑部MRI，内在建模复杂的、空间分布的神经退行模式。我们在两个纵向队列（ADNI和AIBL）上实施并评估了五种深度学习架构（UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet）。使用捕捉全局相似性和局部差异的指标，将预测的随访MRI与实际随访扫描直接进行比较。表现最佳的模型实现了高保真预测，并且所有模型都能很好地泛化到独立的外部数据集，展示了稳健的跨队列性能。我们的结果表明，深度学习可以在体素水平上可靠地预测参与者特定的脑部MRI，为个体化预后提供了新的机会。

英文摘要

Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.

URL PDF HTML ☆

赞 0 踩 0

2511.15390 2026-05-28 cs.CV 版本更新

Automatic Pruning Discovery for Large Language Models

大型语言模型的自动剪枝发现

Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang

发表机构 * Northeastern University, Shenyang, China（东北大学（沈阳））； Hebei Key Laboratory of Marine Perception Network and Data Processing, Northeastern University at Qinhuangdao 066004, Hebei Province, China（河北省海洋感知网络与数据处理重点实验室，秦皇岛东北大学066004，河北省）； Sun Yat-sen University, Shenzhen, China（深圳大学）； Hong Kong Baptist University, Hongkong, China（香港 Baptist 大学）； Xidian University, Xian, China（西安电子科技大学）

AI总结提出AutoPrune方法，利用LLMs自动设计剪枝算法，并通过图驱动思维链优化提示，结合偏态感知动态稀疏分配解决高剪枝率下的异常值问题，在主流基准上超越现有方法。

Comments 15 pages, 10 figures

详情

AI中文摘要

大型语言模型（LLMs）在广泛任务上取得了显著性能，但由于其庞大的规模，阻碍了实际部署。现有的针对LLMs的剪枝方法（例如Wanda）严重依赖手动设计的剪枝算法，从而导致巨大的人力成本并需要专家知识。此外，我们首次识别出在高剪枝率下由均匀稀疏性导致的严重异常值问题，这引发了关于如何为LLMs设计自适应剪枝稀疏度的额外担忧。LLMs能否自行剪枝？在这项工作中，我们通过提出一种名为AutoPrune的新型剪枝方法给出了肯定答案，该方法首次通过利用LLMs自动为其自身设计最优剪枝算法，无需任何专家知识，从而克服了专家知识的限制。具体来说，为了缓解LLMs的黑箱性质，我们提出了一种图驱动思维链（GCoT）来优化提示，显著增强了学习剪枝算法中的推理过程，并使我们能够生成具有卓越性能和可解释性的下一代剪枝算法。最后，基于对异常值问题的洞察，我们引入了偏态感知动态稀疏分配（SDSA）来克服异常值问题，减轻高剪枝率下的性能下降。我们在主流LLMs基准上进行了广泛实验，证明了AutoPrune的优越性，它始终优于最先进的竞争对手。

英文摘要

Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to huge labor costs and requires expert knowledge. Furthermore, we are the first to identify the serious outlier value issue behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called AutoPrune, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithm for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors.

URL PDF HTML ☆

赞 0 踩 0

2511.14558 2026-05-28 cs.CV 版本更新

Explaining Digital Pathology Models via Clustering Activations

通过激活聚类解释数字病理学模型

Adam Bajger, Jan Obdržálek, Vojtěch Kůr, Rudolf Nenutil, Petr Holub, Vít Musil, Tomáš Brázdil

发表机构 * Faculty of Informatics, Masaryk University, Brno, Czech Republic（马萨里克大学信息学院，布拉格，捷克共和国）； Institute of Computer Science, Masaryk University, Brno, Czech Republic（马萨里克大学计算机科学研究所，布拉格，捷克共和国）； Masaryk Memorial Cancer Institute, Brno, Czech Republic（马萨里克纪念癌症研究所，布拉格，捷克共和国）

AI总结提出一种基于卷积神经网络激活聚类的可解释性方法，通过展示模型全局行为并提供细粒度信息，增强对数字病理学模型的理解和信任。

详情

DOI: 10.1109/ISBI61048.2026
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

AI中文摘要

我们提出了一种基于聚类的可解释性技术，用于基于卷积神经网络的数字病理学模型。与常用的基于显著性图的方法（如遮挡、GradCAM或相关性传播）不同，这些方法突出显示对单个切片预测贡献最大的区域，而我们的方法展示了所考虑模型的全局行为，同时提供了更细粒度的信息。结果聚类不仅可以可视化以理解模型，还可以增加对其操作的信心，从而在临床实践中更快地采用。我们还评估了我们的技术在现有用于检测前列腺癌的模型上的性能，证明了其实用性。

英文摘要

We present a clustering-based explainability technique for digital pathology models based on convolutional neural networks. Unlike commonly used methods based on saliency maps, such as occlusion, GradCAM, or relevance propagation, which highlight regions that contribute the most to the prediction for a single slide, our method shows the global behaviour of the model under consideration, while also providing more fine-grained information. The result clusters can be visualised not only to understand the model, but also to increase confidence in its operation, leading to faster adoption in clinical practice. We also evaluate the performance of our technique on an existing model for detecting prostate cancer, demonstrating its usefulness.

URL PDF HTML ☆

赞 0 踩 0

2510.27266 2026-05-28 cs.CV 版本更新

IAR2：通过语义-细节关联令牌预测改进自回归视觉生成

Ran Yi, Teng Hu, Zihan Su, Jiangning Zhang, Lizhuang Ma

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Zhejiang University（浙江大学）

AI总结提出IAR2框架，通过语义-细节关联双码本和分层预测机制，实现从粗到细的图像生成，在ImageNet上取得FID 1.50的领先性能。

详情

AI中文摘要

自回归模型已成为视觉内容创建的有力范式，但常常忽略视觉数据的内在结构特性。我们之前的工作IAR通过基于嵌入相似性重新组织视觉码本，开启了解决这一问题的方向，从而提高了生成的鲁棒性。然而，它受到预训练码本的刚性和硬均匀聚类的不准确性的限制。为了克服这些限制，我们提出了IAR2，一种先进的自回归框架，实现了层次化的语义-细节合成过程。IAR2的核心是一种新颖的语义-细节关联双码本，它将图像表示解耦为用于全局语义信息的语义码本和用于细粒度细节的细节码本。它将量化能力从线性扩展到多项式规模，显著增强了表达能力。为了适应这种双重表示，我们提出了一种语义-细节自回归预测方案，结合局部上下文增强自回归头，执行分层预测——先预测语义令牌，再预测细节令牌——同时利用局部上下文窗口增强空间连贯性。此外，对于条件生成，我们引入了一种渐进式注意力引导的自适应CFG机制，该机制根据每个令牌与条件的相关性及其在生成序列中的时间位置动态调节引导尺度，在不牺牲真实性的情况下改善条件对齐。大量实验表明，IAR2在自回归图像生成上树立了新的最先进水平，在ImageNet上实现了1.50的FID。我们的模型不仅在性能上超越了先前的方法，而且展示了卓越的计算效率，突显了我们结构化、从粗到细生成策略的有效性。

英文摘要

Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.

URL PDF HTML ☆

赞 0 踩 0

2502.17832 2026-05-28 cs.LG cs.AI cs.CR cs.CV 版本更新

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

MM-PoisonRAG：通过局部和全局投毒攻击破坏多模态RAG

Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of California Los Angeles（加州大学洛杉矶分校）

AI总结提出MM-PoisonRAG框架，通过局部投毒攻击（LPA）和全局投毒攻击（GPA）两种策略，系统研究多模态检索增强生成（RAG）在知识投毒下的脆弱性，实验表明攻击成功率高达56%且能绕过现有防御。

Comments Code is available at https://github.com/HyeonjeongHa/MM-PoisonRAG

详情

AI中文摘要

检索增强生成（RAG）已成为多模态大语言模型（MLLM）中增强事实基础并减少幻觉的常见做法。然而，其对检索的依赖使MLLM面临知识投毒攻击，攻击者故意将恶意多模态内容注入外部知识库，以引导模型生成不正确甚至有害的响应。我们提出MM-PoisonRAG框架，系统研究多模态RAG在知识投毒下的脆弱性。具体地，我们设计了两种新颖的攻击策略：局部投毒攻击（LPA），植入针对特定查询的多模态错误信息以操纵输出至攻击者控制的响应；以及全局投毒攻击（GPA），使用单一、非定向的对抗性注入广泛破坏推理并降低所有查询的生成质量。在多样化任务、多模态RAG组件和攻击者访问级别上的大量实验揭示了严重的脆弱性：LPA即使在受限访问下也能达到高达56%的攻击成功率，并且无需重新优化对抗样本即可在四种不同的检索器之间有效迁移。GPA仅需一个投毒内容即可完全破坏模型生成，使准确率降至0%。此外，LPA和GPA均能绕过现有防御，突显了多模态RAG的脆弱性，并将MM-PoisonRAG确立为未来保护RAG框架免受多模态知识投毒研究的基础。

英文摘要

Retrieval-augmented generation (RAG) has become a common practice in multimodal large language models (MLLM) to enhance factual grounding and reduce hallucination. Yet, its reliance on retrieval exposes MLLMs to knowledge poisoning attacks, in which adversaries deliberately inject malicious multimodal content into external knowledge bases to steer models toward generating incorrect or even harmful responses. We present MM-PoisonRAG, a framework to systematically study the vulnerability of multimodal RAG under knowledge poisoning. Specifically, we design two novel attack strategies: Localized Poisoning Attack (LPA), which implants targeted, query-specific multimodal misinformation to manipulate outputs toward attacker-controlled responses, and Globalized Poisoning Attack (GPA), which uses a single, untargeted adversarial injection to broadly corrupt reasoning and collapse generation quality across all queries. Extensive experiments on diverse tasks, multimodal RAG components, and attacker access levels reveal severe vulnerabilities: LPA achieves up to 56% attack success rate even under restricted access, and transfers effectively across four different retrievers without re-optimizing the adversaries. GPA completely disrupts model generation to 0% accuracy with just one poisoned content. Moreover, both LPA and GPA bypass existing defenses, underscoring the fragility of multimodal RAG and establishing MM-PoisonRAG as a foundation for future research on securing RAG frameworks against multimodal knowledge poisoning.

URL PDF HTML ☆

赞 0 踩 0

2508.21046 2026-05-28 cs.CV cs.RO 版本更新

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

CogVLA: 通过指令驱动路由与稀疏化实现认知对齐的视觉-语言-动作模型

Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳校区计算机科学与技术学院）

AI总结提出CogVLA框架，通过指令驱动路由和稀疏化机制，在LIBERO基准和真实机器人任务上以2.5倍训练成本降低和2.8倍推理延迟降低实现97.4%和70.0%的成功率。

Comments Accepted to NeurIPS 2025, Project Page: https://jiutian-vl.github.io/CogVLA-page

详情

AI中文摘要

最近基于预训练视觉-语言模型（VLM）构建的视觉-语言-动作（VLA）模型需要大量后训练，导致计算开销高，限制了可扩展性和部署。我们提出CogVLA，一个认知对齐的视觉-语言-动作框架，利用指令驱动路由和稀疏化来提高效率和性能。CogVLA受人类多模态协调启发，引入了一个3阶段渐进式架构。1）基于编码器-FiLM的聚合路由（EFA-Routing）将指令信息注入视觉编码器，以选择性聚合和压缩双流视觉标记，形成指令感知的潜在表示。2）基于这种紧凑的视觉编码，基于LLM-FiLM的剪枝路由（LFP-Routing）通过剪枝与指令无关的视觉接地标记将动作意图引入语言模型，从而实现标记级稀疏性。3）为确保压缩的感知输入仍能支持准确且连贯的动作生成，我们引入了V-L-A耦合注意力（CAtten），它将因果视觉-语言注意力与双向动作并行解码相结合。在LIBERO基准和真实机器人任务上的大量实验表明，CogVLA实现了最先进的性能，成功率分别为97.4%和70.0%，同时与OpenVLA相比，训练成本降低了2.5倍，推理延迟降低了2.8倍。CogVLA已开源，可在https://github.com/JiuTian-VL/CogVLA获取。

英文摘要

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

URL PDF HTML ☆

赞 0 踩 0

2503.11906 2026-05-28 cs.CV cs.AI 版本更新

A Survey on SAR ship classification using Deep Learning

基于深度学习的SAR船舶分类综述

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Emanuele Salerno

发表机构 * PhD School in Computer Science, University of Pisa（计算机科学博士学院，比萨大学）； Institute of Information Science and Technologies, National Research Council of Italy（意大利国家研究委员会信息科学与技术研究所）； National Biodiversity Future Center - NBFC（国家生物多样性未来中心 - NBFC）

AI总结本文综述了深度学习在SAR船舶分类中的应用，建立了基于模型、手工特征、SAR属性利用和微调影响的分类法，并讨论了未来研究方向。

Comments in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

详情

DOI: 10.1109/JSTARS.2026.3695704

AI中文摘要

深度学习（DL）已成为合成孔径雷达（SAR）船舶分类的强大工具。本综述全面分析了该领域使用的各种DL技术。我们识别了关键趋势和挑战，强调了整合手工特征、利用公共数据集、数据增强、微调、可解释性技术以及促进跨学科合作以提高DL模型性能的重要性。本综述建立了首个基于DL模型、手工特征使用、SAR属性利用和微调影响的分类法，用于对相关研究进行分类。我们讨论了SAR船舶分类任务中使用的方法论以及不同技术的影响。最后，本综述探讨了未来研究的潜在方向，包括解决数据稀缺问题、探索新型DL架构、融入可解释性技术以及建立标准化性能指标。通过应对这些挑战并利用DL的进步，研究人员可以为开发更准确和高效的船舶分类系统做出贡献，最终增强海上监视及相关应用。

英文摘要

Deep learning (DL) has emerged as a powerful tool for Synthetic Aperture Radar (SAR) ship classification. This survey comprehensively analyzes the diverse DL techniques employed in this domain. We identify critical trends and challenges, highlighting the importance of integrating handcrafted features, utilizing public datasets, data augmentation, fine-tuning, explainability techniques, and fostering interdisciplinary collaborations to improve DL model performance. This survey establishes a first-of-its-kind taxonomy for categorizing relevant research based on DL models, handcrafted feature use, SAR attribute utilization, and the impact of fine-tuning. We discuss the methodologies used in SAR ship classification tasks and the impact of different techniques. Finally, the survey explores potential avenues for future research, including addressing data scarcity, exploring novel DL architectures, incorporating interpretability techniques, and establishing standardized performance metrics. By addressing these challenges and leveraging advancements in DL, researchers can contribute to developing more accurate and efficient ship classification systems, ultimately enhancing maritime surveillance and related applications.

URL PDF HTML ☆

赞 0 踩 0

2503.09675 2026-05-28 cs.CV 版本更新

特征空间过采样解决SAR舰船分类中的类别不平衡问题

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

发表机构 * ISTI-CNR \& University of Pisa Pisa, Italy ； Cardiff University Cardiff, UK.

AI总结针对SAR舰船分类中长尾数据集导致的类别不平衡问题，提出两种基于Major-to-minor (M2m)方法的特征空间过采样算法M2m$_f$和M2m$_u$，在OpenSARShip和FuSARShip数据集上使用ViT、VGG16和ResNet50作为特征提取器，平均F1分数分别提升8.82%和4.44%。

Comments Accepted and presented at IGARSS

详情

DOI: 10.1109/IGARSS55030.2025.11242334
Journal ref: IGARSS 2025 - 2025 IEEE International Geoscience and Remote Sensing Symposium, Brisbane, Australia, 2025, pp. 2010-2014,

AI中文摘要

SAR舰船分类面临长尾数据集的挑战，这使得对代表性不足的类别的分类变得复杂。过采样方法已被证明在解决光学数据中的类别不平衡问题方面有效。在本文中，我们评估了在特征空间中进行过采样对SAR舰船分类的影响。我们提出了两种受Major-to-minor (M2m)方法启发的新算法M2m$_f$和M2m$_u$。这些算法在两个公开数据集OpenSARShip（6类）和FuSARShip（9类）上进行了测试，使用三种最先进的模型作为特征提取器：ViT、VGG16和ResNet50。此外，我们还分析了过采样方法对不同类别大小的影响。结果表明，我们的新方法优于原始的M2m和基线方法，在FuSARShip上平均F1分数提高了8.82%，在OpenSARShip上提高了4.44%。

英文摘要

SAR ship classification faces the challenge of long-tailed datasets, which complicates the classification of underrepresented classes. Oversampling methods have proven effective in addressing class imbalance in optical data. In this paper, we evaluated the effect of oversampling in the feature space for SAR ship classification. We propose two novel algorithms inspired by the Major-to-minor (M2m) method M2m$_f$, M2m$_u$. The algorithms are tested on two public datasets, OpenSARShip (6 classes) and FuSARShip (9 classes), using three state-of-the-art models as feature extractors: ViT, VGG16, and ResNet50. Additionally, we also analyzed the impact of oversampling methods on different class sizes. The results demonstrated the effectiveness of our novel methods over the original M2m and baselines, with an average F1-score increase of 8.82% for FuSARShip and 4.44% for OpenSARShip.

URL PDF HTML ☆

赞 0 踩 0

2507.06999 2026-05-28 cs.CV cs.CL cs.LG 版本更新

Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

有意学习，直觉行动：解锁多模态大语言模型的测试时推理能力

Yahan Yu, Yuyang Dong, Masafumi Oyamada

发表机构 * Kyoto University（京都大学）； Initial S ； NEC Corporation, Japan（日本NEC公司）

AI总结提出D2I框架，通过训练时使用基于规则的格式奖励进行有意推理以增强模态对齐，推理时移除显式策略转为直觉推理，从而提升多模态大语言模型的推理能力，无需额外标注或复杂奖励。

Comments 22 pages, 24 figures

详情

AI中文摘要

推理对于大型语言模型（LLMs）至关重要，尤其是在数学问题求解等复杂任务中。然而，多模态推理在模态对齐和训练可扩展性方面仍面临挑战，因为许多现有方法依赖于额外的标注或复杂的基于规则的奖励。为了解决这些问题，我们提出了“有意到直觉”推理框架（D2I），该框架无需额外标注或复杂奖励即可提升多模态大语言模型（MLLMs）的理解和推理能力。在训练过程中，D2I使用仅由基于规则的格式奖励监督的有意推理策略来增强模态对齐。在推理过程中，它通过移除这些显式策略转向直觉推理，使模型能够在其响应中隐式应用所获得的能力。D2I在域内和域外基准测试中均优于基线，突显了格式奖励在培养可迁移多模态推理技能方面的有效性，并表明将训练时的推理深度与测试时的响应灵活性解耦是有益的。

英文摘要

Reasoning is essential for large language models (LLMs), especially in complex tasks such as mathematical problem solving. However, multimodal reasoning still faces challenges in modality alignment and training scalability, as many existing methods rely on additional annotations or complex rule-based rewards. To address these issues, we propose the Deliberate-to-Intuitive reasoning framework (D2I), which improves the understanding and reasoning abilities of multimodal LLMs (MLLMs) without extra annotations or complex rewards. During training, D2I uses deliberate reasoning strategies supervised only by rule-based format rewards to enhance modality alignment. During inference, it shifts to intuitive reasoning by removing these explicit strategies, allowing the model to implicitly apply the acquired abilities in its responses. D2I outperforms baselines on both in-domain and out-of-domain benchmarks, highlighting the effectiveness of format rewards in fostering transferable multimodal reasoning skills and suggesting the benefit of decoupling training-time reasoning depth from test-time response flexibility.

URL PDF HTML ☆

赞 0 踩 0

2505.21771 2026-05-28 cs.CV cs.AI 版本更新

MMTABREAL: Real-World Benchmark for Multimodal Table Understanding

MMTABREAL：多模态表格理解的真实世界基准

Prasham Titiya, Jainil Trivedi, Chitta Baral, Vivek Gupta

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结针对多模态表格理解，构建了包含500个真实表格和4021个问答对的人工筛选基准MMTABREAL，评估发现现有模型在视觉定位、空间对齐和多步推理上存在20-40%的性能差距。

详情

AI中文摘要

多模态表格，即与图表、地图、图标和颜色编码交织的表格布局，在实际应用中无处不在，但对多模态大语言模型（MLLMs）来说仍然困难。尽管在文本和图像理解方面取得了进展，但对以表格为中心的多模态推理的系统评估仍然有限。我们引入了MMTABREAL，一个多模态表格基准，包含人工筛选的500个真实世界表格及其对应的4021个问答对。MMTABREAL涵盖四种问题类型、五种推理类别和八种结构原型。对最先进模型的评估揭示了显著差距，特别是在视觉定位、空间对齐和多步推理方面，相对于现有基准性能下降了20-40%。这些结果强调了需要更紧密融合视觉与表格结构并支持显式数值/逻辑运算的架构。MMTABREAL仅用于评估，提供了一个严谨、可复现的测试平台，反映了真实世界多模态表格的语言、结构和推理复杂性。

英文摘要

Multimodal tables i.e. tabular layouts interleaved with charts, maps, icons, and color encodings are ubiquitous in real applications yet remain difficult for Multimodal Large Language Models (MLLMs). Despite advances in text and image understanding, systematic evaluation of table-centric multimodal reasoning is limited. We introduce MMTABREAL, a MultiModal Table Benchmark, human-curated suite of 500 real-world tables paired with 4,021 question-answer pairs. MMTABREAL spans four question types, five reasoning categories, and eight structural archetypes. Evaluations of state-of-the-art models reveal substantial gaps, especially in visual grounding, spatial alignment, and multi-step inference, with 20-40% performance drops relative to existing benchmarks. These results highlight the need for architectures that more tightly fuse vision with tabular structure and support explicit numeric/logical operations. MMTABREAL is released for evaluation only, providing a rigorous, reproducible testbed that reflects the linguistic, structural, and reasoning complexity of real-world multimodal tables.

URL PDF HTML ☆

赞 0 踩 0

2502.05242 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

超越外部监控：增强大型语言模型的透明度以便于监控

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； ICISEE, Shanghai Jiao Tong University（上海交通大学ICISEE）； School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University（上海交通大学数学科学学院）； King Abdullah University of Science and Technology（卡塔尔国王 Abdullah 科学与技术大学）

AI总结提出TELLME方法，通过改进大型语言模型的内部表征透明度，帮助监控者识别不当和敏感行为，并在去毒化任务中验证其有效性。

Comments 28 pages,8 figures,15 tables

详情

AI中文摘要

大型语言模型（LLMs）的能力日益增强，但其思维和决策过程的机制仍不清楚。思维链（CoTs）常被用来外化LLMs的思维，但这一策略未能准确反映LLMs的思维过程。基于LLMs隐藏表征的技术提供了内部视角，以改善对其潜在思维的可监控性。然而，以往的方法仅尝试开发外部模块，而非使LLMs本身更易于监控。本文提出了一种新方法TELLME，提高了LLMs的透明度，并帮助监控者识别不合适和敏感的行为。此外，我们在去毒化任务上展示了TELLME的有效性，LLMs在多模态测试集、不同架构和不同参数规模上均取得了一致的改进。我们进一步从最优传输理论和实证角度分析了TELLME对LLMs泛化能力的提升。

英文摘要

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

URL PDF HTML ☆

赞 0 踩 0

2503.02857 2026-05-28 cs.CV cs.AI cs.CY 版本更新

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Deepfake-Eval-2024：2024年传播的深度伪造多模态野外基准

Nuria Alina Chandra, Hannah Lee, Ryan Murtfeldt, Lin Qiu, Arnab Karmakar, Emmanuel Tanumihardja, Kevin Farhat, Ben Caffee, Changyeon Lee, Jongwook Choi, Sejin Paik, Aerin Kim, Oren Etzioni

发表机构 * University of Washington（华盛顿大学）； Allen Institute for Artificial Intelligence（人工智能研究院）； University of Maryland（马里兰大学）； Chung-Ang University（Chung-Ang 大学）； Georgetown University（乔治城大学）； Miraflow AI

AI总结针对现有学术基准过时且不反映真实深度伪造的问题，提出包含2024年社交媒体和用户提交的多模态深度伪造基准Deepfake-Eval-2024，评估发现开源模型性能大幅下降，而商业模型和微调模型表现更优但未达到专家水平。

详情

AI中文摘要

在生成式人工智能日益逼真的时代，稳健的深度伪造检测对于减少欺诈和虚假信息至关重要。尽管许多深度伪造检测器在学术数据集上报告了高准确率，但我们表明这些学术基准已经过时，不能代表现实世界的深度伪造。我们引入了Deepfake-Eval-2024，这是一个新的深度伪造检测基准，由2024年从社交媒体和深度伪造检测平台用户收集的野外深度伪造组成。Deepfake-Eval-2024包含45小时的视频、56.5小时的音频和1,975张图像，涵盖了最新的操纵技术。该基准包含来自52种不同语言、88个不同网站的多样化媒体内容。我们发现，在Deepfake-Eval-2024上评估时，开源最先进的深度伪造检测模型的性能急剧下降，与之前的基准相比，视频模型的AUC下降了50%，音频模型下降了48%，图像模型下降了45%。我们还评估了商业深度伪造检测模型和在Deepfake-Eval-2024上微调的模型，发现它们比现成的开源模型性能更优，但尚未达到深度伪造取证分析师的准确率。数据集可在https://github.com/nuriachandra/Deepfake-Eval-2024获取。

英文摘要

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

URL PDF HTML ☆

赞 0 踩 0

2504.04924 2026-05-28 cs.CV eess.IV 版本更新

Inter-event Interval Microscopy for Event Cameras

事件相机的帧间间隔显微术

Changqing Su, Yanqin Chen, Zihan Lin, Zhen Cheng, You Zhou, Bo Xiong, Zhaofei Yu, Tiejun Huang

发表机构 * National Key Laboratory for Multimedia Information Processing（国家多媒体信息处理重点实验室）； Westlake Laboratory of Life Sciences and Biomedicine（西湖生命科学与生物医学实验室）； School of Automation（自动化学院）； Department of Automation（自动化系）； Nanjing University Medical School（南京大学医学院）

AI总结提出基于事件相机的帧间间隔显微术（IEIM），通过量化连续事件的时间间隔实现静态和动态场景的强度重建，在荧光显微镜中实现高时空分辨率和动态范围。

详情

DOI: 10.1364/PRJ.562782

AI中文摘要

事件相机是一种创新的仿生传感器，与传统相机不同，它通过感知强度变化而非直接感知强度，并将这些变化记录为连续的“事件”流。从这些稀疏事件中重建强度一直是一个具有挑战性的问题。以往的方法主要集中在将运动诱发的事件转换为视频，或通过在事件相机采集端集成调制器件来实现静态场景的强度成像。在本文中，我们首次利用静态事件相机在荧光显微镜中实现了静态和动态场景的事件到强度转换。与主要依赖事件积分的传统方法不同，所提出的帧间间隔显微术（IEIM）量化了每个像素处连续事件之间的时间间隔。在事件相机中，由于阈值固定，时间间隔可以精确表示强度。在硬件层面，所提出的IEIM在配备事件相机的显微镜中集成了脉冲光调制器件，称为基于脉冲调制的事件驱动荧光显微镜。此外，我们收集了包含高动态范围和高速度场景的IEIMat数据集。在IEIMat数据集上的实验结果表明，所提出的IEIM在空间和时间分辨率、动态范围方面优于其他方法，且带宽更低。代码和IEIMat数据集将公开提供。

英文摘要

Event cameras, an innovative bio-inspired sensor, differ from traditional cameras by sensing changes in intensity rather than directly perceiving intensity and recording these variations as a continuous stream of "events". The intensity reconstruction from these sparse events has long been a challenging problem. Previous approaches mainly focused on transforming motion-induced events into videos or achieving intensity imaging for static scenes by integrating modulation devices at the event camera acquisition end. In this paper, for the first time, we achieve event-to-intensity conversion using a static event camera for both static and dynamic scenes in fluorescence microscopy. Unlike conventional methods that primarily rely on event integration, the proposed Inter-event Interval Microscopy (IEIM) quantifies the time interval between consecutive events at each pixel. With a fixed threshold in the event camera, the time interval can precisely represent the intensity. At the hardware level, the proposed IEIM integrates a pulse light modulation device within a microscope equipped with an event camera, termed Pulse Modulation-based Event-driven Fluorescence Microscopy. Additionally, we have collected IEIMat dataset under various scenes including high dynamic range and high-speed scenarios. Experimental results on the IEIMat dataset demonstrate that the proposed IEIM achieves superior spatial and temporal resolution, as well as a higher dynamic range, with lower bandwidth compared to other methods. The code and the IEIMat dataset will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2504.20736 2026-05-28 cs.RO cs.CV 版本更新

A Survey on Event-based Optical Marker Systems

基于事件的光学标记系统综述

Nafiseh Jabbari Tofighi, Maxime Robic, Fabio Morbidi, Pascal Vasseur

发表机构 * MIS laboratory, University of Picardie Jules Verne（皮卡第大学朱勒斯·弗尔大学MIS实验室）； DART Lab, Politecnico di Milano（米兰理工学院DART实验室）

AI总结本文综述了基于事件的光学标记系统（EBOMS），分析其异步操作原理和鲁棒性，并介绍了在目标检测、姿态估计和光通信等领域的应用。

Comments 11 pages, 6 figures, 2 table

2504.04540 2026-05-28 cs.CV cs.AI 版本更新

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

点、视觉与文本：点云能否提升大语言模型的空间推理能力？一项偏差控制研究

Weichen Zhang, Ruiying Peng, Xin Zeng, Jianjie Fang, Ziyou Wang, Kaiyuan Li, Heng Dong, Wei Li, Chen Gao, Xin Wang, Xinlei Chen, Yong Li

发表机构 * Tsinghua University（清华大学）； ByteDance Seed（字节跳动种子）

AI总结本文通过引入包含文本、视觉和点云模态的3D空间推理基准ScanReQA，评估不同模态下大语言模型的空间推理能力，发现点云和视觉模态的模型表现优于纯文本模型，并揭示了3D大语言模型中的注意力下沉现象。

详情

AI中文摘要

利用点云中空间信息进行3D空间推理的3D大语言模型（LLMs）引起了广泛关注。尽管取得了一些有希望的结果，但点云相对于其他模态的优势仍不明确。此外，现有的3D基准不足以公平评估多模态大语言模型理解空间概念的能力。为了解决这些挑战，我们引入了ScanReQA，一个涵盖文本、视觉和点云模态的3D空间推理基准。然后，我们评估了文本、2D和3D大语言模型在该基准上的性能，以比较不同模态在理解空间概念方面的有效性。此外，我们分析了使用点云的3D大语言模型背后的推理机制。我们的发现表明：1）二元空间推理对当前的3D大语言模型仍然具有挑战性；2）基于点云和视觉模态的多模态大语言模型展现出比大语言模型更强的空间推理能力；3）3D大语言模型表现出类似于2D大语言模型中的注意力下沉现象，这损害了空间推理。我们认为这些结论有助于3D大语言模型的下一步发展，并为其他模态的基础模型提供见解。我们在项目页面发布了数据集和代码：https://github.com/EmbodiedCity/ScanReQA.code。

英文摘要

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the advantages of point clouds over other modalities remain unclear. Moreover, existing 3D benchmarks are insufficient for fairly evaluating the ability of multimodal LLMs to comprehend spatial concepts. To address these challenges, we introduce ScanReQA, a 3D spatial reasoning benchmark encompassing text, vision, and point cloud modalities. We then evaluate the performance of text, 2D, and 3D LLMs on the benchmark to compare the effectiveness of different modalities in understanding spatial concepts. Furthermore, we analyze the reasoning mechanisms behind 3D LLMs using point clouds. Our findings reveal that: 1) binary spatial reasoning remains challenging for current 3D LLMs, 2) MLLMs based on point cloud and visual modalities demonstrate stronger spatial reasoning capabilities than LLMs, and 3) 3D LLMs exhibit the attention sink phenomenon similar to that in 2D LLMs, impairing spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and codes in the project page: https://github.com/EmbodiedCity/ScanReQA.code.

URL PDF HTML ☆

赞 0 踩 0

2501.04144 2026-05-28 cs.CV cs.GR 版本更新

Chirpy3D: Part-Aware Multi-View Diffusion for Creative Fine-Grained Object Generation

Chirpy3D: 面向创意细粒度物体生成的部件感知多视角扩散

Kam Woh Ng, Jing Yang, Jia Wei Sii, Chee Seng Chan, Jiankang Deng, Yi-Zhe Song, Tao Xiang, Xiatian Zhu

发表机构 * University of Surrey（萨里大学）； University of Cambridge（剑桥大学）； Universiti Malaya（马来亚大学）； Imperial College London（伦敦帝国学院）

AI总结提出Chirpy3D，一种部件感知多视角扩散框架，从无姿态2D图像中学习层次化部件潜在空间，实现部件级交换、插值和零样本组合，无需3D数据或手动标注。

Comments 20 pages. Code at https://github.com/kamwoh/chirpy3d

详情

AI中文摘要

理解并生成物体的细粒度结构——例如具有物种特异性喙、翅膀和尾巴的鸟类——是计算机视觉中长期存在的挑战。我们提出Chirpy3D，一种部件感知多视角扩散框架，它从无姿态的2D图像中学习层次化部件潜在空间，仅使用现成的2D部件分割掩码作为空间指导——无需任何3D数据、相机姿态或手动部件标注。该潜在空间支持直观的部件级交换、插值和零样本组合。自监督特征一致性损失进一步促进跨视角的结构对齐，即使在混合或未见过的部件组合下也能实现连贯生成。我们的核心贡献是可控制的部件感知潜在空间和多视角扩散模型。通过任何可微分渲染器（如NeRF）支持下游3D生成，但这与主框架正交，使Chirpy3D成为在缺乏结构化3D数据时进行创意物体生成的灵活基础。代码已发布在https://github.com/kamwoh/chirpy3d。

英文摘要

Understanding and generating the fine-grained structure of objects -- such as birds with species-specific beaks, wings, and tails -- is a long-standing challenge in computer vision. We propose Chirpy3D, a part-aware multi-view diffusion framework that learns a hierarchical part latent space from unposed 2D images, using only off-the-shelf 2D part segmentation masks as spatial guidance -- without requiring any 3D data, camera poses, or manual part annotations. This latent space enables intuitive part-level swapping, interpolation, and zero-shot composition. A self-supervised feature consistency loss further encourages structural alignment across views, allowing coherent generation even with hybrid or unseen part combinations. Our core contribution is the controllable part-aware latent space and multi-view diffusion model. Downstream 3D generation is supported via any differentiable renderer such as NeRF but is orthogonal to the main framework, making Chirpy3D a flexible foundation for creative object generation in the absence of structured 3D data. Code is released at https://github.com/kamwoh/chirpy3d.

URL PDF HTML ☆

赞 0 踩 0

2503.22655 2026-05-28 cs.AI cs.CV cs.MM 版本更新

Text-Only Data Synthesis for Vision Language Model Training

仅文本数据合成用于视觉语言模型训练

Xiaomin Yu, Wenjie Zhang, Ziyue Qiao, Chengwei Qin, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Great Bay University（大湾大学）

AI总结提出一个跨集成的三阶段多模态数据合成框架，仅从文本生成高质量多模态训练数据，用于视觉语言模型的预训练和指令微调。

详情

AI中文摘要

训练视觉语言模型（VLM）通常需要大规模、高质量的图像-文本对，但收集或合成此类数据成本高昂。相比之下，文本数据丰富且廉价，这引发了一个问题：能否仅从文本中合成高质量的多模态训练数据？为解决这一问题，我们提出了一个跨集成的三阶段多模态数据合成框架，生成了两个数据集：Unicorn-1.2M和Unicorn-471K-Instruction。在第一阶段：多样化字幕数据合成，我们通过使用大语言模型（LLM）扩展稀疏字幕种子，构建了120万语义多样的高质量字幕。在第二阶段：指令微调数据生成，我们进一步将47.1万个字幕处理为多轮指令微调任务，以支持复杂推理。最后，在第三阶段：模态表示迁移，这些文本字幕表示被转换为视觉表示，从而产生多样化的合成图像表示。这一三阶段过程使我们能够构建用于预训练的Unicorn-1.2M和用于指令微调的Unicorn-471K-Instruction，而无需依赖真实图像。通过消除对真实图像的依赖，同时保持数据质量和多样性，我们的框架为VLM训练提供了一种成本效益高且可扩展的解决方案。

英文摘要

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training.

URL PDF HTML ☆

赞 0 踩 0

2503.04863 2026-05-28 cs.CV cs.AI 版本更新

Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

Manboformer: 通过时空注意力机制学习高斯表示

Ziyue Zhao, Qining Qi, Jianfa Ma

AI总结针对自动驾驶3D语义占用预测中高斯表示性能不足的问题，提出利用时空自注意力机制优化GaussianFormer，以提升模型性能。

Comments After careful self-check, we found several unnoticed deficiencies and incomplete discussions in this manuscript. To ensure the rigor and accuracy of academic results, we decide to withdraw this preprint. A refined, complete, and rigorous version will be submitted soon

2405.09586 2026-05-28 eess.IV cs.AI cs.CV 版本更新

Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation

事实序列化增强：胸部X光报告生成的关键创新

Kang Liu, Zhuoqi Ma, Mengmeng Liu, Zhicheng Jiao, Xiaolu Kang, Qiguang Miao, Kun Xie

发表机构 * School of Computer Science and Technology, Xidian University（西安电子科技大学计算机科学与技术学院）； Xi’an Key Laboratory of Big Data and Intelligent Vision（西安大数据与智能视觉重点实验室）； Key Laboratory of Collaborative Intelligence Systems, Ministry of Education（教育部协同智能系统重点实验室）； School of Artificial Intelligence, Xidian University（西安电子科技大学人工智能学院）； Department of Diagnostic Imaging, Brown University（布朗大学诊断影像科）

AI总结提出FSE两阶段事实序列化增强方法，通过事实引导对比学习和证据驱动报告生成，提升胸部X光报告生成的临床准确性和自然语言质量。

Comments code is available at FSE" target="_blank" rel="noopener">https://github.com/mk-runner/FSE

详情

DOI: 10.1016/j.eswa.2026.132550

AI中文摘要

放射学报告包含呈现式词汇（确保清晰和组织）和事实性词汇（基于可观察发现提供准确客观描述）。手动编写这些报告耗时费力，而自动报告生成提供了一种有前景的替代方案。该过程中的关键步骤是将X光片与其对应报告对齐。然而，现有方法通常依赖完整报告进行对齐，忽略了呈现式词汇的影响。为解决此问题，我们提出FSE，一种两阶段事实序列化增强方法。在第一阶段，我们引入事实引导的对比学习用于视觉表示，通过最大化X光片与对应事实描述之间的语义对应关系。在第二阶段，我们提出证据驱动的报告生成，通过整合来自类似历史病例的结构化事实序列化见解，增强诊断准确性。在MIMIC-CXR和IU X-ray数据集上的实验（涵盖特定和一般场景）表明，FSE在自然语言生成和临床效能指标上均优于最先进方法。消融研究进一步强调了第一阶段和第二阶段中事实序列化的积极作用。代码可在https://github.com/mk-runner/FSE获取。

英文摘要

A radiology report comprises presentation-style vocabulary, which ensures clarity and organization, and factual vocabulary, which provides accurate and objective descriptions based on observable findings. While manually writing these reports is time-consuming and labor-intensive, automatic report generation offers a promising alternative. A critical step in this process is to align radiographs with their corresponding reports. However, existing methods often rely on complete reports for alignment, overlooking the impact of presentation-style vocabulary. To address this issue, we propose FSE, a two-stage Factual Serialization Enhancement method. In Stage 1, we introduce factuality-guided contrastive learning for visual representation by maximizing the semantic correspondence between radiographs and corresponding factual descriptions. In Stage 2, we present evidence-driven report generation that enhances diagnostic accuracy by integrating insights from similar historical cases structured as factual serialization. Experiments on MIMIC-CXR and IU X-ray datasets across specific and general scenarios demonstrate that FSE outperforms state-of-the-art approaches in both natural language generation and clinical efficacy metrics. Ablation studies further emphasize the positive effects of factual serialization in Stage 1 and Stage 2. The code is available at https://github.com/mk-runner/FSE.

URL PDF HTML ☆

赞 0 踩 0