2605.22767 2026-05-22 cs.CV 版本更新

Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition

合成数据足够吗？重新思考儿科罕见病识别中的数据稀缺性

Ganlin Feng, Yuxi Long, Erin Lou, Lianghong Chen, Zihao Jing, Pingzhao Hu, Wei Xu

发表机构 * Western University（西方大学）； University of Toronto（多伦多大学）

AI总结本研究探讨了在儿科罕见病识别中，仅使用合成数据是否足以克服数据稀缺问题，通过实验发现高保真合成数据能模拟临床有意义的分布，从而为遗传咨询提供隐私保护的视觉资源。

Comments CVPR 2026 CV4CHL workshop

详情

AI中文摘要

患有罕见遗传疾病的儿童往往表现出独特的面部表型，但开发用于早期诊断的计算机视觉系统仍极具挑战性，因为存在极端的数据稀缺性、隐私限制以及儿科环境中有限的数据共享。这些挑战不仅阻碍了自动化诊断，也限制了临床遗传咨询中的视觉资源可用性。尽管先前研究表明合成数据可以增强真实数据集并保持表型层面的语义，但尚不清楚在超低资源的儿科环境中，仅使用合成数据是否足以进行学习。在本工作中，我们研究了仅使用合成数据的儿科罕见病识别场景。在受控的实验设置中，模型仅在具有表型意识的合成面部图像上进行训练，随着数据规模的增加。我们发现，在足够规模下，仅使用合成数据的训练在多个架构上实现了与仅使用真实数据的基线相当的性能，这表明高保真合成数据能够近似临床有意义的分布。这些发现进一步使合成的儿科面部图像成为隐私保护的资源，用于遗传教育和咨询，支持临床医生培训和患者沟通。我们的结果强调了计算机视觉在提高数据效率和扩展儿童健康护理中可访问的视觉工具方面的潜力。

英文摘要

Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves performance comparable to real-data-only baselines at sufficient scale across multiple backbones, suggesting that high-fidelity synthetic data can approximate clinically meaningful distributions. These findings together further enable the use of synthetic pediatric facial images as privacy-preserving resources for genetic education and counseling, supporting clinician training and patient communication. Our results highlight the potential of computer vision to improve data efficiency and expand accessible visual tools in children's healthcare.

URL PDF HTML ☆

赞 0 踩 0

2605.22751 2026-05-22 cs.CV 版本更新

将嵌入概念化：面向视觉-语言模型的稀疏解缠

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

发表机构 * Faculty of Mathematics and Computer Science, Jagiellonian University（雅盖隆大学数学与计算机科学学院）； Doctoral School of Exact and Natural Sciences, Jagiellonian University（雅盖隆大学精确与自然科学博士学校）； Centre for Credible AI, Warsaw University of Technology（华沙技术大学可信人工智能中心）

AI总结本文提出CEDAR方法，通过稀疏解缠技术在不增加维度的情况下揭示预训练嵌入的组成结构，从而提升视觉-语言模型的可解释性和与人类感知的一致性。

详情

AI中文摘要

视觉-语言模型学习了强大的多模态嵌入，但其内部语义仍然模糊。尽管稀疏自编码器（SAEs）可以提取可解释的特征，但它们依赖于扩展表示维度，这会破坏原始几何结构并引入冗余。我们引入CEDAR（通过自适应旋转进行概念嵌入解缠），一种事后方法，能够在不增加维度的情况下揭示预训练嵌入的组成结构。通过学习具有top-k稀疏瓶颈的可逆变换，CEDAR将语义信息集中到轴对齐的解缠坐标中。在CLIP-like架构中，单个坐标可以与文本概念进行解释，而对于生成模型如BLIP，它们可以解码为自然语言描述。实验表明，CEDAR在重建-稀疏性权衡方面具有竞争力，同时产生更可解释且更符合人类感知的解释。我们的结果表明，视觉-语言表示中的显性纠缠可以通过适当的基变换来解决，从而消除对过度扩展的需要。

英文摘要

Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.

URL PDF HTML ☆

赞 0 踩 0

2605.22678 2026-05-22 cs.CV cs.AI 版本更新

看见诗歌：基于大语言模型的AI生成现代汉语诗歌的图像-语义检测

Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao, Chengzhong Xu, Jiebo Luo, Derek F. Wong

发表机构 * Department of Computer and Information Science, University of Macau（澳门大学计算机与信息科学系）； University of Rochester（罗切斯特大学）； Sichuan University（四川大学）； Department of Portuguese, Faculty of Arts and Humanities, University of Macau（澳门大学人文学院葡萄牙语系）

AI总结本文提出了一种图像-语义引导的诗歌检测方法，通过整合图像内容与诗歌文本信息，提升大语言模型在检测现代汉语诗歌中的性能，实验结果表明该方法在多个数据集上均优于传统方法。

详情

AI中文摘要

先前的检测研究显示，LLMs无法有效用作检测器，但这些研究未涉及现代汉语诗歌。此外，没有相关研究探讨LLMs在检测现代汉语诗歌中的性能。本文评估并提升了LLMs作为现代汉语诗歌检测器的性能，并提出了一种图像-语义引导的诗歌检测方法。与传统检测方法相比，我们的方法创新性地整合了反映诗歌内容的图像。通过示例驱动的方法，我们的方法有效整合了图像中的意义、意象和情感信息，然后与诗歌文本形成互补判断。实验结果表明，基于我们方法的LLM检测器在多个数据集上均优于基于纯文本的基线检测器，甚至超越了表现最佳的传统检测器RoBERTa。使用我们方法的Gemini检测器在Macro-F1得分上达到85.65%，达到最先进的水平。不同LLM检测器在多个LLM生成数据上的性能提升证明了我们方法的有效性。

英文摘要

Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2605.22651 2026-05-22 cs.CV 版本更新

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

图中标签真的在说些什么？用于视觉语言预训练中组合数据选择的反事实短语干预

Hyejin Go, Semi Lee, Hyesong Choi

发表机构 * Soongsil University（顺斯尔大学）

AI总结本文研究了在视觉语言预训练中如何通过反事实短语干预来改进组合数据选择，提出了CPI方法以解决现有方法中全局过滤信号失效的问题，从而提升模型在关系识别任务上的表现。

Comments 11 pages, 2 figures, 4 tables. Preprint

详情

AI中文摘要

CLIP风格的对比预训练通常通过样本级过滤信号来收集网络级图像-文本对，通常基于对级对齐。我们证明这种信号饱和：一旦粗略不匹配被移除，更严格的全局过滤不再跟踪由保留标签提供的组合监督。原因在于结构问题 - 全局评分混淆了对是否广泛合理与是否个别对象、属性和关系短语在标签中实质性支持图像-文本匹配。后者是组合泛化所需，但对级过滤器对此无能为力。我们通过反事实短语干预（CPI），一种短语级整理框架，将受控的非正式令牌替换转换为图像条件的短语敏感性评分。CPI仅使用全局对齐进行粗略不匹配移除，然后通过是否在受控替换下短语显著影响图像-文本评分来对幸存池进行排名。我们将CPI框架为一阶短语敏感性信号，而非接地或识别结果，并在CC3M规模上评估。按此信号排名产生一个50%的数据子集，在VL-CheckList-VG关系任务上比完整数据基线提高+1.91，在匹配预算下比仅对齐过滤提高+1.00，同时提高SugarCrepe整体表现并保持泛化转移。CPI是损失正交的：应用不变于NegCLIP，它进一步在VL-CheckList-VG关系任务上提高+3.84，并在主要文本中获得额外的CE-CLIP收益。

英文摘要

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

URL PDF HTML ☆

赞 0 踩 0

2605.22649 2026-05-22 cs.CV cs.LG 版本更新

From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder

从基线到随访：利用因果层次变分自编码器在UK Biobank中生成脊柱DXA图像

Yilin Zhang, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar

发表机构 * School of Electronics and Computer Science（电子与计算机科学学院）； University of Southampton（萨塞克斯大学）； MRC Lifecourse Epidemiology Centre（英国医学研究理事会生命周期流行病学中心）； University of Southampton, Southampton General Hospital（萨塞克斯大学索马塞特医院）； Computer Science University of Southampton（计算机科学萨塞克斯大学）

AI总结本文提出了一种基于元数据的因果层次变分自编码器，用于在UK Biobank中生成一致的脊柱DXA图像，通过基线到随访的设置评估因果一致性，展示了年龄干预下关键椎体形态学变量的高一致性，支持了在解剖上合理的DXA图像合成。

Comments 7 pages, 4 figures, 3 tables. Accepted at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

详情

AI中文摘要

双能X射线吸收法（DXA）广泛用于大规模骨骼评估，但学习可控且可解释的因子特异性解剖变异仍具挑战性。我们提出了一种基于元数据的因果层次变分自编码器（CHVAE），用于在UK Biobank（UKB）中因果一致地生成前后位（AP）脊柱DXA图像。模型在3,743个原始AP脊柱扫描（来自首次成像访问）上进行训练，并基于基本参与者属性和腰椎形态学进行条件化。因果一致性在基线到随访的设置中通过 abduction--action--prediction（AAP）进行评估：潜在变量从基线图像中抽象出来，年龄被干预到重复成像值，然后将产生的反事实随访形态学与观察到的重复成像测量进行比较。结果表明，在年龄干预下，关键椎体形态学变量的绝对一致性较高，支持了与干预对齐的、在解剖上合理的DXA图像合成。

英文摘要

Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction--action--prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.

URL PDF HTML ☆

赞 0 踩 0

2605.22631 2026-05-22 cs.CV 版本更新

通过双区间运动线索解耦自身运动与目标动态以实现无人机检测

Liuyang Wang, Feitian Zhang

发表机构 * Department of Robotics, School of Advanced Manufacturing and Robotics（机器人学院，先进制造与机器人学院）； State Key Laboratory of Turbulence and Complex Systems（湍流与复杂系统国家重点实验室）； Peking University（北京大学）； Great Bay University（大湾大学）

AI总结本文提出了一种基于视觉的运动引导检测框架，通过双区间运动提取策略和轻量级运动引导注意力模块，解耦目标运动与相机干扰，提升无人机检测在剧烈自身运动下的性能。

详情

AI中文摘要

无人机的物体检测面临严重的自身运动、相机抖动和大规模变化的挑战。尽管现代检测器在静态图像上表现良好，但直接应用于无人机视频时往往失效，尤其在动态场景中的小目标。现有基于运动的方法要么依赖计算昂贵的光流，要么使用单区间差分，易受抖动影响且难以捕捉多样的运动模式。本文提出了一种视觉-only的运动引导检测框架，通过双区间运动提取策略和轻量级运动引导注意力模块，解耦目标运动与相机干扰。首先基于同射影的全局运动补偿（GMC）对相邻帧进行对齐。然后引入双区间运动提取策略，捕捉短期和长期的运动线索。为了整合这些线索，轻量级运动引导注意力模块（MGA）在特征金字塔网络中增强特征表示。在VisDrone-VID数据集上的实验表明，在严重自身运动下，该方法在YOLOv8基线上有显著改进。消融研究进一步验证了双区间设计和所提运动引导注意力机制的有效性。

英文摘要

Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.22591 2026-05-22 cs.CV 版本更新

Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure

重新思考冻结视觉基础模型的噪声鲁棒训练：一个跨数据集基准与小损失失败的案例研究

Zitong Li, Haoyu Wang

发表机构 * Department of Biostatistics and Health Informatics（生物统计学与健康信息学系）

AI总结本文通过跨五个医学数据集、三种主干网络、两种噪声类型和五种噪声率的基准测试，重新评估了冻结特征域中噪声标签学习方法的性能，揭示了小损失假设在高风险场景下的局限性，并提出了基于特征空间的选择器以指导实际应用。

详情

AI中文摘要

冻结视觉基础模型（VFMs）配备轻量级分类头，因其高效且可重复部署而在医学影像中日益普及。然而，针对此冻结特征域的噪声标签学习方法仍缺乏深入理解，且大多数现有方法仍依赖于从端到端训练继承的小损失假设。本文提出了一个包含八个噪声标签方法、五个医学数据集、三种主干网络、两种噪声类型和五种噪声率（150种条件，6,000次训练运行）的受控基准测试，通过平衡准确率进行评估。基准测试表明，不存在普遍胜利者：Friedman排名在150种条件下得出χ²=333.2（p=4.77×10⁻⁶⁸），ELR在最多条件（49/150）中获胜，而CUFIT获得最佳平均排名（2.51）。方法选择的实际成本随着噪声严重程度急剧增加，从干净数据上的4.5pp增加到不对称40%噪声时的18.8pp。为了解释这些基准级别的模式，我们重新审视了小损失假设在代表性的高风险场景中的应用。在冻结DINOv2特征下，干净和噪声损失分布重叠达53-61%，匹配率的干净样本检测显示，在不对称噪声下，预测一致性比损失排名更加稳定（3pp vs. 13pp精度下降）。在ISIC2019数据集上，不对称40%噪声下，Co-Teaching达到68%的总体准确率，但在三个少数类上无召回时，其平衡准确率降至35.1%。这些结果将冻结VFMs的噪声标签学习重新定义为一种基于场景的方法选择问题，而非寻找单一主导算法。本文最后提供了基于证据的指导和一个低遗憾的特征空间选择器，以指导实际应用。

英文摘要

Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields $χ^2 = 333.2$ ($p = 4.77 \times 10^{-68}$), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40\% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53--61\%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40\% noise, Co-Teaching reaches 68\% overall accuracy while collapsing to 35.1\% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.

URL PDF HTML ☆

赞 0 踩 0

2605.22581 2026-05-22 cs.CV cs.AI cs.LG 版本更新

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

SceneAligner: 在真实场景中实现基于3D的平面定位

Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor

发表机构 * Cornell University（康奈尔大学）； Kempner Institute, Harvard University（哈佛大学 Kempner 院）

AI总结本文提出了一种在真实场景中实现基于3D重建的平面定位方法，通过将任务 grounding 在场景的重建3D表示中，解决了现有方法在大规模建筑和栅格化平面图中应用受限的问题。

Comments Project Page: https://Cornell-VAILab.github.io/SceneAligner

详情

AI中文摘要

许多公共建筑提供带有'你在这里'指示器的平面图，以帮助游客导航。平面定位旨在通过确定视觉观测是在平面图中的哪个位置来计算实现这一能力。然而，现有方法通常假设受控的小规模环境和精确的向量平面图，限制了它们在大规模建筑和栅格化平面图中的应用能力。在本文中，我们提出了一种在真实场景中实现平面定位的方法，通过将任务 grounding 在场景的重建3D表示中。给定一组无约束的图像集合，我们的方法重建一个重力对齐的3D场景，并将其投影到2D密度图中，作为平面图的代理。平面定位则被公式化为通过2D相似性变换将该代理与输入平面图对齐。为了弥合密度图与建筑平面图之间的外观差距，我们适配了一个2D基础模型来学习跨模态的对应关系，引入了一种细调方案，鼓励语义对齐的同时保持结构一致性。广泛的实验表明，与先前方法相比有显著的改进，包括在极稀疏设置中，甚至使用单张输入图像时。我们的代码和数据将公开提供。

英文摘要

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.22578 2026-05-22 cs.CV 版本更新

Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping

超越形变距离：面向在线制图的粒度化顺序感知评估度量

Chouaib Bencheikh Lehocine, Adam Lilja, Junsheng Fu, Lars Hammarstrand

发表机构 * Zenseact AB（Zenseact公司）； Chalmers University of Technology（楚姆勒斯技术大学）

AI总结本文提出一种粒度化顺序感知评估度量，用于评估在线制图方法，通过引入序列最优子模式分配（SOSPA）和多实例评估框架中的多段线定位与检测（PLD），改进了传统基于形变距离的评估方法，揭示了当前方法中检测能力是主要瓶颈。

详情

AI中文摘要

在线地图估计是自动驾驶系统的关键组成部分，能够减少对昂贵高精度地图的依赖。最先进的方法通常将地图元素预测为点的有序序列，形成多边形和多边形链。这些方法的评估主要依赖于基于阈值形变距离（CD）的平均平均精度（mAP）。该框架对点顺序缺乏敏感性，并且在评估几何质量时缺乏粒度，使得难以区分哪些方法真正优于其他方法。在本文中，我们从两个方面解决了这些限制。对于单实例相似性度量，我们引入了序列最优子模式分配（SOSPA），一种顺序感知度量，能够对单个几何体进行细粒度评估，同时满足所有度量公理。对于多实例评估框架，我们提出了多段线定位与检测（PLD），一种软度量，能够同时捕捉检测质量和几何准确性，用原理性的软分配替代mAP的硬阈值。通过在nuScenes上的评估，我们证明PLD能够有效排序最先进的在线制图方法（MapTRv2、StreamMapNet、MapTracker），并提供分解的误差分析。该分析揭示了当前方法中检测能力是主要瓶颈，揭示了一种mAP无法捕捉的性能趋势。使用我们度量的评估代码将被发布。

英文摘要

Online map estimation is a crucial component of autonomous driving systems that reduces the reliance on costly high-definition maps. State-of-the-art (SOTA) methods commonly predict map elements as ordered sequences of points that form polylines and polygons. The evaluation of these methods relies predominantly on mean average precision (mAP) based on thresholded Chamfer distance (CD). This framework lacks sensitivity to point ordering and provides limited granularity in assessing geometric quality, making it difficult to distinguish which methods truly excel over others. In this work, we address these limitations on two fronts. For the single-instance similarity measure, we introduce sequence optimal sub-pattern assignment (SOSPA), an order-aware metric that enables fine-grained evaluation of individual geometries while satisfying all metric axioms. For the multi-instance evaluation framework, we propose polyline localisation and detection (PLD), a soft metric that jointly captures detection quality and geometric accuracy, replacing the hard thresholding of mAP with a principled soft assignment. Through evaluations on nuScenes, we demonstrate that PLD effectively ranks SOTA online mapping methods (MapTRv2, StreamMapNet, MapTracker) while providing a decomposed error analysis. This analysis identifies detection capability as the dominant bottleneck in current methods, revealing a performance trend that mAP fails to capture. Code for evaluation using our metrics will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.22572 2026-05-22 cs.CV 版本更新

GeoWeaver: 在场景推理前通过几何证据 grounding 视觉 token

Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）； Pazhou Lab (Huangpu)（琶洲实验室（黄埔））； Hainan University（海南大学）； University of California at Merced（加州大学默塞德分校）

AI总结本文提出 GeoWeaver，一种在场景推理前通过几何证据对视觉 token 进行 grounding 的框架，以提升空间推理能力并保持多模态能力。

详情

AI中文摘要

视觉语言模型中的时空推理需要保持物理几何的视觉表示，而非仅仅语义外观。最近的多模态模型通过结构分支、3D感知监督、推理阶段融合或长视界记忆来整合几何信息。尽管这些方法展示了几何对空间智能的重要性，但它们通常将几何线索视为所有视觉 token 的共享信号。我们注意到，这忽略了更细致的挑战：不同的视觉 token 需要根据其空间角色不同的几何证据。为了解决这一限制，我们引入 GeoWeaver，一种预推理的几何 grounding 框架，将几何视为时空推理的表示前提。GeoWeaver 从冻结的几何编码器构建多层次的几何库，并执行 token 自适应的几何证据分配，使每个视觉 token 能够检索最相关的几何抽象。所选证据通过残差 grounding 操作整合到视觉 token 中，在语言建模之前，产生几何 grounding 的表示，以支持后续推理。在空间推理基准上的广泛评估表明，GeoWeaver 一致地增强了几何感知推理，同时保持了通用多模态能力。这表明几何信息带来的最大收益不是作为后期融合的辅助信号，而是作为塑造大型语言模型推理基础的必要前提。所有源代码和模型将在 https://github.com/yahooo-m/GeoWeaver 上发布。

英文摘要

Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .

URL PDF HTML ☆

赞 0 踩 0

2605.22552 2026-05-22 cs.CV cs.MM 版本更新

FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

FashionLens：通过任务自适应学习实现多功能时尚图像检索

Haokun Wen, Xuemeng Song, Xinghao Xie, Xiaolin Chen, Xiangyu Zhao, Weili Guan

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）； Department of Data Science, City University of Hong Kong（香港城市大学数据科学系）； Department of Computer Science and Engineering, Southern University of Science and Technology（南方科技大学计算机科学与工程系）； School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； Institute of Data Science, National University of Singapore（新加坡国立大学数据科学研究所）； School of Information Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）信息科学与技术学院）； Shenzhen Loop Area Institute（深圳南山区研究所）

AI总结本文提出FashionLens框架，通过任务自适应学习实现多功能时尚图像检索，解决现有方法无法处理多样检索需求的问题。

详情

AI中文摘要

时尚图像检索是现代电子商务系统的核心。在实践中，一个能够支持多种查询格式和搜索意图的统一框架备受青睐。然而，现有方法专注于狭窄的检索任务，无法充分捕捉这种多样性。因此，在本工作中，我们旨在开发一个能够处理多样现实时尚检索场景的统一框架，实现真正多功能的时尚图像检索。为了建立数据基础，我们首先引入U-FIRE，一个综合基准，将碎片化的时尚数据集整合到统一的集合中，并辅以两个人工整理的数据集进行测试通用性。在此基础上，我们提出了基于多模态大语言模型的FashionLens框架。为处理不同的匹配目标，我们设计了Proposal-Guided Spherical Query Calibrator，通过自适应球形线性插值动态将查询表示转移到任务对齐的度量空间中。此外，为缓解因任务复杂性和数据规模不同导致的优化不平衡问题，我们开发了Gradient-Guided Adaptive Sampling策略，根据实时学习难度和数据规模先验自动重新加权任务。在U-FIRE上的实验表明，FashionLens在多种检索场景中均取得最佳性能，并能稳健地推广到未见任务。数据和代码已公开发布在https://github.com/haokunwen/FashionLens。

英文摘要

Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.

URL PDF HTML ☆

赞 0 踩 0

2605.22550 2026-05-22 cs.CV 版本更新

MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

MOTOR: 两轮车骑行行为理解的多模态数据集

Varun A. Paturkar, Shankar Gangisetty, C. V. Jawahar

发表机构 * CVIT, IIIT-Hyderabad（IIIT-海得拉巴学院计算机视觉研究所）

AI总结本文提出MOTOR数据集，用于研究两轮车在密集无结构交通中的骑行行为，通过多视角、多模态数据融合，为自动驾驶辅助系统提供新的研究基础。

详情

AI中文摘要

两轮车在发展中国家道路上的致命事故比例显著偏高。然而，关于两轮车骑行行为的研究远远落后于四轮车，后者多模态数据集推动了高级驾驶辅助系统（ADAS）的重大进展。为填补这一空白，我们提出了MOTOR数据集，这是首个大规模、多视角、多模态资源，专门用于密集无结构交通中的两轮车。MOTOR包含1,629个序列（25多个小时的视频数据），由16名骑行者收集，整合了同步的前视、后视和头盔视频、可穿戴追踪器的骑行目视数据、道路音频和 telemetry（GPS、加速度计、陀螺仪）。丰富的注释捕捉交通情境、骑行状态、12种骑行动作（涵盖传统和非常规行为）以及合法性标签（合法、非法、未指定）。我们使用最先进的视频动作识别骨干网络（CNN和Transformer-based）进行骑行行为识别和动作合法性分类，并发现结合RGB、目视和telemetry数据能够获得最佳性能。MOTOR因此为两轮车驾驶的安全关键理解提供了独特基础。它为研究社区提供了一个基准，以开发和评估用于行为分析、合法性感知预测和智能交通系统模型。数据集和代码可在https://varuniiith.github.io/MOTOR-Dataset/获取。

英文摘要

Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR comprises 1,629 sequences (25+ hours of video data) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traffic context, rider state, 12 riding maneuvers spanning conventional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code is available at https: //varuniiith.github.io/MOTOR-Dataset/

URL PDF HTML ☆

赞 0 踩 0

2605.22538 2026-05-22 cs.CV 版本更新

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

基于运动、几何和语义适应的复杂非线性视觉目标跟踪

Deyi Zhu, Yuji Wang, Yong Liu, Yansong Tang, Bingyao Yu, Jiwen Lu, Jie Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结本文提出SAMOSA框架，通过显式利用运动、几何和语义线索，改进SAM 2在复杂非线性视觉目标跟踪中的表现，实现了更鲁棒和通用的跟踪方法。

详情

AI中文摘要

传统视觉目标跟踪（VOT）方法通常依赖于任务特定的监督训练，限制了其对未见对象和具有干扰、遮挡和非线性运动的挑战场景的泛化能力。最近的视觉基础模型，如SAM 2，通过大规模预训练学习强大的视频理解先验，并为构建更鲁棒和通用的跟踪器提供了有前景的基础。然而，直接将SAM 2应用于VOT仍然不够优化，因为它没有显式建模目标运动动态或在帧之间强制几何和语义一致性，这两者对于可靠的跟踪至关重要。为了解决这个问题，我们提出了SAMOSA，一个新的跟踪框架，通过显式利用运动、几何和语义线索，将SAM 2适应于复杂的VOT场景。具体来说，我们引入了一个轻量级的非线性运动预测器来建模目标动态并指导掩码选择以及内存过滤。我们进一步利用语义线索来检测目标位移并从跟踪失败中恢复，同时将几何线索作为结构约束以提高跟踪稳定性。通过这种方式，SAMOSA弥合了SAM 2隐含视频理解先验与显式跟踪导向建模之间的差距。广泛的实验表明，SAMOSA在通用基准上始终优于最先进的基于SAM 2的方法，展示了比监督VOT方法更强的泛化能力，并在反UAV数据集上实现了显著的提升，这些数据集典型地代表了复杂的非线性运动场景。我们的代码可在https://github.com/DurYi/SAMOSA上获得。

英文摘要

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.

URL PDF HTML ☆

赞 0 踩 0

2605.22536 2026-05-22 cs.CV cs.CL 版本更新

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG: 在视觉退化下评估空间智能的基准测试

Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； University of Electronic Science and Technology of China（电子科技大学）； Chongqing University（重庆大学）； The University of Tokyo（东京大学）； Beihang University（北航）； Northwestern Polytechnical University（西北工业大学）

AI总结本文提出SpaceDG，首个针对退化感知空间理解的大型数据集，通过物理基础的退化合成引擎生成9种退化类型，评估多模态大语言模型在视觉退化下的空间推理能力，并展示在退化条件下微调可提升模型鲁棒性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在空间智能方面取得了快速进展，但现有空间推理基准大多假设纯净的视觉输入，忽略了现实部署中常见的退化现象，如运动模糊、低光照、恶劣天气、镜头畸变和压缩伪影。这提出了一个根本性问题：当前MLLMs在视觉观察不完美时的空间智能鲁棒性如何？为回答这个问题，我们引入SpaceDG，首个大规模退化感知空间理解数据集。它通过物理基础的退化合成引擎将退化形成过程嵌入3D高斯点散布（3DGS）渲染，能够真实模拟九种退化类型。所生成的数据集包含约100万对QA问题，来自近1000个室内场景。我们进一步引入SpaceDG-Bench，一个经人类验证的基准，包含11种推理类别和9种视觉退化类型的1102个问题，产生超过10000个VQA实例。评估25个开源和闭源MLLMs发现，视觉退化一致且显著损害空间推理能力，暴露出关键的鲁棒性差距。最后，我们展示在SpaceDG上微调可显著提高退化鲁棒性，并且在退化条件下甚至可以超越人类性能，而不会在清晰图像上造成性能下降，突显了退化感知训练在鲁棒空间智能方面的潜力。

英文摘要

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.22504 2026-05-22 cs.AI cs.CV 版本更新

LACO: Adaptive Latent Communication for Collaborative Driving

LACO：适应性潜在通信用于协同驾驶

Tianhao Chen, Yuheng Wu, Dongman Lee

发表机构 * Korea Advanced Institute of Science & Technology（韩国科学技术院）

AI总结本文提出LACO，一种无需训练的潜在通信范式，通过迭代潜在推理、跨时间显著性归因和结构化语义知识蒸馏，解决协同驾驶中潜在通信的延迟和信息丢失问题，实验证明其在降低通信和推理延迟的同时保持了强大的协同驾驶性能。

详情

AI中文摘要

协同驾驶旨在通过使连接车辆在部分可观测性下协调以提高安全性和效率。最近的方法已从共享视觉特征进行感知发展到通过基础模型交换基于语言的推理以实现行为协调。尽管用语言交流提供直观的信息，但引入了两个挑战：由自回归解码引起的高延迟以及由于将丰富的内部表示压缩成离散标记而引起的信信息丢失。为了解决这些挑战，我们分析了协同驾驶中潜在通信在多智能体设置下的固有限制。我们的分析揭示了代理身份混淆，即直接融合潜在状态会将车辆间的决策表示纠缠。受此启发，我们提出了LACO，一种无需训练的潜在通信范式，能够无缝地将预训练驾驶模型适应到协同设置中。LACO引入了迭代潜在推理（ILD）用于潜在推理，跨时间显著性归因（CHSA）用于通信高效的信信息选择，以及结构化语义知识蒸馏（SSKD）以稳定以自我为中心的决策。在CARLA中的闭环实验表明，LACO显著降低了通信和推理延迟，同时保持了强大的协同驾驶性能。

英文摘要

Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi-agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training-free \textbf{LA}tent \textbf{CO}mmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for communication-efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision making. Closed-loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.

URL PDF HTML ☆

赞 0 踩 0

2605.22492 2026-05-22 cs.CV 版本更新

Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline

无需训练的细粒度语义分割在低数据环境下：一个FungiTastic基线

Sebastian Cavada, Francesco Pelosin, Lapo Faggi

发表机构 * Covision Lab（Covision实验室）

AI总结本文提出了一种无需训练的两阶段框架，用于在低数据环境下实现细粒度语义分割，通过宏分类提示生成蘑菇掩码，并利用嵌入空间中的原型匹配进行细粒度标签分配，提高了可扩展性和分割成本。

Comments Accepted at the 13th Workshop on Fine-Grained Visual Categorization, CVPR 2026

2605.22484 2026-05-22 cs.CV 版本更新

Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

监督分类头作为语义原型：通过权重重用解锁视觉-语言对齐

David Méndez, Roberto Confalonieri, Natalia Díaz Rodríguez

发表机构 * Department of Computer Science and Artificial Intelligence, DaSCI Institute, University of Granada, Granada, Spain（计算机科学与人工智能系，DaSCI研究所，格拉纳达大学，格拉纳达，西班牙）； Department of Mathematics ``Tullio Levi-Civita'', University of Padova, Padova, Italy（托里利-西维塔数学系，帕多瓦大学，帕多瓦，意大利）

AI总结本文提出利用预训练视觉模型的分类头作为语义原型，通过权重重用实现视觉-语言对齐，提升跨模态检索、零样本和少样本分类任务的性能。

详情

AI中文摘要

视觉-语言模型（VLMs）通过将图像和文本映射到共享空间，在零样本分类和跨模态检索等任务中表现出色，但需要昂贵的端到端训练和大量配对数据。当前的后处理对齐方法通过轻量级映射连接预训练编码器来降低计算成本，但仍需大量配对数据。在本文中，我们研究了重新利用预训练视觉模型的分类头作为语义原型的潜力。这些权重的重用，通常在预训练后被丢弃，解锁了两种不同的能力：它使零样本对齐成为可能，通过将权重用作语义锚点，并通过将这些原型与真实图像-文本对混合，成为一种稳健的数据增强策略。我们证明，将我们的方法与几种最先进的后处理对齐技术结合，能够一致地提高跨模态检索、零样本和少样本分类任务的准确性。

英文摘要

Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.22469 2026-05-22 cs.CV 版本更新

MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

MaSC：一种用于评估概念驱动生成的遮蔽相似度度量

Patryk Bartkowiak, Lennart Petersen, Bartosz Kotrys, Dominik Michels, Soren Pirk, Wojtek Palubicki

发表机构 * Adam Mickiewicz University（亚当·密茨凯维奇大学）； Kiel University（基尔大学）； ArtCollect（艺术收藏）； KAUST（卡塔尔科技大学）

AI总结本文提出MaSC，一种基于遮蔽的相似度度量方法，用于评估文本到图像扩散模型中单概念个性化生成的保真度和提示遵循性，通过使用外部提供的前景概念遮罩将评估分解为针对主体的概念保真度和基于背景的提示遵循性。

Comments 20 pages, 2 figures, 7 tables

详情

AI中文摘要

评估文本到图像扩散模型中单概念个性化生成需要测量概念保真度（捕捉参考的识别保真度）和提示遵循性（捕捉生成场景是否匹配提示）。现有度量通常使用全局图像或文本-图像嵌入，如CLIP-I、DINO和CLIP-T。我们证明这些度量与人类感知相关性差，因为它们将图像视为整体而非将概念主体与背景分离。我们引入MaSC，一种遮蔽相似度度量，使用外部提供的前景概念遮罩将评估分解为主体特定的概念保真度和基于背景的提示遵循性。MaSC通过冻结的SigLIP2 SO400M-NaFlex特征计算两个分数：概念保真度通过前景参考块与生成图像块之间的遮蔽最大余弦匹配测量，提示遵循性通过比较仅背景的池化图像嵌入与无主体提示嵌入进行比较。在DreamBench++人类评分中，MaSC在概念保真度上达到Krippendorff alpha = 0.471，优于所有测试的非LLM基线和GPT-4V，并接近GPT-4o。在ORIDa，一个跨物理环境的真实照片身份保真度基准中，MaSC达到AUC = 0.992，几乎完美地区分相同主体与跨主体对。其提示遵循性分数也优于DreamBench++中自带的CLIP-T基线。这些结果表明，空间分解聚合是评估概念驱动生成的强大设计原则。

英文摘要

Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals using global image or text-image embeddings, such as CLIP-I, DINO, and CLIP-T. We show that such metrics correlate poorly with human perception because they attend to the image as a whole instead of separating the concept subject from the background. We introduce MaSC, a masked similarity metric that uses externally provided foreground concept masks to decompose evaluation into subject-specific concept preservation and background-based prompt following. MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between foreground reference patches and generated-image patches, while prompt following is measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding. On DreamBench++ human ratings, MaSC achieves Krippendorff alpha = 0.471 for concept preservation, outperforming all tested non-LLM baselines and GPT-4V, and approaching GPT-4o. On ORIDa, a real-photo identity-preservation benchmark across physical environments, MaSC achieves AUC = 0.992, nearly perfectly distinguishing same-subject from cross-subject pairs. Its prompt-following score also outperforms the CLIP-T baseline shipped with DreamBench++. These results show that spatially decomposed aggregation is a strong design principle for evaluating concept-driven generation.

URL PDF HTML ☆

赞 0 踩 0

2605.22467 2026-05-22 cs.CV 版本更新

SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data

SADGE：合成与真实数据的结构和外观领域差距估计

Patryk Bartkowiak, Bartosz Kotrys, Dominik Michels, Soren Pirk, Wojtek Palubicki

发表机构 * Adam Mickiewicz University（亚当·密茨凯维奇大学）； ArtCollect（艺术收藏）； KAUST（卡塔尔科技大学）； Kiel University（基尔大学）

AI总结本文提出SADGE，一种定量相似性度量指标，用于预测合成图像数据集在常见计算机视觉任务上的性能，而无需下游模型训练。研究发现，现有评估指标（如PSNR、FID、CLIP）主要衡量真实与合成图像之间的语义对齐（外观相似性分数），而结构相似性则用于评估领域差距（几何相似性分数）。本文通过多种合成数据集和下游任务证明，单一的外观或几何相似性无法可靠预测下游性能，而是它们的非线性交互决定了合成数据的效用。SADGE在五个公开的合成到真实基准家族和15个数据集变体（79k图像对）中，达到了线性和排名标准下最强的下游转移性能关联性。

详情

AI中文摘要

We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry-based methods and appearance-based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline and the strongest appearance-only baseline.

英文摘要

We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry-based methods and appearance-based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline and the strongest appearance-only baseline .

URL PDF HTML ☆

赞 0 踩 0

2605.22455 2026-05-22 cs.CV cs.AI cs.LG physics.optics 版本更新

Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light

使离散的成为连续的：合成RAW增强用于细粒度评估人检测性能在低光环境

Valeria Pais, Malena Mendilaharzu, Daniele Faccio, Luis Oala, Christoph Clausen, Bruno Sanguinetti

发表机构 * University of Glasgow（格拉斯哥大学）； Dotphoton

AI总结本文提出了一种合成RAW增强方法，用于在低光条件下更准确地评估人检测模型的性能，通过生成与相机传感器噪声模型匹配的低光样本，以改善基准测试的数据覆盖。

Comments Accepted non-archival paper at the CVPR 2026 AUTOPILOT Workshop (Autonomous Understanding Through Open-world Perception and Integrated Language Models for On-road Tasks)

详情

AI中文摘要

人工智能视觉模型的实际应用既受到可用训练和测试数据的推动，也受到其限制。真实数据集稀疏且不均匀：长尾或不平衡分布会阻碍泛化，而低密度区域中的样本数量少使得评估困难。合成数据可以填补这些空白，提供更连续地采样输入空间的方法，提高基准测试的数据覆盖。专注于自动驾驶安全关键场景中的夜间行人检测，我们展示如何利用合成低光样本更好地表征状态-of-the-art目标检测模型的性能，作为场景光照函数的函数。我们使用合成RAW图像增强技术生成低光样本，以匹配相机传感器的噪声模型。在真实和合成低光数据上的性能指标相似，表明AI模型难以区分它们。

英文摘要

Real-world deployment of AI vision models is both fueled and limited by the data available for training and testing. Real datasets are sparse and uneven: long-tailed or unbalanced distributions hinder generalization, and the low number of samples in low density regions makes it hard to run evaluations. Synthetic data can fill these gaps, providing us with a way to sample the input space more continuously and improve data coverage for benchmarks. Focusing on the autonomous driving safety-critical case of pedestrian detection in the dark, we show how synthetic low-light samples can be used to better characterize the performance of a state-of-the-art object detection model as a function of the scene illumination. We use a synthetic RAW image augmentation technique to generate low-light samples that match the noise model of the camera sensor. Performance metrics on real and synthetic low-light data are similar, indicating that the AI model finds it hard to distinguish between them.

URL PDF HTML ☆

赞 0 踩 0

2605.22446 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Pre-VLA: 预防性运行时验证用于可靠视觉-语言-动作和世界模型展开

Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, Zhijun Meng

发表机构 * Beihang University（北京航空航天大学）； Tsinghua University（清华大学）； Peking University（北京大学）； JDT AI Infra ； Zhejiang University（浙江大学）

AI总结本文提出Pre-VLA，一种统一的运行时验证架构，用于在物理执行或世界模型想象之前评估动作的有效性，以提高视觉-语言-动作和世界模型展开的可靠性。

详情

AI中文摘要

尽管大型视觉-语言-动作（VLA）模型和生成世界模型（WM）在长周期具身智能方面取得了进展，但其实际部署仍受到基于学习的动作生成不确定性的挑战。低质量的动作可能导致执行中的物理故障或导致冗余的渲染成本的误导性世界模型展开。为了解决这个问题，我们提出了Pre-VLA，一种统一的运行时验证架构，能够在物理执行或世界模型想象之前进行预防性动作有效性评估。Pre-VLA利用一个高效的多模态主干，具有模态感知的池化和轻量级双分支头，以预测候选动作片段的安全性信心和批评派生的优势分数。为处理严重的类别不平衡和不稳定边界决策，我们使用结合焦点分类、优势回归和软阈值校准的多任务目标来训练Pre-VLA。在部署期间，双模式预防性重采样调度器过滤低质量的动作，并在有限计算预算下触发自适应重采样。在LIBERO基准测试中，Pre-VLA将四个套件的平均闭环成功率从30.79%提高到37.62%，减少任务执行步骤，实现每个动作片段平均183.9毫秒的前向验证时间，并减轻世界模型展开中的误差累积。

英文摘要

While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

URL PDF HTML ☆

赞 0 踩 0

2605.22425 2026-05-22 eess.IV cs.CV 版本更新

Time-varying rPPG signal separation via block-sparse signal model

基于块稀疏信号模型的时变rPPG信号分离

Kosuke Kurihara, Yoshihiro Maeda, Daisuke Sugimura, Takayuki Hamamoto

发表机构 * Tokyo University of Science（东京科学大学）； Shibaura Institute of Technology（Shibaura工学院）； Tokyo Metropolitan University（东京 Metropolitan 大学）

AI总结本文提出了一种利用rPPG信号近似周期特性进行信号提取的方法，通过构建时变信号分离框架，在光照变化下实现适应性信号分离，实验验证了方法的有效性。

Comments Accepted by IEEE International Conference on Image Processing (ICIP 2026)

2605.22422 2026-05-22 cs.CV cs.AI 版本更新

FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers

FastTab: 一种快速表格识别器，结合了微小递归模块和1D变换器

Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet

发表机构 * LITIS

AI总结本文提出FastTab，一种基于网格的表格结构识别模型，通过轻量级的Tiny Recursive Module和轴向1D Transformer编码器，实现了高效的表格结构恢复，同时在多个基准测试中表现出低延迟和良好的鲁棒性。

详情

AI中文摘要

表格结构识别（TSR）需要在表级一致性（行/列数量、表头、跨单元格）和精确的分隔符定位之间取得平衡。我们介绍了FastTab，一种以网格为中心的TSR模型，通过结合（i）轻量级的Tiny Recursive Module（TRM）进行全局推理和（ii）轴向1D Transformer编码器，捕捉行和列上的长距离依赖关系，避免了自动回归的HTML解码。该模型预测行/列数量、表头行和分隔符以构建网格，然后利用ROI对齐的单元格特征推断行跨度/列跨度。在四个基准测试（PubTabNet、FinTabNet、PubTables-1M和SciTSR）中，FastTab在结构恢复性能方面表现优异，同时在低延迟推理中运行良好。我们进一步研究了在像素级匿名化下的鲁棒性，并展示了对相机捕获文档中弯曲分隔符的扩展。源代码将在https://github.com/hamdilaziz/FastTab上公开发布。

英文摘要

Table structure recognition (TSR) requires both table-level coherence (row/column counts, headers, spanning cells) and precise separator localization. We introduce FastTab, a grid-centric TSR model that avoids autoregressive HTML decoding by combining (i) a lightweight Tiny Recursive Module (TRM) for global reasoning and (ii) axial 1D Transformer encoders that capture long-range dependencies along rows and columns. The model predicts row/column counts, header rows, and separators to construct a grid, then infers rowspan/colspan using ROI-aligned cell features. Across four benchmarks (PubTabNet, FinTabNet, PubTables-1M, and SciTSR), FastTab achieves competitive structure recovery performance while operating at low-latency inference. We further study robustness under pixel-level anonymisation and show an extension to curved separators for camera-captured documents. The source code will be made publicly available at https://github.com/hamdilaziz/FastTab .

URL PDF HTML ☆

赞 0 踩 0

2605.22420 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

基于扩散的通用增强器用于城市场景重建

Henry Che, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

发表机构 * Waabi ； University of Toronto（多伦多大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出GenRe，一种基于扩散的通用增强器，用于城市场景重建，通过学习不同场景中的生成先验，高效地生成稳健且高保真的表示，能够可靠地泛化到挑战性的未见过的视角，从而在自动驾驶中实现鲁棒和可扩展的传感器模拟。

Comments ICRA 2026. Project page: https://waabi.ai/genre

详情

AI中文摘要

从真实世界观测重建城市场景已成为自动驾驶开发和测试的强大工具。尽管当前的神经渲染方法在记录轨迹上实现了高质量的渲染，但其在大视角变化下质量显著下降，限制了闭环模拟的应用。最近的研究表明，使用扩散模型在这些具有挑战性的视角上增强质量并将其改进回3D表示具有前景。然而，它们通常需要昂贵的每场景优化，且提炼的表示仍然脆弱，无法超越有限的合成视角泛化。为了解决这些限制，我们提出了GenRe，一种新的基于扩散的通用增强器用于城市场景重建。GenRe输入任何预训练的3D高斯表示，并在几分钟内修复其中的缺陷。通过学习在多样化场景中提炼生成先验，GenRe高效地生成稳健且高质量的表示，能够可靠地泛化到具有挑战性的未见过的视角（例如，变道）。实验表明，GenRe在质量和效率上均优于现有方法，并且受益于各种下游任务，使自动驾驶中的传感器模拟更加稳健和可扩展。

英文摘要

Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.22414 2026-05-22 cs.CV cs.AI 版本更新

Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

迈向具有空间定位病变证据的临床可解释性眼科VQA

Xingyue Wang, Bo Liu, Meng Wang, Zhixuan Zhang, Chengcheng Zhu, Huazhu Fu, Jiang Liu

发表机构 * Department of Computer Science and Engineering, Southern University of Science and Technology（南方科技大学计算机科学与工程系）； The Hong Kong Polytechnic University（香港理工大学）； National University of Singapore（新加坡国立大学）； University of Washington（华盛顿大学）； Institute of High Performance Computing, Agency for Science, Technology and Research（科技研究局高性能计算研究所）

AI总结本文提出FundusGround基准，通过空间定位病变证据提升眼科VQA的临床可解释性，通过三阶段流程收集标注病变的视网膜影像，并评估多种视觉语言模型在答案准确性和病变层面推理上的表现。

详情

AI中文摘要

视觉问答（VQA）在临床支持中具有巨大潜力，特别是在眼科领域，视网膜彩色照相是诊断的关键。然而，眼科VQA基准主要强调答案准确性，忽视了临床可解释性所需的显式视觉证据。本文引入FundusGround，一个新的具有空间定位病变证据的临床可解释性眼科VQA基准。具体而言，我们提出一个三阶段流程，收集了10,719张带有15,595个图像级精细标注病变的视网膜影像。为确保解剖一致性和临床有效性，所有病变均通过早期治疗糖尿病视网膜病变研究（ETDRS）网格进行空间定位，从而标准化映射到九个具有临床意义的视网膜区域。基于此结构化的病变证据，生成了72,706个问题，涵盖四种格式：开放式、封闭式、单选和多选。我们进一步使用双指标（答案准确性和病变层面推理）评估多种通用和医学大型视觉语言模型。实验表明，整合病变层面的视觉证据能持续提高模型性能和透明度，突显了显式空间定位对于可靠和可解释性眼科VQA的必要性。

英文摘要

Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.

URL PDF HTML ☆

赞 0 踩 0

2605.22413 2026-05-22 cs.CV 版本更新

From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

从识别到推理：在现实世界收据文档理解上对齐和增强MLLMs

Yandi Wang, Libin Zhan, Ziwei Huang, Tiancheng Luo, Yuxuan Jiang, Wang Dong, Leilei Gan, Jun Chen

发表机构 * Zhejiang University（浙江大学）

AI总结本文提出ReceiptBench基准，通过四个层次化子任务提升收据信息提取的结构一致性，并提出两阶段训练框架GRPO，通过强化学习信号提升模型性能，实验证明其在复杂推理任务上的优越性。

详情

AI中文摘要

从视觉文档中提取结构化信息（视觉信息提取，VIE）是业务自动化的核心。尽管最近的多模态大语言模型（MLLMs）展示了有前途的能力，但现有基准在规模和现实性方面存在关键限制，缺乏语义粒度，并未能覆盖多样化的文档类型。为弥合这一差距，我们引入ReceiptBench，一个大规模、人工标注的基准，包含10,000种多样化的收据，将信息提取组织成四个层次化子任务：（1）基础感知用于原始文本定位，（2）格式标准化用于严格遵循标准化指令，（3）语义推理用于从上下文中推断隐含属性，（4）结构解析用于处理嵌套的行项。此外，我们提出了一种两阶段训练框架，结合Metric-Aware Group Relative Policy Optimization（GRPO），将严格评估约束转化为强化学习信号以增强结构一致性。广泛的实验表明，我们的方法在复杂推理任务上实现了最先进的性能，超越了领先的专有模型。我们在此发布我们的数据集和代码：https://github.com/wwwT0ri/ReceiptBench。

英文摘要

Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency. Extensive experiments demonstrate that our method yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks. We release our datasets and code at https://github.com/wwwT0ri/ReceiptBench.

URL PDF HTML ☆

赞 0 踩 0

2605.22403 2026-05-22 cs.CV 版本更新

Translating Signals to Languages for sEMG-Based Activity Recognition

将信号转换为语言以实现基于sEMG的活动识别

Ming Wang, Haoxuan Qu, Qiuhong Ke, Wei Zhou, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University（兰卡斯特大学）； Monash University（墨尔本大学）； Cardiff University（卡迪夫大学）

AI总结本文提出了一种基于大语言模型的sEMG活动识别框架，通过将连续sEMG信号转换为语言形式，提升活动识别的准确性。

详情

AI中文摘要

基于sEMG信号的活动识别近年来受到了越来越多的研究关注。为了开发准确的sEMG信号活动识别器，已经提出了许多方法。一些研究专注于设计更大的、更具表达能力的模型架构以增强sEMG信号的表示能力，而另一些研究则通过大规模预训练来丰富模型先验知识，从而提高识别性能。最近，大语言模型（LLMs）在自然语言处理中展示了显著的泛化和推理能力，其隐含的知识，从大量的动作语言描述中学习而来，为解释sEMG信号和推断活动意图提供了新的可能性。受此启发，我们提出了LLM-sEMG，一种新的框架，利用LLMs作为sEMG活动识别器。在该框架中，我们设计了一种面向语言的映射机制，将连续的sEMG序列转换为sEMG语言，结合多种策略进一步促进信号到语言的映射过程。广泛的实验表明，所提出的框架能够利用大语言模型实现高精度的sEMG信号活动识别。

英文摘要

Surface electromyography (sEMG) signal-based activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Within this framework, we design a language-oriented mapping mechanism that converts continuous sEMG sequences into sEMG language, integrating several strategies to further facilitate the signal-to-language mapping process. Extensive experiments demonstrate that the proposed framework achieves highly accurate sEMG signal-based activity recognition using large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.22366 2026-05-22 cs.CV 版本更新

QuantSR+: 推动量化图像超分辨率网络的极限

Haotong Qin, Xudong Ma, Xianglong Liu, Jie Luo, Jinyang Guo, Michele Magno, Yulun Zhang

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Shanghai Jiao Tong University（上海交通大学）； Beihang University（北京航空航天大学）

AI总结本文提出QuantSR+框架，通过改进量化操作、网络设计和训练优化，实现了在精度和效率之间的更好平衡，针对超低精度下的性能下降问题，提出了三种关键技术贡献：重分布驱动位数确定、量化瘦身架构和瘦身引导的功能局部蒸馏。

详情

AI中文摘要

低比特量化广泛用于压缩超分辨率（SR）模型，以减少在资源受限设备上的存储和计算成本。然而，当SR模型被推向超低精度（2-4位）时，性能会因表示能力的降低和SR的细节敏感性而急剧下降。为了解决这些问题，我们提出QuantSR+，一个统一的框架，通过改进量化操作、网络设计和训练优化，实现了比先前低比特SR方法更好的精度和效率的权衡。QuantSR+主要依靠三个技术贡献：（1）重分布驱动位数确定（RBD），通过正向和反向传递中重塑量化分布，以保持表示保真度；（2）量化瘦身架构（QSA），从过参数化的模型开始，逐步剪枝不重要的块以满足效率预算，同时推动精度性能；（3）瘦身引导的功能局部蒸馏（SFD），通过直接损失和逐步的功能局部训练计划强制块感知的特征对齐，以更好地捕捉量化效果并加快收敛速度。广泛的实验表明，QuantSR+在专门的量化SR方法和通用量化方法上均实现了最先进的性能。对于SwinIR-S在Urban100（x4）上，它在2位SOTA基准上将PSNR提高了0.29 dB。同时，在2位下，它在操作数上减少了高达87.9%，存储上减少了89.4%。QuantSR+对卷积和基于Transformer的SR模型都有效，表明了广泛的应用性。

英文摘要

Low-bit quantization is widely used to compress super-resolution (SR) models and reduce storage and computation costs for deployment on resource-limited devices. However, when SR models are pushed to ultra-low precision (2-4 bits), performance can drop sharply due to diminished representational capacity and the detail-sensitive nature of SR. To address these issues, we propose QuantSR+, a unified framework that improves quantization operators, network design, and training optimization, achieving better trade-offs between accuracy and efficiency than prior low-bit SR methods. QuantSR+ mainly relies on three technical contributions: (1) Redistribution-driven Bit Determination (RBD), which reshapes quantization distributions in both forward and backward passes to preserve representation fidelity; (2) Quantized Slimmable Architecture (QSA), which begins with an over-parameterized model and progressively prunes less critical blocks to meet efficiency budgets while pushing the accuracy performance; and (3) Slimming-guided Function-localized Distillation (SFD), which enforces block-aware feature alignment via a direct loss and a progressive, function-local training schedule to capture quantization effects better and speed up convergence. Extensive experiments show that QuantSR+ achieves state-of-the-art performance against both specialized quantized SR methods and generic quantization approaches. For SwinIR-S on Urban100 (x4), it improves PSNR by 0.29 dB over the 2-bit SOTA baseline. Meanwhile, it delivers strong efficiency gains at 2-bit, reducing operations by up to 87.9% and storage by 89.4%. QuantSR+ is effective for both convolutional and transformer-based SR models, indicating broad applicability.

URL PDF HTML ☆

赞 0 踩 0

2605.22344 2026-05-22 cs.CV cs.AI cs.MM 版本更新

Bernini: Latent Semantic Planning for Video Diffusion

Bernini: 视频扩散中的潜在语义规划

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

发表机构 * Bernini Team（伯尼尼团队）

AI总结本文提出Bernini框架，通过将大规模多模态语言模型用于语义规划，扩散模型用于像素生成，实现了视频生成与编辑的统一方法，提升了编辑任务的泛化能力。

Comments Project Page: https://bernini-ai.github.io/

详情

AI中文摘要

多模态大语言模型（MLLMs）和扩散模型各自已达到显著成熟度：MLLMs在处理异构多模态输入时具有强大的语义基础，而扩散模型则能以逼真度生成图像和视频。我们主张通过简单的分工统一这两类模型：MLLMs负责语义规划，扩散模型则根据高层语义指导和低层视觉特征生成像素。基于此思想，我们提出了Bernini，一个统一的视频生成与编辑框架。一个基于MLLM的规划器直接在ViT嵌入空间中预测目标语义表示，而基于DiT的渲染器则根据此计划生成像素，同时结合文本特征，并在编辑时引入源VAE特征以保留细节。因为语义作为接口，规划器和渲染器可以分别训练，并仅轻度联合训练，从而保留两者预训练的优势，同时保持训练效率。为更好地处理多种视觉输入，我们引入了Segment-Aware 3D Rotary Positional Embedding（SA-3D RoPE），并进一步在规划器中结合链式推理以更好地将理解转化为生成。Bernini在广泛的视频生成与编辑基准上均取得最先进的性能，MLLMs的预训练理解在挑战性的编辑任务上实现了强大的泛化能力。

英文摘要

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.22342 2026-05-22 cs.CV cs.AI 版本更新

4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting

4D-GSW: 4D高斯点散布的运动感知空间-时间一致水印技术

Sifan Zhou, Hang Zhang, Yuhang Wang, Ming Li

发表机构 * Southeast University（东南大学）； Guangdong Laboratory of Artificial Intelligence and Digital Economy（广东人工智能与数字经济实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出4D-GSW，一种运动感知的空间-时间一致水印技术，用于在4D高斯点散布中嵌入鲁棒的版权信息，同时保持高空间-时间一致性。

Comments 9 pages main paper, 7 figures, 18 pages in total

详情

AI中文摘要

尽管4D高斯点散布（4DGS）已革新了高保真的动态重建，但保护这些资产的知识产权仍是一个开放性挑战。传统隐写技术常常忽视底层的运动流形，导致非物理的伪影，如严重的时序闪烁和"FVD崩溃"。为了解决这个问题，我们提出了4D-GSW，一种运动感知的水印框架，旨在嵌入鲁棒的版权信息同时保持高空间-时间一致性。与以往的4D隐写技术不同，我们的方法明确处理运动轨迹的物理一致性。我们引入了空间-时间曲率（STC）度量来识别"动态瞬间"，并自适应地门控水印梯度注入，以保护关键运动流形免受非物理扰动。为了确保复杂变形中的全局一致性，我们提出了联合HMM-MRF能量最小化模型，该模型同步水印相位在时间轨迹和空间邻域内。此外，一种各向异性梯度路由机制确保水印嵌入严格脱离光度重建保真度。大量实验表明，我们的方法在鲁棒隐藏水印的同时，能够抵抗各种攻击并保持高质量的渲染质量和空间-时间一致性。

英文摘要

While 4D Gaussian Splatting (4DGS) has revolutionized high-fidelity dynamic reconstruction, safeguarding the intellectual property of these assets remains an open challenge. Conventional steganographic techniques often neglect the underlying kinematic manifolds, triggering non-physical artifacts such as severe temporal flickering and "FVD collapse". To address this, we propose \textbf{4D-GSW}, a kinematic-aware watermarking framework designed to embed robust copyright information while preserving high spatio-temporal consistency. Unlike prior 4D steganography that primarily focuses on opacity-guided invisibility, our approach explicitly addresses the physical coherence of motion trajectories. We introduce a \textbf{Spatio-Temporal Curvature (STC)} metric to identify "Dynamic Instants," adaptively gating watermark gradient injection to shield critical motion manifolds from non-physical perturbations. To ensure global coherence across complex deformations, we formulate a joint \textbf{HMM-MRF energy minimization} model that synchronizes watermark phases within both temporal trajectories and spatial neighborhoods. Furthermore, an \textbf{anisotropic gradient routing} mechanism ensures that watermark embedding remains strictly decoupled from photometric reconstruction fidelity. Extensive experiments have demonstrated the superior performance of our method in robustly hiding watermarks while resisting various attacks and maintaining high rendering quality and spatiotemporal consistency.

URL PDF HTML ☆

赞 0 踩 0

2605.22328 2026-05-22 cs.CV 版本更新

3D LULC classification using multispectral LiDAR and deep learning: current and prospective schemes

基于多光谱LiDAR和深度学习的3D土地利用/覆盖分类：当前和未来方案

Narges Takhtkeshha, Aldino Rizaldy, Markus Hollaus, Juha Hyyppä, Fabio Remondino, Gottfried Mandlburger

发表机构 * D Optical Metrology (3DOM) Unit, Bruno Kessler Foundation (FBK)（3D光学计量（3DOM）单元，布鲁诺·凯塞尔基金会（FBK））； Department of Geodesy and Geoinformation, TU Wien（测绘与地理信息系，维也纳技术大学）； Helmholtz-ZentrumDresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology (HIF)（德累斯顿-罗斯托克赫尔姆霍尔茨研究中心（HZDR），资源技术赫尔姆霍尔茨研究所（HIF））； Freie Universität Berlin, Remote Sensing and Geoinformatics（柏林自由大学，遥感与地理信息学）； Department of Remote Sensing and Photogrammetry, Finnish Geospatial Research Institute FGI, The National Land Survey of Finland（芬兰地理研究所（FGI），芬兰国家土地测绘局）

AI总结本文提出了一种基于多光谱LiDAR和深度学习的3D土地利用/覆盖分类方法，介绍了NMCA对齐的L1和L2分类方案，并引入了一个新的多光谱LiDAR基准数据集，评估了七种最先进的深度学习模型，并展示了光谱信息对分类性能的提升。

详情

AI中文摘要

土地利用/覆盖（LULC）分类对于国家3D制图、地理空间分析和可持续规划至关重要。多光谱（MS）LiDAR提供同步的空谱信息，深度学习（DL）能够实现3D点云语义分割；然而，其应用受限于缺乏与国家制图和地籍机构（NMCAs）分类方案对齐的公开可用的城市和郊区MS LiDAR数据集。本文通过引入L1和L2 NMCA对齐的LULC分类方案和一个新的多光谱LiDAR数据集来填补这些空白。我们评估了七种最先进的深度学习模型，并在两个细节层次上进行了光谱消融研究。结果表明，Point Transformer V3在使用双波长LiDAR系统（532 nm和1064 nm）时，分别在L1（8类）和L2（20类）上实现了79.4%和58.9%的mIoU。消融结果表明，多光谱信息在几何信息基础上提升了性能，分别在L1和L2上提升了1.1个百分点和7.8个百分点。这些结果突显了LiDAR反射率在细粒度材料识别中的价值，并支持NMCA LULC方案向更高语义细节演进。Loosdorf-MSL数据集为一致的国家和国际LULC制图提供了新的基准。

英文摘要

Land Use Land Cover (LULC) classification is essential for national 3D mapping, geospatial analysis, and sustainable planning. Multispectral (MS) LiDAR provides synchronized spatial-spectral information, and deep learning (DL) enables 3D point cloud semantic segmentation; however, adoption is limited by the lack of publicly available urban and suburban MS LiDAR datasets aligned with National Mapping and Cadastral Agencies (NMCAs) classification schemes. This study addresses these gaps by introducing L1 and L2 NMCA-aligned LULC classification schemes and a new benchmark MS LiDAR dataset. We evaluate seven state-of-the-art DL models and perform spectral ablation studies at both levels of detail. Results show that Point Transformer V3 achieves the best performance, with mIoU of 79.4% (L1, 8 classes) and 58.9% (L2, 20 classes) using a dual-wavelength LiDAR system (532 nm and 1064 nm). Ablation results show that multispectral information improves performance over geometry-only inputs, with gains of 1.1 percentage points at L1 and 7.8 points at L2. These results highlight the value of LiDAR reflectance for fine-grained material discrimination and support the evolution of NMCA LULC schemes toward higher semantic detail. The Loosdorf-MSL dataset contributes a new benchmark for consistent national and international LULC mapping.

URL PDF HTML ☆

赞 0 踩 0

2605.22327 2026-05-22 cs.CV physics.med-ph 版本更新

Robustness of breast lesion segmentation under MRI undersampling improves with k-space-aware deep learning

在MRI欠采样下，基于k空间的深度学习改进了乳腺病变分割的鲁棒性

Lukas T. Rotkopf, Marco Schlimbach, Julius C. Holzschuh, Heinz-Peter Schlemmer, Jens Kleesiek, Moritz Rempe

发表机构 * Institute for AI in Medicine (IKIM), University Hospital Essen（人工智能医学研究所（IKIM），埃森大学医院）； Department of Physics, Technical University Dortmund（物理系，多特蒙德技术大学）； Division of Radiology, German Cancer Research Center (DKFZ)（放射学部，德国癌症研究中心（DKFZ））； Cancer Research Center Cologne Essen (CCCE), University Medicine Essen（科隆埃森癌症研究中心（CCCE），埃森大学医学中心）； RACOON Study Group, Site Essen（RACOON研究组，埃森站点）； German Cancer Consortium (DKTK), Partner Site Essen（德国癌症联合会（DKTK），埃森合作伙伴站点）； Medical Faculty and Faculty of Computer Science, University of Duisburg-Essen（医学系和计算机科学系，杜伊斯堡-埃森大学）

AI总结本文研究了直接从获得的MRI k空间学习乳腺病变分割是否能提高在加速或噪声下的鲁棒性，通过比较不同模型发现基于k空间的深度学习方法在欠采样和噪声下表现更优。

详情

AI中文摘要

目的：评估是否可以直接从获得的MRI k空间学习乳腺病变分割，并判断在数据加速或噪声情况下这种学习方式是否能提高鲁棒性。材料和方法：本回顾性研究使用了公开的乳腺动态对比增强MRI（DCE-MRI）数据集，包含获得的和合成的k空间，以及数据集内的合成对照。我们比较了四种3D U-Net变体：混合k空间到图像模型、原生k空间模型以及幅度和复数图像空间基线。模型在增加的欠采样和添加复数高斯k空间噪声下进行评估。主要结果是交叉验证下的患者级Dice相似性系数，其中混合模型被预设为主要比较对象，与幅度图像空间基线进行比较。结果：在完全采样下，混合模型和图像空间模型表现相似。随着加速增加，混合模型在欠采样水平中保持了显著的分割准确性，并在中等至高欠采样水平上显著优于幅度图像空间基线。当直接向k空间添加噪声时，相同模式被观察到：混合模型退化更慢，而图像空间基线在更重噪声下失败。这种优势在数据集内的合成对照中被重复验证。特征分析表明，k空间阶段和图像空间阶段发挥了互补作用，频率域过滤集中在图像域病变定位之前。结论：基于k空间的深度学习在MRI欠采样和k空间噪声下提高了乳腺病变分割的鲁棒性，同时在完全采样下与图像空间方法相当。

英文摘要

Purpose: To assess whether breast lesion segmentation can be learned directly from acquired MRI k-space, and whether doing so improves robustness when data are accelerated or noisy. Materials and Methods: This retrospective study used public breast dynamic contrast-enhanced MRI (DCE-MRI) datasets with acquired and synthetic k-space, together with a within-dataset synthetic control. We compared four 3D U-Net variants: a hybrid k-space-to-image model, a native k-space model, and magnitude and complex image-space baselines. Models were evaluated under increasing undersampling and added complex Gaussian k-space noise. The primary outcome was patient-level Dice similarity coefficient under cross-validation, with the hybrid model prespecified as the main comparison against the magnitude image-space baseline. Results: At full sampling, the hybrid and image-space models performed similarly. As acceleration increased, the hybrid model retained substantially more segmentation accuracy and significantly outperformed the magnitude image-space baseline across moderate to high undersampling levels. The same pattern was observed when noise was added directly to k-space: the hybrid model degraded more slowly, whereas the image-space baseline failed under heavier noise. This advantage was reproduced in the within-dataset synthetic control. Feature analysis suggested that the k-space stage and image-space stage played complementary roles, with frequency-domain filtering concentrated before image-domain lesion localization. Conclusion: K-space-aware deep learning improves the robustness of breast lesion segmentation under MRI undersampling and k-space noise, while matching image-space methods at full sampling.

URL PDF HTML ☆

赞 0 踩 0

2605.22311 2026-05-22 cs.CV 版本更新

PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models

PIU：基于接近性的身份去学习在ID条件化的扩散模型中

Jose Edgar Hernandez Cancino Estrada, Mauro Díaz Lupone, Žiga Emeršič, Vitomir Štruc, Peter Peer, Darian Tomašević

发表机构 * University of Ljubljana, Faculty of Computer and Information Science（卢布尔雅那大学计算机与信息科学系）； University of Ljubljana, Faculty of Electrical Engineering（卢布尔雅那大学电子工程系）

AI总结本文研究了在ID条件化的扩散模型中身份去学习的问题，提出了一种基于接近性的身份去学习框架PIU，通过在学习的身份空间中重新分配源身份到选定的锚身份来实现身份移除，并结合基于ArcFace表示几何的锚点选择策略，通过局部微调少量身份敏感的交叉注意力层实现有效的去学习。

详情

AI中文摘要

身份条件化的扩散模型能够生成高质量且身份一致的面部图像，但它们也引发了严重的隐私问题，因为模型可能在个人被遗忘后仍继续合成个体。尽管机器去学习已被广泛研究用于概念和数据删除，但身份去学习仍鲜有探索，特别是在直接基于身份嵌入而非文本提示的模型中。在本文中，我们研究了Arc2Face，一个最先进的身份条件化的潜在扩散模型用于面部生成，并引入了基于接近性的身份去学习（PIU），一种锚点引导的身份去学习框架。具体而言，我们将身份移除建模为身份替换目标，该目标将源身份重新分配到学习身份空间中选定的锚身份，并补充了受ArcFace表示几何启发的基于接近性的锚点选择策略。我们进一步表明，通过局部微调少量身份敏感的交叉注意力层可以实现有效的去学习。在许多目标身份上的实验表明，我们的框架能够有效抑制目标身份的生成，同时保持保留身份的真实性和身份一致性，这通过改进的去学习和图像质量指标以及定性评估得到验证。PIU框架的源代码可在https://github.com/edgarcancinoe/piu_unlearning 公开获取。

英文摘要

Identity-conditioned diffusion models enable high-quality and identity-consistent face generation, but they also raise severe privacy concerns, as models may continue to synthesize individuals despite their right to be forgotten. While machine unlearning has been extensively studied for concept and data removal, identity unlearning remains largely unexplored, particularly in models conditioned directly on identity embeddings rather than text prompts. In this work, we study identity unlearning in Arc2Face, a state-of-the-art identity-conditioned latent diffusion model for face generation, and introduce Proximity-guided Identity Unlearning (PIU), an anchor-guided framework for identity unlearning. Specifically, we formulate identity removal as an identity replacement objective that reassigns the source identity to a selected anchor identity in the learned identity space, and we complement it with a proximity-based anchor selection strategy motivated by the geometry of ArcFace representations. We further show that effective unlearning can be achieved through localized fine-tuning of a small subset of identity-sensitive cross-attention layers. Experiments across many target identities show that our framework effectively suppresses generation of the target identity while preserving realism and identity consistency for retained identities, as validated by improved performance on unlearning and image-quality metrics, together with qualitative evaluation. The source code for the PIU framework is publicly available at https://github.com/edgarcancinoe/piu_unlearning .

URL PDF HTML ☆

赞 0 踩 0

2605.22290 2026-05-22 cs.CV 版本更新

Detection of Virus and Small Cell Patches in Foci Images Using Switchable Convolution and Feature Pyramid Networks

利用可切换卷积和特征金字塔网络在焦点图像中检测病毒和小细胞斑块

Amrita Singh, Snehasis Mukherjee

AI总结本文提出了一种改进的YOLOv2检测器，结合特征金字塔网络和可切换空洞卷积机制，以提高在生物医学焦点图像中检测病毒斑块和小细胞斑块的性能，实验结果显示在不同IoU阈值下的mAP值显著提升。

详情

AI中文摘要

准确检测和计数焦点形成单位（FFU）图像中的病毒斑块对于量化病毒感染和分析细胞结构至关重要。这项任务具有挑战性，因为生物医学目标在大小、密度、对比度和形状上往往差异显著。本文提出了一种增强的YOLOv2检测器，集成了特征金字塔网络（FPN）以提高多尺度特征表示。我们还引入了可切换空洞卷积机制，以适应密集显微图像中细粒度目标的接收域。所提出的方法在生物医学焦点图像数据集上进行评估，用于病毒斑块和小细胞斑块的检测。对于小细胞斑块检测，模型在25%的交并比（IoU）阈值下达到40.5%的平均精度均值（mAP）。对于FFU病毒斑块检测，模型达到68%的mAP。这些结果表明，结合FPN特征融合与可切换卷积能够提高YOLOv2在专门生物医学目标检测任务中的适用性。

英文摘要

Accurate detection and counting of virus patches in focus-forming unit (FFU) images, also known as foci images, are important for quantifying viral infection and analyzing cellular structures. This task is challenging because biomedical targets often vary substantially in size, density, contrast, and shape. In this paper, we propose an enhanced YOLOv2-based detector that integrates a Feature Pyramid Network (FPN) to improve multi-scale feature representation. We also incorporate a switchable atrous convolution mechanism to adapt the receptive field for fine-grained targets in dense microscopy images. The proposed method is evaluated on biomedical foci image datasets for virus patch and small cell patch detection. For small cell patch detection, the model achieves a mean average precision (mAP) of 40.5% at a 25% Intersection over Union (IoU) threshold. For FFU virus patch detection, the model achieves an mAP of 68%. These results indicate that combining FPN-based feature fusion with switchable convolution improves the suitability of YOLOv2 for specialized biomedical object detection tasks

URL PDF HTML ☆

赞 0 踩 0

D3Seg: 依赖感知的扩散模型用于缺失模态的脑肿瘤分割

Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan

发表机构 * The University of Western Australia（西澳大学）； The University of Melbourne（墨尔本大学）

AI总结本文提出D3Seg模型，通过多跳模态图融合、轻量扩散插补机制和概率空间决策细化，解决缺失MRI模态下的脑肿瘤分割问题，提升分割性能并保持计算效率。

详情

AI中文摘要

使用多参数MRI进行准确的脑肿瘤分割对于有效的治疗计划至关重要。然而，在临床环境中，完整获取所有MRI序列并不总是可能。某些MRI模态的缺失会导致现有分割方法性能显著下降，这些方法通常依赖于朴素的特征拼接或直接融合策略。为了解决这一限制，我们提出了一种新的分割模型D3Seg，其设计旨在在缺失模态设置下保持稳定的性能。D3Seg引入了多跳模态图融合（MMGF）来建模更高阶的跨模态依赖关系，一种轻量级的扩散基插补机制来补偿潜在空间中缺失的T1ce表示，并在概率空间中进行决策细化以缓解主导类的过度自信并改进低表示肿瘤亚区域的界定。在BraTS 2023数据集上的广泛评估表明，我们的D3Seg模型在缺失模态配置下 consistently 改善了分割性能。所提出的模型在多个缺失模态配置中相比当前最先进的模型，在增强肿瘤（ET）方面实现了约1.5-2.0%的Dice改进，在肿瘤核心（TC）方面实现了约1.0%的改进，同时保持了计算效率。

英文摘要

Accurate brain tumor segmentation using multiparametric MRI is critical for effective treatment planning. However, in clinical settings, complete acquisition of all MRI sequences is not always possible. The absence of certain MRI modalities results in substantial performance degradation in existing segmentation methods, which typically rely on naive feature concatenation or direct fusion strategies. To address this limitation, we propose a novel segmentation model D3Seg which is designed to maintain stable performance under missing-modality settings. D3Seg introduces Multi-hop Modality Graph Fusion (MMGF) to model higher order inter-modality dependencies, a lightweight diffusion-based imputation mechanism to compensate for missing T1ce representations in latent space, and probability-space decision refinement to mitigate dominant class overconfidence and improve delineation of underrepresented tumor subregions. Extensive evaluation on BraTS 2023 dataset demonstrates that our D3Seg model consistently improves segmentation performance under missing modality configurations. The proposed model achieves approximately 1.5-2.0% Dice improvement on enhancing tumor (ET) and around 1.0% on tumor core (TC) across multiple missing modality configurations compared to the current state-of-the-art model, while maintaining computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.22231 2026-05-22 cs.CV 版本更新

通过图表示学习实现超高清图像质量评估

Shaode Yu, Enqi Chen, Ming Huang, Xuemin Ren, Songnan Zhao, Zhicheng Zhang, Qiurui Sun

发表机构 * 1 School of Information ； Communication Engineering, Communication University of China, Beijing 100024, China 2 College of Engineering, Northeastern University, Silicon Valley, San Jose, CA 95113, USA 3 JancsiLab, JancsiTech, Hongkong 999077, China 4 Center of Information \& Network Technology, Beijing Normal University, Beijing 100875, China

AI总结本文提出了一种图表示学习框架UHD-GCN-BIQA，通过显式建模采样图像区域的结构依赖关系来改进超高清图像的盲质量评估，实现了高效的高质量图像质量预测。

详情

AI中文摘要

盲图像质量评估（BIQA）对于超高清（UHD）图像仍具挑战性，因为原分辨率推理计算成本高，而强制缩放或孤立裁剪可能抑制尺度敏感的失真并削弱局部瑕疵与全局场景上下文之间的关系。本文旨在通过显式建模采样图像区域之间的结构依赖关系来改进UHD-BIQA，而不是将它们视为独立视图。所提出的图表示学习框架UHD-GCN-BIQA从每个UHD图像中采样长宽比对齐的块，将它们编码为图节点，并利用空间接近性和特征相似性构建混合k-最近邻图。残差图卷积用于在区域间传播上下文信息，门控注意力池化将块级证据聚合为图像级质量预测。采用指数移动平均归一化的多目标损失函数以稳定回归、相关性和排序目标的联合优化。在UHD-IQA基准测试中，UHD-GCN-BIQA实现了PLCC=0.7784，SRCC=0.8019，RMSE=0.0519，取得了与比较方法相竞争的相关性性能和最低的RMSE。这些结果表明，基于图的区域关系建模对UHD图像质量评估是有效的，特别是在高分辨率视觉内容下提高绝对质量评分估计。

英文摘要

Blind image quality assessment (BIQA) for ultrahighdefinition (UHD) images remains challenging because native-resolution inference is computationally expensive, whereas aggressive resizing or isolated cropping may suppress scale-sensitive distortions and weaken the relationship between local artifacts and global scene context. This paper aims to improve UHD-BIQA by explicitly modeling the structural dependencies among sampled image regions rather than treating them as independent views, and a graph representation learning framework UHD-GCN-BIQA is proposed. The framework samples aspect-ratio-aligned patches from each UHD image, encodes them as graph nodes, and constructs a hybrid k-nearest-neighbor graph using spatial proximity and feature similarity. Residual graph convolution is used to propagate contextual information across regions, and gated attention pooling aggregates patchlevel evidence into an imagelevel quality prediction. An exponential moving average normalized multiobjective loss function is adopted to stabilize the joint optimization of regression, correlation, and ranking objectives. Experiments on the UHD-IQA benchmark show that UHD-GCN-BIQA achieves PLCC = 0.7784, SRCC = 0.8019, and RMSE = 0.0519, obtaining competitive correlation performance and the lowest RMSE among the compared methods. These results indicate that graph-based region relation modeling is effective for UHD image quality assessment, particularly for improving absolute quality score estimation under high-resolution visual content.

URL PDF HTML ☆

赞 0 踩 0

2605.22190 2026-05-22 cs.CV 版本更新

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

无需姿态，无问题：从未姿态多视角视频中馈送动态高斯

Matteo Balice, Yanik Kunzi, Chenyangguang Zhang, Matteo Matteucci, Marc Pollefeys, Sungwhan Hong

发表机构 * Politecnico di Milano（米兰理工大学）； ETH Zürich（苏黎世联邦理工学院）； ETH AI Center（苏黎世联邦理工学院人工智能中心）

AI总结本文提出NoPo4D，一种首个无需姿态的馈送式系统，能够处理动态内容、多视角输入和未知相机姿态，通过速度分解和双向运动编码提升性能，优于现有方法。

Comments https://bralani.github.io/nopo4d_html/

详情

AI中文摘要

近期的馈送式3D高斯散射方法在3D场景重建的单个方面取得了显著进展，但现有方法无法在单次馈送过程中同时处理动态内容、多视角输入和未知相机姿态。处理动态的方法要么需要准确的相机姿态，要么只能接受单目输入；无姿态多视角方法仅能处理静态场景；而每场景优化方法在填补这些差距时，每场景的成本为分钟到小时。我们引入NoPo4D，首个馈送式系统，通过预训练的几何骨干网络和最近的4D高斯框架，引入速度分解，将高斯运动分解为每个像素图像平面位移和深度变化，从而可以直接从伪地面真实光流获得2D组件的监督。这规避了可微渲染将先验姿态方法与姿态准确性耦合以及先验无姿态方法所需的3D运动地面真实。系统还通过双向运动编码实现跨视角和跨帧特征聚合，以及视图依赖的不透明度，以缓解跨视角和跨时间步的高斯错位。在四个多视角动态基准上，NoPo4D一致优于现有馈送式基线，并通过可选后优化阶段超越每场景优化方法，同时运行速度快十倍。

英文摘要

Recent feed-forward 3D gaussian splatting methods have made dramatic progress on individual aspects of 3D scene reconstruction, but no existing method jointly addresses dynamic content, multi-view input, and unknown camera poses in a single feed-forward pass. Methods that handle dynamics either require accurate camera poses or accept only monocular input; pose-free multi-view methods address only static scenes; and per-scene optimization methods bridge some of these gaps but at minutes-to-hours cost per scene. We introduce NoPo4D, the first feed-forward system that addresses this empty quadrant. Building on a pretrained geometry backbone and recent 4D Gaussian frameworks, NoPo4D introduces a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, allowing direct supervision from pseudo ground-truth optical flow on the 2D component. This sidesteps both the differentiable rendering that couples prior posed methods to pose accuracy and the 3D motion ground truth that prior pose-free methods require. The system is rounded out by a bidirectional motion encoder for cross-view and cross-frame feature aggregation, and view-dependent opacity that mitigates cross-view and cross-timestep Gaussian misalignments. On four multi-view dynamic benchmarks, NoPo4D consistently outperforms prior feed-forward baselines, and with an optional post-optimization stage surpasses per-scene optimization methods, while running orders of magnitude faster.

URL PDF HTML ☆

赞 0 踩 0

2605.22186 2026-05-22 cs.CV 版本更新

Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset

事件-照明协同低光照图像增强与高分辨率现实数据集

Senyan Xu, Zhijing Sun, Kean Liu, Xin Lu, Ruixuan Jiang, Mingyang Huang, Xueyang Fu, Zheng-Jun Zha

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出EIC-LIE框架，通过事件-照明协同模块和照明感知事件滤波器，解决低光照图像增强中HDR信息整合不足和现实噪声敏感问题，并构建首个高分辨率现实事件数据集，实验证明其在多个数据集上优于现有方法。

详情

AI中文摘要

事件基于低光照图像增强（LIE）方法主要关注整合高动态范围（HDR）信息，而忽视图像中的全局照明和现实场景中事件信号的固有噪声敏感性。为解决这些问题，我们提出EIC-LIE，一种事件-照明协同LIE框架。具体而言，我们首先设计了一个事件-照明协同交互（EICI）模块，包含两个关键过程：前向收集，用于在不同光照条件下收集HDR特征，以及后向注入，为照明和事件表示提供互补内容。接下来，我们引入了一个照明感知事件滤波器（IAEF），根据图像导出的亮度统计动态减少事件噪声。此外，我们构建了一个基于光束分割器的混合成像系统，以从动态场景中收集高质量的事件-图像对，实现时间同步，提供了首个高分辨率、现实的事件基LIE数据集。广泛的实验表明，我们的EIC-LIE在五个现实和合成数据集上优于现有方法，显著超越了以前的方法，在PSNR上提高了1.24dB，在SSIM上提高了0.069。代码和数据集已发布在https://github.com/QUEAHREN/EIC-LIE。

英文摘要

Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations. Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset. Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset are released at https://github.com/QUEAHREN/EIC-LIE.

URL PDF HTML ☆

赞 0 踩 0

2605.22185 2026-05-22 cs.CV cs.LG 版本更新

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

增强多模态大语言模型以用于安全关键驾驶视频分析

Tomaso Trinci, Henrique Piñeiro Monteagudo, Leonardo Taccari

发表机构 * Verizon Connect

AI总结本研究通过融合降采样视频帧与同步高频 telemetry 数据及专用计算机视觉模型的语义信息，提升多模态大语言模型在安全关键驾驶场景中的感知与推理能力，从而更准确地识别和描述现实驾驶中的安全关键事件。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）在一般视觉理解方面展现了出色的性能。然而，其在安全关键驾驶场景中的应用受限于无法准确感知和推理罕见高风险动态事件（如碰撞或接近碰撞）的能力。为此，我们提出了一种增强MLLM感知能力的流程，通过融合降采样视频帧与同步高频telematics数据（IMU和GPS）以及专用计算机视觉模型的语义信息生成高质量的伪标签，包括描述性标题和问答对，专门用于训练MLLM识别和描述现实驾驶中的安全关键事件（SCEs）。我们通过微调开源QwenVL-2.5模型并使用DoRA适配器展示了该方法的有效性：实验表明在少于50M可训练参数和有限计算预算下，显著提高了识别和解释安全关键事件的能力。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

URL PDF HTML ☆

赞 0 踩 0

2605.22169 2026-05-22 cs.CV 版本更新

Balancing Uncertainty and Diversity of Samples: Leveraging Diversity of Least, High Confidence Samples for Effective Active Learning

平衡不确定性和样本多样性：利用低、高置信度样本的多样性进行有效的主动学习

Vipul Arya, S. H. Shabbeer Basha, Srikrishna U N, Sunainha Vijay, Snehasis Mukherjee

发表机构 * School of Computer Science and Engineering, RV University（计算机科学与工程学院，RV大学）； School of Engineering & Technology, Vidyashilp University（工程与技术学院，Vidyashilp大学）； Shiv Nadar Institution of Eminence（Shiv Nadar卓越研究院）

AI总结本文提出了一种新的混合采样方法，通过同时选择容易和困难的样本，结合多样性，以提高主动学习的效果。实验表明，所提出的Least Confident and Diverse (LCD)方法在性能上优于现有方法，通过选择不确定且多样的实例，帮助模型学习更明显的特征。

详情

AI中文摘要

深度学习模型，包括卷积神经网络（CNNs）和视觉Transformer（ViTs），在各种计算机视觉任务如物体分类、检测、分割、生成等任务中取得了最先进的性能。然而，这些模型对数据需求很高，因为它们需要更多的训练数据来学习数百万或数十亿的参数。特别是对于监督学习任务，为模型训练收集大量标记样本是一个昂贵且耗时的任务。主动学习（AL）已被用于解决这个问题多年。现有的主动学习方法旨在从未标记样本池中选择用于注释的样本，这些样本要么是多样化的要么是不确定的。选择这样的样本可能会阻碍模型的性能，因为我们基于单一维度进行池化，即要么多样化要么不确定。在本文中，我们提出四种新颖的混合采样方法，用于同时池化容易和困难的样本，这些样本也是多样的。为了验证所提出方法的有效性，进行了大量的实验，分别使用高和低置信度样本。我们从实验中发现，所提出的混合采样方法，即Least Confident and Diverse（LCD），在性能上始终优于最先进的方法。观察到选择不确定且多样的实例有助于模型学习更明显的特征。与本研究相关的代码将在https://github.com/XXX/LCD上提供。

英文摘要

Deep learning models, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved state-of-the-art performance on various computer vision tasks such as object classification, detection, segmentation, generation, and many more. However, these models are data-hungry as they require more training data to learn millions or billions of parameters. Especially for supervised learning tasks, curating a large number of labeled samples for model training is an expensive and time-consuming task. Active Learning (AL) has been used to address this problem for many years. Existing active learning methods aim at choosing the samples for annotation from a pool of unlabeled samples that are either diverse or uncertain. Choosing such samples may hinder the model's performance as we pool based on one dimension, i.e., either diverse or uncertain. In this paper, we propose four novel hybrid sampling methods for pooling both easy and hard samples, which are also diverse. To verify the efficacy of the proposed methods, extensive experiments are conducted using high and low-confidence samples separately. We observe from our experiments that the proposed hybrid sampling method, Least Confident and Diverse (LCD), consistently performs better compared to state-of-the-art methods. It is observed that selecting uncertain and diverse instances helps the model learn more distinct features. The codes related to this study will be available at https://github.com/XXX/LCD.

URL PDF HTML ☆

赞 0 踩 0

2605.22158 2026-05-22 cs.AI cs.CV 版本更新

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

ST-SimDiff：平衡时空相似性与差异以实现高效的视频理解与大语言模型

Bingjun Luo, Tony Wang, Chaoqi Chen, Xinpeng Ding

发表机构 * Tsinghua University（清华大学）； Shenzhen University（深圳大学）； Xidian University（西安电子科技大学）

AI总结本文提出ST-SimDiff框架，通过平衡时空相似性与差异来提高视频理解效率，利用时空图和双选择策略减少计算成本并提升性能。

Comments Accepted by ICLR 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在处理长视频时面临显著的计算开销，因为需要处理大量的视觉标记。为了提高效率，现有方法主要通过修剪或合并标记来减少冗余，但这些方法忽略了视频内容的一个关键维度，即变化和转折点，并且缺乏对时空关系的协作模型。为此，我们提出了一种新的视角：相似性用于识别冗余，而差异用于捕捉关键事件。基于此，我们设计了一个无需训练的框架，名为ST-SimDiff。我们首先从视觉标记中构建时空图，以统一建模其复杂的关联。随后，我们采用并行双选择策略：1）基于相似性的选择使用社区检测保留代表性标记，压缩静态信息；2）基于时间差异的选择精确定位内容变化点，以保留捕捉关键动态变化的标记。这使它能够用最少的标记保留静态和动态内容。广泛实验表明，我们的方法在显著优于现有最先进方法的同时，大幅减少了计算成本。我们的代码可在https://github.com/bingjunluo/ST-SimDiff上获得。

英文摘要

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in https://github.com/bingjunluo/ST-SimDiff.

URL PDF HTML ☆

赞 0 踩 0

2605.22147 2026-05-22 cs.CV 版本更新

Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution

基于流的高斯点散射用于连续尺度遥感图像超分辨率

Jiangwei Mo, Xi Lu, Hanlin Wu

发表机构 * School of Information Science and Technology, Beijing Foreign Studies University（信息科学与技术学院，北京外国语大学）

AI总结本文提出FlowGS框架，通过流匹配和高斯点散射实现任意尺度的遥感图像超分辨率，提升生成效率和质量。

详情

AI中文摘要

高分辨率遥感图像（RSI）对于地球观测应用至关重要，但获取它们通常受到传感器限制和成本的限制。近年来，生成式超分辨率（SR）方法，特别是扩散模型，取得了显著进展。然而，它们通常需要缓慢的迭代推断，需要40-1000步，并且在连续尺度SR设置中表现出有限的灵活性。为了解决这些问题，我们提出FlowGS，一种用于任意尺度RSI超分辨率的生成性重建框架。FlowGS建模高分辨率和低分辨率图像之间的高频细节表示，并通过流匹配（FM）约束于快捷一致性，学习从噪声到细节先验的连续概率流，从而减少生成复杂性并提高推断效率。此外，我们采用2D高斯点散射来构建连续特征场，从而在任意查询位置上实现灵活的重建。实验结果表明，FlowGS在连续尺度和固定尺度SR设置中均能提供与现有方法相媲美的感知质量，同时具有显著提高的推断效率。

英文摘要

High-resolution remote sensing images (RSIs) are crucial for Earth observation applications, yet acquiring them is often limited by sensor constraints and costs. In recent years, generative super-resolution (SR) methods, particularly diffusion models, have made significant progress. However, they typically require slow iterative inference with 40--1000 steps and exhibit limited flexibility in continuous-scale SR settings. To address these issues, we propose FlowGS, a generative reconstruction framework for arbitrary-scale SR of RSIs. FlowGS models the high-frequency detail representations between high- and low-resolution images and learns a continuous probability flow from noise to detail priors via flow matching (FM) constrained by shortcut consistency, thereby reducing generative complexity and improving inference efficiency. Additionally, we employ 2D Gaussian splatting to construct a continuous feature field, thereby enabling flexible reconstruction at arbitrary query locations. Experimental results show that FlowGS delivers competitive perceptual quality compared with existing methods in both continuous-scale and fixed-scale SR settings, with substantially improved inference efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.22144 2026-05-22 cs.CV 版本更新

One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

一句话，一出戏剧：通过多智能体系统实现个性化短剧生成

Yufei Shi, Weilong Yan, Naixuan Huang, Yucheng Chen, Chenyu Zhang, Tao He, Si Yong Yeo, Ming Li

发表机构 * MedVisAI Lab, Lee Kong Chian School of Medicine, Nanyang Technological University（MedVisAI实验室，李光前医学院，南洋理工大学）； National University of Singapore（新加坡国立大学）； Beijing Institute of Technology（北京理工大学）； Tsinghua University（清华大学）； University of Electronic Science and Technology of China（电子科技大学）； Guangming Laboratory（光明实验室）

AI总结本文提出了一种多智能体框架，通过结构化中间模块和迭代优化，将用户的单句想法转化为完整短剧，解决了短剧生成中的叙事节奏、空间一致性及生产质量控制问题。

详情

AI中文摘要

现有的数字短剧制作方法通常依赖单次生成的LLM脚本和松散耦合的流程，无法满足短剧生成的三个关键要求：(1) 叙事节奏，导致钩子弱、情节不足和不吸引人的结局；(2) 空间一致性，导致场景布局漂移和跨片段角色位置不一致；(3) 生产级质量控制，需要在脚本和视觉阶段进行大量手动审查和修正。我们提出了One Sentence, One Drama，一种分层多智能体框架，通过结构化中间模块和迭代优化，将用户的单句想法转化为完整短剧。我们的方法基于三个关键组件：(1) 基于多智能体辩论的故事生成模块，强制短剧节奏和叙事连贯性；(2) 3D基础的第一帧生成机制，建立共享的空间参考，确保跨片段的一致性角色定位和场景布局；(3) 多阶段评审循环，在脚本、视觉和视频生成阶段进行全面的错误检测和有针对性的修订。我们还引入了场景级BGM匹配和场景转换规划，以提高观众的沉浸体验。为了系统评估该任务，我们引入了Short-Drama-Bench，一个扩展标准视频质量指标的基准，包含短剧特定的评估标准。实验结果表明，我们的方法在叙事质量、跨片段一致性以及整体观看体验上显著优于现有流程。

英文摘要

Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production-level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi-agent framework that transforms a user's single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi-agent debate-based story generation module that enforces short-drama pacing and narrative coherence; (2) a 3D-grounded first-frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi-stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene-level BGM matching and scene transition planning to improve the audience's immersive experience. To systematically evaluate this task, we introduce Short-Drama-Bench, a benchmark that extends standard video quality metrics with short-drama-specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross-clip consistency, and overall viewing experience.

URL PDF HTML ☆

赞 0 踩 0

2605.22139 2026-05-22 cs.CV 版本更新

感知还是偏见：大语言模型能否超越个性的第一印象？

Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu, Kaipeng Zhang, Yoichi Sato, Yifei Huang

发表机构 * The University of Tokyo（东京大学）； Shanda AI Research Tokyo（Shanda AI 研究所东京）； Dalian University of Technology（大连理工大学）

AI总结本文探讨了多模态大语言模型（MLLMs）在感知个性方面的能力，提出了一种新的任务Grounded Personality Reasoning（GPR），并构建了一个新的数据集MM-OCEAN，通过三重评估体系揭示了MLLMs在人格推理中的偏见问题。

详情

AI中文摘要

多模态大语言模型（MLLMs）正在越来越多地应用于需要感知个性的人类交互角色中，但现有的基准测试仅评估其对大五人格特质分数的预测能力，未能确定模型是通过行为理解真正感知个性，还是仅通过表面模式匹配进行偏见判断。我们通过三个贡献填补了这一空白：（i）一个新的任务：我们正式定义了Grounded Personality Reasoning（GPR），要求MLLMs通过一系列评分、推理和锚定过程，将每个大五评分与可观察的证据联系起来；（ii）一个新的数据集：我们发布了MM-OCEAN（1,104个视频，5,320个多项选择题），由多代理流程生成，包含时间戳行为观察、证据支持的特质分析以及七类线索锚定多项选择题；（iii）基准测试和分析：我们设计了一个三级评估体系（评分、推理、锚定）以及四个样本级失败模式指标：偏见率（PR）、编造率（CR）、整合失败率（IR）和整体锚定率（HR），并基准测试了27个MLLMs（13个封闭式，14个开放式）。分析揭示了一个显著的偏见差距：在所有正确评分中，51%的评分没有基于检索到的线索进行锚定，而整体锚定率仅在0-33.5%之间。这些发现揭示了获得正确分数与为正确原因推理之间的脱节，为MLLMs中的扎根社会认知绘制了路线图。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.22104 2026-05-22 cs.CV 版本更新

OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization

OPERA: 一种用于图像修复的智能体，通过端到端联合规划-执行优化

Feng Zhu, Shuyang Xie, Yihan Zeng, Ming Liu, Wangmeng Zuo

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结该研究提出OPERA框架，通过端到端联合优化修复规划和工具执行，解决图像修复中复杂混合退化问题，优于现有方法和统一模型。

详情

AI中文摘要

现实中的图像修复因复杂的、相互作用的混合退化而具有挑战性。最近的基于智能体的方法通过组合多个任务特定的修复工具来解决这个问题。然而，实证分析表明，其性能根本上受到隐式约束的规划空间和独立预训练工具之间缺乏协调的限制。为了解决这些问题，我们提出了OPERA（优化规划-执行修复智能体），一种框架，通过端到端的方式联合优化修复规划和工具执行。在规划方面，OPERA使用强化学习直接优化工具组合在一个组合计划空间上，最终修复质量作为奖励。在执行方面，OPERA引入了智能体引导的修复工具协同训练，使它们能够在顺序组合下学习合作行为。在多退化基准和真实世界数据集上的大量实验表明，OPERA在多样且复杂的退化场景中始终优于所有-in-one修复模型和现有基于智能体的方法。

英文摘要

Real-world image restoration is challenging due to complex and interacting mixed degradations. Recent agent-based approaches address this problem by composing multiple task-specific restoration tools. However, empirical analysis reveals that their performance is fundamentally limited by implicitly constrained planning spaces and the lack of coordination among independently pretrained tools. To address these issues, we propose OPERA (Optimized Planning-Execution Restoration Agent), a framework that jointly optimizes restoration planning and tool execution in an end-to-end manner. On the planning side, OPERA uses reinforcement learning to directly optimize tool composition over a combinatorial plan space, with the final restoration quality as the reward. On the execution side, OPERA introduces agent-guided co-training of restoration tools, enabling them to learn cooperative behaviors under sequential composition. Extensive experiments on multi-degradation benchmarks and real-world datasets demonstrate that OPERA consistently outperforms both all-in-one restoration models and existing agent-based methods across diverse and complex degradation scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.22098 2026-05-22 cs.CV cs.AI cs.LG 版本更新

TextTeacher: What Can Language Teach About Images?

TextTeacher: 语言能教会我们关于图像什么？

Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

发表机构 * RPTU University Kaiserslautern-Landau（赖兴海大学凯撒斯劳滕-兰道分校）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））

AI总结该研究提出TextTeacher方法，通过将语言模型的语义知识注入到图像分类训练中，提升视觉模型的性能，同时保持推理时的模型简洁性。

Comments Published at TMLR

详情

Journal ref: Transactions on Machine Learning Research, ISSN 2835-8856, 2026

AI中文摘要

ParaVT: 平衡工具先验悖论以实现代理视频强化学习中的并行工具使用

Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing

发表机构 * MiroMind ； NTU（国立台湾大学）； HKU（香港大学）； HKUST(GZ)（香港科技大学（广州））； THU（清华大学）； LMMs-Lab（LMMs实验室）

AI总结本文提出ParaVT，一种用于并行视频工具调用的端到端强化学习框架，通过引入PARA-GRPO机制解决工具先验悖论，提升了长视频理解的性能。

Comments Project Page: https://evolvinglmms-lab.github.io/ParaVT/

详情

AI中文摘要

通过强化学习（RL）训练大型多模态模型（LMMs）以原生调用视频处理工具（如裁剪）已成为实现长视频理解的有前景途径。然而，现有原生RL方法按顺序调度工具调用（即每回合一个）：单个错误的裁剪会传播错误而无法得到同伴纠正，多回合工具调用会破坏上下文，且推理成本与回合数成线性关系。我们引入ParaVT，首个多智能体端到端RL训练框架用于并行视频工具调用，通过单个回合内调度多个时间窗口裁剪以获得更干净的上下文和更好的容错能力。然而，将标准RL应用于ParaVT揭示了一个我们称之为工具先验悖论的障碍：预训练的工具先验能够促进工具探索，但也破坏了冷启动的结构格式并暴露了在温度采样下的跳过工具奖励捷径。一个较弱先验LMM的跨模型对比支持这一观点：格式保持稳定但RL触发零工具调用，表明先验强度是格式崩溃和工具探索的共同驱动因素。我们提出PARA-GRPO（Parseability-Anchored和Ratio-gAted GRPO），它通过两种互补机制增强标准RL：（i）仅在最易崩溃的结构标记位置应用目标格式奖励；（ii）每提示帧预算随机化，创建训练提示，其中调用工具会提供可测量的奖励信号，而跳过工具则不会。在六个长视频理解基准测试中，ParaVT在平均上比Qwen3-VL基线提升了7.9%，而PARA-GRPO将训练时间格式合规性从0.13提升到0.64。随着工具能力在现代LMMs中日益内部化，RL必须与由此产生的先验合作，ParaVT提供了一种通用的代理RL配方。代码、数据和模型权重已公开可用。

英文摘要

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.19578 2026-05-22 cs.CV cs.AI 版本更新

Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

Lens Privacy Sealing: 一种新的基准和方法用于物理隐私保护的动作识别

Mengyuan Liu, Ziyi Wang, Peiming Li, Junsong Yuan

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School（北京大学深圳研究生院通用人工智能国家重点实验室）； Department of Computer Science and Engineering, State University of New York at Buffalo（纽约州立大学布法罗分校计算机科学与工程系）

AI总结本文提出了一种名为Lens Privacy Sealing (LPS)的硬件解决方案，通过可调节的贴膜物理遮挡摄像头镜头，实现低成本的预传感器隐私保护，并引入P$^3$AR数据集用于隐私保护的动作识别，同时提出MSPNet框架以应对LPS带来的视频退化问题，实验表明MSPNet在动作识别准确率和隐私保护方面具有优势。

Comments Accepted by IEEE Transactions on Image Processing (TIP), 2026

详情

AI中文摘要

基于RGB摄像头的监控系统能够为公共安全和医疗保健提供人类动作识别，但引发了严重的隐私问题。现有方法依赖于事后捕获算法，这些算法在数据采集过程中无法保护隐私。我们提出Lens Privacy Sealing (LPS)，一种简单的硬件解决方案，通过可调节的贴膜物理遮挡摄像头镜头，以最低的成本提供预传感器隐私保护。与软件方法或昂贵的工程光学不同，LPS通过随机多层散射实现强隐私保护，这种散射是物理不可逆的。我们引入了P$^3$AR数据集用于隐私保护的动作识别，该数据集包含大规模回放捕获（P$^3$AR-NTU，114K视频）和现实世界收集（P$^3$AR-PKU）的子集，并带有隐私属性注释。为处理LPS带来的视频退化，我们提出MSPNet，一种单阶段框架，结合了帧间噪声抑制器（IFNS）和跨帧语义聚合器（CFSA），并借助对比语言-图像预训练进行增强的语义提取。大量实验表明，与基线方法相比，MSPNet结合IFNS和CFSA几乎将动作识别准确率提高了一倍，同时抑制身份识别到低水平。全面验证显示，LPS在隐私-效用权衡方面优于现有最先进的硬件方法，能够抵御包括PSF反向计算和数据驱动恢复在内的重建攻击，并在不同光学配置和挑战性环境中具有良好的泛化能力。代码可在https://github.com/wangzy01/MSPNet上获得。

英文摘要

RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P$^3$AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P$^3$AR-NTU, 114K videos) and real-world collected (P$^3$AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at https://github.com/wangzy01/MSPNet.

URL PDF HTML ☆

赞 0 踩 0

2605.19354 2026-05-22 eess.IV cs.CV 版本更新

HumanSplatHMR: 闭合人体网格恢复与高斯点绘肖像之间的循环

Yeheng Zong, Pou-Chun Kung, Yike Pan, Seth Isaacson, Yizhou Chen, Ram Vasudevan, Katherine A. Skinner

发表机构 * University of Michigan（密歇根大学）

AI总结本文提出HumanSplatHMR方法，通过闭合几何姿态估计与可微渲染之间的循环，改进人体姿态恢复和高斯点绘肖像的生成，提升在新视角和新姿态下的渲染质量。

Comments Project page: https://scottyehengz.github.io/HumanSplat/

详情

AI中文摘要

从视频中准确恢复人体姿态和外观是场景重建的关键组成部分，应用于动作捕捉、动作预测、虚拟现实和数字孪生等领域。尽管对从视频中构建逼真人类肖像已有大量研究，本文证明现有方法无法准确恢复人类的3D几何结构。基于ViT的方法不一致可靠且可能过度拟合2D视角，而基于NeRF和高斯点绘的肖像将姿态和外观分开，限制了对新姿态的渲染泛化能力。为解决这些问题，本文提出HumanSplatHMR，一种联合优化框架，通过同时优化3D人体姿态并学习高保真的肖像，以实现新视角和新姿态的合成。我们的关键见解是闭合几何姿态估计与可微渲染之间的循环。不同于以往依赖运动捕捉系统或离线优化获得的准确人体姿态的人形肖像方法，在野外场景中不实用，我们的方法仅使用最先进的姿态估计器得到的人体网格估计，以更好地反映现实情况。因此，不同于将人体姿态仅作为变形先验使用，HumanSplatHMR通过可微渲染将光度、分割和深度损失反向传播到姿态参数和全局位置。这种耦合在时间上优化全局3D姿态，提高精度和对齐性，同时产生更高质量的新视角渲染。实验显示，与省略图像级优化的姿态恢复基线和将姿态估计与肖像重建解耦的肖像基线相比，有持续的改进。

英文摘要

Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2605.02098 2026-05-22 cs.CV 版本更新

Skarimva：基于骨架的动作识别是一种多视图应用

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

发表机构 * Institute for Software and Systems Engineering, University of Augsburg（软件与系统工程研究所，奥格斯堡大学）

AI总结本文研究了基于骨架的动作识别中多视图应用的重要性，指出通过多摄像头视图三角化获得更准确的3D骨架数据，可以显著提升现有动作识别模型的性能，表明输入数据质量是限制模型性能的关键因素，未来研究应将多视图应用作为标准设置。

2602.20845 2026-05-22 cs.CV 版本更新

FLIM Networks with Bag of Feature Points

具有特征点袋的FLIM网络

João Deltregia Martinelli, Marcelo Luis Rodrigues Filho, Felipe Crispim da Rocha Salvagnini, Gilson Junior Soares, Jefersson A. dos Santos, Alexandre X. Falcão

发表机构 * Institute of Computing UNICAMP Campinas, Brazil School of Computer Science University of Sheffield Sheffield, United Kingdom（计算研究所（UNICAMP）埃尔南迪斯，巴西学校计算机科学大学谢菲尔德，英国）

AI总结本文提出FLIM-BoFP，一种更高效的滤波器估计方法，用于显微镜图像中的寄生虫检测，相较于FLIM-Cluster和其他先进基线，在效率、效果和泛化能力上均有优势。

Comments Accepted at the 28th Iberoamerican Congress on Pattern Recognition (CIARP 2025). To appear in Lecture Notes in Computer Science (LNCS), Springer

详情

DOI: 10.1007/978-3-032-23176-5_19

AI中文摘要

卷积网络需要大量的图像标注，这可能成本高昂且耗时。通过从少量代表性图像上用户绘制的标记中估计编码器滤波器（即核权重），特征学习从图像标记（FLIM）解决了这一挑战，而无需传统优化。这种编码器与自适应解码器结合构成了一个完全训练而无需反向传播的FLIM网络。先前研究已证明其在显著物检测（SOD）中的有效性，比现有轻量模型显著更轻。本研究重新审视FLIM SOD，并引入FLIM-Bag of Feature Points（FLIM-BoFP），一种显著更快的滤波器估计方法。先前方法FLIM-Cluster通过每个编码器块的补丁聚类来推导滤波器，导致计算开销和对滤波器位置的控制减少。FLIM-BoFP通过在输入块进行一次聚类，创建特征点袋，并在所有块上直接从映射的特征点定义滤波器。论文评估了FLIM-BoFP与FLIM-Cluster和其他最先进的基线在寄生虫检测中的效率、效果和泛化能力的益处。

英文摘要

Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness in Salient Object Detection (SOD), being significantly lighter than existing lightweight models. This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method. The previous approach, FLIM-Cluster, derives filters through patch clustering at each encoder's block, leading to computational overhead and reduced control over filter locations. FLIM-BoFP streamlines this process by performing a single clustering at the input block, creating a bag of feature points, and defining filters directly from mapped feature points across all blocks. The paper evaluates the benefits in efficiency, effectiveness, and generalization of FLIM-BoFP compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.

URL PDF HTML ☆

赞 0 踩 0

2602.17517 2026-05-22 cs.CV 版本更新

Depth Augmented and FE Free 3D/2D Liver Registration for Laparoscopic Liver AR

深度增强和无有限元分析的3D/2D肝脏注册用于腹腔镜肝脏AR

Hanyuan Zhang, Lucas He, Runlong He, Weixi Yi, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew J. Clarkson

发表机构 * UCL Hawkes Institute, University College London, London WC1E 6BT, UK（伦敦大学学院UCL哈维斯研究所）； Division of Surgery and Interventional Science, University College London, London WC1E 6BT, UK（伦敦大学学院UCL外科与介入科学系）； Unit for Lifelong Health and Ageing at UCL, University College London, London WC1E 7HB, UK（伦敦大学学院UCL终身健康与老龄化单位）； Medtronic plc., London, UK（伦敦梅脱利克公司）

AI总结本研究提出了一种深度增强且无需有限元分析的3D/2D肝脏注册方法，通过结合鲁棒的刚性初始化和患者特定的非刚性细化，以提高腹腔镜肝脏手术AR中的3D到2D注册精度。

详情

AI中文摘要

增强现实（AR）在腹腔镜肝脏手术中的引导需要准确地将术前3D模型与术中2D视频进行注册，但因部分可见性、镜面反射和组织变形而具有挑战性。现有方法通常依赖于基于轮廓的刚性初始化和有限元（FE）模型进行可变形注册，增加了建模和工程复杂性。我们提出了一种深度增强且无有限元分析的3D-2D注册流程，结合了鲁棒的刚性初始化和患者特定的非刚性细化。对于刚性对齐，我们通过使用多类轮廓图和单目深度来适应FoundationPose的RefineNet模块以适应腹腔镜肝脏场景，以实现相对姿态的细化。对于可变形对齐，我们从非刚性ICP（NICP）对应关系中构建患者特定的统计变形模型，并使用粗到细的L-BFGS-B策略优化姿态和形状参数。在公开的临床腹腔镜肝脏数据集上，所提出的方法在受控的手动轮廓设置下实现了平均目标注册误差（TRE）为14.73毫米。消融研究显示，单目深度在轮廓输入上提高了刚性初始化，而肿瘤映射分析表明良好的表面对齐并不一定转化为更低的目标定位误差。在没有地面真实数据的外部数据集上，该方法产生视觉上合理的叠加以进行定性评估。这些结果表明，深度增强的姿态细化和无有限元分析的统计变形建模为受控的3D-2D肝脏注册在手术AR中提供了一个有前景的替代方案。

英文摘要

Augmented reality (AR) guidance in laparoscopic liver surgery requires accurate registration of preoperative 3D models to intraoperative 2D video, but remains challenging due to partial visibility, specularities, and tissue deformation. Existing methods often rely on contour-based rigid initialization and finite-element (FE) models for deformable registration, increasing modeling and engineering complexity. We present a depth-augmented, FE-free 3D--2D registration pipeline that combines robust rigid initialization with patient-specific non-rigid refinement. For rigid alignment, we adapt the RefineNet module of FoundationPose to laparoscopic liver scenes by using multi-class contour maps and monocular depth for relative pose refinement. For deformable alignment, we construct a patient-specific statistical deformation model from non-rigid ICP (NICP) correspondences and optimize pose and shape parameters using a coarse-to-fine L-BFGS-B strategy. On a public clinical laparoscopic liver dataset, the proposed method achieves a mean target registration error (TRE) of 14.73\,mm under a controlled manual-contour setting designed to isolate registration performance. Ablation studies show that monocular depth improves rigid initialization over contour-only inputs, while tumor-mapping analysis indicates that good surface alignment does not necessarily translate into lower target localization error. On an external dataset without ground truth, the method produces visually plausible overlays for qualitative assessment. These results suggest that depth-augmented pose refinement and FE-free statistical deformation modeling provide a promising alternative to FE-based pipelines for controlled 3D--2D liver registration in surgical AR.

URL PDF HTML ☆

赞 0 踩 0

2602.12952 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Transporting Task Vectors across Different Architectures without Training

在不同架构间传输任务向量而无需训练

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara

发表机构 * AImageLab, University of Modena and Reggio Emilia（AImageLab，Modena和雷吉奥艾米利亚大学）

AI总结本文提出Theseus方法，通过功能匹配在不同宽度模型间传输任务更新，无需训练或反向传播，展示了在视觉和语言模型上的改进效果。

Comments Accepted at the International Conference on Machine Learning (ICML), 2026

详情

AI中文摘要

适应大型预训练模型以完成下游任务时，通常会产生针对特定任务的参数更新，这些更新对于每个模型变体重新学习都很昂贵。尽管最近的研究表明，这些更新可以在具有相同架构的模型之间转移，但跨不同宽度的模型转移仍鲜有探索。在本文中，我们引入Theseus，一种无需训练的方法，用于在异构宽度模型间传输任务更新。与其匹配参数，我们通过其在中间表示上诱导的功能效应来表征任务更新。我们正式将任务向量传输定义为在观察到的激活上进行的功能匹配问题，并显示在通过正交Procrustes分析对齐表示空间后，它允许一个稳定的闭式解，该解保留了更新的几何结构。我们在不同宽度的视觉和语言模型上评估Theseus，显示在不进行额外训练或反向传播的情况下，相对于基线有持续的改进。我们的结果表明，当任务身份通过功能而非参数定义时，任务更新可以有意义地在不同架构间转移。代码可在https://github.com/apanariello4/merge-and-rebase获取。

英文摘要

Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains unexplored. In this work, we introduce Theseus, a training-free method for transporting task updates across heterogeneous-width models. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically. Code is available at https://github.com/apanariello4/merge-and-rebase.

URL PDF HTML ☆

赞 0 踩 0

2602.06995 2026-05-22 cs.RO cs.CV cs.IT cs.MA math.IT 版本更新

When Simultaneous Localization and Mapping Meets Wireless Communications: A Survey

当同时定位与建图遇见无线通信：一篇综述

Konstantinos Gounis, Sotiris A. Tegos, Dimitrios Tyrovolas, Panagiotis D. Diamantoulakis, George K. Karagiannidis

发表机构 * Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki（阿尔蒂斯大学电气与计算机工程系）

AI总结本文综述了SLAM与无线通信交汇领域的最新进展，重点探讨了视觉SLAM（V-SLAM）整合中的双向影响，总结了无线信号传播、几何信道建模、基于射频（RF）的定位与感知等关键概念，以及图像处理技术如何检测地标并预测无线信道的最优路径，同时分析了SLAM与无线通信交叉领域的技术、挑战和未来方向。

详情

AI中文摘要

本文综述了SLAM与无线通信交汇领域的最新进展， attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

英文摘要

This paper surveys the state-of-the-art in the nexus of SLAM and Wireless Communications, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

URL PDF HTML ☆

赞 0 踩 0

2602.06676 2026-05-22 cs.CV 版本更新

通过单目法线图增强基于事件的目标检测

Mingjie Liu, Hanqing Liu, Luoping Cui, Chuang Zhu

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications（人工智能学院，北京邮电大学）

AI总结本文提出NRE-Net框架，结合法线图的结构先验、RGB图像的外观上下文和事件的高频动态，通过自适应双流融合模块和事件模态感知融合模块提升自动驾驶中复杂光照下的目标检测性能。

详情

AI中文摘要

自动驾驶中的目标检测常受到复杂光照条件的干扰。虽然事件相机提供了一种稳健的解决方案，但它们容易受到突然的对比度变化（如反射）的影响，这通常会触发密集且误导性的事件信号。为了解决这个问题，我们利用RGB衍生的表面法线图作为显式的几何约束。关键在于，即使RGB退化，它们也保留了低频的结构先验，这有助于事件检测。因此，我们提出了NRE-Net，一个三模态框架，该框架整合了来自表面法线图的结构先验、来自RGB图像的外观上下文以及来自事件的高频动态。自适应双流融合模块（ADFM）首先对几何和外观线索进行对齐，随后是事件模态感知融合模块（EAFM），它选择性地整合事件动态。在DSEC-Det-sub和PKU-DAVIS-SOD上的大量评估表明，结合几何先验相比双模态基线在AP50上获得了额外的3.0%提升，而我们的方法在融合方法如SFNet（+2.7%）和SODFormer（+7.1%）上表现一致优于。

英文摘要

Object detection in autonomous driving is frequently compromised by complex illumination. While event cameras offer a robust solution, they are susceptible to sudden contrast changes such as reflections which often trigger dense, misleading event signals. To overcome this, we leverage RGB-derived surface normal maps as explicit geometric constraints. Crucially, even when RGB degrades, they preserve low-frequency structural priors that effectively assist in event-based detection. Consequently, we present NRE-Net, a trimodal framework that integrates structural priors from surface Normal maps, appearance context from RGB images, and high-frequency dynamics from Events. The Adaptive Dual-stream Fusion Module (ADFM) first aligns geometric and appearance cues, followed by the Event-modality Aware Fusion Module (EAFM) which selectively integrates event dynamics. Extensive evaluations on DSEC-Det-sub and PKU-DAVIS-SOD demonstrate that incorporating geometric priors yields an additional 3.0% AP50 gain over dual-modal baselines, while our approach consistently outperforms fusion methods such as SFNet (+2.7%) and SODFormer (+7.1%).

URL PDF HTML ☆

赞 0 踩 0

2505.16416 2026-05-22 cs.CV cs.AI 版本更新

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Circle-RoPE: 用于大视觉-语言模型的锥形解耦旋转位置嵌入

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

发表机构 * Huawei Noah's Ark Lab.（华为诺亚实验室）； City University of Hong Kong.（香港城市大学）； University of Sydney.（悉尼大学）； State Key Lab of General AI, School of Intelligence Science and Technology, Peking University（北京大学人工智能国家重点实验室，智能科学与技术学院）

AI总结本文提出Circle-RoPE，通过将图像标记坐标映射到与文本位置轴正交的圆环上，实现跨模态位置解耦，同时保留图像内部空间结构，并通过交替几何编码增强跨模态位置解耦和细粒度图像空间结构保留。

Comments Accepted at ICML 2026

详情

AI中文摘要

旋转位置嵌入（RoPE）在大型语言模型中被广泛采用，但应用于视觉-语言模型（VLMs）时会耦合文本和图像位置索引，并可能引入虚假的跨模态相对位置偏差。我们提出Per-Token Distance（PTD）来量化跨模态位置解耦，并证明PTD = 0是消除RoPE引起的几何注意力偏差的充分条件。基于此准则，我们引入Circle-RoPE，将2D图像标记坐标映射到与文本位置轴正交的圆环上，得到一种锥形几何结构，其中每个文本标记到所有图像标记等距，同时保留图像内部空间结构。我们进一步提出交替几何编码（AGE）以通过在层之间交替Circle-RoPE的解耦几何和标准RoPE的网格先验来结合互补的几何先验。这种设计在保持细粒度图像空间结构的同时实现了跨模态位置解耦。在多种VLM后端和多模态基准测试中的实验显示，在空间定位和视觉推理方面均取得了稳定的提升。代码可在https://github.com/lose4578/CircleRoPE上获得。

英文摘要

Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and prove that PTD = 0 is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to combine complementary geometric priors by alternating the decoupled geometry of Circle-RoPE and the grid-based prior of standard RoPE across layers. This design enables cross-modal positional disentanglement while preserving fine-grained intra-image spatial structure. Experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning. The code is available at https://github.com/lose4578/CircleRoPE.

URL PDF HTML ☆

赞 0 踩 0

2410.19787 2026-05-22 cs.CV cs.LG 版本更新

Leveraging Multi-Temporal Sentinel 1 and 2 Satellite Data for Leaf Area Index Estimation With Deep Learning

利用多时相哨兵1和2卫星数据进行叶面积指数估计的深度学习方法

Clement Wang, Antoine Debouchage, Valentin Goldité, Aurélien Wery, Jules Salzinger

发表机构 * Austrian Institute of Technology - Vienna, Austria（奥地利技术研究所-维也纳，奥地利）

AI总结本文提出了一种基于多时相哨兵1雷达数据和哨兵2多谱段数据的深度学习方法，用于像素级叶面积指数预测，通过多U-Net网络结构和共同潜在空间实现不同输入模态的互补信息融合，最终在公开数据上取得了0.06 RMSE和0.93 R2分数。

详情

DOI: 10.2760/46796
Journal ref: Proc. 2023 Conference on Big Data from Space (BiDS'23), Publications Office of the European Union, Luxembourg, 2023

AI中文摘要

叶面积指数（LAI）是理解生态系统健康和植被动态的关键参数。在本文中，我们提出了一种新的像素级LAI预测方法，通过利用多时间戳的哨兵1雷达数据和哨兵2多谱段数据的互补信息。我们的方法基于多个针对此任务定制的多U-Net深度神经网络。为处理不同输入模态的复杂性，该方法由多个预先训练的模块组成，以在共同的潜在空间中表示所有输入数据。然后，我们通过一个共同的解码器进行端到端微调，该解码器还考虑了季节性因素，我们发现季节性在其中起重要作用。我们的方法在公开可用数据上实现了0.06 RMSE和0.93 R2分数。我们的贡献可在https://github.com/valentingol/LeafNothingBehind上获得，供未来工作进一步改进当前进展。

英文摘要

The Leaf Area Index (LAI) is a critical parameter to understand ecosystem health and vegetation dynamics. In this paper, we propose a novel method for pixel-wise LAI prediction by leveraging the complementary information from Sentinel 1 radar data and Sentinel 2 multi-spectral data at multiple timestamps. Our approach uses a deep neural network based on multiple U-nets tailored specifically to this task. To handle the complexity of the different input modalities, it is comprised of several modules that are pre-trained separately to represent all input data in a common latent space. Then, we fine-tune them end-to-end with a common decoder that also takes into account seasonality, which we find to play an important role. Our method achieved 0.06 RMSE and 0.93 R2 score on publicly available data. We make our contributions available at https://github.com/valentingol/LeafNothingBehind for future works to further improve on our current progress.

URL PDF HTML ☆

赞 0 踩 0

2404.05307 2026-05-22 cs.CV cs.RO 版本更新

4D Radar Semantic Segmentation of People in Field Conditions Using Temporal Multi-View Networks

利用时序多视角网络进行野外条件下4D雷达的人体语义分割

Mikael Skog, Oleksandr Kotlyar, Vladimír Kubelka, Martin Magnusson

发表机构 * Center for Advanced Autonomous Sensor Systems (AASS)（先进自主传感器系统中心）

AI总结本文提出TMVA4D网络，利用4D雷达数据进行人体语义分割，通过多视角投影区分背景与人体，在低能见度条件下实现75.9%的Dice系数和61.2%的IoU指标。

详情

AI中文摘要

可靠的人员检测对于移动机器人和重型车辆在道路和工业环境（如采矿和建筑）中的安全自主至关重要。然而，常规传感器如摄像头或激光雷达在尘埃、雾或烟等恶劣条件下容易失效，限制了其在现实机器人系统中的应用。雷达在广泛的环境条件下提供稳健的测量。特别是现代高分辨率4D成像雷达提供跨距离、方位和仰角的4D点云，以及每个点的多普勒速度数据，非常适合机器人感知。我们提出TMVA4D，一种基于CNN和ConvLSTM编码器的神经网络架构家族，利用4D雷达模态进行语义分割。这些架构被训练以区分背景和人体类别，使用一系列2D投影的4D雷达数据，涵盖仰角、方位、距离和多普勒速度维度。在多个操作站点评估中，我们的模型在低能见度条件下实现了有希望的性能（Dice 75.9%，IoU 61.2% for class person）。数据和代码将在发表后公开发布。

英文摘要

Reliable people detection is crucial for the safe autonomy of mobile robots and heavy vehicles, both on roads and in industrial settings like mining and construction. However, common sensors like cameras or lidars are prone to failure in adverse conditions such as dust, fog, or smoke, which limits their use in real-world robotic systems. Radar, on the other hand, delivers robust measurements in a wide range of environmental conditions. In particular, modern high-resolution 4D imaging radars provide 4D point clouds across range, azimuth, and elevation, as well as per-point Doppler velocity data, well suited for robot perception. We propose TMVA4D, a family of artificial neural network architectures based on CNN and ConvLSTM encoders that leverage the 4D radar modality for semantic segmentation. The architectures are trained to distinguish between background and person classes using a series of 2D projections of the 4D radar data, encompassing elevation, azimuth, range, and Doppler velocity dimensions. Evaluated across several operational sites, our models achieve promising performance (Dice 75.9%, IoU 61.2% for class person) even in low-visibility conditions. The data and code will be made publicly available upon publication.

URL PDF HTML ☆

赞 0 踩 0

2209.03358 2026-05-22 cs.NE cs.AI cs.CR cs.CV cs.LG 版本更新

Attacking the Spike: On the Transferability and Security of Spiking Neural Networks to Adversarial Examples

攻击尖峰：关于脉冲神经网络对抗示例的转移性和安全性

Nuo Xu, Kaleel Mahmood, Haowen Fang, Ethan Rathbun, Caiwen Ding, Wujie Wen

发表机构 * Lehigh University（莱文大学）； University of Minnesota Twin Cities（明尼苏达大学双城分校）； North Carolina State University（北卡罗来纳州立大学）； University of Rhode Island（罗德岛大学）； Northeastern University（东北大学）

AI总结本文研究了脉冲神经网络（SNN）在对抗示例中的鲁棒性，揭示了对抗攻击的转移性，并提出了混合动态脉冲估计（MDSE）攻击方法，以提高SNN和非SNN模型的对抗示例生成效果。

Comments Accepted manuscript. Published in *Neurocomputing*, Volume 656, 2025, Article 131506. Available online 12 September 2025. DOI: 10.1016/j.neucom.2025.131506

详情

DOI: 10.1016/j.neucom.2025.131506
Journal ref: Neurocomputing, Volume 656, 2025, 131506

AI中文摘要

脉冲神经网络（SNNs）因其高能效和最近在分类性能上的进展而受到广泛关注。然而，与传统深度学习方法不同，SNN对对抗示例的鲁棒性研究仍相对薄弱。在本文中，我们通过三个贡献推进了SNN的对抗攻击研究。首先，我们表明对SNN的成功白盒对抗攻击高度依赖于底层的替代梯度估计器，即使对于对抗训练的SNN也是如此。其次，使用最佳的单一替代梯度估计器，我们分析了对抗攻击在SNN、视觉Transformer（ViTs）和CNN之间的可转移性。我们的分析揭示了两个关键差距：现有的白盒攻击没有利用多个替代梯度估计器来攻击SNN，且没有单个模型攻击能够可靠地生成同时欺骗SNN和非SNN模型的对抗示例。作为我们的第三个贡献，我们开发了混合动态脉冲估计（MDSE）攻击来解决这些问题。MDSE使用动态梯度估计方案，充分利用多个替代梯度估计器函数，生成能够同时欺骗SNN和非SNN模型的对抗示例。MDSE在SNN/ViT模型集合上比传统白盒攻击如Auto-PGD有效多达91.4%，在对抗训练的SNN集合上提供了3倍的提升。实验覆盖了三个数据集（CIFAR-10、CIFAR-100、ImageNet）和十九个分类器模型（每个CIFAR数据集七个，ImageNet五个）。我们的MDSE实现和评估的模型在https://github.com/nuoxuxxx/attacking-the-spike-mdse上公开可用。

英文摘要

Spiking neural networks (SNNs) have attracted much attention for their high energy efficiency and recent advances in classification performance. However, unlike traditional deep learning approaches, the study of SNN robustness to adversarial examples remains relatively underdeveloped. In this work, we advance the adversarial attack side of SNNs through three contributions. First, we show that successful white-box adversarial attacks on SNNs are highly dependent on the underlying surrogate gradient estimator, even for adversarially trained SNNs. Second, using the best single surrogate gradient estimator, we analyze the transferability of adversarial attacks across SNNs, Vision Transformers (ViTs) and CNNs. Our analysis reveals two key gaps: no existing white-box attack exploits multiple surrogate gradient estimators for SNNs, and no single-model attack reliably generates adversarial examples that simultaneously fool both SNN and non-SNN models. For our third contribution, we develop the Mixed Dynamic Spiking Estimation (MDSE) attack to address these issues. MDSE uses a dynamic gradient estimation scheme to fully exploit multiple surrogate gradient estimator functions and generates adversarial examples capable of fooling SNN and non-SNN models simultaneously. MDSE is up to 91.4% more effective on SNN/ViT model ensembles and provides a 3x boost on adversarially trained SNN ensembles compared to conventional white-box attacks like Auto-PGD. Experiments cover three datasets (CIFAR-10, CIFAR-100, ImageNet) and nineteen classifier models (seven per CIFAR dataset, five for ImageNet). Our implementation of MDSE and the evaluated models is publicly available at https://github.com/nuoxuxxx/attacking-the-spike-mdse.

URL PDF HTML ☆

赞 0 踩 0

1709.03806 2026-05-22 cs.CV 版本更新

Do Vision Models Encode Object-Level Semantic Relatedness? A Cognitive Psychology-Inspired Benchmark

视觉模型是否编码物体层面的语义相关性？一种受认知心理学启发的基准

Hansang Lee, Haeil Lee, Junmo Kim

发表机构 * Department of Computer Science（计算机科学系）； Seoul Women’s University（首尔女子大学）； LG Energy Solution（LG能源解决方案）； School of Electrical Engineering（电气工程学院）； KAIST（韩国科学技术院）

AI总结本文通过一种受认知心理学启发的基准，探讨了视觉模型是否能编码物体层面的语义相关性，研究了两种仅基于图像的测试集，并揭示了分类准确率之外的表征特性。

详情

AI中文摘要

现代视觉模型在物体识别任务上取得了显著的性能，但尚不清楚其表示是否编码物体层面的语义相关性，即支持人类视觉认知的对象概念之间的有意义联系。现有的基准主要针对类别预测或依赖图像-文本匹配，忽略了视觉表示本身的研究。受认知心理学启发，我们将语义相关性重新定义为三元组排序任务，并研究了两个仅基于图像的测试集：POPORO，一个已有的400个三元组心理刺激集，重新用于表示评估；以及PoporoIN，一个新构建并人工编写的1000个三元组ImageNet验证扩展集。每个三元组沿两个正交轴进行注释：一个相关目标轴区分类别相关性（CR，分类学）和上下文相关性（TR，主题性），一个干扰轴区分颜色匹配干扰项（CD）和形状匹配干扰项（SD）。二十种预训练模型，涵盖监督、自监督、视觉-语言和生成范式，在仅推理的协议下通过余弦相似度进行评估。基于变换器的表示在PoporoIN上比卷积表示高出高达18.30个百分点，且在可比的ImageNet准确率下，视觉-语言编码器在POPORO上比视觉-only编码器高出高达22.50个百分点。在所有范式中，模型在分类学目标上比主题性目标更可靠地识别，且更容易被形状匹配干扰项所误导，而不是颜色匹配干扰项。这些基准揭示了分类准确率之外的表征特性，连接了认知心理学和视觉表征评估。

英文摘要

Modern vision models have achieved strong object-recognition performance, yet it remains unclear whether their representations encode object-level semantic relatedness, the meaningful connection between object concepts that supports human visual cognition. Existing benchmarks predominantly target category prediction or rely on image--text matching, leaving the visual representation itself underexamined. Drawing on cognitive psychology, we recast semantic relatedness as a triplet-ranking task and study two image-only test beds: POPORO, an existing 400-triplet psychological stimulus set repurposed for representation evaluation, and PoporoIN, a newly constructed and manually curated 1,000-triplet ImageNet-validation extension. Each triplet is annotated along two orthogonal axes: a related-target axis distinguishing Categorical Relatedness (CR, taxonomic) from conTextual Relatedness (TR, thematic), and a distractor axis distinguishing Color-matched Distractors (CD) from Shape-matched Distractors (SD). Twenty pretrained models spanning supervised, self-supervised, vision--language, and generative paradigms were evaluated by cosine similarity in an inference-only protocol. Transformer-based representations exceeded convolutional counterparts by up to 18.30 percentage points on PoporoIN at comparable ImageNet accuracy, and vision--language encoders exceeded vision-only counterparts by up to 22.50 percentage points under matched ImageNet accuracy on POPORO. Across paradigms, models recognized taxonomic targets more reliably than thematic ones and were more easily misled by shape-matched than by color-matched distractors. The benchmarks expose representational properties that classification accuracy alone does not fully predict, bridging cognitive psychology and visual representation evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.22086 2026-05-22 cs.CV 版本更新

COCOTree: 一个用于开放树状视觉分解的数据集和基准

Junhyub Lee, Seunghun Chae, Hyosu Kim

发表机构 * Chung-Ang University（Chung-Ang大学）

AI总结本文提出COCOTree数据集和基准，通过自动化生成管道和开放词汇空间，实现了对复杂物理组装的长尾分布的捕捉，并提出了Open Tree Quality (OTQ)评估指标。

详情

AI中文摘要

我们正式化并启用了开放树分解任务，该任务将图像分割为具有无约束粒度和灵活性的层次树状视觉组件。具体而言，我们为这一新范式提供了基础基准，有三个关键贡献：首先，通过开发一个完全自动化的生成管道，结合大视觉-语言模型的语义推理与SAM 3的精确几何定位，克服了手动标注的高认知和物理瓶颈；其次，利用该管道构建了COCOTree大规模基准，包含超过21,000张图像和180万个结构节点，通过超过3,500个唯一标签的开放词汇空间，成功捕捉了复杂物理组装的长尾分布；最后，我们通过提出Open Tree Quality (OTQ)指标建立了标准化评估协议，该指标联合评估掩码精度、标签准确性和结构一致性。我们已发布数据集和基准代码：https://github.com/melonkick3090/COCOTree.

英文摘要

We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at https://github.com/melonkick3090/COCOTree.

URL PDF HTML ☆

赞 0 踩 0

2605.22066 2026-05-22 cs.CV cs.AI 版本更新

GA-VLN: 用于高效视觉-语言导航的几何感知鸟瞰图表示

Jiahao Yang, Zihan Wang, Xiangyang Li, Xing Zhu, Yujun Shen, Yinghao Xu, Shuqiang Jiang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（人工智能安全国家重点实验室，计算技术研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）； Robbyant ； School of Computing, National University of Singapore（新加坡国立大学计算机学院）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出GA-VLN框架，通过引入几何感知的鸟瞰图表示（GA-BEV），整合显式和隐式几何信息，提升视觉-语言导航的效率和性能，实验表明其在仅使用导航数据的情况下取得了最先进的结果。

详情

AI中文摘要

尽管在视觉-语言导航（VLN）领域取得了显著进展，现有方法仍依赖密集的RGB视频，产生过多的片段标记且缺乏显式的空间结构，导致计算开销大且空间推理能力有限。为了解决这些问题，我们引入了几何感知的鸟瞰图（GA-BEV）-一种紧凑且3D基础的特征表示，将显式和隐式的几何线索整合到多模态大语言模型（MLLM）导航系统中。我们通过将视觉特征投影到3D空间并聚合为以代理为中心的布局来构建BEV空间地图，该布局在保持几何一致性的同时减少标记冗余。为了进一步丰富几何理解，我们将预训练的3D基础模型的特征融入BEV空间，注入从大规模3D重建任务中学习到的结构先验。这些互补的线索-基于深度的显式投影和隐式学习的先验-产生紧凑但空间表达能力强的表示，显著提高了导航效率和性能。实验表明，我们的方法仅使用导航数据即可取得最先进的结果，无需DaGger增强或混合VQA训练，证明了所提GA-VLN框架的鲁棒性和数据效率。

英文摘要

Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

URL PDF HTML ☆

赞 0 踩 0

2605.22035 2026-05-22 cs.CV cs.CL 版本更新

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

HyLoVQA: 动态超网络生成低秩适应用于连续视觉问答

Yiran Wang, Chenyi Xiong, Ziyue Qin, Miao Zhang, Kui Xiao, Zhifei Li

发表机构 * School of Computer Science, Hubei University, Wuhan 430062, China（湖北大学计算机学院，武汉430062，中国）； Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China（湖北省大数据智能分析与应用重点实验室（湖北大学），武汉430062，中国）； Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China（智能感知系统与安全重点实验室（湖北大学），教育部，武汉430062，中国）

AI总结 HyLoVQA通过动态超网络生成低秩适应，解决连续视觉问答中任务干扰问题，提升模型对当前任务和对象的适应能力。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

连续视觉问答（VQA）需要在非稳态的视觉输入和问题流中学习，同时保持过去知识。大多数先前方法通过更新大量共享参数集来适应，这通常导致跨层任务干扰，阻碍对当前任务和对象的准确适应。为了解决这一限制，我们提出了HyLoVQA。它维护一个具有漂移鲁棒性的锚点记忆库。该库存储视觉对象的内容和文本任务的内容，并使用当前输入特征进行更新。基于检索到的锚点，超网络生成轻量级低秩适应（LoRA）适配器。这确保了参数效率，使模型能够动态适应每个任务和对象。此外，我们提出了一个对齐损失，将特征空间中的语义差异与参数空间中的功能变化对齐，从而约束LoRA适配器保持专注于当前任务和对象。在VQA v2和NExT-QA上广泛实验表明，HyLoVQA在标准和组合设置下优于先前最先进的方法。

英文摘要

Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hindering accurate adaptation to the current task and object. To address this limitation, we propose HyLoVQA. It maintains a drift-resilient memory bank of anchors. The bank stores the content of visual objects and textual tasks, and they are updated using current input features. Conditioned on retrieved anchors, a hypernetwork generates lightweight Low-Rank Adaptation (LoRA) adapters. This ensures parameter efficiency, allowing the model to adapt to each task and object dynamically. Additionally, we formulate an alignment loss that aligns semantic discrepancies in the feature space with functional changes in the parameter space, thereby constraining LoRA adapters to remain focused on the current task and object. Extensive experiments on VQA v2 and NExT-QA under both standard and compositional settings demonstrate the superiority of HyLoVQA over prior state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.22034 2026-05-22 cs.CV cs.AI 版本更新

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

AgroVG：一个大规模多源基准用于农业视觉 grounding

Haocheng Li, Juepeng Zheng, Zenghao Yang, Kaiqi Du, Guilong Xiao, Gengmeng Pu, Haohuan Fu, Jianxi Huang

发表机构 * China Agricultural University（中国农业大学）； Sun Yat-sen University（中山大学）； Tianjin University（天津大学）； Tsinghua University（清华大学）； Southwest Jiaotong University（西南交通大学）； National Supercomputing Center in Shenzhen（深圳国家超算中心）

AI总结本文提出AgroVG基准，用于评估农业视觉 grounding能力，通过多源数据集和任务特定协议，评估模型在多目标、多实例和无目标场景下的性能，揭示了现有模型在农业视觉 grounding任务中的不足。

Comments 45 pages,12 figures

详情

AI中文摘要

视觉 grounding，即根据自然语言描述定位物体的任务，是农业人工智能系统的基础能力，可应用于选择性除草、疾病监测和定向收获。农业视觉 grounding的可靠评估具有挑战性，因为农业目标往往小、重复、被遮挡或形状不规则，且指令可能指一个、多个或没有物体。因此，评估此能力需要联合测试定位精度、目标集完整性和存在感知的回避。为了解决这些挑战，我们引入了AgroVG，一个多源基准，将农业 grounding 视为广义集合预测：给定一张图像和一个指称表达，模型必须返回所有匹配的目标实例或在没有目标时回避。AgroVG包含来自十个数据集的10,071个注释-图像查询对，涵盖六个目标类别：作物/杂草、水果、小麦头、害虫、植物疾病和树冠。它支持所有六个类别上的边界框 grounding（T1）和具有可靠实例级像素注释的数据源上的实例掩码 grounding（T2），查询涵盖单目标、多目标和无目标场景。AgroVG进一步提供任务特定的协议用于框集匹配和查询级掩码覆盖。对26种模型配置的零样本评估揭示了持续的差距：最好的多目标Set-F1仅达到0.35，最好的正查询掩码成功率在IoU@0.75下仍低于0.17。数据和代码可在https://anonymous.4open.science/r/AgroVG-5172/上获得。

英文摘要

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

URL PDF HTML ☆

赞 0 踩 0

2605.22031 2026-05-22 cs.CV 版本更新

SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction

SO-Mamba：用于展开MRI重建的态所有权Mamba

Pengcheng Fang, Hongli Chen, Fangfang Tang, Feng Liu, Xiaohao Cai, Shanshan Shan

发表机构 * University of Southampton（南安普顿大学）； University of Queensland（昆士兰大学）； Soochow University（苏州大学）

AI总结本文提出SO-Mamba，一种用于展开MRI重建的态所有权Mamba正则化器，通过分配每个Mamba阶段的重建证据到递归驻留、态接口访问和非态输出校正，以提升重建质量与效率。

详情

AI中文摘要

加速MRI重建需要在大空间区域内恢复缺失细节的同时保持解剖学一致的结构。状态空间模型如Mamba提供高效的长距离建模，使其成为展开重建中的有吸引力的学得正则化器。然而，在数据一致性耦合的展开求解器中，不同阶段操作于不同的重建迭代，其中驻留载体应在不同阶段保持一致的重建内容，而阶段依赖的非驻留证据则与当前更新相关。将这些角色统一处理会将持久驻留载体证据和更新依赖的非驻留证据置于相同的递归内容路由中。为此，我们提出了SO-Mamba，一种态所有权Mamba正则化器，该正则化器将每个Mamba阶段的重建证据分配到递归驻留、态接口访问和非态输出校正。SO-Mamba通过State-Ownership Router (SOR)实现这一所有权规则，构建递归内容的驻留载体，并将非驻留证据路由到B/C态接口的仿射调制和输出校正出口。驻留载体提供Mamba内容路由，而非驻留证据流调整态接口并通过输出出口贡献，而无需进入递归内容路由。我们进一步引入了两级外带泄漏诊断，通过测量选择性扫描状态轨迹中的外带能量和扫描后Mamba读取中的外带能量，将隐藏状态存储与读取表达分开。在五个公开的MRI重建基准上进行的实验表明，SO-Mamba在具有竞争性计算效率的CNN、Transformer和Mamba基线中表现一致提升。

英文摘要

Accelerated MRI reconstruction requires recovering missing details while preserving anatomically coherent structures across large spatial regions. State-space models such as Mamba provide efficient long-range modeling, making them attractive learned regularizers for unrolled reconstruction. However, in a data-consistency-coupled unrolled solver, different stages operate on different reconstruction iterates, where the resident carrier should preserve coherent reconstruction content across stages while stage-dependent non-resident evidence is tied to the current update. Treating these roles uniformly can place persistent resident-carrier evidence and update-dependent non-resident evidence into the same recurrent content route. We therefore propose SO-Mamba, a state-ownership Mamba regularizer that assigns reconstruction evidence within each Mamba stage to recurrent residency, state-interface access, and non-state output correction. SO-Mamba implements this ownership rule with a State-Ownership Router (SOR), which constructs a resident carrier for recurrent content and routes non-resident evidence to affine modulation of the B/C state interfaces and an output correction outlet. The resident carrier supplies the Mamba content route, while the non-resident evidence stream adapts the state interfaces and contributes through the output outlet without entering the recurrent content route. We further introduce a two-level outer-band leakage diagnostic that separates hidden-state storage from readout expression by measuring outer-band energy in the selective-scan state trajectory and the post-scan Mamba readout. Experiments on five public MRI reconstruction benchmarks spanning diverse anatomies, sampling patterns, and coil configurations show that SO-Mamba consistently improves over CNN-, Transformer-, and Mamba-based baselines with competitive computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.22017 2026-05-22 cs.CV 版本更新

Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction

多样且一致：基于能量的联合细化的上下文引导扩散用于多智能体运动预测

Lei Chu, Yuhuan Zhao

发表机构 * University of Southern California（南加州大学）

AI总结本文提出了一种基于扩散的框架，通过利用历史轨迹中的丰富上下文信息来改进多智能体运动预测，通过引导机制增强预测动作的多样性和表达性，并引入基于能量的公式来细化联合轨迹分布，同时保持个体轨迹的合理性，实验表明该方法在多个基准数据集上均优于现有方法。

Comments MEIS-- CVPR

详情

AI中文摘要

深度生成模型由于其能够捕捉多模态分布和表示多样化的人类行为的能力，已成为人类运动预测的有希望的方法。然而，生成在相互作用代理之间既多样又联合一致的预测仍然具有挑战性。此外，大多数现有方法主要使用单代理（边缘）度量进行评估，这无法充分反映多代理互动的联合动态。我们提出了一种基于扩散的框架，通过利用历史轨迹中的丰富上下文信息来改进多代理运动预测。这种信息通过引导机制进行整合，以增强预测动作的多样性和表达性。为了进一步强制交互一致性，我们引入了基于能量的公式，通过细化联合轨迹分布的同时保持个体轨迹的合理性。在四个基准数据集上的大量实验表明，我们的方法在多个指标上均优于现有方法。值得注意的是，我们的方法在ETH/UCY上显著提高了边缘（ADE/FDE）和联合（JADE/JFDE）度量，与先前的联合预测方法相比，它在保持竞争性联合性能的同时，显著提高了边缘度量。

英文摘要

Deepgenerative models havebecomeapromisingapproach for human motion prediction due to their ability to capture multimodal distributions and represent diverse human be haviors. However, generating predictions that are both di verse and jointly consistent among interacting agents re mains challenging. In addition, most existing approaches are primarily evaluated using single-agent (marginal) met rics, which fail to fully reflect the joint dynamics of multi agent interactions. We propose a diffusion-based frame work that improves multi-agent motion prediction by lever aging rich contextual information from historical trajecto ries. This information is incorporated through a guidance mechanism to enhance the diversity and expressiveness of predicted motions. To further enforce interaction consis tency, we introduce an energy-based formulation that re fines the joint trajectory distribution while preserving the plausibility of individual trajectories. Extensive experi ments on four benchmark datasets demonstrate that our approach consistently outperforms existing methods. No tably, our approach substantially improves both marginal (ADE/FDE) and joint (JADE/JFDE) metrics on ETH/UCY over strong marginal baselines. Compared with prior joint prediction methods, it delivers significant gains in marginal metrics while maintaining competitive joint performance.

URL PDF HTML ☆

赞 0 踩 0

2605.22015 2026-05-22 cs.CV cs.AR 版本更新

ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

ORBIS: 通过分布感知匹配的输出引导标记减少以加速视频扩散

Hangyeol Lee, Joo-Young Kim

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出ORBIS，一种针对视频扩散Transformer的SW-HW协同设计加速器，通过利用前一时间步的输出激活获得更准确的token相似性，从而提高匹配质量并实现更高的标记减少比例，同时引入分布感知标记匹配算法和专用硬件设计，实现比现有方法更高的标记减少率、更快的速度和更低的能耗。

详情

AI中文摘要

扩散Transformer（DiT）已发展为生成高质量图像和视频的强大模型架构。在视频DiT中，3D空间时间注意力使token长度与帧数成正比，显著增加计算成本。标记减少方法通过利用空间冗余来缓解这一成本，但现有方法依赖于不准确的相似性估计和轻量级匹配算法，导致匹配质量差且仅带来微小的加速效果。为克服这些限制，我们提出了ORBIS，一种为视频DiT设计的SW-HW协同加速器。ORBIS利用前一时间步的输出激活以获得更准确的token间相似性，显著提高匹配质量并实现更高的token减少比例。我们进一步引入了分布感知标记匹配（DATM）算法，该算法捕捉全局token分布并显式最小化token对损失以获得额外收益。为了完全隐藏DATM延迟，我们设计了专用、深度流水线化的硬件并通过量化来最小化其硬件成本，仅占用总面积的2.4%，且精度损失可忽略不计。大量实验表明，ORBIS的token减少比例比最先进的方法AsymRnR高约2倍，相比NVIDIA A100 GPU实现了高达4.5倍的速度提升和79.3%的能耗降低。

英文摘要

Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal acceleration. To overcome these limitations, we propose ORBIS, an SW-HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.

URL PDF HTML ☆

赞 0 踩 0

2605.22013 2026-05-22 cs.CV cs.GR cs.LG 版本更新

PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

PointLLM-R: 通过链式推理增强3D点云推理

Chaoqi Chen, Qile Xu, Wenjun Zhou, Hui Huang

发表机构 * Visual Computing Research Center (VCC), College of Computer Science（视觉计算研究中心（VCC），计算机科学学院）； Software Engineering (CSSE) Shenzhen University China（软件工程（CSSE）深圳大学中国）； VCC, CSSE Shenzhen University China（VCC，CSSE 深圳大学中国）； Shenzhen University（深圳大学）

AI总结本文提出了一种数据驱动的框架，用于构建大规模链式推理监督，以改进3D点云理解。通过两阶段流程优化点文本指令数据，并合成高质量推理路径，构建了包含55K样本的PoCoTI数据集，训练PointLLM-R实现3D多模态语言模型的推理能力，实验表明其在生成3D分类和描述任务中达到最先进的性能。

详情

AI中文摘要

通过语言理解3D点云仍然是计算机图形学和视觉计算中的基本挑战，由于点云数据的不规则结构和现有3D多模态模型中缺乏显式推理。尽管链式推理（CoT）在LLM和基于图像的MLLM中表现出强大的有效性，但其在3D理解中的扩展仍鲜有探索。本文提出了一种数据驱动的框架，用于构建大规模CoT监督，专门针对3D点云理解。我们的框架由一个两阶段流程组成，首先通过基于视觉语言模型的质量评估和参考引导细化点文本指令数据，然后通过人机协同提示优化（HiLPO）合成高质量的推理路径。使用这种方法，我们构建了PoCoTI，一个包含55K样本的CoT增强点文本指令遵循数据集。在PoCoTI上微调PointLLM，得到PointLLM-R，一个具备推理能力的3D多模态语言模型。在生成3D分类和描述任务上的大量实验表明，PointLLM-R在生成3D分类和描述任务中达到了最先进的性能，并且能够稳健地推广到现实世界扫描点云和多轮对话场景中。

英文摘要

Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.22012 2026-05-22 cs.CL cs.CV 版本更新

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni: 通过统一的音频-视觉潜在推理重新思考多模态理解

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang

发表机构 * School of AI, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Kling Team, Kuaishou Technology（快手科技 Kling 团队）； Peking University（北京大学）； HKUST（香港科技大学）； CASIA（中国科学院自动化研究所）； Nanjing University（南京大学）； Renmin University of China（中国人民大学）； Tsinghua University（清华大学）

AI总结本文提出LatentOmni框架，通过统一的音频-视觉潜在空间进行多模态推理，利用特征级监督和Omni-Sync Position Embedding保持时间一致性，从而在多个音频-视觉推理基准测试中取得最佳性能。

Comments 21 pages, 15 figures

详情

AI中文摘要

联合音频-视觉推理对于多模态理解至关重要，但当前的多模态大语言模型（MLLMs）在需要从两种模态中提取细粒度证据进行推理时仍存在困难。一个核心限制是显式的基于文本的推理链（CoT）将连续的音频-视觉信号压缩成离散的标记，削弱了时间定位并使中间推理偏向语言先验。我们主张统一的潜在空间是此类推理更好的媒介，因为它保留了密集的感知信息，同时仍能与自回归生成兼容。基于这一见解，我们提出了LatentOmni，一个跨模态推理框架，将文本推理与音频-视觉潜在状态交织在一起。LatentOmni引入了特征级监督，以对齐潜在推理状态与任务相关的感知特征，并使用Omni-Sync Position Embedding（OSPE）来保持潜在音频和视觉状态之间的时间一致性。我们进一步构建了LatentOmni-Instruct-35K数据集，该数据集包含音频-视觉交织推理轨迹，用于监督潜在空间推理。在多个音频-视觉推理基准测试中的综合评估表明，LatentOmni在评估的开源模型中取得了最佳性能，并且在显式文本CoT基线中表现一致，支持潜在空间联合推理作为更强多模态理解的有前途的路径。

英文摘要

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.22011 2026-05-22 cs.CV 版本更新

Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

重新思考扩散模型的token减少：通过输出相似性意识

Hangyeol Lee, Hyojeong Lee, Joo-Young Kim

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出DiTo，一种基于输出中心的token减少方法，通过利用相邻时间步的输出相似性来建立token对应关系，从而减少计算复杂度并提高生成质量。

详情

AI中文摘要

扩散变换器（DiTs）在图像生成质量上表现出色，但其计算复杂度与token数量呈二次关系。尽管已提出多种token减少（TR）方法以缓解这一成本，但它们忽略了生成模型的主要目标：最小化恢复误差，这需要反映输出token的相似性。它们仅依赖于输入token相似性，这是来自仅减少的ViT范式继承的，导致与该目标的根本不一致。为弥合这一差距，我们提出DiTo，一种新的TR范式，其重点转向以输出为中心的token减少。基于观察到输出token相似性在相邻时间步中保持一致，DiTo利用先前步骤的相似性作为有效代理，在匹配时间步中建立token对应关系，然后在多个后续减少时间步中重用。为了优化这种交错调度，我们提出Pair Match Ratio（PMR）引导的区间调度，以确定最佳匹配频率。此外，为了减轻由重复重用导致的局部近似误差和由此产生的阻塞伪影，我们提出频率感知的token匹配，通过引入选择频率惩罚。广泛的实验表明，DiTo在可比的加速下，比现有TR方法在PSNR上高出1.6-3.9 dB，实现了更优的帕累托前沿。

英文摘要

Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple subsequent Reduction timesteps. To optimize this interleaved scheduling, we propose Pair Match Ratio (PMR)-guided Interval Scheduling to determine the optimal matching frequency. Furthermore, to mitigate localized approximation errors and resulting blocking artifacts caused by repeated reuse, we propose Frequency-aware Token Matching by incorporating a selection-frequency penalty. Extensive experiments demonstrate that DiTo consistently outperforms existing TR methods with 1.6-3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier.

URL PDF HTML ☆

赞 0 踩 0

2605.22002 2026-05-22 cs.CV 版本更新

ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation

ConvNeXt-FD：一种基于分形的深度模型用于鲁棒的生物医学图像分割

Joao Batista Florindo, Amanda Pontes de Oliveira Ornelas

发表机构 * Institute of Mathematics, Statistics and Scientific Computing, Department of Applied Mathematics, University of Campinas（数学、统计与科学计算研究所，应用数学系，坎皮纳斯大学）

AI总结本文提出了一种基于分形的深度学习模型ConvNeXt-FD，用于提高生物医学图像分割的鲁棒性，通过结合Dice系数和边界感知正则化项，提升模型对物体边界和形状保真的敏感性。

详情

AI中文摘要

生物医学图像分割是医疗诊断和治疗计划中的关键任务，能够精确勾勒解剖结构和病理区域。尽管取得了显著进展，但由于不同医学成像模态中固有的变异性、噪声和复杂的形态，仍存在挑战。本文介绍了一种新的深度学习架构ConvNeXt-FD，基于类似U-Net的编码器-解码器框架，利用强大的ConvNeXt主干网络。我们的方法结合了一种混合损失函数，该函数结合了Dice系数和受可微分分形维度公式启发的边界感知正则化项，旨在增强模型对物体边界和形状保真的敏感性。我们严格评估了ConvNeXt-FD在六个不同的生物医学数据集上的表现：BUSI（乳腺超声图像）、DDTI（甲状腺超声图像）、FluoCells（荧光细胞图像）、IDRiD（糖尿病视网膜病变图像用于视盘分割）、ISIC2018（皮肤病变图像）和MoNuSeg（核分割）。实验结果表明，ConvNeXt-FD，特别是在使用ImageNet预训练权重初始化时，在各种指标上（包括Dice、Jaccard、准确率、灵敏度、特异度和假阳性率）均表现出竞争性甚至更优的性能。ConvNeXt作为强大编码器的结合，与边界感知正则化相结合，证明了在挑战性的生物医学上下文中捕获高级语义特征和细粒度边界细节的有效性，从而实现更准确和可靠的分割。

英文摘要

Biomedical image segmentation is a critical task in medical diagnosis and treatment planning, enabling precise delineation of anatomical structures and pathological regions. Despite significant advancements, challenges persist due to the inherent variability, noise, and complex morphology present in diverse medical imaging modalities. This paper introduces ConvNeXt-FD, a novel deep learning architecture for robust biomedical image segmentation, built upon a U-Net-like encoder-decoder framework leveraging the powerful ConvNeXt backbone. Our approach integrates a hybrid loss function combining the Dice coefficient with a boundary-aware regularization term inspired by a differentiable formulation of Fractal Dimension, designed to enhance the model's sensitivity to object boundaries and shape fidelity. We rigorously evaluate ConvNeXt-FD across six distinct biomedical datasets: BUSI (Breast Ultrasound Images), DDTI (Thyroid Ultrasound Images), FluoCells (Fluorescent Cell Images), IDRiD (Diabetic Retinopathy Images for Optic Disc Segmentation), ISIC2018 (Skin Lesion Images), and MoNuSeg (Nuclei Segmentation). Experimental results demonstrate that ConvNeXt-FD, particularly when initialized with ImageNet pre-trained weights, achieves competitive and often superior performance compared to existing state-of-the-art methods across various metrics, including Dice, Jaccard, Accuracy, Sensitivity, Specificity, and False Positive Rate. The integration of ConvNeXt as a strong encoder, coupled with the boundary-aware regularization, proves effective in capturing both high-level semantic features and fine-grained boundary details, leading to more accurate and reliable segmentations in challenging biomedical contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.22000 2026-05-22 cs.CV cs.AI 版本更新

Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography

从相位对比背光干涉断层扫描生成虚拟3D的H&E染色

Anthony Song, Boyan Zhou, Mayank Golhar, Marisa Morakis, Alex Baras, Nicholas Durr

发表机构 * Department of Biomedical Engineering, Johns Hopkins University（约翰霍普金斯大学生物医学工程系）； Department of Pathology, Johns Hopkins Hospital（约翰霍普金斯医院病理学系）

AI总结本文提出HistoBIT3D，首个基于voxel的配对BIT和荧光标记核数据集，用于评估无监督虚拟染色在结构保持方面的定量效果。通过该数据集，作者提出一种新的虚拟染色框架，利用双向多尺度内容一致性与跨域风格复用，将具有移变对比度的BIT体积转化为逼真的H&E体积，从而提升3D核分割精度和边界保持性。

详情

AI中文摘要

三维（3D）未处理组织的病理学具有潜在的疾病管理变革能力，通过使组织微结构的体积分析和活体评估成为可能。背光干涉断层扫描（BIT）是一种新的相位显微镜技术，能够提供快速、非破坏性的未处理组织体积分像。然而，将BIT体积转化为临床可解释的H&E图像仍然具有挑战性，特别是由于移变对比和缺乏定量验证基准。我们引入HistoBIT3D，首个voxel-wise配对的BIT和荧光标记核数据集，使在无监督虚拟染色中结构保持的定量评估成为可能。利用该数据集，我们提出了一种新的虚拟染色框架，通过双向多尺度内容一致性和跨域风格复用来增强结构保真度和感知现实性，将具有移变对比度的BIT体积转化为逼真的H&E体积。我们的方法在现实感度量方面达到最先进的水平，同时显著提高了3D核分割精度和边界保持性，特别是在零shot Cellpose评估下。这些贡献共同建立了一个经过定量验证、结构忠实且可扩展的3D虚拟H&E染色流程，推动了无切片、体积分计算病理学的范式转变。我们的数据和代码可在：https://github.com/aasong113/HistoBIT3D_VirtualStaining。

英文摘要

Three-dimensional (3D) histopathology of unprocessed tissues has the potential to transform disease management by enabling volumetric characterization of tissue microarchitecture and in-vivo assessment. Back-illumination Interference Tomography (BIT) is a new phase microscopy technology that provides rapid, non-destructive volumetric imaging of unprocessed tissues. However, translating BIT volumes into clinically interpretable H&E images remains challenging, particularly due to shift-variant contrast and the absence of quantitative validation benchmarks. We introduce HistoBIT3D, the first voxel-wise paired BIT and fluorescence-labeled nuclei dataset, enabling quantitative evaluation of structural preservation in unsupervised virtual staining against ground-truth nuclear distributions. Using this dataset, we present a novel virtual staining framework that translates BIT volumes with shift-variant contrast into realistic H&E volumes by leveraging bidirectional multiscale content consistency and cross-domain style reuse to enhance structural fidelity and perceptual realism. Our method achieves state-of-the-art realism metrics while significantly improving 3D nuclei segmentation accuracy and boundary preservation under zero-shot Cellpose evaluation. Together, these contributions establish a quantitatively validated, structurally faithful, and scalable pipeline for 3D virtual H&E staining, advancing the paradigm of slide-free, volumetric computational histopathology. Our data and code are available at: https://github.com/aasong113/HistoBIT3D_VirtualStaining.

URL PDF HTML ☆

赞 0 踩 0

2605.21988 2026-05-22 cs.CV cs.AI 版本更新

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

通过反事实强化学习学习视频大语言模型中的时空敏感性

Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

发表机构 * Hong Kong University of Science and Technology（香港科技大学）； Tencent（腾讯）

AI总结本文提出CRPO方法，通过反事实强化学习提升视频大语言模型对时空动态的敏感性，通过构建反事实视频并引入反事实关系奖励，有效抑制了依赖静态线索的简略策略，从而在DyBench基准测试中提升了模型的时空敏感性。

Comments Project website: https://ddz16.github.io/crpo.github.io/

详情

AI中文摘要

视频大语言模型（Video LLMs）在基准测试中表现出色，但往往通过单帧线索和语言先验来回答视频问题，而不是通过跟踪时空动态。在训练后强化学习（RL）中，这种问题进一步加剧，因为仅正确性奖励会进一步强化那些不跟踪视频动态但能获得高奖励的简略策略。为此，我们提出一个受控的反事实问题：如果视觉世界发生变化而问题保持不变，答案应改变还是保持不变？基于这一观点，我们提出了反事实关系策略优化（CRPO），一种双分支强化学习框架，用于提升时空敏感性。CRPO通过水平翻转和时间反转构建反事实视频，在原始和反事实分支上进行训练，并引入反事实关系奖励（CRR）以鼓励答案在动态问题中改变而在静态问题中保持不变。这种跨分支约束使简略策略难以在两个分支中持续获得奖励。为了评估这一特性，我们引入了DyBench，一个配对反事实视频基准，包含3,014个视频，涵盖可逆动态、运动方向和事件序列，以及一个严格的配对准确度指标，防止固定答案简略策略夸大分数。实验表明，CRPO在时空敏感性评估中优于先前的RL方法，同时保持了竞争性的通用视频性能。在Qwen3-VL-8B上，CRPO在DyBench P-Acc上比基模型提高了+7.7，在TimeBlind I-Acc上提高了+8.2，表明改进了时空敏感性而非更强依赖静态简略策略。项目网站可在https://ddz16.github.io/crpo.github.io/上找到。

英文摘要

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .

URL PDF HTML ☆

赞 0 踩 0

2605.21981 2026-05-22 cs.CV 版本更新

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

RiT: vanilla diffusion transformers suffice in representation space

Le Zhang, Ning Mang, Aishwarya Agrawal

发表机构 * Mila – Québec AI Institute, UdeM（魁北克AI研究院，麦吉尔大学）； Utrecht University（乌得勒支大学）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结该研究探讨了在表示空间中使用vanilla diffusion transformers进行图像生成的有效性，发现通过预训练的表示空间能够更有效地进行流匹配学习，从而在ImageNet数据集上取得了优于DiT-DH-XL的性能。

详情

AI中文摘要

流匹配与x预测--回归干净的数据点而非环境速度--已被证明在像素空间中有效利用低维流形结构\cite{li2025back}。我们询问是否预训练的表示空间，尽管包含具有可比内在维度的低维数据流形，能提供更有利于流匹配学习的分布。通过比较像素、SD-VAE和DINOv2特征在四个几何轴上的表现，我们发现像素和DINOv2具有几乎相同的内在维度性（两者$\hat{d}\!\approx\!33$），但DINOv2表现出7.3倍更高的有效秩、35倍更好的协方差条件、11.5倍更低的超额峰度以及1.7倍更低的流形插值误差；SD-VAE潜在特征始终处于中间位置，表明优势源于表示学习目标而非单纯的压缩。这些统计特性使流匹配回归变得良好条件化，并消除了先前DINOv2扩散方法中专门预测头或Riemannian运输的需要。我们提出了表示图像变换器（RiT）：一个通过冻结DINOv2特征进行x预测训练的vanilla Diffusion Transformer，仅通过维度感知的噪声调度和联合 exttt{[CLS]}-patch建模进行增强。在ImageNet $256{ imes}256$上，RiT在无指导时达到FID 1.45，在无分类器指导时达到1.14，优于参数更少19%的DiT$^ ext{DH}$-XL（676M vs.\ 839M）。所得到的ODE在粗略离散化下可以高效求解：在无分类器指导时，5步Heun步骤已达到FID 2.0，10步达到1.25，无需蒸馏或一致性训练。代码在https://github.com/lezhang7/RiT。

英文摘要

Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

URL PDF HTML ☆

赞 0 踩 0

2605.21980 2026-05-22 cs.CV cs.AI 版本更新

Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow

通过跨模态信息流解读并增强大视觉-语言模型中的情感电路

Chengsheng Zhang, Chenghao Sun, Zhining Xie, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception（脑启发智能感知与认知MOE实验室）； Cognition, University of Science（认知，科学大学）； AIPD, Tencent（AIPD，腾讯）

AI总结本文提出了一种基于转向向量的因果归因框架，用于描述性情感推理，通过构建专用数据集揭示了三阶段'适应-聚合-执行'机制下的情感电路，发现视觉情感线索在中间层通过情感特定的注意力头进行聚合，随后在深层通过情感通用路径转换为叙述生成，并通过调控情感信息路由增强注意力流和语义激活，从而提升性能并缓解情感幻觉。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型视觉-语言模型（LVLMs）代表了迈向共情代理的重要进展，展示了在情绪理解方面的显著能力。然而， governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remains largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

英文摘要

Large Vision-Language Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remain largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

URL PDF HTML ☆

赞 0 踩 0

2605.21977 2026-05-22 cs.CV cs.AI 版本更新

Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

视频作为自然增强：迈向统一的AI生成图像和视频检测

Zhengcen Li, Chenyang Jiang, Liangxu Su, Tong Shao, Shiyang Zhou, Ming Tao, Jingyong Su

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）； Shenzhen Loop Area Institute（深圳南山区研究院）

AI总结本研究针对AI生成内容检测中跨模态差距的问题，提出VINA框架，通过联合训练图像和视频数据，利用视频帧作为自然增强，并引入跨模态监督对比目标，实现统一的AI生成内容检测，提升鲁棒性和迁移性。

详情

AI中文摘要

AI生成内容（AIGC）正在迅速提升，催生了需要在数据源、部署管道和视觉模态间通用的检测器的紧迫需求。一个高度通用的检测器应在分布变化下保持稳健。然而，我们发现了一种一致的失败模式：最先进的AI生成图像检测器在应用于从视频中提取的帧时往往会崩溃。通过系统分析，我们发现这种跨模态差距源于交织的合成无关视频处理转换，包括颜色转换、编码压缩、缩放和模糊，以及由现代视频生成器引入的模型特定指纹。受这些发现的启发，我们提出了VINA（Video as Natural Augmentation），一个统一的AIGC检测框架，联合训练图像和视频数据。VINA利用视频帧作为物理上合理的自然增强，并进一步引入跨模态监督对比目标，以在共享的真/假决策边界下对齐图像和视频表示。在14个图像、视频和现实世界基准测试中，VINA展示了双向收益，提高了鲁棒性和迁移性，并在几乎所有评估设置中实现了最先进的性能，无需复杂的增强或数据集特定调整。

英文摘要

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.21973 2026-05-22 cs.CV 版本更新

Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

Foresee-to-Ground: 从预测性时间感知到证据驱动推理的视频时间接地

Zelin Zheng, Xinyan Liu, Ruixin Li, Antoni B. Chan, Guorong Li, Qingming Huang, Laiyun Qing

发表机构 * Qwen3-VL-8B-Instruct

AI总结本文提出了一种新的视频时间接地框架F2G，通过将时间接地问题重新表述为可验证的识别-测量问题，结合预测性时间感知和证据驱动推理，以提高时间接地的准确性和鲁棒性。

Comments Accepted by ICML 2026

详情

AI中文摘要

当前视频大语言模型（Video-LLM）在视频时间接地（VTG）中的方法通常依赖于从无结构的视觉令牌流中直接生成时间戳，这通常导致脆弱的数值和不一致的边界。为了解决这个问题，我们提出了Foresee-to-Ground（F2G），一种将VTG重新表述为可验证的识别-测量问题的框架。F2G集成了预测性时间感知与证据驱动推理：它学习对边界敏感的时间表示，以构建一个覆盖整个视频的候选事件片段证据池，并将这些片段暴露给LLM作为可引用的证据单元，将边界预测与显式事件假设绑定。通过将事件识别与精确边界测量解耦，F2G稳定了接地并使预测可验证。广泛的实验表明，F2G在各种基准上都一致提高了接地准确性，能够在不同的Video-LLM后端之间稳健地转移，并保持了通用视频理解能力。

英文摘要

Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.21970 2026-05-22 eess.IV cs.CV 版本更新

Entropy-Guided Self-Supervised Learning for Medical Image Classification

熵引导的自监督学习用于医学图像分类

Joao Florindo, Viviane Moura

发表机构 * Institute of Mathematics, Statistics and Scientific Computing（数学、统计与科学计算研究所）； Department of Applied Mathematics, University of Campinas（应用数学系，坎皮纳斯大学）

AI总结本文提出了一种结合自监督学习和迁移学习的深度学习框架，通过使用熵引导的掩码自动编码器和ImageNet预训练模型，提升医学图像分类的性能和鲁棒性。

详情

AI中文摘要

准确且鲁棒的医学图像分类对于早期疾病诊断和治疗计划至关重要。然而，有限的标注数据、高类内变异性以及细微的类间差异往往阻碍深度学习模型的性能。本文介绍了一种协同深度学习框架，利用自监督学习和迁移学习的优势来增强医学图像分类。我们的方法使用两个不同的ConvNeXt-Tiny模型：一个在大规模自然图像数据集（ImageNet）上预训练，另一个在目标医学数据集上使用熵引导的掩码自动编码器（MAE）预训练。然后，这两个模型在特定的医学图像分类任务上进行微调。最终采用基于平均预测概率的集成策略，结合这两个模型的互补见解。在四个多样化的医学成像数据集（乳腺超声图像（BUSI）、国际皮肤成像协作（ISIC）2018、Kvasir和COVID）上的严格实验验证显示，我们的集成方法在性能和鲁棒性方面均优于现有方法。MAE预训练显著提升了领域特定数据的特征学习，而ImageNet预训练提供了强大的可迁移特征。集成方法始终取得最先进的结果，优于单独模型和现有方法，突显了结合多样预训练策略在挑战性医学图像分析中的有效性。

英文摘要

Accurate and robust medical image classification is paramount for early disease diagnosis and treatment planning. However, challenges such as limited annotated data, high intra-class variability, and subtle inter-class differences often hinder the performance of deep learning models. This paper introduces a synergistic deep learning framework that leverages the strengths of self-supervised learning and transfer learning for enhanced medical image classification. Our approach employs two distinct ConvNeXt-Tiny models: one pre-trained on a large-scale natural image dataset (ImageNet) and another pre-trained using an entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both models are then fine-tuned on specific medical image classification tasks. A final ensemble strategy, based on averaging predicted probabilities, is utilized to combine the complementary insights from these two models. Rigorous experimental validation across four diverse medical imaging datasets (Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC) 2018, Kvasir, and COVID) demonstrates the superior performance and robustness of our ensemble approach. The MAE pre-training significantly improves feature learning on domain-specific data, while the ImageNet pre-training provides strong generalizable features. The ensemble consistently achieves state-of-the-art results, outperforming individual models and existing methods, highlighting the efficacy of combining diverse pre-training strategies for challenging medical image analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.21957 2026-05-22 cs.CV 版本更新

Bounding-Box Trajectories Matter for Video Anomaly Detection

边界框轨迹对视频异常检测至关重要

Inpyo Song, Jangwon Lee

发表机构 * Sungkyunkwan University（成均馆大学）

AI总结本文提出TrajVAD框架，通过建模多类边界框轨迹来学习正常运动模式，利用边界框轨迹作为主要异常线索，在ShanghaiTech数据集上取得优于现有姿态基方法的性能。

Comments 17 pages, 3 figures

详情

AI中文摘要

视频异常检测对于公共安全和安保至关重要，尽管已有大量研究，但仍极具挑战性，因为存在大量外观、视角和场景动态的变化。在现有方法中，基于人类姿态的方法已成为主要研究方向，由于许多公共数据集中的异常涉及人类，姿态表示对外观变化具有鲁棒性，同时提供紧凑的运动描述。然而，这些方法往往忽视了边界框轨迹，尽管这种信息在基于姿态的管道中本应是固有的。在本文中，我们明确利用这些轨迹作为主要异常线索。我们提出了TrajVAD框架，使用归一化流建模多类边界框轨迹以学习正常运动模式。其仅轨迹变体（TrajVAD-T）消除了姿态估计，并在ShanghaiTech上以87.7%的AP超越了所有比较的姿态基方法，同时在MSAD上取得最佳结果。扩展版本（TrajVAD-P）纳入了姿态信息，进一步将ShanghaiTech上的性能提升至88.6%的AUROC和90.9%的AP，突显了边界框轨迹作为视频异常检测中有效但尚未充分研究的模态。

英文摘要

Video anomaly detection is critical for public safety and security, yet remains highly challenging despite extensive research due to large variations in appearance, viewpoint, and scene dynamics. Among existing approaches, human pose-based methods have emerged as a major line of research, showing strong performance since many anomalies in public datasets involve humans and pose representations are robust to appearance changes while providing compact motion descriptions. However, these methods often overlook bounding-box trajectories, although such information is inherently available in pose-based pipelines. In this paper, we explicitly leverage these trajectories as a primary anomaly cue. We present TrajVAD, a framework that models multi-class bounding-box trajectories using normalizing flows to learn normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and surpasses all compared pose-based methods on ShanghaiTech in AP (87.7%), while achieving the best results on MSAD. An extended version (TrajVAD-P) incorporates pose information and further improves performance to 88.6% AUROC and 90.9% AP on ShanghaiTech, highlighting bounding-box trajectories as an effective yet underexplored modality for video anomaly detection.

URL PDF HTML ☆

赞 0 踩 0

2605.21954 2026-05-22 cs.CV cs.AI 版本更新

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Dazhao Du, Liao Duan, Jian Liu, Tao Han, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

发表机构 * Hong Kong University of Science and Technology（香港理工大学）； Xi’an Jiaotong University（西安交通大学）； Tencent（腾讯）

AI总结本文研究了多模态大语言模型（MLLMs）在视频时间定位中的感知与生成之间的差距，提出了一种推理阶段的读取-再生成框架，通过利用注意力线索来提高时间定位的准确性，从而在三个视频时间定位基准上提升了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能。

Comments Project Website: https://ddz16.github.io/mllmsknowwhen.github.io/

详情

AI中文摘要

视频时间定位（VTG），即在未剪裁的视频中定位查询事件的起止时间，是检验多模态大语言模型（MLLMs）是否理解不仅发生了什么，而且何时发生的关键测试。尽管现代MLLMs能够流畅地描述视频内容，但它们的时间戳预测仍然不可靠，而现有的解决方案要么需要昂贵的后训练时间标注，要么依赖于粗略的训练无关启发式方法。在本文中，我们探测了MLLMs的跨模态注意力，并揭示了一个感知-生成的差距。我们的关键发现是，MLLMs在prefill阶段往往知道目标区间，但在生成最终答案时会丢失这个信号。在prefill阶段，一组稀疏的注意力头（我们称之为时间定位头（TG-Heads））会将查询到视频的注意力集中在真实区间上。然而，在自回归解码过程中，答案标记会将注意力从该区间转移到视觉显著但与查询无关的段落。这一观察促使我们提出了一种推理阶段的读取-再生成框架。我们首先将TG-Head prefill注意力转换为一个去偏的帧级相关性信号，并提取它突出的高注意力区间。然后，我们使用视频裁剪或注意力掩码来限制MLLM的视觉上下文，仅限于该区间，以抑制干扰项。在不进行参数更新和架构更改的情况下，我们的框架在三个VTG基准上一致地提高了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能，最大提升达到+3.5 mIoU。该项目网站可在https://ddz16.github.io/mllmsknowwhen.github.io/上找到。

英文摘要

Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at https://ddz16.github.io/mllmsknowwhen.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.21931 2026-05-22 cs.CV 版本更新

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

EvoVid: 以时间为中心的自我进化用于视频大语言模型

Shiqi Huang, Ziyue Wang, Zhongrong Zuo, Han Qiu, Qi She, Bihan Wen

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电子与电气工程学院）； ByteDance（字节跳动）

AI总结本文提出EvoVid，一种以时间为中心的自我进化框架，使视频大语言模型能够直接从未经标注的视频中改进。通过引入两个互补的时间感知奖励，即时间感知的问题生成奖励和时间基础的求解奖励，EvoVid在四个基础模型和六个基准测试中实现了优于基线模型和现有自我进化基线的改进，展示了时间为中心的自我进化在视频理解和推理中的有效性。

Comments Project page: https://huangshiqi128.github.io/EvoVid.io/

详情

AI中文摘要

近期的视频大语言模型（Video-LLMs）通过强化学习（RL）展示了在视频推理中的强大能力。然而，现有的RL流程严重依赖于人工标注的任务和解决方案，使其扩展成本高且本质上受人类专业知识的限制。自我进化框架最近作为一种有前途的替代方案出现，通过自主的提问者-求解者自玩。不幸的是，这些方法主要针对静态模态，如文本和图像，从根本上无法捕捉视频推理中至关重要的时间动态。在本工作中，我们提出了EvoVid，一种以时间为中心的自我进化框架，使Video-LLMs能够直接从原始、未标注的视频中改进。具体来说，我们引入了两个互补的时间感知奖励：一个时间感知的问题生成奖励，通过时间扰动敏感性鼓励时间依赖性的问题生成；一个时间基础的求解奖励，通过固有的视频片段定位提供自动的时间监督。在四个基础模型和六个基准测试中的广泛实验显示，EvoVid在基线模型和现有自我进化基线上实现了持续的改进，取得了与监督方法相竞争的性能。这些结果突显了时间为中心的自我进化作为视频理解和推理的有效且可扩展的范式。

英文摘要

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.21924 2026-05-22 cs.CV 版本更新

Visual-Advantage On-Policy Distillation for Vision-Language Models

基于视觉优势的在线策略蒸馏用于视觉-语言模型

Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, Shu Wu

发表机构 * Institute of Automation, CAS（中国科学院自动化研究所）； School of Advanced Interdisciplinary Sciences, UCAS（中国科学院大学(UCAS)先进交叉学科学院）； Hello Group Inc.（Hello集团有限公司）； Sun Yat-sen University（中山大学）

AI总结本文提出了一种基于视觉优势的在线策略蒸馏方法，用于提升视觉-语言模型对视觉输入的依赖性，通过引入视觉优势指标来区分关键视觉token与语言token，从而提高蒸馏效果。

详情

AI中文摘要

在线策略知识蒸馏在语言模型中已被证明有效，但其在视觉-语言模型（VLMs）中的应用仍显不足。我们发现标准在线策略蒸馏可以提高学生模型的输出质量，但未能增强其对视觉输入的依赖性：在视觉关键token上，学生模型的预测在是否具备细粒度视觉细节时基本保持不变，尽管教师模型的预测依赖于它。为了使这种差异变得明显，我们引入了视觉优势（VA），即当教师在评分学生生成的rollout时，有无细粒度视觉细节的token级对数概率差异。VA集中在少数token上，这些高VA token实际上承载了视觉监督信号。这促使我们提出了一种蒸馏目标，使它们与语言支架不同，以避免其被大量语言token稀释。我们提出了视觉优势在线策略蒸馏（VA-OPD），它在两个粒度上使用VA：通过轨迹平均VA进行rollout级重新加权，以及在高VA和低VA组内分别计算token级KL平均值。我们在这两个数学数据集（Geometry3K和ViRL39K）上进行训练，并在八个基准测试上进行评估，涵盖数学推理和视觉理解，跨三种教师大小（4B、8B和32B）在Qwen3-VL系列上。VA-OPD在每个基准测试上均优于标准在线策略蒸馏，增益随着教师大小和数据规模轴单调增长，表明这些因素一致地相互作用。

英文摘要

On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language tokens.We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.

URL PDF HTML ☆

赞 0 踩 0

2605.21919 2026-05-22 cs.CV cs.AI 版本更新

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

SDGBiasBench: 评估和减轻可持续发展目标中视觉-语言模型的偏见

Zihang Lin, Huaiyuan Qin, Muli Yang, Hongyuan Zhu

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结本文提出SDGBiasBench，一个用于评估和减轻可持续发展目标中视觉-语言模型偏见的大型基准测试集，通过分析模型在决策和估计层面的偏见，提出CADE方法以减少偏见，提高模型的准确性和可靠性。

详情

AI中文摘要

评估可持续发展目标（SDGs）的进展需要对视觉线索、上下文知识和发展指标进行多步骤推理，其中不完整的证据使用和不完美的证据整合可能引入隐藏的预测偏见。现实中的SDG监测还涵盖定性判断和定量估计。然而，现有基准通常孤立地评估这些方面，掩盖了当模型用先验代替证据时系统性偏见。为解决这一差距，我们提出了SDGBiasBench，一个面向SDG的视觉-语言推理大型基准测试集。该基准涵盖50万专家参与的多项选择题和5万回归任务，能够全面评估视觉-语言模型（VLMs）在决策和估计层面的偏见。在SDGBiasBench上的评估揭示了当前VLMs中固有的SDG偏见，其中预测通常由SDG特定的先验驱动，而非可靠的多模态线索。为减轻这种偏见，我们提出CADE（对比自适应去偏集合），一种无需训练的即插即用方法，利用模态特定的答案先验。CADE在所提出的基准上取得显著成效，提高了多项选择的准确率高达25%，并减少了回归MAE高达12点，适用于多种VLMs。我们希望我们的工作能促进更公平和可靠的AI系统在可持续发展中的发展。

英文摘要

Assessing progress toward the Sustainable Development Goals (SDGs) requires multi-step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real-world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large-scale benchmark suite for SDG-oriented vision-language reasoning. Spanning 500k expert-involved multiple-choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision-level and estimation-level bias in Vision--Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training-free, plug-and-play method that leverages modality-specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.

URL PDF HTML ☆

赞 0 踩 0

2605.21917 2026-05-22 cs.CV cs.AI 版本更新

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

MAVEN：一种多阶段代理标注管道用于视频推理任务

Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, Vidya Murali

发表机构 * NVIDIA

AI总结本文提出MAVEN，一种多阶段代理标注管道，通过链式推理轨迹生成多任务训练数据，用于视频事件推理任务，核心方法是多尺度时空事件描述，支持代理驱动的领域适应，通过分层细化循环改进数据质量，并在多个数据集上验证了其有效性。

Comments CVPR 2026 Workshop

详情

AI中文摘要

训练视频事件推理的视觉语言模型（VLMs）需要高质量的结构化标注，这些标注不仅要描述发生了什么，还要捕捉何时、何地、为何以及后果。我们提出了MAVEN（多阶段代理视频事件标注），一种多阶段代理管道，通过链式推理（CoT）轨迹将原始视频转换为多任务训练数据，围绕指定的事件焦点组织。在核心部分，MAVEN从三个互补的标题级别合成多尺度时空事件描述（MSTED），该显式中间体是下游问答生成的唯一输入，适用于多种任务格式。关键的是，MAVEN支持代理驱动的领域适应：给定新的视频数据集和目标问题示例，代理可以重新设计所有提示，而无需手动重新工程。分层细化循环进一步将注释错误分类到分类学中，追溯根本原因到起始管道阶段，并应用有针对性的编辑，重写提示或修改管道结构本身，迭代改进数据质量。我们应用MAVEN标注超过5,300个交通视频，并在生成的数据上微调Cosmos-Reason2-8B。在私人CCTV评估集上，微调优于Gemini 2.5 Pro和3.1 Flash，包括在零样本情况下MCQ准确率提高了38.8个百分点。在AccidentBench上，仅使用CCTV训练提升了Cosmos-Reason2的MCQ分数10.7分，并在没有dashcam视频的情况下与Gemini 2.5 Pro持平；添加代理适应的dashcam注释缩小了与Gemini 3.1 Flash的差距，RL后训练将总体性能推过了Gemini基线。对仓库监控和公共安全视频的定性结果进一步表明，代理工作流能够轻松适应新领域。

英文摘要

Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.

URL PDF HTML ☆

赞 0 踩 0

2605.21913 2026-05-22 cs.CV 版本更新

Multi-scale interaction network for stereo image super-resolution

多尺度交互网络用于立体图像超分辨率

Liyi Xu, Lin Qi

发表机构 * Ocean University of China（中国海洋大学）

AI总结本文提出了一种多尺度交互网络，用于立体图像超分辨率，通过改进视内特征提取和视间匹配精度，实现了更优的超分辨率效果。

2605.21907 2026-05-22 cs.CV 版本更新

MM-Conv: 一种多模态数据集和基准，用于上下文感知的3D对话中指代解析

Anna Deichler, Jim O'Regan, Fethiye Irmak Dogan, Lubos Marcinek, Anna Klezovich, Iolanda Leite, Jonas Beskow

发表机构 * KTH Royal Institute of Technology（皇家理工学院）

AI总结本文提出了一种多模态数据集和基准，用于在动态3D环境中实现上下文感知的指代解析，通过引入包含6.7小时第一人称VR交互的同步语音、动作、注视和3D场景几何数据的基准，以及一个两阶段的指代解析流水线，改进了对话中的指代解析性能。

Comments Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis

详情

DOI: 10.63317/37fzwjphsb9y
Journal ref: Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026), Palma de Mallorca, Spain

AI中文摘要

在物理世界中将语言进行定位需要AI系统解释在对话中动态出现的参考。尽管当前的视觉语言模型（VLMs）在静态图像任务上表现出色，但在自发的多轮对话中解决歧义表达方面存在困难。我们通过引入（1）一个用于动态3D环境中的指代交流的基准，该基准基于6.7小时的第一人称VR交互，同步语音、动作、注视和3D场景几何数据，以及（2）一个两阶段的定位流水线，该流水线在视觉定位之前显式解决对话中的歧义，来填补这一空白。该基准包含超过4,200个经过人工验证的指代表达，涵盖完整、部分和代词类型。我们的上下文重写方法在平均上将定位性能提高了11-22个百分点，纯检测器（GroundingDINO）在重写后在代词上达到了56.7%的准确率，几乎是最佳端到端基线的两倍。结果表明，将语言推理与视觉感知解耦比端到端方法在对话定位中更有效。

英文摘要

Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.

URL PDF HTML ☆

赞 0 踩 0

2605.21788 2026-05-22 cs.CV cs.RO 版本更新

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

SceneGraphGrounder: 通过结构化场景图匹配实现零样本3D视觉定位

Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结本文提出SceneGraphGrounder框架，通过结构化场景图匹配将3D定位问题转化为结构化图匹配问题，利用视觉标记提示策略从2D视图推断物体间关系，并在3D场景图中建立持久编码，从而在ScanRefer基准测试中实现了零样本条件下与现有方法相当的性能，并在真实机器人部署中验证了其在长周期物理环境中的鲁棒空间推理能力。

详情

AI中文摘要

零样本3D视觉定位需要从非结构化环境中通过自由形式自然语言定位物体。最近的视觉-语言模型（VLM）方法取得了有希望的结果，但依赖于视点依赖的推理或隐式表示，限制了组合查询的空间一致性和可解释性。我们提出了SceneGraphGrounder，一个将3D定位重新表述为在重建的3D场景图上的结构化图匹配的框架。为了实现这种表述，我们引入了一种视觉标记提示策略，使VLM能够从2D视图推断物体-物体关系，这些关系随后被提升为持久的3D场景图编码，既包含空间关系又包含语义关系。给定一个查询，我们构建查询图并与场景图进行受限对齐，确保多视图一致性和可解释的推理。在ScanRefer基准测试中，我们的方法在零样本条件下实现了与现有方法相当的性能，仅使用RGB-D输入。我们进一步通过在移动机器人上的真实世界部署验证了我们的框架，展示了其在长周期物理环境中的鲁棒空间推理能力。我们将在接受后公开我们的代码。

英文摘要

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.21766 2026-05-22 cs.CV cs.GR 版本更新

BodyReLux: Temporally Consistent Full-Body Video Relighting

BodyReLux: 时序一致的全身人体视频重照明

Li Ma, Mingming He, Xueming Yu, David M. George, Ahmet Levent Taşel, Paul Debevec, Julien Philip

发表机构 * Eyeline Labs（Eyeline实验室）

AI总结本文提出BodyReLux，一种基于视频扩散的框架，用于在时序一致的方式下重照明全身人体表演。该方法利用混合数据集训练，结合传统静态单光源捕捉和新型动态表演捕捉技术，通过引入新的光照条件表示方法和数据增强管道，实现了高质量、鲁棒且时序一致的视频重照明。

Comments Siggraph 2026 Journal Track. Project page: https://eyeline-labs.github.io/bodyrelux/

详情

DOI: 10.1145/3811352

AI中文摘要

能够重照明人体表演是后期制作和内容创作中的基本任务。我们提出了BodyReLux，一种针对特定主体的视频扩散框架，用于在时序一致的方式下重照明全身人体表演。我们的模型是在一个混合的像素对齐视频重照明数据集上训练的，涵盖了多样化的光照条件、表演和视角组合。为了获得这样的数据集，我们结合了传统的静态单光源捕捉（OLAT）和一种新的动态表演捕捉方法，在其中两个平滑变化的光照序列被快速交错。由于光照操作在人类闪烁融合阈值之上，交错不会显得闪烁。我们从预训练的文本到视频模型中训练视频重照明模型，以充分利用生成先验来产生高质量视频。为了实现精确的光照控制，我们引入了一种新的光照条件方法，将每个光源表示为一个标记。我们进一步使用掩码注意力对光照序列进行条件处理，以支持动态光照控制。结合精心设计的数据增强管道，我们实现了高质量、鲁棒且时序一致的特定主体人体表演视频重照明。

英文摘要

Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.

URL PDF HTML ☆

赞 0 踩 0

2605.21747 2026-05-22 cs.CV cs.RO 版本更新

Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

通过利用视觉语言模型推断车辆信息以改进自动驾驶中的3D标注

Steven Chen, Shivesh Khaitan, Nemanja Djuric

发表机构 * Aurora Innovation, Inc.（Aurora创新公司）

AI总结本文提出了一种利用视觉语言模型推断车辆信息以提高自动驾驶中3D车辆标注精度的方法，通过零样本推理车辆信息，结合车辆型号和型号识别方法，提升了标注效率和质量。

Comments To appear in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2026. Accepted for oral presentation

详情

AI中文摘要

我们提出了一种通过零样本推理车辆信息来提高自动驾驶应用中3D车辆标注的方法，利用车辆制造商和型号识别（VMMR）方法。所提出的方法利用视觉语言模型（VLM）从图像片段中推断车辆的制造商、型号和代数，并输出准确的3D包围盒尺寸以引导手动标注。我们评估了迭代提示工程和不同VLMs选择对车辆包围盒推断和制造商/型号/代数识别的影响。与强大的基线相比，所提出的方法不仅在准确性上表现出色，而且在缓解特定失败模式方面也表现出色，例如在车辆显著遮挡的情况下，VLMs提供的尺寸比初始激光雷达辅助的人工标注标签更优。在公共和专有数据上的实验强烈表明，我们的结论可以推广到不同的标注者和数据集。结果表明，将VLMs整合到标注过程中可以减少手动标注时间，同时提高标注质量。

英文摘要

We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

URL PDF HTML ☆

赞 0 踩 0

2605.21728 2026-05-22 cs.CV cs.CL cs.LG 版本更新

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

BEiTScore: 一种基于高效交叉编码器的无参考图像描述评估方法

Gonçalo Gomes, Bruno Martins, Chrysoula Zerva

发表机构 * Instituto Superior Técnico（里斯本大学理工学院）； INESC-ID ； Instituto de Telecomunicações（电信机构）

AI总结本文提出了一种无参考图像描述评估方法BEiTScore，通过高效的交叉编码器模型解决传统评估方法在计算成本和敏感性方面的不足，提出了一种新的评估指标，并在多种场景下验证了其优越的性能。

详情

AI中文摘要

图像描述评估仍是一个重大挑战，因为视觉-语言模型朝着生成长形式和上下文丰富的描述等更具挑战性的能力发展。最先进的评估度量标准涉及使用大型语言模型（LLMs）作为评判者的大量计算成本，或者受到标准CLIP基于编码器的限制，例如严格的令牌限制、缺乏细粒度敏感性或缺乏组合泛化能力，因为将描述视为“词袋”。我们提出了一种新的学习度量标准，以解决上述挑战，基于一个轻量级交叉编码器，其初始化来自视觉问答模型检查点，平衡了强大的权重初始化与计算效率。我们的训练方案使用精心编排的数据混合进行监督学习，特征是对抗性的LLM基于数据增强，以增强模型对细粒度视觉-语言错误的敏感性。我们还引入了一个新的基准，用于在多种场景中评估详细的描述评估。实验结果表明，所提出的度量标准在保持大规模基准测试、质量感知解码或奖励指导所需的效率的同时，实现了最先进的性能。

英文摘要

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

URL PDF HTML ☆

赞 0 踩 0

2605.21714 2026-05-22 cs.CV cs.RO 版本更新

AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

AVI-HT：自适应视觉-IMU融合用于3D手部跟踪

Ziyi Kou, Ankit Kumar, Mia Huang, Taylor Niehues, Vatsal Mehta, Ergys Ristani, Li Guan

发表机构 * Meta Reality Labs（Meta现实实验室）

AI总结本文提出AVI-HT，一种自适应视觉-IMU融合方法，通过联合建模第一人称视角图像与手套上的6自由度IMU信号，用于跟踪3D手部姿态。核心方法包括同步多模态训练数据配对和跨传感器深度注意力机制，主要贡献是提高了在手-物体交互场景中的准确性和可用性。

详情

AI中文摘要

我们提出了AVI-HT，一种用于通过联合建模第一人称视角图像与手套上的6自由度IMU信号来跟踪3D手部姿态的自适应视觉-IMU融合方法。AVI-HT在手-物体交互（HOI）场景中，特别是在重视觉遮挡情况下，实现了显著提高的准确性和可用性。其成功基于两个互补的成分：（1）同步多模态训练数据配对身体上的视觉-IMU传感器流与运动捕捉系统的地面真实3D手部姿态；（2）一种跨传感器深度注意力机制，能够自适应地调节对视觉和单个IMU传感器的信任度。为了在真实世界中评估AVI-HT，我们在包含100000+对视觉-IMU样本的DexGloveHOI数据集中进行了广泛的实验，这些样本具有同步的3D标注姿态，用户在日常任务中操作各种物体。我们比较了多种单模态和多模态跟踪方法，基于两种手部模型（UmeTrack、MANO）。结果表明，AVI-HT在基准上将平均关键点误差减少了16.1%，其腕对齐变体减少了24.2%。消融研究进一步揭示了IMU传感器在不同活动类型中的每指贡献，以及模型对IMU噪声和视觉-IMU融合中的时间偏移的敏感性。

英文摘要

We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

URL PDF HTML ☆

赞 0 踩 0

2605.21671 2026-05-22 eess.IV cs.CV 版本更新

HyperBench: Standardizing and Scaling Synthetic Evaluation for Hyperspectral Super-Resolution

HyperBench: 标准化和扩展超光谱超分辨率的合成评估

Ritik Shah, Marco F. Duarte

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结本文提出HyperBench框架，通过标准化和扩展合成实验来评估超光谱超分辨率方法，以解决现有评估方法中配置不一致、结果难以比较和复现的问题。

详情

AI中文摘要

超光谱超分辨率（HSR）通过融合低分辨率超光谱图像（LR-HSI）和高分辨率多光谱图像（HR-MSI）来重建高空间分辨率的超光谱图像。在缺乏真实世界配对数据的情况下，HSR方法几乎 exclusively 评估于通过Wald协议从超光谱数据集中衍生的合成实验中。尽管该协议被广泛采用，但其实际实施在不同研究工作中差异显著，通常依赖于单一（通常是高斯）或非常少的点扩散函数（PSFs），一个或两个光谱响应函数（SRFs），以及少量的空间下采样因子。因此，报告的性能指标在文献中难以比较，且往往难以复现；此外，它们可能无法在现实传感条件下推广。我们引入HyperBench，一个统一且可扩展的框架，用于标准化HSR的合成实验。HyperBench支持跨度十个PSFs、四个源自操作多光谱传感器的SRFs、可配置的空间下采样因子以及匹配的加性白高斯噪声；其目标是自动化大规模评估和结构化日志记录。通过将模型开发与实验设计解耦，该框架使可复现、公平的跨方法比较成为可能，且摩擦最小。我们使用HyperBench在四个广泛使用的超光谱场景上对六种最近提出的HSR方法进行了70种配置的评估，并观察到方法间PSNR的差异从最简单的PSF上的约5 dB扩大到最困难的PSF上的超过13 dB——这种脆弱性在现有的单配置评估协议中是结构上不可见的。HyperBench代码可在https://github.com/ritikgshah/HyperBench上获取。

英文摘要

Hyperspectral super-resolution (HSR) reconstructs a high-spatial-resolution hyperspectral image by fusing a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI). In the absence of real-world paired data, HSR methods are evaluated almost exclusively on synthetic experiments derived from hyperspectral datasets through Wald's protocol. Despite the protocol's widespread adoption, its practical implementation varies markedly across research works, typically relying on a single (usually Gaussian) or very few point spread functions (PSFs), one or two spectral response functions (SRFs), and a couple of spatial downsampling factors. As a result, reported performance figures are difficult to compare across the literature, in addition to being often difficult to reproduce; furthermore, they may not generalize across realistic sensing conditions. We introduce HyperBench, a unified and extensible framework that standardizes synthetic experimentation for HSR. HyperBench supports diverse degradation configurations spanning ten PSFs, four SRFs derived from operational multispectral sensors, configurable spatial downsampling factors, and matched additive white Gaussian noise; its goal is to automate large-scale evaluation and structured logging. By decoupling model development from experimental design, the framework enables reproducible, apples-to-apples cross-method comparison with minimal friction. We use HyperBench to evaluate six recently proposed HSR methods across a 70-configuration sweep on four widely used hyperspectral scenes and observe that the inter-method PSNR spread widens from approximately 5 dB on the easiest PSF to over 13 dB on the hardest - a fragility that is structurally invisible to the prevailing single-configuration evaluation protocol. HyperBench code is available at https://github.com/ritikgshah/HyperBench .

URL PDF HTML ☆

赞 0 踩 0

2605.21669 2026-05-22 cs.CV cs.AI 版本更新

MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast

MRecover: 一种基于AI生成对比度的条件生成模型，用于通过AI生成对比度恢复运动模糊的MRI图像

Jinghang Li, Tales Santini, Courtney Clark, Bruno de Almeida, Cong Chu, Salem Alkhateeb, Andrea Sajewski, Jacob Berardinelli, Hecheng Jin, Tobias Campos, Jeremy J. Berardo, Joseph Mettenburg, Ariel Gildengers, Howard J. Aizenstein, Minjie Wu, Tamer S. Ibrahim

发表机构 * Department of Bioengineering, University of Pittsburgh（匹兹堡大学生物工程系）； School of Medicine, University of Pittsburgh（匹兹堡大学医学院）； Department of Radiology, University of Pittsburgh（匹兹堡大学放射科）； Department of Psychiatry, University of Pittsburgh（匹兹堡大学精神病学系）

AI总结该研究提出了一种条件生成模型MRecover，利用AI生成的对比度来恢复运动模糊的MRI图像，通过自回归切片条件化实现体积分 consistency，提高了 hippocampal 子区域分割的精度和泛化能力。

详情

AI中文摘要

海马亚区分割需要高分辨率的T2w turbo spin echo (TSE) MRI，但该序列易受运动伪影影响，导致数据丢失。我们开发了一种条件生成模型（MRecover），通过自回归切片条件化生成常规获取的T1w图像，生成TSE图像以实现体积分 consistency。在7T MRI数据（n=577）上训练，该模型在域内实现了高保真度（n=148，SSIM=0.84，FSIM=0.94），并能很好地推广到域外3T数据：合成和原生图像的亚区体积高度匹配（n=416，r=0.87-0.97），并在运动影响的ADNI3数据集中通过质量控制后，分析可及受试者数量增加了31.8%（593 vs 450）。合成图像还由于增加诊断组差异的样本量，产生了更大的效应量（整个海马体ε²=0.121-0.100 vs. 0.086-0.062，左右半球）。项目页面：https://jinghangli98.github.io/MRecover/

英文摘要

Hippocampal subfield segmentation requires high-resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in-domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out-of-domain 3T data: subfield volumes from synthesized and the as-acquired images closely matched: (n=416, r=0.87-0.97) and yielded 31.8% more analyzable subjects in the motion-affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus $ε^2$= 0.121-0.100 vs. 0.086-0.062, left-right hemispheres). Project page: https://jinghangli98.github.io/MRecover/

URL PDF HTML ☆

赞 0 踩 0

2605.21661 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Hierarchical Variational Policies for Reward-Guided Diffusion

分层变分策略用于奖励引导的扩散

Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt

发表机构 * Department of Computer Science（计算机科学系）； University of California Irvine（加州大学伊文斯顿分校）

AI总结本文提出了一种分层变分模型框架，通过将控制信息压缩到轻量级且表达能力强的随机策略中，实现了在降低推理成本的同时生成高质量的奖励对齐样本，该方法在4倍超分辨率任务中实现了比现有最佳基线快5倍的推理速度并具有更好的感知质量。

详情

AI中文摘要

适应预训练扩散模型以解决下游目标如逆问题通常需要昂贵的测试时间引导或优化。我们提出了一种系统框架，能够在大幅降低推理成本的同时生成高质量的奖励对齐样本。我们的方法将测试时间适应建模为分层变分模型，其中控制被压缩到一个轻量级但表达能力强的随机策略中。这种建模自然支持少量步扩散采样：大步长使推理快速，而学习的策略通过提供结构化的每步控制保持样本质量。所得到的完全压缩采样器实现了强大的质量-速度权衡，匹配或超过最近的测试时间扩展基线，同时需要显著更少的计算资源。例如，在4倍超分辨率任务中，我们的方法在比最佳表现基线快5倍的情况下实现了更好的感知质量。我们进一步将该方法扩展到半压缩的 regime，结合廉价的压缩提案和有限的测试时间优化，在多个具有挑战性的逆问题中实现了最先进的感知质量。

英文摘要

Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality--speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.

URL PDF HTML ☆

赞 0 踩 0

2605.21642 2026-05-22 cs.CV 版本更新

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

Ablate-to-Validate: 视觉语言模型真的在使用连续思维令牌吗？

Tianyi Zhang, Mahtab Bigverdi, Ranjay Krishna

发表机构 * University of Washington（华盛顿大学）

AI总结本文提出了一种诊断原则Ablate-to-Validate，通过Token Replacement Test（TRT）测试视觉语言模型是否真正利用了连续令牌内容，发现模型性能提升可能并非源于令牌内容，而是令牌存在本身。

详情

AI中文摘要

视觉语言模型（VLMs）越来越多地引入连续或潜在的非文本令牌以支持'视觉思维'。尽管任务准确性有所提高，但这并不能证明模型确实使用这些令牌进行推理——收益可能来自于诸如增加的上下文长度、特殊令牌锚定或训练时的正则化等混淆因素。我们正式提出了一种诊断原则，Ablate-to-Validate，用于测试潜在令牌内容是否被真正利用，并将其实例化为Token Replacement Test（TRT），一个标准化的内容替换消融套件。TRT固定提示、图像、令牌预算和解码，同时用零、随机、首次重复或Oracle替代中间令牌，以确定性能是否依赖于令牌内容或仅仅是令牌存在。作为受控测试平台，我们研究了LLaVA-13B和Qwen2.5-VL-3B在相对深度推理中的表现，训练模型在多个冻结编码器（SigLIP2，CLIP，DINOv2）和令牌预算下预测和消耗连续或离散深度跨度。此外，我们还将TRT应用于三个现成的视觉思维系统（Mirage，Mull-Tokens，CoVT）在BLINK，VSP和CV-Bench上。在所有设置中，准确性提升都是潜在令牌推理的误导性代理：VLMs在令牌内容被破坏或替换时仍能保持大部分改进，揭示了拥有潜在通道与将其用作信息瓶颈之间的持续差距。我们推荐TRT作为任何引入连续思维令牌的方法的标准诊断工具，与准确性并行使用。

英文摘要

Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.

URL PDF HTML ☆

赞 0 踩 0

2605.21633 2026-05-22 eess.IV cs.CV 版本更新

VRXU-net: A Deep Learning Approach for Brain Ischemic Stroke Lesion Detection and Segmentation in T1W MRI

VRXU-net: 一种用于T1W MRI中脑缺血性中风病变检测和分割的深度学习方法

Sayed Amir Mousavi Mobarakeh

发表机构 * Sayed Amir Mousavi Mobarakeh

AI总结该研究提出了一种基于视觉特征、残差连接和U型网络的VRU-Net架构，用于在3D磁共振成像扫描中检测和分割脑缺血性中风病变，通过改进的VGG模型和U型分割模型在不同切面中独立处理，并通过聚合结果提高分割精度和处理速度。

详情

AI中文摘要

当大脑供血被血栓阻断时，脑组织的氧气供应不足，导致细胞坏死。在医疗环境中，准确识别和勾勒缺血性病变边界对于治疗和手术计划至关重要。然而，缺血性中风病变在形状、大小和位置上差异很大，在灰度MRI模态如T1W中，它们可能与周围脑结构相似，这使得病变检测和分割对临床医生来说是一项挑战。本研究介绍了一种新的VRU-Net架构，该架构基于视觉特征、残差连接和U型网络，用于检测和分割3D磁共振成像扫描中的缺血性中风病变。所提出的方法首先使用修改后的VGG模型在单独的2D切片中识别缺血性中风。然后，一个带有残差块的U型分割模型对每个切片中的病变进行分割。此过程独立应用于轴向、矢状和冠状平面，并通过聚合三个分割结果生成最终输出。为了提高性能和处理速度，一种高性能分类器在顺序框架中应用于分割模型之前。这种策略减少了非病变切片的不必要的分割，并提高了整体准确性。此外，将3D图像分解为2D切片减少了模型复杂性，同时允许来自三个解剖平面的信息支持更准确的病变定位。所提出的方法在脑缺血后解剖追踪数据集上进行训练，并在准确率和Dice系数方面优于现有最先进模型。此外，分割输出提供的反馈有助于分类模型减少假阳性预测。

英文摘要

When the blood supply to the brain is obstructed by a clot, oxygen delivery to brain tissues becomes insufficient, leading to cellular necrosis. In healthcare settings, accurately identifying and delineating ischemic lesion boundaries is essential for treatment and surgical planning. However, ischemic stroke lesions vary widely in shape, size, and location, and in grayscale MRI modalities such as T1W they may resemble surrounding brain structures. This makes lesion detection and segmentation a challenging task for clinicians. This study introduces a novel VRU-Net architecture, derived from visual features, residual connections, and a U-shaped network, for detecting and segmenting ischemic stroke lesions in 3D magnetic resonance imaging scans. The proposed method first uses a modified VGG model to identify ischemic stroke in separate 2D slices. Then, a U-shaped segmentation model with residual blocks segments the lesion in each slice. This procedure is applied independently to the axial, sagittal, and coronal planes, and the final output is generated by aggregating the three segmentation results. To improve both performance and processing speed, a high-performance classifier is applied before the segmentation model in a sequential framework. This strategy reduces unnecessary segmentation of non-lesion slices and improves overall accuracy. In addition, decomposing 3D images into 2D slices reduces model complexity while allowing information from three anatomical planes to support more accurate lesion localization. The proposed model is trained on the Anatomical Tracings of Lesions After Stroke dataset and outperforms state-of-the-art models in terms of accuracy and Dice coefficient. Moreover, the segmentation output provides feedback that helps the classification model reduce false-positive predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.21625 2026-05-22 cs.CV cs.AI cs.CL 版本更新

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Flat-Pack Bench: 通过家具组装评估大视觉-语言模型的时空理解

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan

发表机构 * Cornell University（康奈尔大学）； Cornell Tech（康奈尔科技）； MBZUAI（麦吉尔-伯克利-浙江大学人工智能研究院）； UC Berkeley（伯克利大学）

AI总结本文提出Flat-Pack Bench基准，用于评估大视觉-语言模型在复杂视频场景中的时空理解能力，发现当前模型在细粒度时空推理上存在显著不足。

Comments CVPR 2026

详情

AI中文摘要

大视觉-语言模型（LVLMs）的出现显著提升了视频理解能力。然而，现有基准主要集中在粗粒度任务，如动作分割、分类、描述和检索，且这些基准通常依赖于易于口头识别的实体，如家庭物品、动物、人类主体等，限制了其在复杂真实视频场景中的适用性。但许多应用，如家具组装、烹饪等，需要对视频进行逐步细粒度的时空理解，而当前基准并未充分评估。为解决这一差距，我们引入了Flat-Pack Bench，一个专注于家具组装任务的新基准。我们的基准评估LVLMs在细微任务上的表现，包括组装动作的时间顺序、组装状态的时间定位、理解部件配合和追踪，使用多选问题配以视觉提示突出相关部分作为参考，以回答细粒度问题。我们的实验表明，最先进的LVLMs在细粒度时空推理上表现显著不足，凸显了其在有效利用视频时间信息、跟踪能力和理解空间交互（如物理接触）方面的局限性。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

URL PDF HTML ☆

赞 0 踩 0

2605.21611 2026-05-22 cs.CV cs.LG 版本更新

不要压缩你的特征：为什么CenterLoss伤害OOD检测和多尺度Mahalanobis获胜

Rahul D Ray

发表机构 * Department of Electronics and Electrical Engineering（电子与电气工程系）

AI总结本文提出GOEN方法，通过多尺度特征、L2归一化、Mahalanobis距离和校准头来提升OOD检测性能，发现CenterLoss会降低OOD检测性能，而GOEN-NoCenterLoss在CIFAR-10基准上表现优于其他基线方法。

详情

AI中文摘要

检测分布外（OOD）输入的能力是安全部署机器学习系统的基础。然而，当前方法往往依赖于仅优化分类准确性的特征表示，忽略了epistemic不确定性的要求。我们引入GOEN（几何优化的epistemic网络），一种结合多尺度特征、L2归一化、Mahalanobis距离和使用真实硬OOD示例训练的校准头的简单流程。通过系统消融，我们发现一个反直觉的发现：CenterLoss，一种用于特征紧凑性的流行正则化器，显著降低了OOD检测性能，尽管提高了分类准确性。最佳变体GOEN-NoCenterLoss在CIFAR-10基准上实现了0.9483的平均OOD AUROC，超过了包括深度集成（0.8827）、KNN（0.8967）和ODIN（0.8870）在内的所有基线方法，同时保持了有竞争力的分布内准确性。我们的结果挑战了普遍认为更好的分类几何自动导致更好的epistemic不确定性假设。相反，我们展示了过于紧致的特征簇会压缩类间边缘并扭曲所需的有效OOD检测的协方差结构。GOEN是高效的，在单个GPU上训练不到20分钟，并提供了一种构建可靠识别自身局限的AI系统的实用蓝图。

英文摘要

The ability to detect out-of-distribution (OOD) inputs is fundamental to safe deployment of machine learning systems. Yet, current methods often rely on feature representations that are optimised solely for classification accuracy, neglecting the distinct requirements of epistemic uncertainty. We introduce GOEN (Geometry-Optimised Epistemic Network), a simple pipeline that combines multi-scale features, L2 normalisation, Mahalanobis distance, and a calibration head trained with real hard OOD examples. Through systematic ablation we uncover a counter-intuitive finding: CenterLoss, a popular regulariser for feature compactness, significantly degrades OOD detection performance, reducing average OOD AUROC from 0.9483 to 0.9366 despite improving classification accuracy. The best variant, GOEN-NoCenterLoss, achieves an average OOD AUROC of 0.9483, surpassing all baselines including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870) on CIFAR-10 benchmarks, while maintaining competitive in-distribution accuracy. Our results challenge the prevailing assumption that better classification geometry automatically leads to better epistemic uncertainty. Instead, we show that overly tight feature clusters compress inter-class margins and distort the covariance structure needed for effective OOD detection. GOEN is efficient, training in under 20 minutes on a single GPU, and provides a practical blueprint for building AI systems that reliably recognise their own limitations.

URL PDF HTML ☆

赞 0 踩 0

2605.21079 2026-05-22 cs.CV 版本更新

VDFP: Video Deflickering with Flicker-banding Priors

VDFP：基于闪烁带先验的视频去闪烁

Zhiyi Zhou, Libo Zhu, Zihan Zhou, Yulun Zhang, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出VDFP，一种基于闪烁带先验的视频去闪烁框架，通过构建DeViD数据集和引入DFM和CPP模块，有效解决屏幕捕捉中的带状伪影问题，实验表明其在去闪烁效果和时空一致性方面优于现有方法。

Comments Our dataset and code will be released at https://github.com/ZhiyiZZhou/VDFP

详情

AI中文摘要

使用智能手机捕捉数字屏幕时，由于硬件同步不匹配，经常会产生严重的带状伪影。现有的视频修复方法难以处理这些结构化、周期性的亮度波动，通常导致残留伪影或过度平滑的纹理。我们首先构建了DeViD数据集，以应对可用数据集不足的问题。然后我们提出了VDFP（Video Deflickering with Flicker-banding Priors），一种新颖的感知引导生成框架。首先，我们引入了一种基于滚动快门机制的退化场建模（DFM），能够合成复杂的多带状场景。其次，我们提出了空间-时间连续先验感知（CPP）。不同于传统的二元分割，该模块通过闪烁感知的均方误差（FA-MSE）进行优化，以捕捉亮度过渡。通过零初始化增强的输入层，我们的模型保留了预训练的生成先验以及空间-时间先验感知。广泛的实验表明，VDFP在去闪烁效果和时空一致性方面显著优于其他方法，能够高效消除复杂的带状伪影并保留高保真的空间细节。我们的数据集和代码将在https://github.com/ZhiyiZZhou/VDFP上发布。

英文摘要

Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available datasets. Then we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at https://github.com/ZhiyiZZhou/VDFP.

URL PDF HTML ☆

赞 0 踩 0

2605.20302 2026-05-22 cs.LG cs.CV 版本更新

Neural Collapse by Design: Learning Class Prototypes on the Hypersphere

按设计实现神经崩溃：在超球面上学习类别原型

Panagiotis Koromilas, Theodoros Giannakopoulos, Mihalis A. Nicolaou, Yannis Panagakis

发表机构 * The Cyprus Institute（塞浦路斯研究所）； University of Athens（雅典大学）； Archimedes AI/Athena Research Center（阿基米德AI/阿泰纳研究中心）； University of Cyprus（塞浦路斯大学）

AI总结本文研究了监督分类的理论最优解神经崩溃（NC），指出交叉熵（CE）和监督对比学习（SCL）两种主流范式在实践中无法达到该最优解。作者提出通过在超球面上对比原型的方法，改进了CE和SCL，从而在多个基准测试中实现了更接近NC的性能。

Comments 43rd International Conference on Machine Learning (ICML 2026); Code: https://github.com/pakoromilas/nc_by_design

详情

AI中文摘要

监督分类有一个理论最优解，即神经崩溃（NC），然而其两种主导范式在实践中都无法达到这一最优。交叉熵（CE）保留了径向自由度，导致收敛到退化几何结构，而监督对比学习（SCL）在预训练阶段驱动特征向NC靠近，但在后续的线性探测阶段丢弃了这一结构。我们证明这两种范式实际上是同一种方法的不同表现，即在单位超球面上对比原型。缩小差距需要在各自失败点进行修正。从CE侧，我们提出NTCE和NONL两种归一化损失，将对比优化缺失的成分引入分类器学习：大有效负样本集和解耦的对齐和均匀性项。从SCL侧，我们证明SCL的目标在训练过程中已经优化了原理分类器，其权重是类别均值嵌入，使线性探测变得冗余且有害。实验表明，在四个基准测试（包括ImageNet-1K）中，NTCE和NONL在准确率上超过了CE，接近NC（≥95%），并在不到7.5%的迭代次数中在4/5个指标上匹配CE的收敛NC，而SCL在固定原型的情况下无需线性探测阶段即可达到。学习的几何结构在迁移学习中带来了+5.5%的平均相对改进，严重类别不平衡下可达+8.7%，并且在ImageNet-C上提高了对损坏的鲁棒性。本文将监督学习重新定义为在超球面上的原型学习，通过设计达到NC。

英文摘要

Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method that contrasts prototypes on the unit hypersphere, and that closing the gap requires fixing each at its point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization's missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL's objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet-1K, NTCE and NONL surpass CE accuracy, closely approximate NC ($\geq 95\%$), and match CE's converged NC on 4/5 metrics in under $7.5\%$ of its iterations, while SCL with fixed prototypes matches linear probing without the hours-long classifier training phase. The learned geometry yields $+5.5\%$ mean relative improvement in transfer learning, up to $+8.7\%$ under severe class imbalance, and improved robustness to corruptions on ImageNet-C. Our work recasts supervised learning as prototype learning on the hypersphere, with NC reached by design.

URL PDF HTML ☆

赞 0 踩 0

2605.17837 2026-05-22 cs.CV cs.AI 版本更新

OmniShotCut: 以-shot查询Transformer实现整体关系性shot边界检测

Boyang Wang, Guangyi Xu, Jiahui Zhang, Zhipeng Tang, Zezhou Cheng

发表机构 * University of Virginia（弗吉尼亚大学）； University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结本文提出OmniShotCut，通过shot查询基于的密集视频Transformer，将shot边界检测建模为结构化关系预测，同时估计shot内关系和shot间关系，以解决现有方法在边界不可解释、错过细微有害断点以及依赖噪声低多样性标注和过时基准的问题。

详情

AI中文摘要

Shot Boundary Detection (SBD)旨在自动识别shot变化并将视频划分为连贯的shot。尽管SBD在文献中被广泛研究，现有方法往往在转换处产生不可解释的边界，错过细微但有害的断点，并依赖于噪声大、低多样性的标注和过时的基准。为缓解这些限制，我们提出OmniShotCut，将SBD建模为结构化关系预测，通过shot查询基于的密集视频Transformer，联合估计shot范围、shot内关系和shot间关系。为避免不精确的手动标注，我们采用完全合成的过渡合成管道，自动重现主要过渡家族并精确生成参数化变体。我们还引入OmniShotCutBench，一个现代宽领域基准，能够实现整体和诊断评估。在基准上的实验展示了我们方法的有效性和通用性。

英文摘要

Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation. Experiments on the benchmarks demonstrate the effectiveness and generality of our method.

URL PDF HTML ☆

赞 0 踩 0

2604.17623 2026-05-22 cs.CV cs.GR 版本更新

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

ViPS: 为自动绑定网格的视频感知姿态空间

Honglin Chen, Karran Pandey, Rundi Wu, Matheus Gadelha, Yannick Hold-Geoffroy, Ayush Tewari, Niloy J. Mitra, Changxi Zheng, Paul Guerrero

发表机构 * Columbia University（哥伦比亚大学）； University of Toronto（多伦多大学）； Adobe Research（Adobe研究）； University of Cambridge（剑桥大学）； University College London（伦敦大学学院）

AI总结本文提出ViPS，一种通过视频扩散模型提取运动先验来发现自动绑定网格有效姿态分布的前馈框架，实现了对多样形状变化、逆向运动学和动画的关键帧生成的支持。

Comments Project page: https://honglin-c.github.io/vips/

详情

AI中文摘要

运动绑定提供了一个结构化的接口来表达3D网格，但缺乏任何关联的姿态空间，即给定网格的可能关节配置的显式表示。没有这样的姿态空间，随机采样或手动操作原始绑定参数很容易导致语义和/或几何违规，例如解剖学超伸展和非物理自相交。我们提出了Video-informed Pose Spaces (ViPS)，一种前馈框架，通过从预训练的视频扩散模型中提取运动先验，发现自动绑定网格有效姿态的潜在分布。与现有方法依赖稀缺的艺术家编写的4D数据集或专注于重建单个运动实例不同，ViPS将生成视频模型的先验转移到给定绑定参数化的通用分布中。应用于皮肤网格的可微几何验证器在不需手动调节器的情况下强制执行形状特定的完整性。我们的前馈模型揭示了平滑、紧凑且可控的姿态空间。这反过来支持了对多样形状变化的采样、逆向运动学的流形投影以及动画和关键帧的时序一致轨迹。此外，提取的3D姿态样本作为语义代理指导视频扩散，有效地闭合了生成2D先验和结构化3D运动控制之间的循环。我们的评估显示，仅使用视频先验训练的ViPS在合理性和多样性方面与基于合成艺术家创建的4D数据训练的最新模型表现相当。此外，作为通用模型，ViPS在分布外物种和未见骨骼拓扑上表现出鲁棒的零样本泛化能力。

英文摘要

Kinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

URL PDF HTML ☆

赞 0 踩 0

2604.15003 2026-05-22 cs.CV 版本更新

Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

流之真相：面向图像到视频生成的主动时间鉴伪

Yuzhuo Chen, Zehua Ma, Han Fang, Hengyi Wang, Guanjie Wang, Weiming Zhang

发表机构 * Anhui Province Key Laboratory of Digital Security (School of Cyber Science and Technology, University of Science and Technology of China)（安徽省数字安全重点实验室（网络安全学院，中国科学技术大学））

AI总结本文提出了一种面向图像到视频生成的主动时间鉴伪方法，通过追踪像素在视频中的流动和变换，解决了传统空间鉴伪在时间维度上的不足。

详情

AI中文摘要

图像到视频（I2V）生成的迅速发展使单张图像可以生成逼真的视频，但也带来了新的鉴伪需求。与静态图像不同，I2V内容随时间演变，要求鉴伪方法超越二维像素级篡改定位，追踪像素在视频中的流动和变换。随着帧数增加，嵌入的痕迹会漂移和变形，使传统空间鉴伪失效。为应对这一未探索的维度，我们提出了**Flow of Truth**，首个专注于I2V生成中时间鉴伪的主动框架。关键挑战在于发现一个能够与生成过程一致演化的鉴伪特征，这本质上是一种创造性的转换而非确定性重建。尽管存在这种内在困难，我们创新性地将视频生成重新定义为*像素随时间的运动而非帧的合成*。基于这一观点，我们提出了一种可学习的鉴伪模板，追踪像素运动，并提出一个模板引导的流模块，将运动与图像内容解耦，实现稳健的时间追踪。实验表明，Flow of Truth在商业和开源I2V模型上均表现出色，显著提升了时间鉴伪性能。

英文摘要

The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

URL PDF HTML ☆

赞 0 踩 0

2602.17186 2026-05-22 cs.CV 版本更新

Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

聚焦视觉关键点：通过视觉信息增益进行大视觉语言模型的定向训练

Seulbi Lee, Sangheum Hwang

发表机构 * Department of Data Science, Seoul National University of Science and Technology（数据科学系，首尔科学技术大学）

AI总结本文提出通过视觉信息增益（VIG）指标，对大视觉语言模型进行定向训练，以提升视觉基础性并减少语言偏见，通过优先选择高VIG样本和token来提高性能。

Comments Accepted at ICML 2026

详情

AI中文摘要

大视觉语言模型（LVLMs）已取得显著进展，但它们常常受到语言偏见的影响，产生答案时往往不依赖视觉证据。尽管先前工作试图通过解码策略、架构修改或精心挑选的指令数据来缓解这一问题，但它们通常缺乏对单个训练样本或token实际从图像中获益程度的定量衡量。在本工作中，我们引入了视觉信息增益（VIG），一种基于困惑度的度量指标，用于衡量视觉输入对预测不确定性的减少。VIG能够在样本和token层面进行细粒度分析，有效突出视觉基础元素，如颜色、空间关系和属性。借助这一指标，我们提出了一种VIG引导的定向训练方案，优先选择高VIG样本和token。这种方法提高了视觉基础性并减轻了语言偏见，通过专注于仅视觉信息丰富的样本和token，实现了显著减少监督下的优越性能。

英文摘要

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

URL PDF HTML ☆

赞 0 踩 0

2602.13294 2026-05-22 cs.CV cs.AI 版本更新

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

VisPhyWorld: 通过代码驱动的视频重建探测物理推理

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

发表机构 * University of Waterloo（滑铁卢大学）； Autodesk AI Lab（Autodesk人工智能实验室）； Independent Researcher（独立研究者）

AI总结本文提出VisPhyWorld框架，通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力，引入VisPhyBench基准测试集，验证模型在重建外观和模拟物理运动方面的能力，发现最先进的MLLM在准确推断物理参数和模拟一致的物理动态方面存在困难。

详情

AI中文摘要

评估多模态大语言模型（MLLMs）是否真正理解物理动态仍然具有挑战性。现有的基准测试大多依赖于识别式协议，如视觉问答（VQA）和期望违反（VoE），这些协议通常可以在不承诺明确、可测试的物理假设的情况下回答。我们提出了VisPhyWorld，一个基于执行的框架，通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力。通过生成可运行的代码，推断的世界表示可以直接检查、编辑和验证。这将物理推理与渲染分开。基于此框架，我们引入了VisPhyBench，包含209个评估场景，这些场景源自108个物理模板和一个系统化的协议，用于评估模型在重建外观和模拟物理合理的运动方面的能力。我们的流水线在97.7%的基准运行中生成有效的重建视频之前会回退。实验表明，尽管最先进的MLLM在语义场景理解方面表现强劲，但在准确推断物理参数和模拟一致的物理动态方面存在困难。我们的代码可在https://github.com/TIGER-AI-Lab/VisPhyWorld上获得。

英文摘要

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% of benchmark runs before fallback. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics. Our code is available https://github.com/TIGER-AI-Lab/VisPhyWorld

URL PDF HTML ☆

赞 0 踩 0

SpaceDrive: 在基于视觉语言模型的自动驾驶中引入空间感知

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, Andreas Zell

发表机构 * Mercedes-Benz AG（梅赛德斯-奔驰集团）； University of Tübingen（图宾根大学）； Tübingen AI Center（图宾根人工智能中心）； TU Munich（慕尼黑工业大学）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； University of Stuttgart（斯图加特大学）； UCLA（加州大学洛杉矶分校）

AI总结本文提出SpaceDrive框架，通过将空间信息作为显式位置编码来增强基于VLM的自动驾驶系统对精细3D空间关系的理解，从而提升规划精度和开放环性能。

详情

AI中文摘要

基于视觉语言模型（VLM）的端到端自动驾驶方法因具备通用的视觉理解和强大的推理能力而迅速发展。然而，我们发现当前VLM在理解细粒度的3D空间关系方面存在困难，这在与物理世界交互的系统中是基本要求。为了解决这一问题，我们提出了SpaceDrive，一个基于空间感知的VLM自动驾驶框架，将空间信息作为显式位置编码（PEs）而非文本数字标记，从而实现语义和空间表示的联合推理。SpaceDrive采用通用的位置编码器处理从多视角深度估计、历史自我状态和文本提示中得到的所有3D坐标。这些3D PE首先叠加到相应的2D视觉标记上，同时作为任务无关的坐标表示，取代数字形式的数值标记作为VLM的输入和输出。这种机制使模型能够更好地在空间推理中索引特定的视觉语义，并直接回归轨迹坐标而非逐位生成，从而提升规划精度。广泛的实验验证了SpaceDrive在nuScenes数据集上实现了最先进的开放环性能，并在Bench2Drive闭环基准中取得了78.02的第二好Driving Score。代码可在：https://github.com/zhenghao2519/SpaceDrive获取。

英文摘要

End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code is available at: https://github.com/zhenghao2519/SpaceDrive.

URL PDF HTML ☆

赞 0 踩 0

2510.20814 2026-05-22 cs.CV 版本更新

SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution

SpectraMorph: 结构化潜在学习用于自监督超光谱超分辨率

Ritik Shah, Marco F Duarte

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）

AI总结本研究提出SpectraMorph，一种基于物理指导的自监督融合框架，通过结构化潜在空间实现超光谱超分辨率，利用多光谱图像与超光谱图像的融合，产生可解释的中间结果，并在短时间内训练，即使在单波段多光谱图像下也保持鲁棒性。

详情

DOI: 10.1109/ICASSP55912.2026.11461602
Journal ref: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

AI中文摘要

超光谱传感器每像素捕获密集的光谱信息，但空间分辨率低，导致边界模糊和混合像素效应。共注册的互补传感器如多光谱、RGB或全色相机提供高空间分辨率细节，推动通过超光谱与多光谱图像融合实现超光谱超分辨率。现有的基于深度学习的方法虽然性能强大，但依赖于不透明的回归器，缺乏可解释性且在多光谱图像波段很少时往往失效。我们提出了SpectraMorph，一种具有结构化潜在空间的物理指导自监督融合框架。SpectraMorph不通过直接回归，而是强制一个解混瓶颈：从低分辨率超光谱图像中提取端成员签名，并通过紧凑的多层感知机从多光谱图像预测类似丰度的地图。通过线性混合重建光谱，训练通过多光谱传感器的光谱响应函数进行自监督方式。SpectraMorph产生可解释的中间结果，训练时间短于一分钟，并且即使在单波段（全色）多光谱图像下也保持鲁棒性。在合成和真实数据集上的实验表明，SpectraMorph在自监督和无监督基线中表现一致优于最先进方法，同时在监督基线中也保持非常具有竞争力。

英文摘要

Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.

URL PDF HTML ☆

赞 0 踩 0

2510.08759 2026-05-22 cs.CV cs.RO 版本更新

Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis

通过技能级评估与诊断解构多模态语言模型的具身能力

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Yizhe Zhu, Shiji Xin, Yijian Huang, Boce Hu, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Haojie Huang, Lawson L. S. Wong

发表机构 * Northeastern University, Boston, MA, USA ； The Chinese University of Hong Kong, Hong Kong, China ； Peking University, Beijing, China ； Westlake University, Hangzhou, China ； Harvard University, Cambridge, MA, USA ； Purdue University, West Lafayette, IN, USA ； University of Oxford, Oxford, United Kingdom

AI总结本文提出BEAR基准，通过分解具身任务为14个原子技能进行细粒度评估，发现感知能力是推理失败的主要瓶颈，并提出BEAR-Agent多模态对话代理，显著提升具身技能性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

理解具身多模态大语言模型（MLLMs）的能力瓶颈对于改进具身代理至关重要。然而，现有具身基准主要集中在任务级评估，未能提供模型失败的潜在原因的可操作见解。为解决这一限制，我们引入BEAR，一个将具身任务分解为14个原子技能以进行细粒度技能级评估的基准。BEAR包含4,469个交错的图像-视频-文本样本，涵盖6类中的14种技能，从低级感知到高级规划。我们评估了20个MLLMs在BEAR上的表现，采用分层技能级诊断框架，并揭示了两个关键发现：（1）感知能力是推理失败的主要瓶颈，（2）当前模型存在不稳定的时间空间建模问题，这在先前基准中未被充分暴露。受这些发现启发，我们进一步提出BEAR-Agent，一个多模态对话代理，通过添加视觉和空间推理工具来增强MLLMs。BEAR-Agent在具身技能上显著提升了性能，在BEAR上相对于GPT-5基模型实现了17.5%的相对提升，同时在仿真和现实世界机器人实验中也优于强基线模型。项目页面：https://bear-official66.github.io/

英文摘要

Understanding the capability bottlenecks of embodied multimodal large language models (MLLMs) is crucial for improving embodied agents. However, existing embodied benchmarks mainly focus on task-level evaluation and fail to provide actionable insights into the underlying causes of model failures. To address this limitation, we introduce BEAR, a benchmark that decomposes embodied tasks into 14 atomic skills for fine-grained skill-level evaluation. BEAR comprises 4,469 interleaved image-video-text samples spanning 14 skills across 6 categories, ranging from low-level perception to high-level planning. We evaluate 20 MLLMs on BEAR under a hierarchical skill-level diagnosis framework and uncover two key findings: (1) perceptual capabilities are major bottlenecks behind reasoning failures, and (2) current models suffer from unstable spatiotemporal modeling that remains largely unexposed in prior benchmarks. Motivated by these findings, we further propose BEAR-Agent, a multimodal conversational agent that augments MLLMs with visual and spatial reasoning tools. BEAR-Agent substantially improves performance across embodied skills, achieving a relative improvement of 17.5% on GPT-5 over the base model on BEAR, while also outperforming strong baselines in both simulation and real-world robotic experiments. Project page: https://bear-official66.github.io/

URL PDF HTML ☆

赞 0 踩 0

2509.17086 2026-05-22 cs.CV 版本更新

SFN-YOLO: Towards Free-Range Poultry Detection via Scale-aware Fusion Networks

SFN-YOLO：通过尺度感知融合网络实现自由放养禽类检测

Jie Chen, Yuhong Feng, Tao Dai, Hao Wang, Hongtao Chen, Zhaoxi He, Mingzhe Liu, Jiancong Bai

发表机构 * Shenzhen University（深圳大学）； The Hong Kong University of Science（香港科学与技术大学）

AI总结本文提出了一种名为SFN-YOLO的创新禽类检测方法，通过尺度感知融合技术提高复杂环境中的检测性能，并引入了专为自由放养条件设计的M-SCOPE数据集，实验表明该模型在仅7.2M参数的情况下达到了80.7%的mAP，比基准模型少35.1%的参数，同时保持了良好的泛化能力。

详情

AI中文摘要

检测和定位禽类对于推进智能禽类养殖至关重要。尽管检测导向方法已取得进展，但在自由放养环境中仍面临多尺度目标、遮挡和复杂或动态背景带来的挑战。为解决这些问题，我们引入了一种名为SFN-YOLO的创新禽类检测方法，该方法利用尺度感知融合技术，将详细的局部特征与更广泛的全局上下文相结合，以提高复杂环境中的检测性能。此外，我们还开发了一个新的扩展数据集（M-SCOPE），专门针对多样的自由放养条件。全面的实验表明，我们的模型在仅7.2M参数的情况下实现了80.7%的mAP，比基准模型少35.1%的参数，同时在不同领域中保持了强大的泛化能力。SFN-YOLO的高效和实时检测能力支持了自动化智能禽类养殖。

英文摘要

Detecting and localizing poultry is essential for advancing smart poultry farming. Despite the progress of detection-centric methods, challenges persist in free-range settings due to multiscale targets, obstructions, and complex or dynamic backgrounds. To tackle these challenges, we introduce an innovative poultry detection approach named SFN-YOLO that utilizes scale-aware fusion. This approach combines detailed local features with broader global context to improve detection in intricate environments. Furthermore, we have developed a new expansive dataset (M-SCOPE) tailored for varied free-range conditions. Comprehensive experiments demonstrate our model achieves an mAP of 80.7% with just 7.2M parameters, which is 35.1% fewer than the benchmark, while retaining strong generalization capability across different domains. The efficient and real-time detection capabilities of SFN-YOLO support automated smart poultry farming.

URL PDF HTML ☆

赞 0 踩 0

2507.17640 2026-05-22 cs.CV 版本更新

Not All Starting Points Are Equal: Pre-trained Priors and Their Outsized Impact on Person Identification

并非所有起始点都平等：预训练先验及其在人识别人脸识别中的巨大影响

Thomas M. Metz, Matthew Q. Hill, Alice J. O'Toole

发表机构 * School of Behavioral and Brain Sciences（行为与脑科学学院）； The University of Texas at Dallas（德克萨斯大学达拉斯分校）； Richardson, Texas, USA（德克萨斯州里德利尔）

AI总结本文研究了预训练方法对人识别人脸识别任务的影响，发现预训练权重在域适应过程中扮演重要先验角色，并展示了使用大视觉基础模型进行简单域适应可获得SOTA结果。

详情

AI中文摘要

近年来，计算机视觉领域出现了大量多样化的通用预训练方法。然而，这些预训练方法对人识别人脸识别任务（re-id）的影响仍缺乏深入研究。我们发现，在等效域适应流程下，不同起始模型（架构和预训练权重）会产生显著不同的识别人脸识别结果。我们指出，对不同下游性能的直观解释是不足的，并提出预训练权重在域适应过程中学习的权重起着强先验作用。在此框架下，域适应解决方案可被视为Gibbs后验的最大概率点估计，其中预训练权重充当先验。在此框架下，我们展示了使用大预训练基础模型进行简单域适应可在多个re-id数据集（Market、PRCC、DeepChange、BTS）上获得SOTA结果，其参数空间与起始参数非常接近。此外，我们对这些解决方案进行了消融研究，发现它们可以使用小的迁移集和不同迁移数据集实现，但对优化器、权重衰减和损失函数的选择敏感。最终，我们提出直接使用大视觉基础模型（如CLIP、Dino、EVA、AIM等）进行微调的简单方法应作为未来re-id研究的重要基准。

英文摘要

Recent years have seen an explosion of diverse general purpose pre-training methodologies for computer vision. However, the impact that these pre-training methodologies have on person identification tasks (re-id) remains under-explored. We show that under equated domain adaptation pipelines, there is dramatic variance in person identification outcomes using different starting models (architectures and pre-trained weights). We show that a range of intuitive explanations for differing downstream performance on a range of re-id tests are insufficient and propose that pre-trained weights serve as a strong prior to the weights learned during domain adaptation. This framework allows for domain adapted solutions to be viewed as a maximum probability point estimate of the Gibbs posterior with the pre-trained weights acting as a prior. Under this framework, we show that large, pre-trained foundation models with simple domain adaptation achieve SOTA solutions on a range of re-id datasets (Market, PRCC, DeepChange, BTS) with solutions that are very close in the parameter space to the starting parameters. Moreover, we perform ablations on these solutions and show that they can be reached with small transfer sets and with varying transfer datasets but are sensitive to choice of optimizer, weight-decay, and loss function. Ultimately, we propose that the simple approach of direct fine-tuning using large vision foundation models (CLIP, Dino, EVA, AIM, etc.) needs to serve as an important baseline for future work in re-id.

URL PDF HTML ☆

赞 0 踩 0

2507.13339 2026-05-22 eess.IV cs.CV 版本更新

SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution

SpectraLift: 一种基于物理的频谱反演网络用于自监督超分辨率高光谱图像

Ritik Shah, Marco F. Duarte

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）

AI总结该研究提出了一种自监督的频谱反演网络SpectraLift，利用多谱段图像的光谱响应函数实现高光谱图像与多谱段图像的融合，无需点扩散函数校准或高分辨率高光谱图像的地面真实数据，从而在PSNR、SAM、SSIM和RMSE等指标上优于现有方法。

详情

DOI: 10.1109/WHISPERS69515.2025.11501599
Journal ref: 2025 15th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)

AI中文摘要

高空间分辨率的高光谱图像（HSI）对于遥感和医学成像等应用至关重要，但HSI传感器本质上是牺牲空间细节来换取光谱丰富性。将高空间分辨率多谱段图像（HR-MSI）与低空间分辨率高光谱图像（LR-HSI）融合是恢复精细空间结构而不牺牲光谱保真度的有希望的途径。大多数最先进的HSI-MSI融合方法需要点扩散函数（PSF）校准或地面真实高分辨率HSI（HR-HSI），这两种在现实世界中都难以获得。我们提出SpectraLift，一种完全自监督的框架，利用仅MSI的光谱响应函数（SRF）融合LR-HSI和HR-MSI输入。SpectraLift通过（i）将SRF应用于LR-HSI得到的合成低空间分辨率多谱段图像（LR-MSI）作为输入，（ii）LR-HSI作为输出，以及（iii）估计与真实LR-HSI之间的ℓ₁光谱重建损失作为优化目标，训练一个轻量级的每像素多层感知机（MLP）网络。在推理时，SpectraLift使用训练好的网络将HR-MSI像素映射到HR-HSI估计。SpectraLift在几分钟内收敛，对空间模糊和分辨率不敏感，并在PSNR、SAM、SSIM和RMSE基准测试中优于现有方法。

英文摘要

High-spatial-resolution hyperspectral images (HSI) are essential for applications such as remote sensing and medical imaging, yet HSI sensors inherently trade spatial detail for spectral richness. Fusing high-spatial-resolution multispectral images (HR-MSI) with low-spatial-resolution hyperspectral images (LR-HSI) is a promising route to recover fine spatial structures without sacrificing spectral fidelity. Most state-of-the-art methods for HSI-MSI fusion demand point spread function (PSF) calibration or ground truth high resolution HSI (HR-HSI), both of which are impractical to obtain in real world settings. We present SpectraLift, a fully self-supervised framework that fuses LR-HSI and HR-MSI inputs using only the MSI's Spectral Response Function (SRF). SpectraLift trains a lightweight per-pixel multi-layer perceptron (MLP) network using ($i$)~a synthetic low-spatial-resolution multispectral image (LR-MSI) obtained by applying the SRF to the LR-HSI as input, ($ii$)~the LR-HSI as the output, and ($iii$)~an $\ell_1$ spectral reconstruction loss between the estimated and true LR-HSI as the optimization objective. At inference, SpectraLift uses the trained network to map the HR-MSI pixel-wise into a HR-HSI estimate. SpectraLift converges in minutes, is agnostic to spatial blur and resolution, and outperforms state-of-the-art methods on PSNR, SAM, SSIM, and RMSE benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2506.23808 2026-05-22 cs.CV 版本更新

Towards Initialization-free Calibrated Bundle Adjustment

迈向无初始化的校准捆绑调整

Carl Olsson, Amanda Nilsson

发表机构 * Lund University（隆德大学）

AI总结本文提出了一种利用已知相机校准的无初始化校准SfM方法，通过引入成对相对旋转估计来实现近等距重建，从而提高三维重建的准确性。

详情

AI中文摘要

近期一系列工作表明，可以通过伪对象空间误差（pOSE）作为替代目标函数来实现无初始化的捆绑调整（BA）。初始重建步骤优化一个所有项都是射影不变的目标函数，无法纳入相机校准的知识。因此，解法仅在射影变换下确定，该过程需要更多的数据才能成功重建。相反，我们提出了一种能够利用已知相机校准的方法，从而产生近等距解，即精确到相似变换的重建。为此，我们引入了携带相机校准信息的成对相对旋转估计。这些估计仅对相似变换不变，因此鼓励保留真实场景的度量特征的解。我们的方法可以看作是将旋转平均整合到pOSE框架中，朝着无初始化校准SfM迈进。我们的实验评估表明，我们能够可靠地优化我们的目标函数，从随机起始解中以高概率收敛到全局最小值，从而产生准确的近等距重建。

英文摘要

A recent series of works has shown that initialization-free BA can be achieved using pseudo Object Space Error (pOSE) as a surrogate objective. The initial reconstruction-step optimizes an objective where all terms are projectively invariant and it cannot incorporate knowledge of the camera calibration. As a result, the solution is only determined up to a projective transformation of the scene and the process requires more data for successful reconstruction. In contrast, we present a method that is able to use the known camera calibration thereby producing near metric solutions, that is, reconstructions that are accurate up to a similarity transformation. To achieve this we introduce pairwise relative rotation estimates that carry information about camera calibration. These are only invariant to similarity transformations, thus encouraging solutions that preserve metric features of the real scene. Our method can be seen as integrating rotation averaging into the pOSE framework striving towards initialization-free calibrated SfM. Our experimental evaluation shows that we are able to reliably optimize our objective, achieving convergence to the global minimum with high probability from random starting solutions, resulting in accurate near metric reconstructions.

URL PDF HTML ☆

赞 0 踩 0

2503.00747 2026-05-22 cs.CV cs.RO eess.IV 版本更新

LFX: Towards Unified Light Field Dense Semantic Segmentation and Salient Object Detection

LFX：迈向统一的光场密集语义分割和显著物体检测

Fei Teng, Lingxin Huang, Buyin Deng, Kai Luo, Boyuan Zheng, Zheng Fang, Hong Zheng, Kunyu Peng, Jiaming Zhang, Yaonan Wang, Kailun Yang

发表机构 * School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, China（人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心，湖南大学，中国）； China Mobile Group Hunan Company Ltd., China（中国移动集团湖南有限公司，中国）； Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Germany（人机学与机器人研究所，卡尔斯鲁厄理工学院，德国）

AI总结本文提出LFX框架，通过统一的光场表示特征调制空间，实现了对多种光场表示和不同感知任务的适应，从而在三个光场基准测试中取得最先进的结果，显著优于特定表示方法。

Comments The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX

详情

AI中文摘要

光场相机在单次曝光内捕获多视角观测。然而，现有研究通常针对特定的LF表示进行优化，导致该领域缺乏统一的学习框架。为弥合这一差距，我们提出了LFX，首个统一的光场感知框架。LFX建立了一个表示不变的特征调制空间，使其能够适应异构的LF表示和多样的感知任务。具体而言，我们提出了Field-of-Parallax Angular Subspace Modeling（FoP-ASM），为每个辅助视图分配独立的角标记，实现视图间的独立建模。同时，共享流形子空间约束和正则化损失强制在视图间保持全局一致的语义调制。在三个LF基准测试中的广泛评估表明，LFX在不同的LF表示上均取得最佳结果，比特定表示方法高出高达12%和20%，在显著物体检测中达到0.029/0.027的MAE，且在语义分割中达到84.37 mIoU。源代码将在https://github.com/FeiT-FeiTeng/LFX上公开。

英文摘要

Light field cameras capture multi-view observations within a single exposure. However, existing studies are typically tailored to specific LF representations, leaving the field without a unified learning framework. To bridge this gap, we present LFX, the first unified framework for LF perception. LFX establishes a representation-invariant feature modulation space, enabling it to adapt to heterogeneous LF representations and diverse perception tasks. Specifically, we propose Field-of-Parallax Angular Subspace Modeling (FoP-ASM), which assigns an independent angular marker to each auxiliary view, enabling view-wise independent modeling. Meanwhile, shared manifold subspace constraints and regularization losses enforce globally consistent semantic modulation across views. Extensive evaluations across three LF benchmarks show that LFX achieves state-of-the-art results across distinct LF representations, outperforming representation-specific methods by up to 12% and 20% with 0.029/0.027 MAE for salient object detection, and achieving 84.37 mIoU for semantic segmentation. The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX.

URL PDF HTML ☆

赞 0 踩 0

2501.00677 2026-05-22 cs.LG cs.CV cs.IT cs.NA math.IT math.NA stat.ML 版本更新

Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery

深度学习鲁棒矩阵补全用于大规模低秩数据恢复

HanQin Cai, Chandra Kundu, Jialin Liu, Wotao Yin

发表机构 * School of Data, Mathematical, and Statistical Sciences and the Department of Computer Science, University of Central Florida（数据、数学与统计科学学院和计算机科学系，中央佛罗里达大学）； School of Data, Mathematical, and Statistical Sciences, University of Central Florida（数据、数学与统计科学学院，中央佛罗里达大学）； Damo Academy, Alibaba US（阿里云美国研究院）

AI总结本文提出了一种可扩展且可学习的非凸方法，即学得鲁棒矩阵补全（LRMC），用于大规模鲁棒矩阵补全问题，该方法具有低计算复杂度和线性收敛性，并通过深度展开有效学习自由参数以实现最优性能，同时在合成数据集和实际应用中验证了其优越的实验性能。

详情

DOI: 10.1109/TPAMI.2026.3659041
Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(6): 6541-6556, 2026

AI中文摘要

鲁棒矩阵补全（RMC）是一种广泛使用的机器学习工具，同时解决低秩数据分析中的两个关键问题：缺失数据条目和极端异常值。本文提出了一种新颖的可扩展且可学习的非凸方法，称为学得鲁棒矩阵补全（LRMC），用于大规模RMC问题。LRMC具有低计算复杂度和线性收敛性。受所提出定理的启发，LRMC的自由参数可通过深度展开有效学习以达到最佳性能。此外，本文提出了一种灵活的前馈-递归-混合神经网络框架，将深度展开从固定次数迭代扩展到无限次数迭代。通过在合成数据集和实际应用中的广泛实验，验证了LRMC的优越的实验性能，包括视频背景减除、超声成像、面部建模和卫星图像云去除。

英文摘要

Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.

URL PDF HTML ☆

赞 0 踩 0

2403.16552 2026-05-22 cs.NE cs.AI cs.CV 版本更新

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

QKFormer: 基于Q-K注意力的分层脉冲变换器

Chenlin Zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Liwei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Huihui Zhou, Yonghong Tian

发表机构 * Pengcheng Laboratory（鹏城实验室）； Harbin Institute of Technology（哈尔滨工业大学）； Peking University（北京大学）

AI总结本文提出QKFormer，一种基于Q-K注意力的分层脉冲变换器，通过引入新的脉冲形式Q-K注意力机制、分层结构和灵活的补丁嵌入模块，提升了脉冲神经网络在图像分类任务中的性能，实现了在ImageNet-1K数据集上85.65%的top-1准确率。

Comments Accepted by NeurIPS 2024 (Spotlight). Code and Model: https://github.com/zhouchenlin2096/QKFormer

详情

AI中文摘要

Spiking Transformers，将脉冲神经网络（SNNs）与变换器架构相结合，因其在能效和高性能方面的潜力而受到广泛关注。然而，现有模型在此领域仍存在性能不佳的问题。我们引入了几个创新来提高性能：i）我们提出了一种新的脉冲形式Q-K注意力机制，专为SNNs设计，通过二进制向量以线性复杂度高效建模token或通道维度的重要性。ii）我们将层次结构引入脉冲变换器，显著提升了生物和人工神经网络的性能，以获得多尺度脉冲表示。iii）我们设计了一个灵活且强大的补丁嵌入模块，具有特定于脉冲变换器的变形快捷方式。共同，我们开发了QKFormer，一种基于Q-K注意力的直接训练分层脉冲变换器。QKFormer在各种主流数据集上显著优于现有最先进SNN模型。值得注意的是，与Spikformer（66.34 M，74.81%）相比，QKFormer（64.96 M）在ImageNet-1k上实现了突破性的top-1准确率85.65%，大幅超越Spikformer 10.84%。据我们所知，这是首次直接训练SNNs在ImageNet-1K上超过85%的准确率。代码和模型可在https://github.com/zhouchenlin2096/QKFormer公开获取。

英文摘要

Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mechanism, tailored for SNNs, which efficiently models the importance of token or channel dimensions through binary vectors with linear complexity. ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation. iii) We design a versatile and powerful patch embedding module with a deformed shortcut specifically for spiking transformers. Together, we develop QKFormer, a hierarchical spiking transformer based on Q-K attention with direct training. QKFormer shows significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on ImageNet-1K. The code and models are publicly available at https://github.com/zhouchenlin2096/QKFormer

URL PDF HTML ☆

赞 0 踩 0

2311.04938 2026-05-22 cs.CV cs.AI cs.LG 版本更新

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

改进的DDIM采样与矩匹配高斯混合模型

Prasad Gabbur

发表机构 * Independent Researcher（独立研究者）； Apple（苹果公司）

AI总结本文提出在DDIM框架中使用高斯混合模型作为反向转换操作符，通过约束GMM参数匹配DDPM前向边缘的矩，从而在少量采样步骤下提升生成样本质量，实验表明GMM核在FID和IS指标上优于传统高斯核。

Comments 34 pages, 12 figures; Accepted to TMLR; Code open sourced

详情

Journal ref: Transactions on Machine Learning Research, 05/2026

AI中文摘要

我们提出在去噪扩散隐式模型（DDIM）框架中使用高斯混合模型（GMM）作为反向转换操作符（核），这是用于加速从预训练去噪扩散概率模型（DDPM）采样的最广泛使用的 approaches 之一。具体而言，我们通过约束GMM参数来匹配DDPM前向边缘的一阶和二阶中心矩。我们发现矩匹配足以获得与原始DDIM高斯核相等或更好的样本质量。我们分别在无条件模型（训练于CelebAHQ和FFHQ）、类条件模型（训练于ImageNet）以及使用Stable Diffusion v2.1在COYO700M数据集上进行文本到图像生成实验。我们的结果表明，当采样步骤数较小时，使用GMM核可显著提升生成样本的质量，如在ImageNet 256x256上，使用10个采样步骤时，GMM核的FID为6.94，IS为207.85，而高斯核分别为10.15和196.73。此外，我们还为修正流匹配模型推导了新的SDE采样器，并对所提出的方法进行了实验。我们发现使用1-修正流和2-修正流模型均有所改进。代码：https://github.com/pgabbur/ddim-gmm。

英文摘要

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.

URL PDF HTML ☆

赞 0 踩 0