2606.05409 2026-06-18 cs.CV cs.CL 版本更新

Would you still call this Dax? Novel Visual References in VLMs and Humans

你还会称它为Dax吗？VLM与人类中的新颖视觉参照

Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

发表机构 * McGill University（麦吉尔大学）； Mila Quebec AI Institute（魁北克人工智能研究所）； University of Michigan - Ann Arbor（密歇根大学安娜堡分校）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结提出新颖视觉参照数据集（NVRD），通过对比VLM和人类对新颖视觉概念的泛化能力，发现模型在矛盾先验知识时难以习得新概念，且过度泛化。

详情

AI中文摘要

视觉语言模型（VLM）像人类学习者一样，经常接触新的视觉概念，但它们在接触后如何将新颖的视觉参照映射到语言上仍未被充分探索，特别是当这些参照与预训练的先验知识相矛盾时。为了研究这一点，我们提出了新颖视觉参照数据集（NVRD）：包含跨越90个视觉概念的19,176张图像，这些概念具有不同层次的新颖性，每个概念最多有20个原始对象的逐渐扰动版本以测试泛化能力。与之前关于熟悉概念视觉增强的工作不同，NVRD包含完全新颖、开放式的刺激，从头构建，模拟人类遇到真正新概念的方式。我们评估了3个开源和2个闭源模型以及2,400个人类判断，以进行直接的人机比较，发现（i）当新概念与先验知识矛盾时，模型难以在上下文中习得它们，以及（ii）虽然模型和人类对视觉扰动表现出相关的敏感性，但模型显著过度泛化，将学到的标签扩展到人类拒绝的刺激上。我们贡献了NVRD作为人类和机器视觉概念学习研究的语料库和基准。

英文摘要

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

URL PDF HTML ☆

赞 0 踩 0

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni：从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳分校）； National University of Singapore（新加坡国立大学）

AI总结提出FutureOmni基准，评估多模态大模型从音视频线索预测未来的能力，发现现有模型在语音密集场景下表现差，并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）展现出强大的全模态感知能力，但它们从音视频线索预测未来事件的能力仍未被充分探索，因为现有基准主要关注回顾性理解。为弥补这一差距，我们引入了FutureOmni，这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理，并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建，包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明，当前系统在音视频未来预测方面存在困难，尤其是在语音密集场景中，Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限，我们整理了一个7K样本的指令微调数据集，并提出全模态未来预测（OFF）训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明，OFF增强了未来预测和泛化能力。我们公开发布所有代码（此 https URL ）和数据集（此 https URL ）。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

URL PDF HTML ☆

赞 0 踩 0

2603.11417 2026-06-18 cs.CV cs.LG 版本更新

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

端到端自动驾驶中的零样本跨城市泛化：自监督与监督表示

Fatemeh Naeinian, Ali Hamza, Haoran Zhu, Anna Choromanska

发表机构 * Department of Electrical and Computer Engineering, NYU Tandon School of Engineering（电气工程系，纽约大学Tandon工程学院）

AI总结研究端到端自动驾驶模型在跨城市零样本迁移中的泛化能力，发现自监督预训练（如I-JEPA、DINOv2、MAE）相比监督预训练能显著减少位移和碰撞退化，提升闭环评估中的分布外PDMS。

详情

AI中文摘要

端到端自动驾驶模型通常使用监督的ImageNet预训练骨干网络在多城市数据集上训练，但其泛化到未见城市的能力尚未得到充分检验。当训练和评估数据在地理上混合时，模型可能隐含地依赖城市特定线索，掩盖了在真实世界域偏移下泛化到新位置时可能出现的失败模式。在这项工作中，我们将零样本跨城市迁移定义为端到端自动驾驶的受控表示级压力测试，并探究视觉预训练如何影响地理域偏移下的迁移行为。我们通过将自监督骨干网络I-JEPA、DINOv2和MAE集成到规划框架中进行了全面研究。我们在nuScenes上的开环设置和NAVSIM上的闭环评估协议中，在严格的地理划分下评估性能。我们的实验揭示了当模型在不同道路拓扑、交通规则和视觉环境的城市间迁移时存在显著的泛化差距。在开环评估中，监督骨干网络在城市间迁移时表现出严重退化，而某些领域特定的自监督方法可以显著减少位移和碰撞退化。在闭环评估中，自监督预训练在多个单城市训练设置中提高了平均分布外PDMS。我们的结果提供了经验证据，表明表示学习影响跨城市规划的鲁棒性，并促使将零样本地理迁移作为评估端到端自动驾驶系统的重要压力测试。

英文摘要

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real-world domain shifts when generalizing to new locations. In this work, we formulate zero-shot cross-city transfer as a controlled representation-level stress test for end-to-end autonomous driving and ask how visual pretraining affects transfer behavior under geographic domain shift. We conduct a comprehensive study by integrating self-supervised backbones I-JEPA, DINOv2, and MAE into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models across cities with different road topologies, traffic conventions, and visual environments. In open-loop evaluation, a supervised backbone exhibits severe degradation when transferring between cities, yet some domain-specific self-supervised methods can substantially reduce both displacement and collision degradation. In closed-loop evaluation, self-supervised pretraining improves average out-of-distribution PDMS in several single-city training settings. Our results provide empirical evidence that representation learning influences the robustness of cross-city planning and motivate zero-shot geographic transfer as an important stress test for evaluating end-to-end autonomous driving systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17030 2026-06-18 cs.CV 版本更新

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team（Qwen团队）

AI总结提出Qwen-RobotWorld，一种以自然语言为统一动作接口的语言条件视频世界模型，通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练，在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测，在多个基准上取得最优结果。

详情

AI中文摘要

我们介绍Qwen-RobotWorld，一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口，它从当前观测预测物理上合理的未来视觉轨迹，涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向：用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境，以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的：a) 双流MMDiT与MLLM动作编码，其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合；b) 具身世界知识(EWK)，一个860万视频-文本语料库（2亿+帧），包含20+种具身形态和500+动作类别的动作-语言映射；c) 通用+专家渐进式课程，一种两阶段训练策略，首先学习通用视觉先验，然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力：在EWMBench和DreamGen Bench上总体排名第一，在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.17846 2026-06-18 cs.RO cs.CV cs.LG 版本更新

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip 技术报告：对齐解锁机器人操作基础模型的规模

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team（Qwen团队）

AI总结提出 Qwen-RobotManip，通过统一的对齐框架（表示、运动和行为维度）实现多源异构操作数据的大规模协同训练，构建约38,100小时预训练语料，在零样本指令跟随、跨本体迁移等泛化能力上超越先前模型。

Comments 44 pages

详情

AI中文摘要

语言和多模态基础模型通过统一公式对齐异构数据并大规模训练，实现了强大的泛化能力。在本报告中，我们研究这种扩展方法是否可以应用于机器人操作以实现真正的泛化。这具有挑战性，因为与文本不同，操作数据本质上是异构的、收集成本高且多样性狭窄，使得对齐和规模同时变得困难。我们提出了 Qwen-RobotManip，一个基于 Qwen-VL 构建的可泛化视觉-语言-动作基础模型。Qwen-RobotManip 引入了一个跨操作表示、运动和行为维度的统一对齐框架，使大规模多源训练变得一致而非冲突。这种对齐能力进而使 Qwen-RobotManip 能够吸收以前训练方案无法维持规模的操作数据。一个人到机器人合成流水线将第一人称手部演示转换为跨15个平台的机器人轨迹，一个严格的策展流水线协调异构数据集。仅使用开源数据集和人类视频，无需专有数据收集，Qwen-RobotManip 构建了约38,100小时的预训练语料，并展现出涌现的泛化能力，包括零样本指令跟随、对扰动的鲁棒性、反应性错误恢复和跨本体迁移。我们发现标准基准无法捕捉预训练质量，因此采用了包括 RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF 和 RoboTwin-XE 在内的 OOD 设置。Qwen-RobotManip 在所有 OOD 设置中显著优于先前最先进的模型（包括 π0.5），在 RoboChallenge 中排名第一，相对改进20%，并在包括 AgileX ALOHA、Franka、UR 和 ARX 在内的真实机器人平台上得到验证。

英文摘要

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

URL PDF HTML ☆

赞 0 踩 0

2606.02045 2026-06-18 cs.CV cs.AI 版本更新

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

域偏移下基于注意力机制和迁移学习的鲁棒桃叶损伤分类

Adrián Cánovas-Rodriguez, Miguel A. González-Illán, Maria Fernanda García-Cruz, Pedro Nortes Tortosa, José Salvador Rubio-Asensio, Miguel A. Zamora Izquierdo, Juan Antonio Martínez Navarro, Antonio F. Skarmeta

发表机构 * Department of Information and Communication Engineering（信息与通信工程系）； University of Murcia（穆尔西亚大学）； Department of Irrigation, Centro de Edafología y Biología Aplicada del Segura CEBAS-CSIC（灌溉系，塞格拉应用土壤学与生物技术中心CEBAS-CSIC）

AI总结提出基于注意力机制和迁移学习的桃叶损伤分类方法，通过CBAM增强EfficientNet模型在公共数据集上达到93.3%准确率，并在本地数据集上通过迁移学习实现93%宏F1分数，有效应对域偏移。

详情

AI中文摘要

人工智能为从图像数据评估作物损伤提供了实用框架，支持农业管理中的早期决策。在桃园中，气候变化增加了非生物胁迫和生物压力，包括病虫害，这些通常产生视觉上相似的叶片症状。这种重叠使得手动诊断变得困难，尤其是在不同环境条件下的多个田地中，凸显了对具有强泛化能力的自动化模型的需求。我们提出了一种基于图像的桃叶损伤检测分类方法。通过手动标注公开图像创建了一个基准数据集，包含六个损伤类别的1,366片桃叶。评估了几种深度学习架构。EfficientNet模型取得了最佳结果，其中EfficientNetB0达到92.9%的准确率，EfficientNetB3达到91.5%，EfficientNetB5在少数类上表现最强。DenseNet121达到92.6%的准确率。卷积块注意力模块（CBAM）的集成在多个骨干网络中提升了性能，特别是在EfficientNetB5和InceptionV3中，而在其他网络中效果有限或为负。CBAM增强的EfficientNetB5取得了93.3%的最佳总体准确率。为了评估在现实条件下的鲁棒性，收集了一个包含四个类别180张图像的本地数据集，并应用迁移学习策略来解决域偏移。测试了三种微调策略。结合CBAM的EfficientNetB3在本地域中取得了最佳性能，迁移后宏F1分数达到93%。总体而言，基于注意力的模型在少数类上表现出更强的鲁棒性，并在不同田间条件下具有更好的泛化能力。

英文摘要

Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in agricultural management. In peach orchards, climate change increases abiotic stress and biotic pressures, including pests and diseases, which often produce visually similar foliar symptoms. This overlap makes manual diagnosis difficult, especially across multiple fields with varying environmental conditions, highlighting the need for automated models with strong generalization ability. We propose an image-based classification approach for peach leaf damage detection. A benchmark dataset was created through manual annotation of publicly available images, consisting of 1,366 peach leaves across six damage categories. Several deep learning architectures were evaluated. EfficientNet models achieved the best results, with EfficientNetB0 reaching 92.9 percent accuracy, EfficientNetB3 achieving 91.5 percent, and EfficientNetB5 showing the strongest performance on minority classes. DenseNet121 reached 92.6 percent accuracy. The integration of the Convolutional Block Attention Module (CBAM) improved performance in several backbones, particularly EfficientNetB5 and InceptionV3, while showing limited or negative impact in others. The CBAM-enhanced EfficientNetB5 achieved the best overall accuracy of 93.3 percent. To evaluate robustness under realistic conditions, a local dataset of 180 images across four classes was collected, and transfer learning strategies were applied to address domain shift. Three fine-tuning strategies were tested. EfficientNetB3 combined with CBAM achieved the best performance in the local domain, reaching a 93 percent macro F1-score after transfer. Overall, attention-based models showed improved robustness for minority classes and better generalization across different field conditions.

URL PDF HTML ☆

赞 0 踩 0

2602.04401 2026-06-18 cs.RO cs.CV 版本更新

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

视觉地点识别中可靠操作点选择的分位数迁移

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics（昆士兰理工大学机器人中心）； School of Electrical Engineering and Robotics（电气工程与机器人学院）； Queensland University of Technology（昆士兰理工大学）

AI总结提出一种通过分位数归一化迁移阈值的方法，自动选择视觉地点识别系统的操作点，在100%精度下最大化召回率，无需手动调参。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情

AI中文摘要

视觉地点识别（VPR）是全球导航卫星系统（GNSS）受限环境中定位的关键组成部分，但其性能严重依赖于选择平衡精度和召回率的图像匹配阈值（操作点）。阈值通常针对特定环境离线手动调整，并在部署期间固定，导致在环境变化下性能下降。我们提出一种方法，自动选择VPR系统的操作点，以在100%精度下最大化召回率。该方法使用已知对应关系的小型校准遍历，并通过相似度得分分布的分位数归一化将阈值迁移到部署中。这种分位数迁移确保阈值在校准大小和查询子集上保持稳定。在五个基准数据集上使用七种最先进的VPR技术进行的实验表明，我们提出的方法始终优于现有基线，使底层VPR技术在大约两倍的部署场景中（中位数改进）以100%精度运行，同时在该精度下检索到多达29%的正确匹配。该方法通过适应新环境并在操作条件下泛化，消除了手动调整。我们的代码可在该https URL获取。

英文摘要

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

URL PDF HTML ☆

赞 0 踩 0

2511.20302 2026-06-18 cs.CV 版本更新

先验引导的多模态特征融合用于光学-SAR图像变化检测

Xuanguang Liu, Lei Ding, Yujie Li, Chenguang Dai, Zhenchao Zhang, Mengmeng Li, Ziyi Yang, Yifan Sun, Yongqi Sun, Hanyun Wang, Lorenzo Bruzzone

发表机构 * Institute of Geospatial Information, Information Engineering University（地理信息研究所，信息工程大学）； Academy of Digital China (Fujian), Fuzhou University（数字中国研究院（福建），福州大学）； The School of Electronics and Communication Engineering, Sun Yat-sen University（电子与通信工程学院，中山大学）； The Department of Information Engineering and Computer Science, University of Trento（信息工程与计算机科学系，特伦托大学）

AI总结提出STSF-Net框架，联合建模模态特定和时空共同特征，并利用视觉基础模型的语义先验自适应融合多模态特征，在三个数据集上达到最优性能。

详情

AI中文摘要

多模态变化检测（MMCD）识别多模态遥感数据中的变化区域，在土地利用监测和城市可持续发展中具有重要应用价值。然而，现有MMCD方法在跨模态交互和利用模态特定特征方面存在局限性，导致对细粒度变化信息的建模不足，从而阻碍了语义变化的精确检测。为解决这些问题，我们提出了STSF-Net，一个专为光学和SAR图像之间的MMCD设计的框架。STSF-Net联合建模模态特定特征和时空共同特征以增强变化表示。具体而言，利用模态特定特征捕获真实的语义变化信号，同时嵌入时空共同特征以抑制由成像机制差异引起的伪变化。此外，我们引入了一种光学和SAR特征融合策略，该策略基于从视觉基础模型获得的语义先验自适应调整多模态特征的重要性。最后，我们引入了新的Delta-SN6数据集，这是第一个公开可访问的多类MMCD基准，包含极高分辨率全极化SAR和光学图像。在Delta-SN6、BRIGHT和Wuhan数据集上的实验结果表明，我们的方法在mIoU上分别比最先进方法高出3.21%、0.87%和1.32%。

英文摘要

Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing data, demonstrating significant application value in land use monitoring and urban sustainable development. However, literature MMCD approaches exhibit limitations in both cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes. To address these problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts multimodal feature importance based on semantic priors obtained from visual foundation models. Finally, we introduce the novel Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan datasets demonstrate that our method outperforms the state-of-the-art by 3.21%, 0.87%, and 1.32% in mIoU, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.08206 2026-06-18 cs.CV cs.LG 版本更新

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

SegmentAnyTreeV2：跨传感器、平台和森林的基于Transformer的树木实例分割扩展

Maciej Wielgosz, Stefano Puliti, Rasmus Astrup

发表机构 * Norwegian Institute of Bioeconomy Research (NIBIO)（挪威生物经济研究所（NIBIO））

AI总结提出SegmentAnyTreeV2，一种传感器和平台无关的森林点云语义与实例分割框架，结合Point Transformer v3骨干网络、轻量语义头和树木交叉注意力掩码解码器，在FOR-instance v3基准上达到90.5%精度和80.2%召回率，并展现出强跨域泛化能力。

Comments 25 pages, 6 figures, 10 tables, Corrected bibliography metadata and minor typographical issues; results unchanged

详情

AI中文摘要

我们提出SegmentAnyTreeV2，一种传感器和平台无关的森林点云语义与实例分割框架。该模型结合了基于序列化的Point Transformer v3骨干网络、轻量级语义头以及专注于树木的交叉注意力掩码解码器。语义预测将实例解码限制在树木类体素上，而实例感知的查询初始化、一对多种子监督和非对称掩码评分改善了密集和结构复杂林分中的分离效果。我们进一步引入了FOR-instance v3，一个扩展的基准数据集，包含427个场景和26,496棵标注树木，涵盖不同生物群落、森林结构和LiDAR平台。在FOR-instanceV2测试集上，SegmentAnyTreeV2实现了90.5%的精度、80.2%的召回率、85.0%的F1分数、90.7%的覆盖率和87.6%的语义mIoU，在实例检测和掩码完整性方面均优于以往基于学习的方法。在独立站点上的零样本评估进一步证明了其强大的跨域泛化能力。

英文摘要

We present SegmentAnyTreeV2, a sensor- and platform-agnostic framework for semantic and instance segmentation of forest point clouds. The model combines a serialization-based Point Transformer v3 backbone with a lightweight semantic head and a tree-focused cross-attention mask decoder. Semantic predictions restrict instance decoding to tree-class voxels, while instance-aware query initialization, one-to-many seed supervision, and asymmetric mask scoring improve separation in dense and structurally complex stands. We further introduce FOR-instance v3, an expanded benchmark comprising 427 scenes and 26,496 annotated trees across diverse biomes, forest structures, and LiDAR platforms. On the FOR-instanceV2 test split, SegmentAnyTreeV2 achieves 90.5% precision, 80.2% recall, 85.0% F1, 90.7% coverage, and 87.6% semantic mIoU, outperforming previous learning-based methods in both instance detection and mask completeness. Zero-shot evaluation on independent sites further demonstrates strong cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2604.22476 2026-06-18 cs.CV cs.LG 版本更新

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

全神贯注于工作流：从视频流中自动高效发现事件

Marco Pegoraro, Jonas Seng, Dustin Heller, Wil M. P. van der Aalst, Kristian Kersting

发表机构 * Chair of Process and Data Science, RWTH Aachen University（过程与数据科学教授席位，亚琛工业大学）； Artificial Intelligence & Machine Learning Lab, Technical University of Darmstadt（人工智能与机器学习实验室，达姆施塔特技术大学）

AI总结提出SnapLog方法，利用图像嵌入和帧间相似矩阵进行时间分割，结合广义少样本分类从视频中提取事件数据，生成可解释的带标签时间戳帧序列。

Comments 18 pages, 6 figures, 1 table, 27 references

详情

AI中文摘要

业务流程管理和流程挖掘等学科通过基于记录的事件数据发现流程见解来帮助组织。然而，流程分析的一个障碍是数据多模态性：例如，视频形式的数据不能直接解释为事件。现有方法依赖于活动标签字典作为输入，无法提供逐帧标签解释，或依赖于过时的计算机视觉技术。在这项工作中，我们提出了SnapLog，一种通过使用图像嵌入将帧转换为特征向量，并通过帧间相似矩阵进行时间分割来从视频中提取事件数据的方法。然后使用广义少样本分类为视频片段分配标签，生成可解释为事件的带标签、时间戳的子帧序列。传统的流程挖掘技术可用于分析结果数据。我们表明，我们的方法生成的日志准确反映了视频中的流程。

英文摘要

Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. Existing approaches rely on a dictionary of activity label as input, cannot provide frame-by-frame labeling explanations, or rely on superseded computer vision techniques. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.

URL PDF HTML ☆

赞 0 踩 0

2606.06926 2026-06-18 cs.CV cs.MM 版本更新

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

SVHighlights: 迈向极长体育视频精彩片段检测

Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim

发表机构 * Ulsan National Institute of Science and Technology（釜山国立科学研究院）

AI总结针对现有方法无法处理超长视频精彩片段检测的问题，提出首个基准SVHighlights（包含320个平均时长2小时的体育视频）以及无训练的分段方法TF-SELECTOR，通过大语言模型融合多模态信息预测片段级显著性分数，在多个指标上超越现有基线。

Comments Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/

详情

DOI: 10.1145/3770855.3817564

AI中文摘要

尽管长视频的精彩片段检测具有重要的实际意义，但现有方法大多局限于短视频内容，这主要是由于缺乏合适的基准。为了填补这一空白，我们引入了SVHighlights，据我们所知，这是首个针对极长体育视频（每段时长超过一小时，涵盖多种体育类别）精彩片段检测的基准。SVHighlights是通过一个数据集生成流水线，从完整体育视频及其对应的官方精彩片段视频对构建而成，无需传统的逐片段显著性标注即可实现可扩展的标签生成。该基准包含320个视频，平均时长2.00小时，总时长640.18小时，显著超过以往的数据集。现有方法在长视频上也面临根本性挑战：在短视频片段上训练的模型无法泛化到小时级内容，并且它们的片段级评分缺乏识别精彩片段所需的更广泛上下文。为了解决这一问题并提供一个强基线，我们提出了TF-SELECTOR，一种无需训练的基于分段的方法，该方法通过合并相邻的具有相同语义内容的镜头，将每个视频划分为上下文感知的分段，并使用多模态输入（包括视觉描述、转录文本和音频音量）的大语言模型预测分段级显著性分数。实验表明，与视频时间定位（VTG）微调的基线相比，TF-SELECTOR在大多数指标上取得了更优的性能，在HIT@1上提升+3.12，在HIT@K上提升+4.06，在IoU上提升+2.95。这些结果确立了SVHighlights作为长视频精彩片段检测的具有挑战性的测试平台，并证明了简单的基于分段的策略可以有效地扩展到小时级视频。

英文摘要

While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +2.50 in HIT@1, +4.04 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

URL PDF HTML ☆

赞 0 踩 0

2606.15632 2026-06-18 cs.CV 版本更新

Open-World Video Segmentation

开放世界视频分割

Qing Su, Kaiyang Li, Yuan Zhuang, Fei Miao, Shihao Ji

发表机构 * University of Connecticut（康涅狄格大学）

AI总结提出Savvy系统，结合分层掩码发现、延迟接纳和轨迹整合，实现零样本开放世界长时视频分割；并设计粒度感知评估套件OGA，采用n:1匹配协议，解决传统1:1匹配对开放世界方法的不公平惩罚问题。

详情

AI中文摘要

尽管视频分割在短片段和封闭集基准上取得了快速进展，但开放世界视频分割仍然在很大程度上未被探索。挑战有两方面：（1）现有方法不支持在动态自我运动的长视频中进行对象发现和身份维护；（2）现有评估协议依赖于严格的1:1匹配，不公平地惩罚了具有不匹配粒度的语义有效预测。为了解决这两个问题，我们引入了Savvy，一个实用且强大的零样本开放世界长时视频分割系统。Savvy结合了分层掩码发现、延迟接纳和轨迹整合，以支持持久对象发现、安全轨迹提升和稳定的长距离身份维护。我们进一步提出了OGA，一个用于开放世界视频分割的粒度感知评估套件。基于粒度无关（GA）匹配协议，OGA将传统的1:1匹配放宽为n:1映射，但通过断点检测支持不连续性并通过对每个参考对象的优势连贯片段进行评分来强制执行时间严谨性。这防止了碎片化或闪烁的支持被过度奖励，同时实现了GA适应的指标和结构诊断：身份持久性（IP）和身份集中性（IC）。在VIPSeg上，我们展示了标准的1:1评估严重低估了开放世界方法，而GA评估恢复了许多被抑制的性能。在更现实的长时基准ScanNet和HM3D上，Savvy在经典指标和提出的指标（包括STQ、VPQ$_\infty$、IP和IC）上始终优于强基线。这些结果共同为开放世界长时视频分割建立了一个实用的基准和一个强基线。

英文摘要

While video segmentation has advanced rapidly on short clips and closed-set benchmarks, open-world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego-motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero-shot open-world long-horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance. We further propose OGA, a granularity-aware evaluation suite for open-world video segmentation. Built on a Granularity-Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over-rewarded while enabling GA-adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open-world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long-horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ$_\infty$, IP and IC. Together, these results establish a practical benchmark and a strong baseline for open-world long-horizon video segmentation.

URL PDF HTML ☆

赞 0 踩 0

2502.07531 2026-06-18 cs.CV cs.AI cs.LG cs.MM 版本更新

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

VidCRAFT3: 面向图像到视频生成的相机、物体与光照控制

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； Shanghai Innovation Institute（上海创新研究院）； Zhejiang University（浙江大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）； Westlake University（西湖大学）； School of Data Science and MOE Frontiers Center for Brain Science, Fudan University（复旦大学数据科学学院和脑科学前沿中心）； Fudan ISTBI–ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University（复旦大学-浙江师范大学脑启发智能算法中心）

AI总结提出VidCRAFT3框架，通过显式建模几何、运动与光照的跨因素交互，实现对相机运动、物体运动和光照方向的独立或联合控制，在控制精度和视觉一致性上达到最优。

Comments Accepted to TVCG 2026

详情

AI中文摘要

可控图像到视频（I2V）生成将参考图像转换为由用户指定控制信号引导的连贯视频。虽然对相机运动、物体运动和光照的精确控制对于高保真创作至关重要，但现有方法通常独立处理这些因素，忽视了动态场景中视角、几何和光照之间的物理耦合，导致同时变化时出现阴影不匹配和透视漂移等视觉不一致问题。我们提出了VidCRAFT3，一个统一且灵活的I2V框架，显式建模几何、运动和光照之间的跨因素交互，实现对相机运动、物体运动和光照方向的独立或联合控制。Image2Cloud提供显式的3D几何先验以实现精确的相机运动控制。ObjMotionNet将稀疏物体轨迹编码为多尺度运动特征，以引导逼真的物体运动。空间三重注意力变压器通过光照交叉注意力整合光照方向，实现一致的重光照。为了解决联合标注数据的稀缺性，我们构建了VideoLightingDirection（VLD）数据集，包含精确的逐帧光照方向标注，并引入三阶段渐进训练策略，使得无需完全联合标注即可实现鲁棒学习。大量实验表明，VidCRAFT3在多种场景下的控制精度和视觉一致性上达到了最先进水平。

英文摘要

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. While precise control over camera motion, object motion, and lighting is essential for high-fidelity creation, existing methods often treat these factors independently. This overlooks the physical coupling among viewpoint, geometry, and illumination in dynamic scenes, leading to visual inconsistencies such as mismatched shadows and perspective drift under simultaneous changes. We present VidCRAFT3, a unified and flexible I2V framework that explicitly models cross-factor interactions among geometry, motion, and illumination, enabling both independent and joint control over camera motion, object motion, and lighting direction. Image2Cloud provides explicit 3D geometric priors for accurate camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale motion features to guide realistic object motion. A Spatial Triple-Attention Transformer integrates lighting direction through lighting cross-attention for consistent relighting. To address the scarcity of jointly annotated data, we construct the VideoLightingDirection (VLD) dataset with accurate per-frame lighting direction annotations, and introduce a three-stage progressive training strategy that enables robust learning without fully joint annotations. Extensive experiments demonstrate that VidCRAFT3 achieves state-of-the-art performance in control precision and visual coherence across diverse scenarios.

URL PDF HTML ☆

赞 0 踩 0

2510.21615 2026-06-18 cs.CV 版本更新

Epipolar Geometry Improves Video Generation Models

极线几何改进视频生成模型

Orest Kupyn, Théo Uscidda, Marta Tintore Gazulla, Fabian Manhardt, Federico Tombari, Christian Rupprecht

发表机构 * University of Oxford（牛津大学）； Google Research（谷歌研究院）； CREST-ENSAE, Institut Polytechnique de Paris（巴黎理工学院CREST-ENSAE研究中心）； Technical University of Munich（慕尼黑技术大学）

AI总结针对视频生成模型几何不一致和运动伪影问题，提出基于极线几何约束的偏好优化方法，在保持视觉质量的同时将极线误差降低31%，人类评分一致性从54%提升至72%。

详情

AI中文摘要

视频生成模型通过使用整流流技术训练的潜在扩散变换器取得了显著进展。然而，这些模型仍然存在几何不一致、运动不稳定以及破坏逼真3D场景错觉的视觉伪影。3D一致的视频生成可能对生成和重建任务中的众多下游应用产生重大影响。我们探索了极线几何约束如何改进现代视频扩散模型。尽管使用了大量训练数据，这些模型未能捕捉基本的几何原理。我们通过基于偏好的优化，利用成对极线几何约束对齐扩散模型，通过数学上合理的几何约束直接解决不稳定轨迹和几何伪影。我们的方法有效地强制执行几何原理，而不需要端到端的可微性。评估表明，经典的几何约束比现代学习度量提供了更稳定的优化信号。在静态场景和动态相机上的训练确保了度量质量，同时模型泛化到各种动态场景。通过将数据驱动学习与经典计算机视觉相结合，我们将极线误差降低了31%，并将人类评分一致性从54%提高到72%，且不损害视觉质量。

英文摘要

Video generation models have advanced significantly through the latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite using massive training data, these models fail to capture fundamental geometric principles. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics. Training on static scenes with dynamic cameras ensures metric quality while the model generalizes to various dynamic scenes. By bridging data-driven learning with classical computer vision, we reduce epipolar error by 31% and improve human-rated consistency from 54% to 72% without compromising visual quality.

URL PDF HTML ☆

赞 0 踩 0

2604.03156 2026-06-18 cs.CV 版本更新

针对点云分类和分割的深度学习架构系统性调研

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

发表机构 * State University of New York at Albany（纽约州立大学阿尔巴尼分校）

AI总结本文系统性地探讨了点云分类和分割中的深度学习架构，分析了点云数据的结构特性，分类了不同架构的工作，并评估了其在主流基准上的性能，同时指出了开放挑战和未来方向。

Comments We reviewed a decade of advancements in point cloud processing: trace the evolution of the field from its foundational roots to the modern SOTA, analyze how diverse architectures overcome the inherent geometric challenges of 3D data, and map out critical research gaps alongside promising future directions. GitHub: https://github.com/MinhasKamal/DeepLearningForPointCloud

详情

DOI: 10.1145/3815180
Journal ref: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026

AI中文摘要

点云因其简洁性和几何保真度而成为表示3D形状和场景最广泛采用的格式。然而，其固有的无序和不规则性质，加剧了传感器噪声和遮挡的影响，给基于机器学习的方法带来了独特的挑战。为应对这些问题，已开发出多种策略，包括转换为有序格式、提取局部几何特征以及基于排列不变或自注意力的处理方法。在本文中，我们的重点是深度学习模型在3D视觉三个基本任务中的应用：点云分类、部分分割和语义分割。我们首先正式定义点云数据，然后深入讨论其结构特性。接着，我们根据其骨干结构对重要工作进行分类，并评估其在流行基准上的性能。除了经验比较外，我们还提供了架构创新和局限性的见解。我们还概述了3D点云理解中的开放挑战和有前途的未来方向。

英文摘要

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

URL PDF HTML ☆

赞 0 踩 0

2510.10779 2026-06-18 cs.CV 版本更新

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

结构化谱图表示学习用于3D CT扫描的多标签异常分析

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

发表机构 * INSA Lyon, University of Lyon, CNRS, INSERM, CREATIS UMR 5220, U1294（里昂国立应用科学学院、里昂大学、国家科学研究中心、法国国家医学研究院、CREATIS UMR 5220、U1294）

AI总结提出一种基于谱图卷积的2.5D框架，将3D CT体积表示为结构化图，通过轴向切片三元组节点建模层间依赖，实现多标签异常分类，跨数据集泛化性能强。

Comments Accepted at MELBA Journal 2026

详情

DOI: 10.59275/j.melba.2026-87e3

AI中文摘要

随着CT检查数量的增长，对器官分割、异常检测和报告生成等自动化工具的需求日益增加，以支持放射科医生管理临床工作负载。由于三维数据中固有的复杂空间关系和异常的广泛变异性，3D胸部CT扫描的多标签分类仍然是一个关键但具有挑战性的问题。基于3D卷积神经网络的现有方法难以捕捉长距离依赖，而视觉Transformer通常需要在大规模领域特定数据集上进行大量预训练才能获得竞争力。在这项工作中，我们提出了一种2.5D替代方案，引入了一个新的基于图的框架，将3D CT体积表示为结构化图，其中轴向切片三元组作为节点，通过谱图卷积处理，使模型能够推理层间依赖，同时保持与临床部署兼容的复杂度。我们的方法在来自独立机构的3个数据集上进行训练和评估，实现了强大的跨数据集泛化能力，并与最先进的视觉编码器相比表现出竞争性能。我们进一步进行了全面的消融研究，以评估各种聚合策略、边加权方案和图连接模式的影响。此外，我们通过自动放射学报告生成和腹部CT数据的迁移实验展示了我们方法的更广泛适用性。

英文摘要

With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.

URL PDF HTML ☆

赞 0 踩 0

2512.09185 2026-06-18 cs.CV cs.AI 版本更新

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

学习患者特异性疾病动态：基于潜在流匹配的纵向影像生成

Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li

发表机构 * University of Cambridge（剑桥大学）； Nanjing First Hospital（南京第一医院）； Nanjing Medical University（南京医科大学）； Johns Hopkins University（约翰霍普金斯大学）； University of Dundee（邓迪大学）

AI总结提出Δ-LFM框架，利用流匹配对齐患者潜在轨迹，通过患者特异性潜在对齐实现单调疾病进展建模，在三个纵向MRI基准上验证了可解释性和性能。

Comments ICLR 2026 accepted

详情

AI中文摘要

理解疾病进展是一个直接的临床挑战，对早期诊断和个性化治疗具有重要意义。虽然最近的生成方法试图对进展进行建模，但关键不匹配仍然存在：疾病动态本质上是连续且单调的，然而潜在表示通常是分散的，缺乏语义结构，并且基于扩散的模型通过随机去噪过程破坏了连续性。在这项工作中，我们提出将疾病动态视为速度场，并利用流匹配（FM）来对齐患者数据的时间演变。与先前方法不同，它捕捉了疾病的内在动态，使进展更具可解释性。然而，一个关键挑战仍然存在：在潜在空间中，自动编码器（AE）不能保证跨患者的对齐或与临床严重性指标（例如年龄和疾病状况）的相关性。为了解决这个问题，我们提出学习患者特异性潜在对齐，这迫使患者轨迹沿着特定轴延伸，其幅度随疾病严重程度单调增加。这导致了一个一致且语义上有意义的潜在空间。总之，我们提出了Δ-LFM，一个用于通过流匹配建模患者特异性潜在进展的框架。在三个纵向MRI基准上，Δ-LFM展示了强大的实证性能，更重要的是，为解释和可视化疾病动态提供了一个新框架。

英文摘要

Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

URL PDF HTML ☆

赞 0 踩 0

2512.10353 2026-06-18 cs.CV 版本更新

Hybrid Transformer-Mamba for Weakly Supervised Volumetric Medical Segmentation

混合Transformer-Mamba用于弱监督体积医学分割

Yiheng Lyu, Lian Xu, Coen Arrow, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi

发表机构 * University of Western Australia（西澳大学）； Harry Perkins Institute of Medical Research（哈利·佩金斯医学研究所）； National Imaging Facility（国家成像设施）； Fiona Stanley Hospital（菲奥娜·斯蒂尔医院）； Victor Chang Cardiac Research Institute（维多利亚·张心脏研究中心）

AI总结提出TranSamba混合架构，通过跨平面建模捕获3D上下文，在弱监督下实现高效体积分割，在三个数据集上达到最优性能。

详情

AI中文摘要

弱监督分割使得模型能够从平面级标签进行训练。现有方法通常依赖2D编码器，忽略了医学数据的体积特性。我们提出TranSamba，一种混合Transformer-Mamba架构，旨在通过跨平面建模捕获3D上下文。TranSamba在Vision Transformer骨干网络基础上增加跨平面Mamba块，利用线性时间建模实现相邻平面间的高效信息交换。这种交换改善了平面内自注意力以及后续用于目标定位的注意力图。TranSamba在输入体积深度上保持线性时间复杂度和恒定空间复杂度。在涵盖不同模态和病理的三个数据集上的大量实验表明，TranSamba达到了最先进的性能，展示了跨平面建模的泛化有效性。代码可在以下网址获取：this https URL.

英文摘要

Weakly supervised segmentation enables model training from plane-level labels. Existing methods often rely on 2D encoders, neglecting the volumetric nature of medical data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context via cross-plane modeling. TranSamba augments a Vision Transformer backbone with Cross-Plane Mamba blocks, leveraging linear-time modeling for efficient information exchange across neighboring planes. This exchange improves in-plane self-attention and subsequent attention maps for object localization. TranSamba maintains linear time complexity and constant space complexity with respect to the input volume depth. Extensive experiments on three datasets covering diverse modalities and pathologies show that TranSamba achieves state-of-the-art performance, demonstrating the generalizable efficacy of cross-plane modeling. Code is available at: https://github.com/YihengLyu/TranSamba.

URL PDF HTML ☆

赞 0 踩 0

2606.00491 2026-06-18 cs.CV cs.AI 版本更新

增强病理视觉语言模型的跨尺度推理能力

Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu, Zeyu Liu, Sudong Wang, Yueming Jin

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore（新加坡国立大学电气与计算机工程系）； PuzzleLogic Pte Ltd（PuzzleLogic私人有限公司）； Department of Pathology, Fujian Medical University Cancer Hospital & Fujian Cancer Hospital（福建医科大学附属肿瘤医院病理科暨福建省肿瘤医院）

AI总结提出首个跨尺度训练与评估范式，通过多倍率视觉问答任务增强病理视觉语言模型的跨尺度推理能力，并构建高质量基准数据集Scale-VQA及模型ScaleReasoner-R1，实现最优性能。

详情

AI中文摘要

病理图像本质上是多尺度的，要求病理学家整合从低倍放大下的整体组织结构到高倍放大下的细胞形态的证据以进行准确诊断。虽然现有的视觉语言模型（VLM）病理数据集包含多种尺度，但它们通常缺乏明确的跨尺度推理目标。这一限制阻碍了VLM捕获关键的跨尺度表示和学习基于证据的推理。为弥补这一差距，我们引入了首个跨尺度训练和评估范式，将病理解释表述为多倍率推理。然而，创建这样的任务揭示了一个关键挑战：多图像视觉问答（VQA）容易受到仅文本捷径的影响，这使得模型能够利用与放大倍数相关的伪影而非视觉证据来猜测答案。为解决此问题，我们提出了一种泄漏感知的策展流程，结合了对抗性仅文本筛选和约束引导的问题设计。利用该流程，我们构建了Scale-VQA，一个高质量基准，包含4,685个多项选择题，基于2,537张跨多个放大级别的病理图像。最后，我们提出了ScaleReasoner-R1，一个通过强化学习训练的模型，以优化跨尺度VQA任务的性能。ScaleReasoner-R1在我们的跨尺度推理基准上达到了最先进的性能，并在已有的单尺度基准上泛化到最先进的性能。研究结果表明，即使是有限的跨尺度监督也能显著改善病理理解。代码和演示将开源。

英文摘要

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2508.11211 2026-06-18 eess.IV cs.CV 版本更新

Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension

面向CT视野扩展的高效图像到图像薛定谔桥

Zhenhao Li, Song Ni, Long Yang, Xiaojie Yin, Haijun Yu, Jiazhou Wang, Hongbin Han, Weigang Hu, Yixing Huang

发表机构 * Institute of Medical Technology, Peking University Health Science Center（北京大学人民医院医学技术研究所）； Shanghai Cancer Center, Fudan University（复旦大学上海癌症中心）； Department of Electrical and Computer Engineering, University of Massachusetts Lowell（马萨诸塞大学洛厄尔分校电气与计算机工程系）； Beijing Key Laboratory of Intelligent Neuromodulation and Brain Disorder Treatment（北京智能神经调控与脑疾病治疗重点实验室）

AI总结提出基于图像到图像薛定谔桥（I²SB）扩散模型的CT视野扩展框架，通过直接学习有限视野与扩展视野图像间的随机映射，实现单步快速推理，在精度和速度上均超越现有扩散模型。

Comments 12 pages

详情

Journal ref: IEEE Transactions on Radiation and Plasma Medical Sciences 2026

AI中文摘要

计算机断层扫描（CT）是一种用于无创、高分辨率可视化内部解剖结构的基石成像模态。然而，当扫描物体超出扫描仪的视野（FOV）时，投影数据被截断，导致重建不完整并在FOV边界附近出现明显伪影。传统重建算法难以从这类数据中恢复准确的解剖结构，限制了临床可靠性。深度学习方法已被探索用于FOV扩展，其中扩散生成模型代表了图像合成的最新进展。然而，传统扩散模型由于迭代采样过程，计算量大且推理速度慢。为解决这些限制，我们提出了一种基于图像到图像薛定谔桥（I$^2$SB）扩散模型的高效CT FOV扩展框架。与从纯高斯噪声合成图像的传统扩散模型不同，I$^2$SB学习配对的有限FOV和扩展FOV图像之间的直接随机映射。这种直接对应关系产生了更可解释和可追踪的生成过程，增强了重建中的解剖一致性和结构保真度。I$^2$SB实现了优越的定量性能，在模拟噪声数据上的均方根误差（RMSE）值为49.8 HU，在真实数据上为152.0 HU，优于最先进的扩散模型，如条件去噪扩散概率模型（cDDPM）和基于块的扩散方法。此外，其单步推理使得每2D切片的重建仅需0.19秒，相比cDDPM（135秒）实现了超过700倍的加速，并超过了第二快的DiffusionGAN（0.58秒）。这种准确性和效率的结合表明I$^2$SB具有实时或临床部署的潜力。

英文摘要

Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner's field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schrödinger Bridge (I$^2$SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I$^2$SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I$^2$SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8 HU on simulated noisy data and 152.0 HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19 s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135 s) and surpassing DiffusionGAN (0.58 s), the second fastest. This combination of accuracy and efficiency indicates that I$^2$SB has potential for real-time or clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2408.01526 2026-06-18 cs.CV 版本更新

Recognizing and Reconstructing a Multi-Unit Floor Plan

识别与重建多单元楼层平面图

Lukas Kratochvila, Gijs de Jong, Monique Arkesteijn, Simon Bilik, Tomas Zemcik, Karel Horak, Jan S. Rellermeyer

发表机构 * Department of Control and Instrumentation, Brno University of Technology, Brno, Czech Republic（控制与仪器系，布拉格技术大学，布拉格，捷克共和国）； Department of Software Technology, Faculty of Electrical Engineering Mathematics and Computer Science, TU Delft, Delft, Netherlands（软件技术系，电气工程数学与计算机科学学院，代尔夫特理工大学，代尔夫特，荷兰）； Department of Management in the Built Environment, Faculty of Architecture and the Built Environment, TU Delft, Delft, Netherlands（建筑环境管理系，建筑与环境学院，代尔夫特理工大学，代尔夫特，荷兰）； Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, Ostrava, Czech Republic and with Department of Informatics, Mendel University in Brno, Brno, Czech Republic（模糊建模研究与应用研究所，奥斯特拉瓦大学，奥斯特拉瓦，捷克共和国，并与布拉格梅德勒大学信息系联合）； Department of Software Technology, Faculty of Electrical Engineering Mathematics and Computer Science, TU Delft, Delft, Netherlands and with Dependable and Scalable Software Systems, Institute of Systems Engineering, Faculty of Electrical Engineering and Computer Science, Leibniz University Hannover, Hannover, Germany（软件技术系，电气工程数学与计算机科学学院，代尔夫特理工大学，代尔夫特，荷兰，并与可靠和可扩展软件系统，系统工程研究所，电气工程与计算机科学学院，莱比锡大学汉诺威分校，汉诺威，德国）

AI总结提出基于MDA-Unet和MACU-Net的像素级分割方法，结合改进跳跃连接和注意力机制，从2D平面图重建3D模型，在CubiCasa数据集上平均F1达0.86。

详情

AI中文摘要

数字孪生在应急规划中具有巨大潜力，可更高效设计逃生路线、在异常情况下提供更好方向感并加快救援干预。然而，由于缺乏3D表示（仅部分新建筑有有限数量），创建数字孪生仍主要依赖手动工作。因此，本文旨在从常见的2D建筑平面图合成3D信息。我们提出两种基于MDA-Unet和MACU-Net架构的新型像素级分割方法，具有改进的跳跃连接、注意力机制以及训练目标，并结合流水线的重建部分，将分割后的平面图矢量化以创建3D模型。将所提方法与另外两种最先进技术及多个基准数据集进行比较。在常用的CubiCasa基准数据集上，我们的方法在五个检查类别上实现了平均F1分数0.86，优于其他测试的像素级方法。我们还公开了代码以支持该领域的研究。

英文摘要

Digital twins have a major potential to form a significant part of urban management in emergency planning, as they allow more efficient designing of the escape routes, better orientation in exceptional situations, and faster rescue intervention. Nevertheless, creating the twins still remains a largely manual effort, due to a lack of 3D-representations, which are available only in limited amounts for some new buildings. Thus, in this paper we aim to synthesize 3D information from commonly available 2D architectural floor plans. We propose two novel pixel-wise segmentation methods based on the MDA-Unet and MACU-Net architectures with improved skip connections, an attention mechanism, and a training objective together with a reconstruction part of the pipeline, which vectorizes the segmented plans to create a 3D model. The proposed methods are compared with two other state-of-the-art techniques and several benchmark datasets. On the commonly used CubiCasa benchmark dataset, our methods have achieved the mean F1 score of 0.86 over five examined classes, outperforming the other pixel-wise approaches tested. We have also made our code publicly available to support research in the field.

URL PDF HTML ☆

赞 0 踩 0

2605.02089 2026-06-18 cs.CV 版本更新

Cross-Lingual Learning within Arabic Script for Low-Resource HTR

阿拉伯文字内低资源手写文本识别的跨语言学习

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

AI总结针对阿拉伯文字低资源手写文本识别，通过跨语言联合训练CRNN和HTR-VT模型，在KHATT、NUST-UHWR和PHTD数据集上显著降低字符错误率。

Comments This paper accepted at DALL workshop ICDAR 2026

详情

AI中文摘要

有限标注数据下的手写文本识别（HTR）仍然是一个具有挑战性的问题，尤其是对于阿拉伯文字语言。尽管现代基于序列的识别器在高资源设置下表现良好，但随着训练数据的稀缺，其准确率急剧下降。阿拉伯文字语言共享一个书写系统，具有大量字符重叠，这促使跨语言学习成为缓解数据稀缺的一种策略。我们在低资源场景（样本数K=100、500、1000标注行）下，对阿拉伯语（KHATT）、乌尔都语（NUST-UHWR）和波斯语（PHTD）进行了受控的行级跨语言联合训练研究。基于CRNN和Vision Transformer的HTR-VT模型在多个相关阿拉伯文字数据集的联合集上进行训练以缓解数据稀缺，并在单个目标语言上进行评估。两种架构在低资源条件下均受益于跨语言训练。CRNN在目标语言数据极其有限时仍然更有效，而随着更多目标语言数据的可用，HTR-VT的跨语言训练收益变得不太一致。在波斯语（PHTD）上，联合训练实现了9.99的字符错误率（CER），尽管未使用全部可用训练数据，仍超越了先前报告的结果。在另一个乌尔都语数据集（UNHD）上，联合训练将CER从17.20降低到14.45。

英文摘要

Handwritten Text Recognition (HTR) with limited labeled data remains a challenging problem, particularly for Arabic-script languages. Although modern sequence-based recognizers perform well in high-resource settings, their accuracy degrades sharply as training data becomes scarce. Arabic-script languages share a common writing system with substantial character overlap, motivating cross-lingual learning as a strategy to mitigate data scarcity. We conduct a controlled line-level study of cross-lingual joint training for Arabic-script HTR under low-resource regimes (number of samples K = 100, 500, 1000 labeled lines) on Arabic (KHATT), Urdu (NUST-UHWR) and Persian (PHTD). CRNN and Vision Transformer-based HTR-VT models are trained on the union of multiple related Arabic-script datasets to mitigate the data scarcity and are evaluated on individual target languages. Both architectures benefit from cross-language training under low-resource conditions. CRNN remains more effective under extremely limited target-language data, whereas the benefits of cross-language training for HTR-VT become less consistent as larger amounts of target-language data become available. On Persian (PHTD), joint training achieves a Character Error Rate (CER) of 9.99 , surpassing previously reported results despite not using the full available training data. On an additional Urdu dataset (UNHD), joint training reduces CER from 17.20 to 14.45.

URL PDF HTML ☆

赞 0 踩 0

2204.14224 2026-06-18 cs.CV cs.LG eess.IV 版本更新

Investigation of Neural Network Methods for Reconstruction and Classification of Texture Images Under Conditions of Incomplete Information

不完全信息条件下纹理图像重建与分类的神经网络方法研究

Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Darkhan Kurmangaliyev, Daniyar Nurseitov, Tatyana Dedova, Larissa Balakay, Serik Nurakynov

发表机构 * Satbayev University（萨特巴耶夫大学）； Institute of Ionosphere LLP（电离层研究所）； Information Technology Department（信息技术部门）； Assiut University（阿西乌特大学）

AI总结提出结合目标检测、GAN（CRA）修复和Transformer/CNN分类的端到端框架，发现重建质量高（PSNR 28.7dB）但分类准确率仅53%，通过置信度混合集成将MCA从48%提升至58%，揭示生成模型产生语义模糊特征的问题。

Comments IEEE ACCESS

详情

DOI: 10.1109/ACCESS.2026.3705029

AI中文摘要

异质自然纹理的自动化分析常因物理损伤和数据丢失而受阻，这对计算机视觉构成了重大挑战。虽然深度学习在受控环境中已显示出成功，但其在信息不完全条件下对复杂地质材料的应用仍未被充分探索。本研究提出了一个用于高分辨率岩心样本图像修复和分类的集成框架。我们设计了一个端到端流水线，利用目标检测进行样本分割，随后使用具有上下文残差聚合（CRA）的生成对抗网络（GAN）进行图像修复，以重建缺失的高频细节。接着，我们在重建数据上评估了现代基于Transformer（Swin、ViT）和CNN架构的性能。实验揭示了重建质量与下游效用之间的关键分歧：尽管结构保真度高（PSNR 28.7 dB，FID 74.01），分类准确率却停滞在53%。为了改善少数类检测，我们提出了一种基于置信度的混合集成方法，将MCA从48%提升至58%。这些结果凸显了当前最先进生成模型的局限性，它们可能产生视觉上合理但语义模糊的特征（“幻觉”），从而混淆分类器。本工作深入探讨了图像重建质量与分类性能之间的依赖关系，为无损检测和材料科学领域的未来研究提供了可复现的基线。鉴于井间准确率仍处于49-53%范围，我们将所得到的系统定位为岩相解释的决策支持和筛选工具，而非完全自主的分类器。代码可在以下网址获取：https://github.com/your-repo（注：原文URL未提供，此处为示例）

英文摘要

The automated analysis of heterogeneous natural textures is frequently hindered by physical damage and data loss, presenting a significant challenge to computer vision. While deep learning has shown success in controlled environments, its application to complex geological materials under conditions of incomplete information remains underexplored. This study presents an integrated framework for the inpainting and classification of high-resolution core sample images. We propose an end-to-end pipeline that utilizes object detection for sample segmentation, followed by image inpainting using Generative Adversarial Networks (GANs) with Contextual Residual Aggregation (CRA) to reconstruct missing high-frequency details. Subsequently, we evaluate the performance of modern Transformer-based (Swin, ViT) and CNN architectures on the reconstructed data. Our experiments revealed a critical divergence between reconstruction quality and downstream utility: despite high structural fidelity (PSNR 28.7~dB, FID 74.01), classification accuracy plateaued at 53\%. To improve minority-class detection, we propose a confidence-based hybrid ensemble that raises MCA from 48\% to 58\%. These results highlight the limitations of current state-of-the-art generative models, which may produce visually plausible but semantically ambiguous features ("hallucinations") that confound classifiers. This work provides insights into the dependencies between image reconstruction quality and classification performance, offering a reproducible baseline for future research in non-destructive testing and material science. Given that cross-well accuracy remains in the 49--53\% range, we position the resulting system as a decision-support and screening tool for lithofacies interpretation rather than as a fully autonomous classifier. The code is available at https://github.com/GalymzhanAbdimanap/Lithology_recognition

URL PDF HTML ☆

赞 0 踩 0

2601.01200 2026-06-18 cs.CV eess.IV 版本更新

Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

点云的多尺度隐式结构相似性客观质量评估

Zhang Chen, Shuai Wan, Yuezhe Zhang, Siyu Ren, Fuzheng Yang, Junhui Hou

发表机构 * School of Electronics and Information, Northwestern Polytechnical University（电子与信息学院，西北工业大学）； Department of Computer Science, City University of Hong Kong（计算机科学系，香港城市大学）； School of Telecommunication Engineering, Xidian University（电信工程学院，西安电子科技大学）

AI总结针对点云质量评估中不规则数据匹配困难的问题，提出多尺度隐式结构相似性度量（MS-ISSM），通过径向基函数连续表示局部特征并比较隐式函数系数，结合ResGrouped-MLP网络，在多个基准上超越现有方法。

Comments IEEE TMM Accepted

详情

AI中文摘要

点云的无结构和不规则特性对精确的点云质量评估（PCQA）构成重大挑战，特别是在建立准确的感知特征对应关系方面。为了解决这一问题，我们提出了多尺度隐式结构相似性度量（MS-ISSM）。与传统的点对点匹配不同，MS-ISSM利用径向基函数（RBF）连续表示局部特征，将失真测量转化为隐式函数系数的比较。该方法有效避免了不规则数据中固有的匹配误差。此外，我们提出了ResGrouped-MLP质量评估网络，该网络能够鲁棒地将多尺度特征差异映射到感知分数。该网络架构摒弃了传统的平面多层感知器（MLP），采用分组编码策略，集成了残差块和通道注意力机制。这种分层设计使得模型能够保留亮度、色度和几何的独特物理语义，同时自适应地关注高、中、低尺度上最显著的失真特征。在多个基准上的实验结果表明，MS-ISSM在可靠性和泛化性方面均优于最先进的指标。源代码可在以下网址获取：this https URL。

英文摘要

The unstructured and irregular nature of points poses a significant challenge for accurate point cloud quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes radial basis function (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat multi-layer perceptron (MLP) by adopting a grouped encoding strategy integrated with residual blocks and channel-wise attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: https://github.com/ZhangChen2022/MS-ISSM.

URL PDF HTML ☆

赞 0 踩 0

2602.00176 2026-06-18 cs.CV cs.AI 版本更新

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出后验延续框架，根据扩散噪声水平逐步暴露测量频率，结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情

AI中文摘要

扩散后验采样通过将预训练的扩散先验与测量一致性指导相结合来解决逆问题。然而，在高噪声水平下，全频带指导可能不可靠，因为干净估计包含分数诱导误差，且高频测量方向弱可识别。我们认为后验指导应根据瞬时扩散噪声水平暴露测量频率。基于这一原则，我们提出一个后验延续框架，构建一系列中间后验，其似然强调当前可靠频带并逐渐恢复全频带一致性。我们通过一个稳定采样器实例化该框架，该采样器结合了扩散预测器、频率受限似然细化以及Haar域承诺规则，该规则提交可靠粗校正同时推迟弱可识别细节。在超分辨率、修复和去模糊任务中，我们的方法实现了具有竞争力乃至最先进的恢复性能，包括在FFHQ和ImageNet评估中，运动去模糊相比强基线PSNR提升高达5 dB。

英文摘要

Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance. However, full-band guidance can be unreliable at high noise levels, where clean estimates contain score-induced errors and high-frequency measurement directions are weakly identifiable. We argue that posterior guidance should expose measurement frequencies according to the instantaneous diffusion noise level. Based on this principle, we propose a posterior continuation framework that constructs a family of intermediate posteriors whose likelihood emphasizes currently reliable frequency bands and gradually returns to full-band consistency. We instantiate this framework with a stabilized sampler that combines a diffusion predictor, frequency-limited likelihood refinement, and a Haar-domain commitment rule that commits reliable coarse corrections while deferring weakly identifiable details. Across super-resolution, inpainting, and deblurring, our method achieves competitive-to-state-of-the-art restoration performance, including up to 5 dB PSNR improvement on motion deblurring over strong baselines in evaluations on FFHQ and ImageNet.

URL PDF HTML ☆

赞 0 踩 0

2603.05010 2026-06-18 cs.CV 版本更新

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

生成式图像恢复进展：能力、局限性与评估实践研究

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu

发表机构 * Fudan University（复旦大学）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（深圳先进技术研究院，中国科学院）； University of the Chinese Academy of Sciences（中国科学院大学）； Multimedia Laboratory, The Chinese University of Hong Kong（香港中文大学多媒体实验室）； Shenzhen University of Advanced Technology（深圳先进技术大学）

AI总结通过多维度评估管道系统比较扩散、GAN等生成式模型与PSNR导向模型，揭示从细节不足到细节质量与语义控制的范式转变，并训练了更符合人类感知的IQA模型。

Comments Accepted by CVPR 2026 Findings

详情

AI中文摘要

生成式图像恢复（GIR）在感知真实感方面取得了显著进展，但与先前方法相比，其实际能力究竟有多大提升？为回答这一问题，我们基于新的多维度评估管道开展大规模研究，该管道从细节、清晰度、语义正确性和整体质量四个维度评估模型。我们的分析涵盖多种架构，包括基于扩散的、基于GAN的、PSNR导向的以及通用生成模型，揭示了关键的性能差异。此外，我们的分析揭示了失败模式的演变，这标志着以感知为导向的低层视觉领域发生了范式转变。核心挑战正从先前的细节稀缺（欠生成）问题演变为细节质量和语义控制（防止过生成）的新前沿。我们还利用我们的基准训练了一个新的IQA模型，该模型更符合人类感知判断。最终，本工作对现代生成式图像恢复模型进行了系统研究，提供了关键见解，重新定义了对其真实状态的理解，并为未来发展指明了方向。

英文摘要

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

URL PDF HTML ☆

赞 0 踩 0

2605.12567 2026-06-18 cs.CV cs.AI 版本更新

Pyramid Self-Contrastive Learning for Single-shot Test-time Ultrasound Image Denoising

金字塔自对比学习框架用于测试时超声图像去噪

Jiajing Zhang, Bingze Dai, Xi Zhang, Yue Xu, Wei-Ning Lee

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong（香港大学电子与计算机工程系）； Department of Biomedical Engineering, Duke University（达特茅斯大学生物医学工程系）

AI总结本文提出一种纯测试时训练框架，用于单次超声图像去噪，应用于合成孔径超声，通过自对比学习分离解剖相似性和噪声随机性，提升去噪效果和结构细节。

详情

AI中文摘要

内在的电子噪声和斑点噪声使超声图像的临床解释复杂化。传统去噪方法依赖显式噪声假设，其有效性在复合噪声条件下减弱。基于学习的方法需要大量标注数据和模型参数。这些预定义和预训练的方法在复杂体内环境中不可避免地导致领域偏移，因此局限于特定噪声类型并常模糊结构细节。本文提出了一种纯测试时训练框架用于单次超声图像去噪，并应用于合成孔径超声（SAU），该方法通过自对比学习在金字塔潜在空间中分离解剖相似性和噪声随机性。干净图像随后从解剖空间解码，而丢弃噪声空间。A2A在测试时仅使用一个噪声样本的SAU信号进行训练，从而从根本上消除了领域偏移和预训练成本。模拟实验，包括电子噪声水平0至30 dB和不同包含几何形状，证明了A2A在SNR和CNR上的改进分别为69.3%和34.4%。体内结果表明，仅使用心脏六个超声切面、肝脏和肾脏的两个孔径数据，SNR和CNR分别提高了84.8%和25.7%。A2A在多种成像目标和配置中产生清晰的图像/信号，为更可靠的超声解剖可视化和功能评估铺平了道路。

英文摘要

The inherent electronic and speckle noise complicates clinical interpretation of ultrasound images. Conventional denoising methods rely on explicit noise assumptions whose validity diminishes under composite noise conditions. Learning-based methods are usually pretrained in a limited image domain using a labeled dataset, which implies inevitable domain shift in complex in vivo environments. This study proposes a Pyramid Self-Contrastive Learning (PSCL) framework for test-time ultrasound image denoising without pretraining. Given multiple noisy samples from only one-shot imaging, PSCL disentangles anatomical similarity and noise randomness into separate pyramid latent spaces. The clean image is then decoded from the anatomy space while discarding the noise space. We first apply PSCL to synthetic aperture ultrasound (SAU), where an Aperture-to-Aperture loop serves as a self-supervised proxy task to ensure denoising fidelity. Simulation experiments, including noise levels from 0 to 30 dB and inclusion geometries from simple to complex, demonstrated improvements of 69.3% in SNR and 34.4% in CNR. The in vivo results showed 84.8% SNR and 25.7% CNR gains using only two aperture data of the heart in six echocardiographic views, liver, and kidney. PSCL delivers clear images across diverse imaging targets and configurations, paving the way for more reliable anatomical visualization without domain shift and pretraining costs.

URL PDF HTML ☆

赞 0 踩 0

2506.11139 2026-06-18 eess.IV cs.AI cs.CV 版本更新

Grids Often Outperform Implicit Neural Representations at Compressing Dense Signals

网格通常在压缩密集信号方面优于隐式神经表示

Namhoon Kim, Sara Fridovich-Keil

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）； Georgia Institute of Technology（佐治亚理工学院）

AI总结研究发现，对于密集信号任务，带插值的正则化网格在训练速度和重建质量上优于同等参数量的隐式神经表示，而INR仅在拟合二值信号（如形状轮廓）时表现更优。

Comments Our analysis are available at https://github.com/voilalab/INR-benchmark

详情

AI中文摘要

隐式神经表示（INR）最近展示了令人印象深刻的结果，但其基本容量、隐式偏差和缩放行为仍知之甚少。我们研究了不同INR在一系列具有不同有效带宽的2D和3D真实及合成信号上的性能，以及包括断层扫描、超分辨率和去噪在内的过拟合和泛化任务。通过根据模型大小以及信号类型和带宽对性能进行分层，我们的结果揭示了不同INR和网格表示如何分配其容量。我们发现，对于许多涉及密集信号的任务，具有插值的简单正则化网格在训练速度和质量上优于或等同于具有相同参数数量的任何INR。我们还发现有限的情况——即拟合二值信号（如形状轮廓）——其中INR优于网格，以指导INR的未来开发和使用，使其应用于最有利的应用场景。

英文摘要

Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior remain poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for many tasks involving dense signals, a simple regularized grid with interpolation trains faster and to higher or comparable quality than any INR with the same number of parameters. We also find limited settings -- namely fitting binary signals such as shape contours -- where INRs outperform grids, to guide future development and use of INRs towards the most advantageous applications.

URL PDF HTML ☆

赞 0 踩 0

2508.03483 2026-06-18 cs.CV cs.AI 版本更新

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

当汽车有刻板印象：审计文本到图像模型中对象的群体偏见

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

发表机构 * AIM Intelligence（AIM智能研究院）； Yonsei University（延世大学）

AI总结提出SODA框架，通过三个指标系统测量文本到图像模型在生成对象中的群体偏见，发现中性提示隐含偏向中年和白人，且人口统计线索导致高度偏斜的刻板输出。

详情

AI中文摘要

虽然先前关于文本到图像生成的研究主要集中在人类描绘中的偏见，但生成对象中的群体偏见仍然相对未被充分探索。我们引入了SODA（刻板对象诊断审计），这是一个新颖的框架，通过自动属性发现和三个标准化指标系统地测量这些偏见：基础与群体差异（BDS）、跨群体差异（CDS）和视觉属性集中度（VAC）。将SODA应用于五个最先进模型和八个对象类别（例如汽车）的8000张图像，我们发现“中性”提示产生的输出在视觉上最接近中年和白人，表明这些群体在模型默认设置中被隐含地过度代表。此外，人口统计线索触发了高度偏斜的刻板输出：26.6%的对象-模型-群体组合产生的结果中，所有20张生成图像共享完全相同的属性值（例如，为女性生成玫瑰金笔记本电脑）。最后，提示级别的去偏减少了群体间差异，但矛盾地压缩了群体内多样性，用一种刻板印象取代了另一种。SODA提供了一个实用的流程，使这些隐含关联变得可测量，作为迈向更负责任的人工智能发展的一步。

英文摘要

While prior research on text-to-image generation has predominantly focused on biases in human depictions, demographic bias in generated objects remains relatively underexplored. We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring these biases through automated attribute discovery and three standardized metrics: Base vs. Demographic Divergence (BDS), Cross-Demographic Disparity (CDS), and Visual Attribute Concentration (VAC). Applying SODA to 8,000 images across five state-of-the-art models and eight object categories (e.g., cars), we find that "neutral" prompts produce outputs most visually similar to middle-aged and White people, suggesting these groups are implicitly over-represented in model defaults. Furthermore, demographic cues trigger highly skewed stereotypical outputs: 26.6% of object-model-demographic combinations produce results where all 20 generated images share the exact same attribute value (e.g., rose gold laptops for women). Finally, prompt-level debiasing reduces inter-group disparity but paradoxically collapses within-group diversity, replacing one stereotype with another. SODA offers a practical pipeline for making these implicit associations measurable, serving as a step toward more responsible AI development.

URL PDF HTML ☆

赞 0 踩 0

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器：通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China（数据科学学院、人工智能学院、香港中文大学（深圳））

AI总结提出语义感知通用扰动（SAUP），作为语义路由器同时劫持多个无状态决策，通过理论分析和SORT优化策略实现，在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在无状态系统中，例如自动驾驶和机器人技术。本文研究了一种新型威胁：语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动（SAUP），它充当语义路由器，“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点，我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下，我们提出了语义导向（SORT）优化策略，并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性，在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

URL PDF HTML ☆

赞 0 踩 0

2606.11615 2026-06-18 cs.CV cs.CR cs.LG 版本更新

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

Adv-TGD：面向人脸识别冒充攻击的对抗性文本引导扩散

Omid Ahmadieh, Nima Karimian

发表机构 * University of South Florida, Bellini College of Artificial Intelligence, Cybersecurity and Computing（南佛罗里达大学贝利尼人工智能、网络安全与计算学院）

AI总结提出Adv-TGD框架，利用Stable Diffusion和LoRA微调生成逼真对抗人脸，在保持视觉质量的同时实现高成功率身份冒充攻击，平均ASR达85.90%。

详情

AI中文摘要

人脸识别（FR）技术的广泛普及引发了严重的隐私担忧，因为面部数据可能在未经同意的情况下被利用。为了解决这一挑战，我们提出了Adv-TGD，一个生成式对抗攻击框架，能够合成逼真的人脸，冒充目标身份并欺骗人脸识别系统。基于Stable Diffusion，Adv-TGD对每个样本进行LoRA微调，以简洁的文本提示为条件，生成自然但具有对抗性操控的身份。与传统的身份攻击方法不同，我们的方法在单步去噪过程中为每个源-目标对优化轻量级交叉注意力适配器。潜在混合受到面部局部热图掩码的约束，以确保空间精确的身份操控，同时保留非敏感区域。我们引入了一个复合目标，结合了掩码epsilon-MSE重建、FR嵌入空间中的阈值化身份差异、方向特征对齐和源相似性抑制，以平衡对抗攻击和视觉真实性。可选地，LLaVA生成的属性提示增强了细粒度语义细节，而不会重新引入身份线索。在黑盒评估协议下，Adv-TGD在IR152、IRSE50、MobileFace和FaceNet上平均攻击成功率（ASR）达到85.90%，超过语义SOTA基线Adv-CPG +6.25个百分点、基于扩散的化妆方法DiffAIM +3个百分点以及基于噪声的P3-Mask +16个百分点。尽管攻击效果强劲，Adv-TGD仍保持了高视觉保真度（PSNR = 27.15 dB，SSIM = 0.981）。此外，我们通过成功将其扩展到野外数据集（LADN）、通用对象分类（ImageNet）和基于Transformer的扩散模型（FLUX.1），展示了我们框架的灵活性。

英文摘要

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion v2.1, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a fixed-timestep denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by 6.25 points, the diffusion-based makeup method DiffAIM by 3 points, and the noise-based P3-Mask by 16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 28.18 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

URL PDF HTML ☆

赞 0 踩 0

2504.14798 2026-06-18 cs.LG cs.CV 版本更新

RUB: Evaluating Residual Knowledge in Unlearned Models

RUB: 评估未学习模型中的残留知识

Hao Xuan, Xingyu Li

发表机构 * Electrical and Computer Engineering University of Alberta（电气与计算机工程大学阿尔伯塔大学）

AI总结提出鲁棒未学习原则及统一基准RUB，通过未学习映射攻击（UMA）检测残留信息，揭示现有方法在对抗评估下的脆弱性。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2026, pages 8550-8559

AI中文摘要

机器未学习（MUL）已成为隐私保护和内容监管的关键机制，然而当前技术往往无法保证完全移除敏感信息。虽然现有工作大多关注验证未学习的执行，但它们忽略了模型在面对对抗性恢复遗忘知识尝试时是否保持鲁棒性的关键问题。在这项工作中，我们倡导鲁棒未学习原则，要求模型既与重新训练的模型不可区分，又能抵御多样化的对抗威胁。为实例化这一原则，我们提出了一个统一基准RUB（鲁棒未学习基准），系统评估未学习算法在分类、图像到图像重建和文本到图像合成中的鲁棒性。在此框架内，我们引入未学习映射攻击（UMA）作为检测残留信息的通用方法，并展示现有攻击策略如何适应此框架，只要它们符合通用UMA框架。我们在判别式和生成式任务上的实验表明，最先进的未学习方法在这些评估下仍然脆弱，即使通过了标准验证指标。通过将鲁棒性定位为核心标准并提供对抗评估基准，我们希望RUB能为更可靠和安全的未学习实践铺平道路。RUB中的代码库和模型检查点将公开发布。

英文摘要

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

URL PDF HTML ☆

赞 0 踩 0

2505.03646 2026-06-18 cs.LG cs.AI cs.CV 版本更新

Revealing Hidden Vulnerabilities in Autoencoders through Gradient Signal Restoration

通过梯度信号恢复揭示自编码器中的隐藏漏洞

Chethan Krishnamurthy Ramanaik, Arjun Roy, Tobias Callies, Eirini Ntoutsi

发表机构 * University of the Bundeswehr Munich（联邦国防军理工大学）

AI总结针对自编码器对抗攻击中梯度消失导致鲁棒性被高估的问题，提出GRILL框架恢复梯度信号，显著提升攻击效果，暴露隐藏漏洞。

详情

AI中文摘要

深度自编码器（AE）的对抗鲁棒性受到的关注远少于判别模型，尽管其压缩的潜在表示会导致病态映射，从而放大小的输入扰动并破坏重建稳定性。现有的AE白盒攻击通过优化范数有界的对抗扰动以最大化重建损失，往往收敛到次优扰动，从而可能高估AE的鲁棒性。我们表明，这种限制与通过病态层反向传播时对抗损失梯度消失有关，这些病态层的中间权重矩阵具有接近零的奇异值。为了解决这个问题，我们提出了GRILL（病态层中的梯度信号恢复）框架，旨在减轻梯度退化并提高编码器-解码器架构中对抗鲁棒性评估的可靠性。GRILL旨在缓解优化过程中的对抗梯度退化，使攻击能够在固定范数约束下更好地逼近高失真扰动。通过在多种AE架构上的广泛实验，包括样本特定和通用攻击，以及标准和自适应攻击设置，我们表明GRILL显著提高了攻击有效性，从而暴露了现有攻击限制所隐藏的漏洞。除了AE之外，我们提供了初步证据表明现代多模态编码器-解码器架构也存在类似的漏洞。

英文摘要

Adversarial robustness of deep autoencoders (AEs) has received less attention than that of discriminative models, although their compressed latent representations induce ill-conditioned mappings that can amplify small input perturbations and destabilize reconstructions. Existing white-box attacks for AEs, which optimize norm-bounded adversarial perturbations to maximize reconstruction damage, often converge to suboptimal perturbations, thereby potentially overstating AE robustness. We show that this limitation is linked to vanishing adversarial loss gradients during backpropagation through ill-conditioned layers, associated with near-zero singular values in their intermediate weight matrices. To address this, we propose GRILL (Gradient Signal Restoration in Ill-Conditioned Layers), a framework designed to mitigate gradient degradation and improve the reliability of adversarial robustness evaluation in encoder-decoder architectures. GRILL is designed to mitigate adversarial gradient degradation during optimization, enabling attacks to better approximate high-distortion perturbations under fixed norm constraints. Through extensive experiments across multiple AE architectures, under both sample-specific and universal attacks, as well as standard and adaptive attack settings, we show that GRILL significantly increases attack effectiveness, thereby exposing vulnerabilities hidden by existing attack limitations. Beyond AEs, we provide preliminary evidence that modern multimodal encoder-decoder architectures exhibit similar vulnerabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.09946 2026-06-18 cs.AR cs.CV 版本更新

SPARX: Secure and Privacy-Aware Approximate CNN Acceleration with Edge RISC-V SoC

SPARX: 面向边缘RISC-V SoC的安全与隐私感知近似CNN加速

Sonu Kumar, Akash Sankhe, Mukul Lokhande, Santosh Kumar Vishvakarma

发表机构 * Dept of Science and Technology (DST), Govt of India（印度科学技术部）； MeitY/SMDP-C2S（印度电子与信息化部/SMDP-C2S）

AI总结提出SPARX框架，集成RISC-V指令扩展、近似对数CNN加速单元、差分隐私引擎和认证机制，通过近似感知决策框架选择最优乘法器，在边缘实现安全高效的CNN推理。

Comments Under review in 12th International Symposium on Smart Electronic Systems (iSES) 2026

详情

AI中文摘要

边缘AI系统日益需要在严格的能耗、性能、安全和隐私约束下进行实时CNN推理。近似计算通过利用神经网络工作负载的错误容忍性来提高硬件效率；然而，大多数近似CNN加速器并未联合考虑安全的、隐私感知的边缘部署。本文提出了SPARX，一个集成在异构RV32IMC RISC-V系统级芯片（SoC）内的安全与隐私感知近似CNN加速框架。SPARX结合了自定义RISC-V指令扩展、近似对数CNN加速单元、轻量级基于差分噪声的隐私引擎以及挑战-响应认证机制。为了指导算术选择，引入了一个近似感知决策框架，该框架使用近似严重性指数（ASI）、近似效率（AE）、近似质量（QoA）、近似品质因数（AFOM）和硬件加速效率（HAE）。对11种最先进的近似MAC架构的评估表明，迭代对数乘法器（ILM）是最合适的设计，与精确的基4 Booth MAC相比，面积减少51.7%，功耗降低81.5%，吞吐量提升2.13倍，而仅使ResNet-20/CIFAR-10的准确率降低2.82个百分点。在Xilinx VC707平台上的FPGA实现实现了250 MHz下58.4 GOPS/W的能效，而28纳米CMOS物理实现验证了ASIC的可行性。

英文摘要

Edge-AI systems increasingly require real-time CNN inference under strict energy, performance, security, and privacy constraints. Approximate computing improves hardware efficiency by exploiting the error resilience of neural network workloads; however, most approximate CNN accelerators do not jointly consider secure, privacy-aware edge deployment. This paper presents SPARX, a Secure and Privacy-Aware Approximate CNN Acceleration framework integrated within a heterogeneous RV32IMC RISC-V System-on-Chip (SoC). SPARX combines a custom RISC-V instruction extension, an approximate logarithmic CNN acceleration unit, a lightweight differential-noise-based privacy engine, and a challenge-response authentication mechanism. To guide arithmetic selection, an approximation-aware decision framework is introduced that uses the Approximation Severity Index (ASI), Approximation Efficiency (AE), Quality of Approximation (QoA), Approximation Figure-of-Merit (AFOM), and Hardware Acceleration Efficiency (HAE). Evaluation across 11 state-of-the-art approximate MAC architectures identifies the Iterative Logarithmic Multiplier (ILM) as the most suitable design, achieving 51.7% area reduction, 81.5% power reduction, and 2.13x throughput improvement compared with an accurate radix-4 Booth MAC, while only reducing ResNet-20/CIFAR-10 accuracy by 2.82 percentage points. FPGA implementation on a Xilinx VC707 platform achieves 58.4 GOPS/W energy efficiency at 250 MHz, while 28-nm CMOS physical implementation validates ASIC feasibility

URL PDF HTML ☆

赞 0 踩 0

2303.18031 2026-06-18 cs.CV cs.AI cs.LG 版本更新

Simple Domain Generalization Methods are Strong Baselines for Open Domain Generalization

简单域泛化方法是开放域泛化的强基线

Masashi Noguchi, Shinichi Shirakawa

发表机构 * Graduate School of Environment and Information Sciences（环境与信息科学研究生院）； Yokohama National University（Yokohama国立大学）； Faculty of Environment（环境学系）

AI总结本文评估现有域泛化方法在开放域泛化中的表现，发现简单方法CORAL和MMD与复杂方法DAML竞争力相当，并通过集成学习和Dirichlet混合数据增强简单扩展后性能接近DAML且计算成本更低。

Comments Accepted at IJCNN 2024. The code used in the experiments is available at https://github.com/shiralab/OpenDG-Eval

详情

DOI: 10.1109/IJCNN60899.2024.10650639

AI中文摘要

在现实应用中，机器学习模型需要处理开放集识别（OSR），即在推理过程中出现未知类别，同时还要处理域偏移，即训练和推理阶段数据分布不同。域泛化（DG）旨在处理推理阶段目标域在模型训练期间不可访问的域偏移情况。开放域泛化（ODG）同时考虑DG和OSR。域增强元学习（DAML）是一种针对ODG的方法，但其学习过程复杂。相比之下，尽管已提出多种DG方法，但它们尚未在ODG场景下进行评估。在本研究中，我们全面评估了现有DG方法在ODG中的表现，并表明两种简单的DG方法——相关对齐（CORAL）和最大均值差异（MMD）——在多种情况下与DAML具有竞争力。此外，我们通过引入DAML中使用的技术（如集成学习和Dirichlet混合数据增强）提出了CORAL和MMD的简单扩展。实验评估表明，扩展后的CORAL和MMD可以以较低的计算成本达到与DAML相当的性能。这表明简单的DG方法及其简单扩展是ODG的强基线。

英文摘要

In real-world applications, a machine learning model is required to handle an open-set recognition (OSR), where unknown classes appear during the inference, in addition to a domain shift, where the data distribution differs between the training and inference phases. Domain generalization (DG) aims to handle the domain shift situation where the target domain of the inference phase is inaccessible during the model training. Open domain generalization (ODG) considers DG and OSR. Domain-augmented meta-learning (DAML) is a method targeting ODG; however, it has a complicated learning process. By contrast, although various DG methods have been proposed, they have not been evaluated in ODG situations. In this study, we comprehensively evaluate the existing DG methods in ODG and show that the two simple DG methods, CORrelation ALignment (CORAL) and maximum mean discrepancy (MMD), are competitive with DAML in several cases. In addition, we propose simple extensions of CORAL and MMD by introducing the techniques used in DAML, such as ensemble learning and Dirichlet mixup data augmentation. The experimental evaluation demonstrates that the extended CORAL and MMD can perform comparably to DAML with lower computational costs. This suggests that the simple DG methods and their simple extensions are strong baselines for ODG.

URL PDF HTML ☆

赞 0 踩 0

2406.18215 2026-06-18 cs.CV 版本更新

Optimizing Incomplete, Large-Scale and Sparse Multi-Graph Matching in Bioimaging

优化生物成像中不完整、大规模和稀疏的多图匹配

Max Kahl, Sebastian Stricker, Lisa Hutschenreiter, Florian Bernard, Carsten Rother, Bogdan Savchynskyy

发表机构 * Heidelberg University（海德堡大学）； Max Planck Institute for Informatics（马克斯·普朗克信息研究所）； University of Bonn（波恩大学）

AI总结针对生物成像中大规模稀疏多图匹配问题，提出稀疏排列同步范式及通用方法GREEDA，在目标值和运行时间上优于现有方法。

详情

AI中文摘要

多图匹配是计算机视觉中的一个基本问题。我们的工作受到生物成像中一个具有挑战性的应用的启发，在该应用中，需要将数十甚至数百张蠕虫的3D显微镜图像进行对应。现有数据集未覆盖这种大规模场景，且几乎所有现有方法都不适用，因为它们假设完整或密集的问题设置。为了支持进一步研究，我们的第一个贡献是基于生物成像中的问题实例构建了一个新的大规模数据集。我们的第二个贡献是对两种主要的多图匹配范式：直接法和排列同步法进行了全面分析。我们通过部分证明论证，实用的大规模方法必须明确处理问题的稀疏性和不完整性。由于标准的排列同步方法在此设置下失败，我们进一步引入了一种稀疏排列同步范式。我们的最终贡献是GREEDA，一种针对稀疏和不完整问题的通用方法，可跨成本阶和范式实例化。虽然本文重点研究最高二次阶的目标函数，但GREEDA本质上可推广到任意阶。在更大、更稀疏的实例上，GREEDA在目标值和运行时间上均优于竞争方法。例如，对于基于30张蠕虫图像的中等规模问题，GREEDA在2分钟内产生高质量解，而竞争方法至少需要半小时且结果差得多。在较小的密集问题上，GREEDA与领先方法性能相当，但速度快一个数量级。

英文摘要

Multi-graph matching is a fundamental problem in computer vision. Our work is motivated by a challenging application in bioimaging, where dozens or even hundreds of 3D microscopy images of worms must be brought into correspondence. Existing datasets do not cover this large-scale regime, and virtually all existing methods are inapplicable because they assume a complete or dense problem setting. To support further research, our first contribution is a new large-scale dataset based on problem instances from bioimaging. Our second contribution is a comprehensive analysis of the two main multi-graph matching paradigms: direct and permutation synchronization-based formulations. We argue, in part by proof, that practical large-scale methods must explicitly address problem sparsity and incompleteness. Since standard permutation synchronization approaches fail in this setting, we further introduce a sparse permutation synchronization paradigm. Our final contribution is GREEDA, a general method for sparse and incomplete problems that can be instantiated across cost orders and paradigms. While our paper focuses on objective functions up to quadratic order, GREEDA is inherently generalizable to arbitrary orders. On larger, sparse instances, GREEDA outperforms competing methods in both objective value and runtime. For example, for moderately-sized problems based on 30 worm images GREEDA produces a high-quality solution within 2 minutes, whereas competitors require at least half an hour and yield far worse results. On smaller dense problems, GREEDA remains on par with leading methods while being an order of magnitude faster.

URL PDF HTML ☆

赞 0 踩 0

2407.18245 2026-06-18 cs.CV cs.LG 版本更新

VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset

VGGHeads: 基于大规模合成数据集的3D多头部对齐

Orest Kupyn, Eugene Khvedchenia, Christian Rupprecht

发表机构 * University of Oxford（牛津大学）； Piñata Farms ； Ukrainian Catholic University（乌克兰天主大学）

AI总结提出VGGHeads，一个由扩散模型生成的大规模合成数据集，用于单步同时进行头部检测和3D网格重建，在真实图像上表现优异。

详情

AI中文摘要

人类头部检测、关键点估计和3D头部模型拟合是许多应用中的基本任务。然而，传统的真实世界数据集常常存在偏差、隐私和伦理问题，并且是在实验室环境中记录的，这使得训练出的模型难以泛化。在这里，我们介绍\method——一个使用扩散模型生成的大规模合成数据集，用于人类头部检测和3D网格估计。我们的数据集包含超过100万张高分辨率图像，每张图像都标注了详细的3D头部网格、面部标志和边界框。利用这个数据集，我们引入了一种新的模型架构，能够从单张图像中单步同时进行头部检测和头部网格重建。通过广泛的实验评估，我们证明了在我们的合成数据上训练的模型在真实图像上取得了强劲的性能。此外，我们数据集的多样性使其适用于广泛的任务，提供了人类头部的通用和全面表示。

英文摘要

Human head detection, keypoint estimation, and 3D head model fitting are essential tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce \method -- a large-scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset, we introduce a new model architecture capable of simultaneous head detection and head mesh reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads.

URL PDF HTML ☆

赞 0 踩 0

2504.01527 2026-06-18 cs.CV eess.IV 版本更新

Beyond Nearest Neighbor Interpolation in Data Augmentation

超越数据增强中的最近邻插值

Olivier Rukundo

发表机构 * Department of Electronic and Computer Engineering, University of Limerick（电子与计算机工程系，利默里克大学）

AI总结本文提出改进的几何变换函数和均值分类过滤机制，以避免最近邻插值带来的标注误差和低通滤波影响，通过离线数据增强管道提升医学图像分割性能。

Comments 10 pages, 11 figures, 14 tables

详情

AI中文摘要

避免最近邻插值导致的未定义类别标签风险忽视了增强训练数据中像素级标注误差的加剧风险。此外，插值算法固有的低通滤波效应会加剧标注区域内的高频结构细节退化风险。为避免这些风险，作者通过修改卷积神经网络的数据转换函数，引入改进的几何变换函数，去除对最近邻插值的依赖，并整合基于均值的类别过滤机制来处理未定义的类别标签。作者还实现了离线数据增强管道，生成特定于插值的增强训练数据，从而能够定量评估插值对增强训练数据的低通滤波效应。在三个医学图像分割数据集和XBAT+数据集上的实验评估显示，在多个定量指标上均实现了性能提升。

英文摘要

Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in augmented training data. Additionally, the inherent low pass filtering effects of interpolation algorithms exacerbate the risk of degrading high frequency structural details within annotated regions of interest. To avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function, removing reliance on nearest neighbor interpolation, and integrating a mean-based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. The author also implemented an offline data augmentation pipeline to generate interpolation specific augmented training data, enabling quantitative assessment of interpolation specific low pass filtering effects on augmented training data. Experimental evaluation on three medical image segmentation datasets and the XBAT+ datasets demonstrated performance gains across multiple quantitative metrics.

URL PDF HTML ☆

赞 0 踩 0

2505.21954 2026-06-18 cs.CV cs.AI 版本更新

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

重新审视主动说话人检测：面向泛化性和鲁棒性的野外基准

Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Tuan Khai Nguyen, Soochahn Lee, Yong Jae Lee

发表机构 * University of Wisconsin - Madison（威斯康星大学麦迪逊分校）； Oregon State University（俄勒冈州立大学）； University of Sydney（悉尼大学）； Kookmin University（韩国成均馆大学）

AI总结提出UniTalk数据集，涵盖多语言、嘈杂背景和拥挤场景等挑战性真实条件，评估显示现有模型在野外环境下性能不足，而UniTalk训练模型泛化性更好，为主动说话人检测建立新基准。

Comments Accepted to Interspeech 2026

2510.21605 2026-06-18 cs.CV 版本更新

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

S3OD：基于合成数据的通用显著目标检测

Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht

发表机构 * University of Oxford, VGG（牛津大学，视觉信息集团）

AI总结提出S3OD方法，通过大规模合成数据生成和歧义感知架构，显著提升显著目标检测的跨数据集泛化能力，仅用合成数据训练即可降低20-50%误差。

详情

AI中文摘要

显著目标检测体现了数据受限任务的特点，昂贵的像素级精确标注迫使相关子任务（如DIS和HR-SOD）进行单独的模型训练。我们提出了一种通过大规模合成数据生成和歧义感知架构来大幅提升泛化能力的方法。我们引入了S3OD，一个包含超过139,000张高分辨率图像的数据集，通过我们的多模态扩散管道从扩散和DINO-v3特征中提取标签。迭代生成框架根据模型性能优先处理具有挑战性的类别。我们提出了一个简化的多掩码解码器，通过预测多个有效解释来处理显著目标检测中固有的歧义。仅使用合成数据训练的模型在跨数据集泛化中实现了20-50%的错误率降低，而微调版本在DIS和HR-SOD基准上达到了最先进的性能。

英文摘要

Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained only on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2602.08355 2026-06-18 cs.CV 版本更新

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds：面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

发表机构 * Alimama Tech, Taobao \& Tmail Group of Alibaba ； Huazhong University of Science ； Vin University

AI总结提出电商短视频理解基准E-VAds，通过多模态信息密度评估框架量化领域复杂性，并构建多智能体生成的问答数据集，最后开发基于强化学习的推理模型E-VAds-R1，在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情

AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域，其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频，因为现有基准主要关注通用任务，忽略了商业意图的推理。在这项工作中，我们首先提出了一个多模态信息密度评估框架，以量化该领域的复杂性。我们的评估显示，与主流数据集相比，电商内容在视觉、音频和文本模态上表现出显著更高的密度，为视频理解建立了更具挑战性的前沿。为了弥补这一差距，我们引入了电商视频广告基准（E-VAds），这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频，涵盖广泛的产品类别，并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度，即感知与认知和推理，包含五个不同的任务。最后，我们开发了E-VAds-R1，一个基于强化学习的推理模型，具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导，同时为专家级精度创造非线性激励。实验结果表明，E-VAds-R1在仅使用几百个训练样本的情况下，在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

URL PDF HTML ☆

赞 0 踩 0

2603.21583 2026-06-18 cs.CV 版本更新

HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling

HACMatch: 基于难度感知课程伪标签的半监督旋转回归

Mei Li, Huayi Zhou, Suizhi Huang, Yuxiang Lu, Yue Ding, Hongtao Lu

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出一种难度感知课程学习框架，通过动态选择伪标签样本和结构化数据增强，在少量标注数据下提升半监督旋转回归性能。

Comments This is an accepted manuscript of an article published in Computer Vision and Image Understanding

详情

DOI: 10.1016/j.cviu.2026.104742
Journal ref: Computer Vision and Image Understanding (2026)

AI中文摘要

从2D图像回归物体的3D旋转是一项关键且具有挑战性的任务，在自动驾驶、虚拟现实和机器人控制等领域有广泛应用。现有的旋转回归模型通常依赖大量标注数据进行训练，或需要点云、CAD模型等2D图像之外的额外信息。因此，探索仅使用有限数量标注2D图像的半监督旋转回归具有重要价值。尽管最近的工作FisherMatch将半监督学习引入旋转回归，但其基于熵的刚性伪标签过滤方法未能有效区分可靠和不可靠的无标注样本。为解决这一局限，我们提出一种难度感知课程学习框架，根据样本难度动态选择伪标签样本，从简单到复杂逐步推进。我们引入了多阶段和自适应课程策略，用更灵活、难度感知的机制替代固定阈值过滤。此外，我们提出一种专门针对旋转估计的新型结构化数据增强策略，通过从增强补丁中组装复合图像来引入特征多样性，同时保持关键几何完整性。在PASCAL3D+和ObjectNet3D上的综合实验表明，我们的方法在低数据场景下尤其优于现有的监督和半监督基线，验证了课程学习框架和结构化增强方法的有效性。

英文摘要

Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.

URL PDF HTML ☆

赞 0 踩 0

2604.20822 2026-06-18 cs.CV cs.LG 版本更新

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

全球海上风电基础设施：基于密集Sentinel-1时间序列的部署与运行动态

Thorsten Hoeser, Felix Bachofer, Claudia Kuenzer

发表机构 * Earth Observation Center (EOC), German Aerospace Center (DLR)（地球观测中心（EOC），德国航空航天中心（DLR））； Institute for Geography and Geology, University of Wuerzburg（地理与地质研究所，乌尔姆大学）

AI总结提出全球Sentinel-1 SAR时间序列数据集，通过目标检测和规则分类器识别海上风电基础设施的部署与运行阶段，支持全球尺度动态分析。

Comments 29 pages, 18 figures

详情

AI中文摘要

海上风电行业正在快速扩张，增加了对全球范围内基础设施部署和运行进行独立、高时间分辨率监测的需求。虽然基于地球观测的海上风电基础设施测绘在空间定位方面已经成熟，但现有的开放数据集缺乏关于建设和运行动态的时间密集且语义精细的信息。我们引入了一个全球Sentinel-1合成孔径雷达（SAR）时间序列数据语料库，该语料库解析了2016年第一季度至2025年第一季度海上风电基础设施的部署和运行阶段。基于更新的目标检测工作流程，我们在检测到的基础设施位置编译了15,606条时间序列，共有14,840,637个事件作为分析就绪的一维SAR后向散射剖面，每个剖面对应一次Sentinel-1采集和一个位置。为了便于直接使用和基准测试，我们发布了（i）分析就绪的一维SAR剖面，（ii）由基于规则的分类器生成的事件级基线语义标签，以及（iii）包含553条时间序列和328,657个事件标签的专家标注基准数据集。基线分类器在事件评估中实现了0.84的宏F1分数，在折叠编辑相似性-质量阈值曲线下面积（AUC）为0.785，表明时间一致性。我们证明，由此产生的语料库支持全球尺度的部署动态分析、区域部署模式差异的识别、船只交互和运行事件，并为开发和比较海上风电基础设施监测的时间序列分类方法提供了参考。

英文摘要

The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.05547 2026-06-18 cs.CV 版本更新

Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings

利用地理空间AlphaEarth嵌入表征巴西大西洋森林恢复结果

Alice Heiman

发表机构 * Department of Computer Science（计算机科学系）

AI总结本研究利用AlphaEarth基础模型的卫星嵌入，通过余弦相似度定义参考轨迹嵌入，评估巴西圣保罗1729个恢复点的早期恢复成效，发现不同土地利用类型在嵌入空间中形成聚类，但信号存在噪声。

Comments Presented as a workshop paper at ICLR 2026 Machine Learning for Remote Sensing (ML4RS)

详情

AI中文摘要

巴西的大西洋森林是一个关键生物多样性热点，但其原始覆盖面积不足12-15%。尽管大规模监测森林恢复至关重要，但传统方法受限于实地报告在大尺度上的不可行性以及遥感指数（如NDVI）的饱和效应。此外，与森林砍伐导致的快速光谱变化不同，再造林是一个渐进过程。在本研究中，我们利用AlphaEarth Foundation模型的卫星嵌入，检查了圣保罗的1,729个恢复点，以评估其在表征早期恢复成功方面的有效性。我们引入了“参考轨迹嵌入”的概念，基于与成熟次生林参考点的余弦相似度定义恢复成功的度量。我们观察到不同土地利用和土地覆盖（LULC）类型在嵌入空间中形成不同的聚类，并且能够识别出具有明显变化向量的地点。然而，信号可能存在噪声，嵌入可能需要进一步微调以捕获和预测超出LULC的地点元数据。

英文摘要

The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground reporting on such a scale and by the saturation of remote-sensing indices such as NDVI. Furthermore, reforestation is a gradual process as opposed to the rapid spectral changes caused by deforestation. In this study, we examine 1,729 restoration sites in São Paulo, using satellite embeddings from the AlphaEarth Foundation's model to evaluate their effectiveness in characterising early restoration success. We introduce the concept of a 'Reference Trajectory Embedding', defining a metric of restoration success based on cosine similarity to reference sites of mature secondary forest. We observe distinct clusters in embedding space according to different land use and land cover (LULC) types, and we can identify sites with clear change vectors. However, the signal can be noisy, and embeddings may require further fine-tuning to capture and predict site metadata beyond LULC.

URL PDF HTML ☆

赞 0 踩 0

2606.05368 2026-06-18 cs.CV 版本更新

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Biomazon：亚马逊盆地三维森林结构与生物量建模的多模态数据集

Sayan Mandal, Rocco Sedona, Simon Besnard, Mikhail Urbazaev, Morris Riedel, Ehsan Zandi, Gabriele Cavallaro

发表机构 * Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich（julich超级计算中心（JSC），julich研究所）； School of Engineering and Natural Sciences (SENS), University of Iceland（工程与自然科学学院（SENS），冰岛大学）； Global Land Monitoring Group, GFZ Helmholtz Centre for Geosciences（全球土地监测组，geofz赫尔姆霍兹研究中心）

AI总结针对现有方法未将森林垂直结构作为有序轮廓学习的问题，提出Biomazon多模态基准数据集，结合GEDI RH和AGBD目标与多传感器预测因子，通过共享编码器-解码器框架进行消融研究，为热带森林结构一致RH轮廓预测和结构-生物量建模建立参考基准。

Comments 32 pages, 21 figures, 8 tables

详情

AI中文摘要

准确、空间明确的描述热带森林结构对于碳核算和生态系统监测至关重要，然而大多数机器学习流程预测冠层顶部高度代理（例如RH95/RH98）或AGBD作为单独的标量目标，而不是将森林垂直结构作为有序轮廓学习。社区缺乏一个ML就绪的多模态基准，用于联合预测整个GEDI RH轮廓与AGBD，或评估强制RH百分位数之间物理一致排序的方法。我们通过Biomazon解决了这一问题，这是一个覆盖亚马逊盆地的20米多模态基准数据集，在标准化的空间划分和评估协议下，将GEDI RH和AGBD目标与多传感器预测因子（Sentinel-1/2、ALOS-2 PALSAR-2、Copernicus DEM、Dynamic World LULC和AlphaEarth嵌入）配对。使用共享编码器-解码器与任务特定头作为基线框架，我们对（i）骨干/模型规模、（ii）模态贡献以及（iii）在独立和融合设置下使用辅助嵌入进行了全面的消融研究，并报告了单目标和联合目标结果，以量化统一训练协议下的权衡。最后，我们通过与现有网格化产品（包括GEDI L4D RH10-RH98和AGBD）在匹配时间尺度上的区域对齐比较，将基线性能置于背景中。Biomazon连同随附的协议和基线结果，为未来热带森林中结构一致的RH轮廓预测和结构-生物量建模工作建立了参考基准。

英文摘要

Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.

URL PDF HTML ☆

赞 0 踩 0

2606.05883 2026-06-18 cs.CV 版本更新

Geometry-Aware Dataset Condensation for Diffusion Model Training

面向扩散模型训练的几何感知数据集压缩

Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

发表机构 * GitHub

AI总结针对扩散模型训练，提出基于几何感知分布对齐的真实子集选择方法，利用单侧部分最优传输保持几何结构，并辅以轻量级特征统计与语义一致性正则化，通过两阶段离散优化实现高效压缩。

Comments ICML 2026

详情

AI中文摘要

数据集压缩旨在通过合成或选择从真实数据中构建紧凑数据集。然而，现有方法不适用于扩散模型训练：合成数据生成通常产生不适合真实建模的低保真样本，而真实子集选择通常无法保留扩散似然目标所需的分布几何结构。为解决此问题，我们提出将真实子集选择重新表述为几何感知分布对齐问题。通过引入单侧部分最优传输，我们的方法选择性地将紧凑子集与完整数据分布对齐，同时允许低密度区域中的未匹配质量，确保保留扩散模型训练所需的有效几何结构。为进一步保证分布保真度，我们用轻量级特征统计和语义一致性正则化补充几何对齐。提出了一种高效的两阶段离散优化策略来实现该对齐目标。在扩散变体、子集大小、图像分辨率和训练轮次上的大量实验表明，我们的方法在扩散模型训练中实现了优越的保真度和分布覆盖。代码可在 https://github.com/2018cx/GADC 获取。

英文摘要

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

URL PDF HTML ☆

赞 0 踩 0

2606.14702 2026-06-18 cs.CV 版本更新

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K：通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University（南京大学）； CASIA（中国科学院自动化研究所）

AI总结提出OmniVideo-100K数据集，通过实体锚定视频脚本和线索引导的QA生成机制，解决音视频问答中跨段实体不一致和长时推理不足的问题，微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情

AI中文摘要

当前的音视频问答（QA）自动化流水线通常采用“视频-字幕-QA”范式。然而，这些方法通常将视频分割成短片段，并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联，而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外，将长文本理解和QA合成耦合到单一步骤中，往往将模型限制在局部事件上，生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题，我们提出了一种自动化数据引擎，包含两种机制：（1）**实体锚定视频脚本**将视频转换为结构化脚本，包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验，确保跨片段引用一致性并重建音视频关联。（2）**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索，然后基于这些高价值线索生成QA对。利用这一流水线，我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B，在OmniVideo-Test上获得了高达20.59%的性能提升，并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力（提升高达12.64%）。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

URL PDF HTML ☆

赞 0 踩 0

2606.17188 2026-06-18 cs.CV cs.CL 版本更新

ERQA-Plus：具身AI推理的诊断基准

Hong Yang, Basura Fernando

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research（新加坡科技研究局前沿人工智能研究中心）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结提出ERQA-Plus基准，包含1766个基于机器人中心图像的问答实例，覆盖感知、动作、社交、导航和常识推理，用于诊断具身AI的推理能力。

详情

AI中文摘要

通用具身智能体需要的不仅仅是物体识别：它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而，现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限，使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus，一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例，这些实例基于711张以机器人为中心的图像，并根据一个结构化的分类法组织，涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建，结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估，以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试，包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数，但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此，ERQA-Plus提供了一个细粒度的评估框架，不仅衡量具身智能体是否回答正确，还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取，项目页面在https://this https URL。

英文摘要

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

URL PDF HTML ☆

赞 0 踩 0

2506.13506 2026-06-18 cs.CV q-bio.NC 版本更新

Stimulus Motion Perception Studies Imply Specific Neural Computations in Human Visual Stabilization

刺激运动知觉研究暗示人类视觉稳定中的特定神经计算

David W Arathorn, Josephine C. D'Angelo, Austin Roorda

发表机构 * Montana State University, Dept of Electrical and Computer Engineering（蒙塔那州立大学电气与计算机工程系）； University of California, Berkeley, Herbert Wertheim School of Optometry and Vision Science（加州大学伯克利分校赫伯特·韦特海姆视觉科学与眼科学学院）

AI总结通过分析人类注视时眼球的微小抖动，发现视觉稳定机制比相机稳定或简单进化方案更复杂，提出了基于视网膜信号特定操作的功能模型和可能的神经回路实现。

详情

AI中文摘要

即使在注视期间，人眼也持续进行低幅度运动，以高达100Hz的频率在随机方向上小角度抖动。这种运动导致视网膜上图像的所有特征不断穿过多个视锥细胞，然而世界中稳定的物体被感知为稳定，而任何运动的物体被感知为运动。一系列持续十多年的实验揭示了视觉稳定的心理物理学比可能假设的（例如，从相机图像稳定的机制，或从进化角度可能假设的最简单解决方案）更为微妙。实验揭示的心理物理学强烈暗示了视网膜信号上的一组特定操作，导致了观察到的稳定行为。报告分为两个层次。首先是对很可能负责实验观察行为的机制的功能描述。其次是对可能实现功能行为的电路级神经元的更推测性提议。

英文摘要

Even during fixation the human eye is constantly in low amplitude motion, jittering over small angles in random directions at up to 100Hz. This motion results in all features of the image on the retina constantly traversing a number of cones, yet objects which are stable in the world are perceived to be stable, and any object which is moving in the world is perceived to be moving. A series of experiments carried out over a dozen years revealed the psychophysics of visual stabilization to be more nuanced than might be assumed, say, from the mechanics of stabilization of camera images, or what might be assumed to be the simplest solution from an evolutionary perspective. The psychophysics revealed by the experiments strongly implies a specific set of operations on retinal signals resulting in the observed stabilization behavior. The presentation is in two levels. First is a functional description of the action of the mechanism that is very likely responsible for the experimentally observed behavior. Second is a more speculative proposal of circuit-level neural elements that might implement the functional behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo：通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Ricoh Software Research Center Beijing Co.,Ltd（Ricoh 软件研究中心北京有限公司）

AI总结提出Hilbert-Geo框架和Parse2Reason方法，利用条件描述语言和定理库实现立体几何问题的严格推理，在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

几何问题求解作为一种典型的多模态推理问题，近年来受到广泛关注并取得了很大进展，然而大多数工作集中于平面几何，由于三维空间图和复杂推理，通常在立体几何中失败。为弥补这一差距，我们引入了Hilbert-Geo，这是第一个用于立体几何的统一形式语言框架，包括一个广泛的谓词库和一个专用的定理库。基于该框架，我们提出了一种Parse2Reason方法，包含先解析后推理两个步骤。在解析步骤中，我们利用条件描述语言（CDL），一种由专门用于构建几何条件的谓词组成的形式化语言，来表示问题描述（自然文本）和立体图（视觉图像）。在推理步骤中，我们利用这些形式化CDL和定理库进行关系推理和代数计算，生成严格正确、可验证且人类可读的推理过程。值得注意的是，我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理，我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k，它们配备了几何形式语言标注、解答和答案。大量实验表明，我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能，在MathVerse-Solid（MathVerse中专用于立体几何的一个小子集）上达到84.1%，显著优于领先的多模态大语言模型，如Gemini-2.5-pro（在SolidFGeo2k上为54.2%）和GPT-5（在MathVerse-Solid上为62.9%）。此外，我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率，展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

URL PDF HTML ☆

赞 0 踩 0

2509.09631 2026-06-18 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结提出DiFlow-TTS框架，通过离散流匹配和分解离散流去噪器，在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper Track)

2604.14837 2026-06-18 cs.CV 版本更新

Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer's Disease Neurodegeneration

Geonwoo Baek, David H. Salat, Ikbeom Jang

发表机构 * Department of Computer Science \& Engineering, Hankuk University of Foreign Studies, Seoul, Republic of Korea ； Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, USA ； Department of Radiology, Harvard Medical School, Boston, MA, USA ； Neuroimaging Research for Veterans (NeRVe) Center, VA Boston Healthcare System, Boston, MA, USA

Comments Submitted to Human Brain Mapping

详情

DOI: 10.1002/hbm.70548
Journal ref: Human Brain Mapping 47(8), e70548 (2026)

英文摘要

Alzheimer's disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.

URL PDF HTML ☆

赞 0 踩 0

2602.02370 2026-06-18 cs.CV 版本更新

Uncertainty-Aware Image Classification In Biomedical Imaging Using Spectral-normalized Neural Gaussian Processes

Uma Meleti, Jeffrey J. Nirschl

发表机构 * Department of Pathology（病理学部）； Lab Medicine, University of Wisconsin-Madison（实验室医学，威斯康星大学麦迪逊分校）

Comments Published at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026

2411.16934 2026-06-18 cs.CV 版本更新

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finocchiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, Christian Micheloni

发表机构 * University of Udine（乌迪内大学）； University of Catania（卡塔尼亚大学）； York University（约克大学）

Comments in IEEE/CVF Winter Conference on Application of Computer Vision (WACV) 2026

2510.13562 2026-06-18 physics.med-ph cs.CV cs.NA math.NA 版本更新

An efficient approach with theoretical guarantees to simultaneously reconstruct activity and attenuation sinogram for TOF-PET

Liyang Hu, Chong Chen

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China（数学科学国家重点实验室，数学与系统科学研究院，中国科学院，北京100190，中国）； University of Chinese Academy of Sciences, Beijing 100190, China（中国科学院大学，北京100190，中国）

Comments 32 pages, 11 figures, 4 tables

详情

DOI: 10.1109/TCI.2026.3697651
Journal ref: IEEE Transactions on Computational Imaging 2026

英文摘要

In positron emission tomography (PET), it is indispensable to perform attenuation correction in order to obtain the quantitatively accurate activity map (tracer distribution) in the body. Generally, this is carried out based on the estimated attenuation map obtained from computed tomography or magnetic resonance imaging. However, except for errors in the attenuation correction factors obtained, the additional scan not only brings in new radiation doses and/or increases the scanning time but also leads to severe misalignment induced by various motions during and between the two sequential scans. To address these issues, based on maximum likelihood estimation, we propose a new mathematical model for simultaneously reconstructing the activity and attenuation sinogram from the time-of-flight (TOF)-PET emission data only. Particularly, we make full use of the exclusively exponential form for the attenuation correction factors, and consider the constraint of a total amount of the activity in some mask region in the proposed model. Furthermore, we prove its well-posedness, including the existence, uniqueness and stability of the solution. We propose an alternating update algorithm to solve the model, and also analyze its convergence. Finally, numerical experiments with various TOF-PET emission data demonstrate that the proposed method is of numerical convergence and robust to noise, and outperforms some state-of-the-art methods in terms of accuracy and efficiency, and has the capability of autonomous attenuation correction.

URL PDF HTML ☆

赞 0 踩 0

2507.05647 2026-06-18 eess.IV cs.CV 版本更新

Diffusion-Based Limited-Angle CT Reconstruction under Noisy Conditions

Jiaqi Guo, Santiago López-Tapia

发表机构 * Dept. of Electrical and Computer Engineering, Northwestern University, Evanston, IL, USA（电气与计算机工程系，西北大学，埃文斯顿，伊利诺伊州，美国）

Comments Accepted at the 2025 IEEE International Conference on Image Processing (ICIP), Workshop

2406.16439 2026-06-18 cs.CV 版本更新

Continual Test-Time Adaptation for Object Detection with Adaptive Monitoring and Randomized Restoration

Shilei Cao, Juepeng Zheng, Yan Liu, Baoquan Zhao, Ziqi Yuan, Weijia Li, Runmin Dong, Haohuan Fu

发表机构 * School of Artificial Intelligence, Sun Yat-Sen University（中山大学人工智能学院）； School of Information Science and Technology, University of Science and Technology of China（中国科学技术大学信息科学与技术学院）； State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University（清华大学智能技术与系统国家重点实验室）； Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生学院）； National Supercomputing Center in Shenzhen（深圳国家超算中心）； Ministry of Education Key Laboratory for Earth System Modeling and the Department of Earth System Science, Tsinghua University（清华大学地球系统模型教育部重点实验室）

1. 多模态与视觉语言模型 5 篇

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

Cosmos 3: Omnimodal World Models for Physical AI

Would you still call this Dax? Novel Visual References in VLMs and Humans

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

2. 具身智能、机器人与自动驾驶 3 篇

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

3. 图像识别、检索与分类 2 篇

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

4. 目标检测、分割与定位 5 篇

CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

MUFASA: A Multi-Layer Framework for Slot Attention

Bidirectional Cross-Attention Fusion of High-Resolution RGB and Low-Resolution Hyperspectral Inputs for Multimodal Semantic Segmentation

Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

5. 视频理解与时序视觉 3 篇

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

Open-World Video Segmentation

6. 生成式视觉与世界模型 10 篇

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Epipolar Geometry Improves Video Generation Models

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

7. 3D视觉、点云与空间智能 2 篇

SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

8. 医学影像与生物视觉 8 篇

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

Hybrid Transformer-Mamba for Weakly Supervised Volumetric Medical Segmentation

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

RaLMPH: Reliability-aware Learning for Multi-Pathologist Harmonization in Whole-Slide Image Classification

Enhancing Pathological VLMs with Cross-scale Reasoning

Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension

9. 文档图像、OCR与图表理解 2 篇

Recognizing and Reconstructing a Multi-Unit Floor Plan

Cross-Lingual Learning within Arabic Script for Low-Resource HTR

10. 低层视觉、计算成像与图像增强 6 篇

Investigation of Neural Network Methods for Reconstruction and Classification of Texture Images Under Conditions of Incomplete Information

Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

Pyramid Self-Contrastive Learning for Single-shot Test-time Ultrasound Image Denoising

Grids Often Outperform Implicit Neural Representations at Compressing Dense Signals

11. 鲁棒性、安全、隐私与可信视觉 6 篇

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

RUB: Evaluating Residual Knowledge in Unlearned Models

Revealing Hidden Vulnerabilities in Autoencoders through Gradient Signal Restoration

SPARX: Secure and Privacy-Aware Approximate CNN Acceleration with Edge RISC-V SoC

12. 数据集、基准、评测与训练方法 17 篇

Simple Domain Generalization Methods are Strong Baselines for Open Domain Generalization

Optimizing Incomplete, Large-Scale and Sparse Multi-Graph Matching in Bioimaging

VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset

Beyond Nearest Neighbor Interpolation in Data Augmentation

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Geometry-Aware Dataset Condensation for Diffusion Model Training

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

Benchmarking Physics-Informed Time-Series Models for Operational Global Station Weather Forecasting

Generalized Kullback-Leibler Divergence Loss