arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 15 篇

2606.09871 2026-06-10 cs.CV cs.AI cs.LG 新提交

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

SD-GRPO:面向长格式视觉-语言生成的可验证片段分解

Hyunwoong Kim, Seongeun Lee, Hannah Yun, Junhyun Park, Jonggwon Park

AI总结 提出SD-GRPO方法,通过将长格式输出分解为片段并计算逐片段优势,解决GRPO在视觉-语言任务中粗粒度信用分配不足的问题,实验证明其在多种长格式生成任务中优于基线。

详情
AI中文摘要

群体相对策略优化(GRPO)及其变体最初为大型语言模型(LLM)开发,最近被应用于多模态LLM并取得了强劲结果。然而,它们基于单一标量优势的粗粒度整体信用分配在视觉-语言(VL)任务中拟合不足,这些任务的输出通常是基于语义丰富图像的长格式响应。为解决这一限制,我们利用了一种单标量公式丢弃的结构化信号:长格式VL输出的自然分段。具体地,我们提出片段分解GRPO(SD-GRPO),它对整个rollout组中可验证的逐片段奖励进行z归一化,生成一个逐片段优势向量以替代单一标量。我们在三个设置中评估SD-GRPO,涵盖受控和真实世界的长格式VL生成,按片段间语义纠缠程度递增组织。在从DOCCI构建的受控多面板密集字幕任务中(片段语义独立),SD-GRPO始终优于GRPO基线,且片段数量越多增益越大。扩展到从MultiChartQA构建的受控多图表长格式VQA任务,我们从理论和经验上证明,rollout级奖励存在随输出长度增加而加剧的跨片段信用错误归因。在MMSci数据集上的真实世界科学图表字幕任务中(子图字幕共享图表上下文),混合整体和逐片段奖励进一步提升了两者性能,表明当片段语义纠缠时,仅逐片段归一化是不够的。最后,通过将SD-GRPO集成到Dr. GRPO中,我们确认它可以以最小的实现开销应用于任何GRPO框架,以增强长格式VL生成。

英文摘要

Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

2606.10468 2026-06-10 cs.CV 新提交

Geometric Coastline Localization using Vision-Language Models

基于视觉语言模型的海岸线几何定位

Rafia Malik, Bernhard Pfahringer, Karin Bryan, Mark Dickson, Eibe Frank

发表机构 * The University of Waikato(怀卡托大学) The University of Auckland(奥克兰大学)

AI总结 提出将海岸线提取视为几何边界定位任务,基于GeoChat-7B/LLaVA-1.5架构构建CoastlineVLM-7B模型,直接预测折线而非分割掩码,在几何指标上优于传统分割方法。

详情
AI中文摘要

遥感图像中的海岸线检测通常被表述为逐像素分割问题,通过后处理从预测掩码中提取最终海岸线。这种表述将海岸线几何(海岸变化分析中使用的主要表示)降级为次要产物而非学习目标。在实践中,海岸线由地貌代理(如植被线、沙丘趾或悬崖边缘)定义,而非像素级分割方法中常用的瞬时水陆边界。在这项工作中,我们从表示角度重新审视海岸线提取,并将任务表述为几何边界定位。我们使用新西兰海岸变化数据集(NZCCD)和来自新西兰土地信息局(LINZ)的高分辨率航空影像,开发了CoastlineVLM-7B,这是一个基于GeoChat-7B/LLaVA-1.5架构的视觉语言模型(VLM),联合执行海岸线存在检测、代理类型分类和海岸线定位。该模型直接预测海岸线为折线,而非密集分割掩码。我们在严格的单像素边界监督下,将CoastlineVLM-7B与分割基线进行评估。结果表明,基于几何的指标比像素重叠指标(如交并比IoU)更适合评估海岸线定位质量。CoastlineVLM-7B改善了与参考海岸线的全局几何对齐,将豪斯多夫距离从37.74米降至31.84米,地球移动距离从21.12米降至17.32米。这些结果表明,输出表示是海岸线提取中的关键设计选择,而面向几何的学习结合视觉语言模型的语义推理能力,与运营海岸监测中海岸线的定义和评估方式高度一致。

英文摘要

Coastline detection in remote sensing imagery is commonly formulated as a pixel-wise segmentation problem, where the final coastline is extracted from a predicted mask through post-processing. This formulation relegates coastline geometry, the primary representation used in coastal change analysis, to a secondary artifact rather than the learning objective. In practice, coastlines are defined by geomorphic proxies such as vegetation lines, dune toes, or cliff edges, rather than an instantaneous land-water boundary often used in pixel-based segmentation approaches. In this work, we revisit coastline extraction from a representation perspective and formulate the task as geometric boundary localization. We use the New Zealand Coastal Change Dataset (NZCCD) and high-resolution aerial imagery from Land Information New Zealand (LINZ) to develop CoastlineVLM-7B, a vision-language model (VLM) built on the GeoChat-7B/LLaVA-1.5 architecture that jointly performs coastline presence detection, proxy-type classification, and coastline grounding. The model directly predicts a coastline as a polyline rather than a dense segmentation mask. We evaluate CoastlineVLM-7B against segmentation baselines under strict one-pixel boundary supervision. Results show that geometry-based metrics are more suitable for assessing coastline localization quality than pixel-overlap metrics such as Intersection over Union (IoU). CoastlineVLM-7B improves global geometric alignment with reference coastlines, reducing Hausdorff distance from 37.74 m to 31.84 m and Earth Mover's Distance from 21.12 m to 17.32 m. These results indicate that output representation is a critical design choice in coastline extraction, and that geometry-oriented learning, combined with the semantic reasoning capabilities of vision-language models, aligns well with how coastlines are defined and evaluated in operational coastal monitoring.

2606.10522 2026-06-10 cs.CV 新提交

GUI-AC: Enhancing Continual Learning in GUI Agents

GUI-AC:增强GUI代理的持续学习能力

Can Lin, Tao Feng, Hangjie Yuan, Dan Zhang, Yifan Zhu, Zhonghong Ou

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 针对GUI代理在持续学习中的分布漂移和强化微调不稳定性问题,提出GUI-AC方法,通过自适应优势和动态裁剪机制提升性能,超越现有基线。

详情
AI中文摘要

图形用户界面(GUI)是人机交互的主要媒介,但构建能够像人类一样在多样化的真实界面环境中泛化、具有相同灵活性和鲁棒性的GUI代理仍未解决。值得注意的是,GUI数据本质上是非平稳的:持续出现未见过的界面实例(例如,新领域和分辨率)会导致持续的分布漂移,严重阻碍现有GUI代理的持续学习。强化微调(RFT)作为一种有前景的方法引起了广泛关注。然而,RFT在其定位能力上表现出明显的不稳定性,表现为奖励的急剧不连续和高方差振荡。推出结果的不平衡分布给优势估计引入了大量噪声,导致策略过度自信。固定的裁剪边界抑制了适应新分布所需的策略概率增加,导致探索能力崩溃。为了解决这些挑战,我们提出了GUI-AC,一种增强GUI代理持续学习能力的方法。GUI-AC引入了定位确定性以支持两个核心机制:(i)自适应优势,降低噪声优势估计的权重以防止策略过度自信;以及(ii)动态裁剪,放松裁剪边界以鼓励探索范围。大量实验表明,这些机制共同提高了性能,使我们的方法超越了最先进的基线。代码匿名提供于此https URL。

英文摘要

Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently non-stationary: the continual emergence of previously unseen interface instances (e.g., novel domains and resolutions) induces persistent distribution shifts, significantly impeding the continual learning of existing GUI agents. Reinforcement fine-tuning (RFT) has attracted considerable attention as a promising approach. Nevertheless, RFT exhibits pronounced instability in its grounding capability, manifested as sharp reward discontinuities and high-variance oscillations. The imbalanced distribution of rollout outcomes introduces substantial noise into advantage estimation, leading to policy overconfidence. The fixed clipping bound suppresses the increase in policy probabilities needed to adapt to new distributions, leading to a collapse in exploration capacity. To address these challenges, we propose GUI-AC, a method that enhances the continual learning capability of GUI agents. GUI-AC introduces grounding certainty to support two core mechanisms: (i) Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence; and (ii) Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. Extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines. Code is available anonymously at https://anonymous.4open.science/r/GUI-AC.

2606.10533 2026-06-10 cs.CV 新提交

Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning

音频-视觉交换感知令牌剪枝用于高效音频-视觉字幕生成

Zihan Meng, Dexiang Hong, Weidong Chen, Ziyu Zhou, Bo Hu, Zhendong Mao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于强化学习的AVEX-Prune方法,通过跨模态令牌交换策略选择高置信度令牌,在40%保留率下保持全令牌质量。

详情
AI中文摘要

音频-视觉字幕生成从视频和音频内容生成自然语言描述。多模态大语言模型推进了这一任务,但两种模态都为LLM输入贡献了大量令牌,其中预填充自注意力呈二次方扩展。现有的令牌剪枝方法通常通过注意力、显著性或交叉熵损失保留令牌,但硬阈值选择使得难以保留真正有价值的令牌,尤其是决策边界附近的高混淆令牌。为此,我们提出AVEX-Prune,一种基于强化学习的音频-视觉动态令牌剪枝方法。在我们的AVEX-Prune中,提出了一种音频-视觉令牌交换策略,通过用来自同一或另一模态的高置信度候选令牌替换低置信度保留令牌,并测量令牌交换带来的字幕生成差异,来选择真正有价值的令牌。AVEX-Prune在VILA 1.5-8B(54.5 vs. 54.6)和VideoLLaMA 2(57.0 vs. 56.8)上以40%保留率保持了全令牌质量。

英文摘要

Audio-visual captioning generates natural language descriptions from video and audio content. Multimodal LLMs have advanced this task, but both modalities contribute many tokens to the LLM input, where prefill self-attention scales quadratically. Existing token-pruning methods usually retain tokens by attention, saliency, or cross-entropy loss, yet the hard threshold selection makes it difficult to retain tokens that are truly valuable, especially for high-confusing tokens near the decision boundary. To this end, we propose a AVEX-Prune, an RL-based audio-visual dynamic token pruning method in this work. In our AVEX-Prune, an audio-visual token exchange strategy is proposed to select truly valuable tokens by replacing low-confidence retained tokens with high-confidence candidate tokens from the same or the other modality, and measuring the differences in caption generation from token swaps. AVEX-Prune preserves full-token quality at a 40% retention ratio on both VILA 1.5-8B (54.5 vs. 54.6) and VideoLLaMA 2 (57.0 vs. 56.8).

2606.10651 2026-06-10 cs.CV 新提交

Kwai Keye-VL-2.0 Technical Report

Kwai Keye-VL-2.0 技术报告

Kwai Keye Team, Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen, Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai, Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han, Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang

发表机构 * Kuaishou Group(快手集团)

AI总结 提出开源MoE多模态基础模型Keye-VL-2.0,首次将DeepSeek稀疏注意力适配到GQA架构,支持无损256K上下文处理,并通过跨模态多教师策略蒸馏和上下文/视频强化学习解决多任务对齐中的灾难性遗忘,在长视频理解和智能体任务上达到同类最优。

Comments 31 pages, 11 figures

详情
AI中文摘要

我们介绍了 Kwai Keye-VL-2.0-30B-A3B,一个开源的混合专家(MoE)多模态基础模型,旨在推进长视频理解和智能体智能。为应对小时级视频中存在的超长上下文、信息冗余和过高计算成本等挑战,Keye-VL-2.0 首次将 DeepSeek 稀疏注意力(DSA)适配到基于 GQA 的多模态架构中,实现了无损的 256K 上下文处理,同时捕捉关键帧和长程时间依赖。该架构由高度优化的训练和推理基础设施支撑,包括可扩展的视频 I/O、异构 ViT-LM 并行和自定义 DSA 内核,显著提高了吞吐量并最小化计算开销。此外,为克服多任务对齐过程中灾难性遗忘的算法困境,我们引入了跨模态多教师在线策略蒸馏(MOPD),并结合上下文强化学习和视频强化学习。通过将在线策略 rollout 中的密集 token 级教师反馈蒸馏回仅激活 3B 参数的 MoE 骨干网络,Keye-VL-2.0 原生支持跨代码、工具和搜索场景的高级智能体协作,并具备多模态自我纠正能力。在视频理解、时间定位、推理、STEM 和智能体基准上的广泛评估表明,Keye-VL-2.0-30B-A3B 在相似规模模型中达到了最先进的性能,尤其在 TimeLens 上的细粒度时间定位和 Video-MME-v2 及 LongVideoBench 上的长视频理解方面表现优异。我们发布了模型检查点,以加速社区向可扩展且鲁棒的多模态智能体应用迈进。

英文摘要

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

2606.10819 2026-06-10 cs.CV cs.AI 新提交

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Earth-OneVision:将遥感多模态大语言模型扩展到更多传感器模态和任务

Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li

发表机构 * National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing (SBIIP), Beijing Institute of Technology(北京理工大学空间智能信息处理国家重点实验室) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院空天信息创新研究院) Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences(中国科学院地理空间信息处理与应用系统技术重点实验室) Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology(北京理工大学前沿交叉科学研究院) School of Mechatronical Engineering, Beijing Institute of Technology(北京理工大学机电学院) School of Earth and Space Sciences, Peking University(北京大学地球与空间科学学院) School of Electronics, Peking University(北京大学电子学院) School of Computer Science and Hubei Key Laboratory of Intelligent Geo-Information Processing(华中科技大学计算机科学与技术学院&湖北省智能地理信息处理重点实验室)

AI总结 提出Earth-OneVision,一个2B参数的RS-MLLM,通过全粒度视觉语言对齐、空间语言同构序列化和渐进式跨模态适应机制,统一六种传感器模态和九类任务,在多个基准上达到或超越4B-72B模型。

详情
AI中文摘要

RS-MLLM能够对地球观测图像进行自然语言理解和空间推理。然而,现有模型仅支持狭窄的传感器类型和任务范围,导致对地球的碎片化视角,并使得跨模态地球科学知识在很大程度上未被利用。本文提出了Earth-OneVision,一个2B参数的RS-MLLM,它在单一自回归框架内统一了六种传感器模态(即光学、SAR、红外、多光谱、时序和视频)以及跨传感器融合,涵盖9个任务类别。三种专用机制解决了三个瓶颈。全粒度视觉语言对齐(FGVLA)将多级视觉特征与多维语言空间对齐。空间语言同构序列化(SLIS)将异构空间输出统一为自回归令牌。渐进式跨模态适应(PCMA)将复合领域差距分解为连续阶段,依次解决视角和成像物理差距。为了支持联合训练,构建了MMRS-OneVision,包含约340万QA对,涵盖所有六种传感器模态和9个任务类别的跨传感器融合,大大超过了现有的遥感多模态指令数据集。仅用2B参数,Earth-OneVision在广泛基准上取得了具有竞争力或最先进的结果,持续匹配或超越4B-72B的RS-MLLM。它在光学视觉定位的OPT-RSVG测试集上达到87.52%的P@0.5,在SAR VQA基准SARLANG-Bench上达到80.68%,超过7B模型7%以上。它还在多光谱分类的BigEarthNet-MS测试集上达到75.74%的召回率,在跨模态推理的EarthMind-Bench上达到81.94%的MCQ准确率。

英文摘要

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

2606.10887 2026-06-10 cs.CV 新提交

Listen, Look, and Learn: Learning Without Forgetting through SAM-Audio

听、看、学:通过SAM-Audio实现无遗忘学习

Avi Gupta, Nilotpal Sinha, Vishnu Raj, Sambuddha Saha, Pratik Joshi, Koteswar Rao Jerripothula, Tammam Tillo

发表机构 * University of Washington(华盛顿大学)

AI总结 提出一种利用SAM-Audio多模态先验的类增量学习方法,通过引导注意力机制和双层蒸馏策略,在音频-视觉场景中缓解灾难性遗忘,性能优于现有方法。

详情
AI中文摘要

类增量学习(CIL)旨在持续学习新类别而不遗忘先前获取的知识。尽管最近的CIL进展在各种模态中引起了显著兴趣,但音频-视觉设置仍未被充分探索。此外,尽管像SAM-Audio这样的基础多模态模型封装了丰富的静态先验,我们的实证分析表明,这些表示在增量设置中表现不佳。本文通过将SAM-Audio的音频-视觉先验整合到CIL设置中来弥合这一差距。具体来说,我们利用其密集的音频和视觉表示,并采用一种新颖的引导注意力策略,其中音频特征在上下文中引导视觉表示。为了进一步缓解灾难性遗忘,我们在特征和logit级别引入了双层蒸馏目标。在音频-视觉CIL基准上的广泛评估表明,我们的方法始终优于最先进的方法。

英文摘要

Class-Incremental Learning (CIL) aims to continuously learn new classes without forgetting previously acquired knowledge. While recent CIL advances have spurred significant interest across various modalities, the audio-visual setting remains underexplored. Furthermore, although foundational multimodal models like SAM-Audio encapsulate rich static priors, our empirical analysis reveals that these representations struggle in incremental settings. This work bridges this gap by integrating SAM-Audio's audio-visual priors into the CIL setting. Specifically, we leverage its dense audio and visual representations and employ a novel guided attention strategy where the audio features contextually guide the visual representations. To further mitigate catastrophic forgetting, we introduce dual-level distillation objectives at both the feature and logit levels. Extensive evaluations on audio-visual CIL benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods.

2606.11176 2026-06-10 cs.CV cs.CL cs.CY cs.HC 新提交

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

数据记者智能体:将数据转化为可验证的多模态故事

Kevin Qinghong Lin, Batu EI, Yuhong Shi, Pan Lu, Philip Torr, James Zou

发表机构 * University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出多智能体框架Data2Story,通过证据链验证声明并自动生成多模态文章,在18篇文章上评估,证明其在透明性和可审计性上接近人类记者。

Comments Project page: https://data2story.github.io Github: https://github.com/QinghongLin/data2story-skill

详情
AI中文摘要

数据讲述塑造社会的故事;数据记者的工作是将原始信息转化为非专家可以信任的故事。一篇高质量的新闻专题需要新闻编辑室团队数周时间:寻找背景、运行统计、选择角度和设计视觉。最近的智能体在单个步骤上表现良好:数据科学智能体闭合分析循环,而设计智能体合成漂亮的网站。但是,一个智能体能否端到端地充当数据记者?我们引入了数据记者智能体(Data2Story),这是一个多智能体框架,将专业角色编排成一个虚拟新闻编辑室。Data2Story贡献了两项创新。(i)声明有证据支持:一个检查员将每个数字、角度和资产链接回数据、代码或外部参考。(ii)文章是多模态生成的:而不是默认使用纯文本和静态图表,Data2Story推理读者想看什么,然后部署多模态工具,例如用于地理的交互式地图和用于音乐的音频。我们在18篇文章上评估Data2Story,每篇都与原始发表的专家作品配对,沿着四个轴:(a)人类-智能体角度覆盖;(b)53名参与者在五个维度上的评分评估;(c)计算机使用智能体作为评委,一种节省成本的代理,用于衡量读者如何浏览交互式文章;(d)可验证性,其中编码验证器根据数据重新执行语句并检查声明与参考。Data2Story产生有竞争力、证据可追溯的多媒体故事,在透明性和可审计性方面特别强。人类文章在编辑角度、创意设计和演示方面保持优势。我们将Data2Story定位为记者的合作者,实现更多基于证据、透明和可验证的报道。代码和演示可在https://this URL获取。

英文摘要

Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.

2606.11188 2026-06-10 cs.CV 新提交

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

ARM: 一种具有统一离散表示的自回归大型多模态模型

Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang

发表机构 * Shanghai Key Lab of Intelligent Information Processing, Fudan University(复旦大学上海智能信息处理重点实验室) School of Computer Science, Fudan University(复旦大学计算机科学技术学院) Shanghai Collaborative Innovation Center of Intelligent Visual Computing(上海智能视觉计算协同创新中心) Youtu Lab, Tencent(腾讯优图实验室) Meta AI Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ARM模型,通过离散语义视觉分词器将图像映射为紧凑token序列,结合自回归建模和强化学习,统一实现图像理解、生成和编辑,并提升任务性能与跨任务协同。

Comments technical report

详情
AI中文摘要

本文介绍了ARM,一种基于离散表示的自回归模型,它在下一个词预测框架内统一了图像理解、生成和编辑。ARM建立在三个努力之上:首先,我们训练了一个离散语义视觉分词器,将图像映射为紧凑的token序列。我们的分词器通过多个目标进行监督,这些目标共同促进语义可辨别性、语言对齐和忠实重建,从而在共享潜在空间中支持多样化的任务。在此基础上,我们在大规模文本和图像token序列上训练了一个7B自回归模型,无缝地发展了视觉-语言感知和生成能力。最后,为了进一步改善文本到图像生成和指令引导编辑的偏好对齐行为,ARM应用强化学习(RL)来优化任务级目标,如视觉质量、指令遵循和编辑一致性。令人惊讶的是,结果表明RL不仅显著提高了目标任务的性能(例如,将WISE总体从0.50提升到0.56,GEdit-Bench-EN G_O从5.75提升到6.68),而且还诱导了文本到图像生成和编辑之间的跨任务协同。总的来说,这些发现凸显了自回归建模,当与强大的表示和偏好优化相结合时,作为多模态智能的可扩展基础。代码:此https URL。

英文摘要

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.

2606.10147 2026-06-10 cs.AI cs.CL cs.CV cs.SD 交叉投稿

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

从感知到决策:多模态大语言模型中听觉与视觉感知的信息流

Wish Suharitdamrong, Muhammad Awais, Xiatian Zhu, Sara Atito

AI总结 研究多模态大语言模型(AVLLMs)中音频和视觉信息流的路径与整合机制,发现顺序流与并行流两种路由模式,并证明信息传递后可丢弃无关token以提升效率。

Comments 40 pages, 29 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)能够听和看,但音频和视觉信号实际上如何通过网络传播以形成答案?尽管它们在研究和实际应用中的作用日益增长,但音频和视觉标记影响最终预测的内部路径仍然知之甚少。在本研究中,我们考察了音频-视觉大语言模型(AVLLMs)内部的音视频信息流,追踪了AVLLMs如何在两种输入配置(音视频视频和多个交错音视频项目)下路由、利用和整合音频与视觉信息。我们发现,对于音视频视频,AVLLMs遵循为VLMs和VideoLLMs建立的顺序信息流路径,音频和视觉贡献沿着该路径按任务对每种模态的依赖程度成比例流动。在多个交错音视频项目的设置中,这种路由转变为不同的并行流。此外,我们证明,一旦音频-视觉和其他类型的标记的信息被传递到LLM,它们可以被丢弃,对模型的预测影响最小甚至略有改善,这适用于多个任务和数据集,从而实现更高效的推理。这些发现适用于多个模型和规模,包括3B和7B规模的Qwen2.5-Omni和Video-SALMONN2 Plus,从而产生了关于这些流结构为何出现的假设。总之,这些结果首次清晰地描绘了AVLLMs如何在网络内部协调声音和视觉,并为音频-视觉及更广泛的MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。

英文摘要

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

2606.10400 2026-06-10 cs.CL cs.CV 交叉投稿

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

视觉语言模型是看见还是猜测?通过措辞控制基准衡量和减少文本先验依赖

Pratham Singla, Shivank Garg, Vihan Singh, Paras Chopra

发表机构 * Lossfunk Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Raeth AI

AI总结 本文构建了540张图像的基准,通过为同一图像生成四种措辞变体,衡量视觉语言模型对文本先验的依赖,发现所有模型在最难变体上性能下降,开放模型下降最严重,并通过无图像消融等分析证实了真正的图像依赖。

Comments 17 pages, 7 figures, Submitted to EMNLP 2026

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被部署在答案必须依据图像内容的场景中,然而它们常常基于文本先验(问题的措辞结合记忆的世界知识)而非图像本身来回答,这夸大了基准分数并产生了自信但无根据的答案。现有基准很少孤立这种行为,因为每张图像通常只与一个固定问题配对。为了衡量这种依赖,我们构建了一个包含540张图像、覆盖六个推理类别的基准,并为相同图像生成四个问题变体,使得措辞而非图像内容成为受控变量。最难的变体直接从图像编写以最小化文本泄漏。我们对十一个VLM进行了基准测试,涵盖从小型开放权重模型到大型闭源系统:每个模型在最难的变体上性能下降,开放模型下降最严重。我们的核心诊断是无图像消融,它将开放权重模型降至其纯文本基线(1%到9%)。进一步的三项分析——LLM评定的难度、低基础到最终文本相似度以及人工重新标注——证实了真正的图像依赖性。与变体构建方式匹配的上下文示例恢复了最高的准确率,而GRPO后训练一个小型VLM在所有四个变体上取得了一致的提升,并泛化到保留的分布外集。文本先验依赖是可测量的,并且部分可通过训练消除。

英文摘要

Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

2606.10803 2026-06-10 cs.CL cs.AI cs.CV 交叉投稿

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

超越API:探索多模态大语言模型在物理工具使用中的极限

Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

发表机构 * Singapore Management University(新加坡管理大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出PhysTool-Bench基准,评估多模态大语言模型在真实场景中识别物理工具并规划使用的能力,发现最强模型仅完成21%任务,揭示感知与规划双重缺陷。

详情
AI中文摘要

多模态大语言模型(MLLMs)在利用数字API方面表现出色,并日益成为具身AI的“大脑”,指导机器人与物理世界交互。在这种具身环境中,核心能力之一是使用物理工具,这支撑着MLLMs在现实任务中协助人类的能力。尽管重要性显著,MLLMs在物理工具使用方面的熟练程度仍 largely unexplored。为填补这一空白,我们引入了PhysTool-Bench,这是首个评估MLLMs理解真实场景、识别物理工具并规划其使用能力的物理工具使用基准。PhysTool-Bench包含2,510个查询,覆盖2,678个真实世界物理工具,涉及制造、电气工程、农业和医疗等多个领域。具体而言,模型沿两个主要维度进行评估:1)识别场景中所有存在的物理工具,2)根据指令和视觉上下文规划工具选择和使用顺序。在13个领先的MLLMs中,即使最强的模型(Gemini-3.1-Pro)也只能识别场景中58.7%的工具,并仅完成21.0%的端到端查询。我们的分析揭示了两个层面的缺陷:MLLMs难以在真实场景中感知工具,而规划阶段更大的下降进一步表明缺乏将感知到的工具映射到任务语义的功能常识,这指出了发展实用具身AI的关键瓶颈。

英文摘要

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

2606.11078 2026-06-10 cs.AI cs.CL cs.CV 交叉投稿

A History-Aware Visually Grounded Critic for Computer Use Agents

面向计算机使用代理的历史感知视觉基础批评家

Jaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Supriyo Chakraborty, Kartik Balasubramaniam, Sambit Sahu, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Capital One University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出HiViG框架,通过历史感知的视觉基础多模态批评家,在测试时评估动作并拦截错误,在多个GUI基准上提升成功率。

Comments Code: https://github.com/G-JWLee/HiViG

详情
AI中文摘要

针对计算机使用代理(CUA)的各种测试时干预措施(包括批评模型)已被开发出来,通过在复杂图形用户界面(GUI)环境中执行前动作评估来提高性能。然而,现有的批评家存在两个关键限制:(1)主要关注短视决策循环(例如,遗忘早期动作);(2)缺乏检测有缺陷动作(例如,点击错误的UI元素)所需的视觉基础。为了解决这些问题,我们引入了HiViG,一个历史感知的视觉基础测试时框架,其核心是一个在真实GUI轨迹上训练的多模态批评家,用于将过去的交互抽象为紧凑记录,并基于视觉基础评估动作。在测试时,HiViG将批评家集成到策略决策循环中,以提供宏观动作历史(总结策略已完成成就)和视觉基础批评(根据当前截图验证原始执行坐标,在执行前拦截错误)。在网页、移动和桌面基准测试中,HiViG持续优于现有的标量和口头批评家,在Qwen3-VL-32B上比最强基线平均成功率提高5.8%,在Gemini-3-Flash上提高9.0%,并展示了强大的跨平台泛化能力。消融实验表明,宏观动作历史缓解了短视规划,视觉基础批评减少了执行错误,这两个组件对于长时域GUI任务中的测试时扩展至关重要。

英文摘要

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

2605.00809 2026-06-10 cs.CV 版本更新

Let ViT Speak: Generative Language-Image Pre-training

让ViT说话:生成式语言-图像预训练

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei

发表机构 * Beijing Jiaotong University(北京交通大学) ByteDance(字节跳动) Nanyang Technological University(南洋理工大学)

AI总结 提出GenLIP框架,通过语言建模目标直接训练ViT从视觉token预测语言token,无需对比学习或额外文本解码器,实现简单、可扩展且性能优异的视觉编码器。

Comments 27 pages, 11 figures. Code and models are available at https://github.com/YanFangCS/GenLIP

详情
AI中文摘要

在本文中,我们提出了生成式语言-图像预训练(GenLIP),这是一个为多模态大语言模型(MLLMs)设计的Vision Transformers(ViTs)的极简生成式预训练框架。为了更好地将视觉编码器与LLMs的自回归特性对齐,GenLIP训练ViT直接从视觉token预测语言token,使用标准的语言建模目标,无需对比批次构建或额外的文本解码器。该设计具有三个关键优势:(1)简单性:单个transformer联合建模视觉和文本token;(2)可扩展性:随着数据和模型大小的增加而有效扩展;(3)性能:在多种多模态基准测试中达到竞争性或更优的结果。在使用Recap-DataComp-1B的8B样本训练后,尽管使用的预训练数据显著减少,GenLIP仍能匹配或超越强基线。在继续对原始宽高比的多分辨率图像进行预训练后,GenLIP进一步提高了对细节敏感的任务(如OCR和图表理解)的性能,使其成为MLLMs中视觉编码器的坚实基础。

英文摘要

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

2510.04514 2026-06-10 cs.AI cs.CE cs.CL cs.CV stat.ME 版本更新

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

ChartAgent: 一种用于复杂图表问答中视觉基础推理的多模态智能体

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

发表机构 * J.P. Morgan AI Research(摩根大通人工智能研究)

AI总结 提出ChartAgent框架,通过迭代分解查询为视觉子任务并利用图表专用视觉工具(如绘制注释、裁剪区域)进行空间域推理,在ChartBench和ChartX上取得最先进性能,尤其对无标注图表提升显著。

Comments Accepted at ACL 2026 (Main Conference). Also presented as an oral paper at the NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop (https://marworkshop.github.io/neurips25/)

详情
AI中文摘要

最近的多模态大语言模型在基于图表的视觉问答中显示出潜力,但在无标注图表上——即那些需要精确视觉解释而非依赖文本捷径的图表——其性能急剧下降。为了解决这个问题,我们引入了ChartAgent,一种新颖的智能体框架,它直接在图表的空间域内显式执行视觉推理。与文本思维链推理不同,ChartAgent通过专门的行动(如绘制注释、裁剪区域(例如分割饼图切片、隔离条形图)和定位坐标轴)迭代地将查询分解为视觉子任务,并主动操作和交互图表图像,使用图表专用视觉工具库来完成每个子任务。这种迭代推理过程密切模仿了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试上达到了最先进的准确率,整体上比先前方法绝对提升高达16.07%,在无标注、数值密集的查询上提升17.31%。此外,我们的分析表明,ChartAgent (a) 在多种图表类型上有效,(b) 在不同视觉和推理复杂度水平上均取得最高分数,(c) 作为一个即插即用的框架,提升了多种基础LLM的性能。我们的工作是首批使用工具增强的多模态智能体展示图表理解中视觉基础推理的工作之一。

英文摘要

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

2. 具身智能、机器人与自动驾驶 12 篇

2606.10167 2026-06-10 cs.CV 新提交

FlexPath: Learned Semantic Path Priors for Image-Based Planning

FlexPath: 基于图像规划的学习语义路径先验

Taehyoung Kim, Tim Schoenbrod, David Eckel, Henri Meeß

AI总结 提出FlexPath两阶段框架,将可行性先验与偏好解耦,通过可微路径形状目标实现任务自适应,在最短路径规划中搜索代价降低14.3%,并支持零样本泛化与多目标适配。

详情
AI中文摘要

最近基于学习的路径规划器使用神经网络处理视觉地图表示,并近似经典搜索算法的启发式,从而以更少的搜索代价获得接近最优的路径。然而,这些方法受限于其监督中隐含的最短路径目标,这限制了它们适应其他标准的灵活性。我们提出FlexPath,一个两阶段框架,将可行性与偏好解耦。在第一阶段,我们使用模仿学习从视觉地图输入中获取一个与任务无关的可行路径空间先验。在第二阶段,可微路径形状目标(PSOs)使该先验适应特定任务的标准,而无需重新学习路径结构,仅需高效的 objective 级适应。单个预训练模型可适应多个目标。对于最短路径规划,FlexPath在TMP上相比最先进的TransPath减少了14.3%的搜索代价,同时平均找到更低成本的路径,并在三个未见领域上展现出强大的零样本泛化能力。对于最小间隙距离为2的障碍物避让,它在保持低搜索代价的同时实现了96.8%的完全避障。该框架进一步通过 objective 级适应扩展到语义感知避让和航点引导,并在推理时与经典规划器兼容。数据和代码可在 https://this URL 获取。

英文摘要

Recent learning-based path planners use neural networks to process visual map representations and approximate heuristics for classical search algorithms, yielding near-optimal paths with reduced search effort. However, these methods are tied to the shortest-path objective implicit in their supervision, which limits their flexibility to accommodate alternative criteria. We introduce FlexPath, a two-stage framework that decouples feasibility from preference. In Stage 1, we use imitation learning to acquire a task-independent spatial prior over feasible paths from visual map inputs. In Stage 2, differentiable Path Shape Objectives (PSOs) adapt this prior toward task-specific criteria without relearning path structure, requiring only efficient objective-level adaptation. A single pretrained model can be adapted to multiple objectives. For shortest-path planning, FlexPath reduces search effort on TMP by 14.3% compared to the state-of-the-art TransPath, while also finding lower-cost paths on average and demonstrating strong zero-shot generalization across three unseen domains. For obstacle clearance with minimum clearance distance 2, it achieves 96.8% full obstacle avoidance while maintaining low search cost. The framework further extends to semantic-aware avoidance and waypoint guidance via objective-level adaptation, and remains compatible with classical planners at inference time. Data and code are available at https://github.com/FraunhoferIVI/FlexPath.

2606.10517 2026-06-10 cs.CV 新提交

LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

LAFP:通过流匹配在潜在策略学习中保留潜在动作结构

Jiexi Lyu, Xizhou Bu, Qingqiu Huang, Chufeng Tang, Xiaoshuai Hao, Hongbo Wang, Wei Li

发表机构 * Fudan University(复旦大学) Morphi

AI总结 提出LAFP方法,利用流匹配学习潜在策略,并引入推理时插值机制缓解随机性导致的错位,在模仿学习任务中成功率提升10-15%,推理开销增加不到1倍。

详情
AI中文摘要

从大规模无标签视频中学习高质量潜在动作,并结合有限真实交互数据训练动作解码器,已成为可扩展潜在策略学习的一种有前景的范式。然而,现有方法通常依赖行为克隆,这倾向于将固有的多模态动作分布坍缩为单模态分布,从而破坏预训练的潜在动作结构。虽然流匹配提供了一种潜在的替代方案,但由于学习策略的随机性,直接应用它会导致动作解码器训练中潜在动作与物理动作之间的错位。为了解决这些问题,我们提出了潜在动作流策略(LAFP),它利用流匹配进行潜在策略学习,并引入推理时插值机制来缓解随机性引起的错位。实验结果表明,LAFP在下游模仿学习任务上持续优于先前方法,成功率提升高达10-15%,而推理开销增加不到1倍。

英文摘要

Learning high-quality latent actions from large-scale unlabeled videos, coupled with limited real-world interaction data for training an action decoder, has emerged as a promising paradigm for scalable latent policy learning. However, existing approaches typically rely on behavior cloning, which tends to collapse inherently multimodal action distributions into unimodal ones, thereby degrading the pretrained latent action structure. While flow matching provides a potential alternative, directly applying it leads to a misalignment between latent actions and physical actions during action decoder training, due to the stochastic nature of the learned policy. To address these, we propose Latent Action Flow Policy (LAFP), which leverages flow matching for latent policy learning and introduces an inference-time interpolation mechanism to mitigate stochasticity-induced misalignment. Experimental results demonstrate that LAFP consistently outperforms prior methods on downstream imitation learning tasks, achieving up to 10-15% improvement in success rate while incurring less than 1x additional inference overhead.

2606.10645 2026-06-10 cs.CV 新提交

ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting

ManiSplat: 基于解耦3D高斯泼溅的单目视频操作轨迹合成

Wenhao Hu, Haonan Zhou, Liu Liu, Yun Du, Xinjie Wang, Ziang Li, Zhizhong Su, Gaoang Wang

发表机构 * Zhejiang University(浙江大学) Horizon Robotics(地平线机器人)

AI总结 提出ManiSplat框架,通过图结构解耦表示和任务导向时空对齐,从单目视频重建可控的3D高斯数字孪生,支持机器人操作任务与策略学习。

详情
AI中文摘要

从真实世界观测中重建动态且可交互的3D场景仍然是计算机视觉和机器人学中的一个基本挑战。尽管3D高斯泼溅的最新进展实现了高保真静态重建,但由于复杂的接触交互和突变的姿态变化,将其扩展到具有关节机器人和可操作物体的交互环境仍然困难。为了解决这些挑战,我们引入了ManiSplat,一个统一的框架,直接从单目自我中心机器人视频重建可控且解耦的高斯数字孪生。我们的方法引入了一种图结构解耦表示,将机器人、物体和背景分离为独立可优化的高斯子场,并组织在场景图中。为了确保稳定性,我们提出了一个任务导向的时空对齐模块,利用操作任务的内在逻辑——在运动和技能阶段之间交替——来构建准确的伪真实轨迹。最后,联合光度-几何优化确保重建场景在时间上连贯、物理上一致且可用于仿真。大量实验表明,我们的方法以高保真度和可控性重建了交互驱动的动态场景,有效支持下游机器人任务和策略学习。

英文摘要

Reconstructing dynamic and interactive 3D scenes from real-world observations remains a fundamental challenge in computer vision and robotics. While recent advances in 3D Gaussian Splatting have enabled high-fidelity static reconstruction, extending it to interactive environments with articulated robots and manipulable objects remains difficult due to complex contact interactions and abrupt pose changes. To address these challenges, we introduce ManiSplat, a unified framework that reconstructs controllable and decoupled Gaussian digital twins directly from monocular ego-view robotic videos. Our method introduces a Graph-Structured Disentangled Representation that separates the robot, objects, and background into independently optimizable Gaussian subfields organized within a scene graph. To ensure stability, we propose a Task-Oriented Spatio-Temporal Alignment module that leverages the inherent logic of manipulation tasks-alternating between Motion and Skill phases-to construct accurate pseudo-ground-truth trajectories. Finally, a joint photometric-geometric optimization ensures the reconstructed scenes are temporally coherent, physically consistent, and simulation-ready. Extensive experiments demonstrate that our approach reconstructs interaction-driven dynamic scenes with high fidelity and controllability, effectively supporting downstream robotic tasks and policy learning.

2606.10656 2026-06-10 cs.CV 新提交

Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving

Envision4D: 通过前馈4D高斯泼溅展望自动驾驶的视觉未来

Qi Song, Yifei He, Chi Zhang, Zheng Fu, Xuhe Zhao, Mengmeng Yang, Kun Jiang, Rui Huang, Diange Yang

发表机构 * Tsinghua University(清华大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出Envision4D,一种全自监督前馈框架,通过未来姿态预测、层内时间注意力和条件运动提升,实现无位姿的未来外推,在自动驾驶动态场景预测中达到最先进性能。

Comments Project Page: https://maggiesong7.github.io/research/Envision4D/

详情
AI中文摘要

预测动态场景的未来演变在自动驾驶中至关重要。然而,现有的前馈范式主要设计用于插值。当扩展到未来外推时,它们在大位移下会出现重影伪影,并受限于简化的运动假设或严格的未来先验。为了克服这些挑战,我们提出了Envision4D,一种完全自监督的前馈框架,用于无位姿的未来外推。具体来说,我们引入了一个未来姿态预测模块,通过迭代去噪过程推断未来相机参数。此外,为了捕捉非线性动态,我们提出了层内时间注意力,并采用条件运动提升,将高度不确定的外推过程转化为稳健的关系映射。最后,利用渐进式训练策略来稳定无监督运动学习,防止误差累积。大量实验表明,Envision4D实现了最先进的性能,在未来的视图合成中显著优于现有方法。

英文摘要

Forecasting the future evolution of dynamic scenes is crucial in autonomous driving. However, existing feed-forward paradigms are primarily designed for interpolation. When extended to future extrapolation, they suffer from ghosting artifacts under large displacements and are constrained by simplified motion assumptions or strict future priors. To overcome these challenges, we propose Envision4D, a fully self-supervised feed-forward framework for pose-free future extrapolation. Specifically, we introduce a Future Pose Prediction module that infers future camera parameters via an iterative denoising process. Furthermore, to capture non-linear dynamics, we propose In-layer Temporal Attention and employ Conditioned Motion Lifting, which transforms the highly uncertain extrapolation process into robust relational mappings. Finally, a Progressive Training Strategy is utilized to stabilize unsupervised motion learning against error accumulation. Extensive experiments demonstrate that Envision4D achieves state-of-the-art performance, significantly outperforming existing methods in future view synthesis.

2605.29662 2026-06-10 cs.CV cs.RO 新提交

SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

SAFE-Pruner: 语义注意力引导的未来感知令牌剪枝用于高效视觉-语言-动作操控

Shilin Ma, Chubin Zhang, Changyuan Wang, Yuji Wang, Yue Wu, Zixuan Wang, Jingqi Tian, Zheng Zhu, Yansong Tang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学)

AI总结 针对视觉-语言-动作模型推理加速中现有剪枝方法忽略深层视觉信息的问题,提出SAFE-Pruner框架,通过引入未来层注意力线索和语义注意力一致性实现前瞻性令牌剪枝,在仿真和真实实验中取得最高1.89倍加速且成功率下降小于1.7%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型的实时推理对于机器人控制至关重要。虽然视觉令牌剪枝在加速推理方面显示出巨大潜力,但现有方法主要基于浅层线索进行剪枝决策,并存在丢弃深层所需视觉信息的风险。为解决此问题,我们提出SAFE-Pruner,一种即插即用的剪枝框架,将未来层的注意力线索融入剪枝决策。具体而言,我们识别出语义注意力一致性,即VLA模型在执行步骤中倾向于将其注意力概率质量集中在同一语义实体上。基于这一观察,我们设计了一种前瞻性策略来预测深层令牌的显著性,从而防止关键令牌过早移除并实现更稳定的加速。我们进一步引入自适应子任务划分策略来检测注意力突变,从而提高预测准确性和剪枝可靠性。在仿真和真实环境中的大量实验表明,我们的方法实现了高达1.89倍的加速,成功率下降最小(低于1.7%),同时比最先进的方法高出1.9%。

英文摘要

Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.

2606.10025 2026-06-10 cs.RO cs.CV cs.LG 交叉投稿

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

GHOST: 用于泛化机器人操作的层次化子目标策略

Sriram Krishna, Ben Eisner, Haotian Zhan, Ying Yuan, Haoyu Zhen, Chuang Gan, Shubham Tulsiani, David Held

AI总结 提出GHOST框架,通过将控制分解为高层子目标预测和低层目标条件控制器,实现视觉运动操作策略的泛化,并利用人类演示适应新物体和任务变化。

Comments Accepted at RSS 2026

详情
AI中文摘要

我们提出了GHOST,一个学习视觉运动操作策略的框架,该策略能够泛化到训练分布之外。GHOST将控制分解为:(i) 高层策略,从多视角RGB-D观测中预测下一个子目标作为3D末端执行器位姿的分布,以及(ii) 低层目标条件控制器,执行特定于具体体的动作。为了将基于图像的策略条件化于3D目标,我们引入了一个简单的空间接口,将预测的目标投影到图像平面,并将其表示为末端执行器热图。在一系列操作任务中,与平坦的扩散策略相比,这种层次化分解持续提高了性能和鲁棒性。此外,我们展示了这种层次化接口也使得整合人类演示变得容易,而无需依赖(嘈杂的)动作重定向。由于子目标在很大程度上与具体体无关,我们在人类视频上训练高层策略,以指定如何应用和组合学到的技能,同时保持低层策略仅在机器人数据上训练。这种层次结构使得能够使用少量人类演示适应新物体和任务变化。

英文摘要

We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.

2606.10299 2026-06-10 cs.AI cs.CV cs.MA 交叉投稿

What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

空间记忆必须存储什么:遮挡作为语言-智能体记忆的测试

Doeon Kwon, Junho Bang

发表机构 * Space Zero, Inc.(Space Zero公司)

AI总结 本文通过实验证明,在空间查询场景中,几何信息必须主导记忆召回,而可见性判断需要独立于记忆召回,并提出了基于射线-体素DDA的可见性谓词计算方法。

Comments 23 pages, 6 figures

详情
AI中文摘要

语言智能体的“记忆宫殿”系统将每条记忆锚定到世界坐标,其直觉是几何提供了文本无法提供的信息。我们使这一直觉可测试,并报告三个结果。首先,记忆宫殿默认将空间邻近性折叠成与近期性和重要性线性混合的做法没有帮助甚至有害:在一个预注册的召回实验中,现有的混合在其自身冻结测试中失败(平均Delta-Hit@5 -0.0375,Wilcoxon p=0.306),处于位置盲基线水平,而几何主导的加权则取得决定性胜利(+0.3208,p<10^-15):当查询模式是空间时,几何必须主导召回。其次,记忆召回和可见性必须分离:召回在设计上对遮挡不敏感(你能正确记住墙后下一个房间),而可见性是对存储几何的感知谓词,实时系统从未计算过。一行射线与体素的数字微分分析器(DDA),从智能体已经投射的视线射线重新指向,提供了这一点:文本和实时视锥在849个墙后目标上得分均为0.000,而锥体加DDA达到0.982(精确McNemar p<10^-6);坐标召回分别解决了余弦空值无法解决的近重复位置(1.000 vs 0.533,n=150)。第三,可见性谓词在git提交的预注册下得到实时确认(SPMEM-OCC-LIVE-v1:八个脚本化世界,自动oracle评分,96个墙后目标,假可见从1.000降至0.000,合并精确McNemar p=2.5x10^-29),该运行发现并修复了一个真实的中继锚点缺陷。我们承认遮挡需要几何几乎是同义反复;贡献在于测量和隔离,将空间记忆必须存储的内容与其读取方式分开。这些试验为一个冻结的确认性研究(SPMEM-ZERO-REAL-PREREG-v1)提供动力;完整的人类作者多世界研究(含盲评者)仍是未来工作。

英文摘要

Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p<10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p<10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000->0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.

2606.10614 2026-06-10 cs.RO cs.CV cs.LG 交叉投稿

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

灵巧点策略:从人类演示中学习基于点的灵巧手策略

Beomjun Kim, Seong Hyeon Park, Seunghoon Sim, Seungjun Moon, Sanghyeok Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院)

AI总结 提出Dexterous Point Policy框架,通过统一3D关键点表示从人类视频学习灵巧操作策略,无需机器人演示,在真实任务中达到75%成功率。

详情
AI中文摘要

基于人类演示视频预训练的机器人基础模型显示出潜力,但当策略部署到真实机器人时仍存在显著的具身差距。常见的补救措施是在机器人特定演示上微调这些模型。然而,机器人数据收集可能过于昂贵和耗时,这在灵巧操作中尤为突出,例如,即使是单个原子任务,遥操作多指手也可能需要数天。为了解决这个问题,我们引入了Dexterous Point Policy,一个直接从人类视频学习灵巧操作策略且无需机器人演示的框架。我们的核心见解是,统一的3D关键点表示在用于观察和动作时,可以桥接人类和机器人的具身。具体来说,我们从原始视频中提取任务相关物体和人类手的3D关键点,并训练一个自回归变换器来处理这些关键点。我们观察到,在关键点层面,特别是手腕和指尖,人类和机器人的行为紧密对齐,从而实现直接策略迁移。在一套包括拾取放置和工具使用的真实机器人任务中,Dexterous Point Policy达到了75.0%的成功率,而最先进的VLA基线仅达到1.0%。此外,我们的方法对未见过的场景具有很强的泛化能力,包括多物体环境和新型物体类别。

英文摘要

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

2606.10818 2026-06-10 cs.RO cs.CV 交叉投稿

IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

IMPACT:面向强力机器人操控的内部模型预测控制学习

Jiawei Gao, Chaoqi Liu, Peilin Wu, Haonan Chen, Yilun Du

发表机构 * Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 提出IMPACT框架,将强力操控任务解耦为任务规划和基于内部模型的预测控制,通过仿真和实验证明其在成功率、泛化性、安全性和能效上的优势。

Comments Project website: https://gao-jiawei.com/IMPACT/

详情
AI中文摘要

现实世界中的机器人操控任务通常涉及与环境的有力交互,例如使用不同重量的工具、运输不同质量的物体以及执行接触密集任务(如擦桌子)。先前的基于学习方法通常采用模仿学习策略,输出由低级阻抗控制器跟踪的目标末端执行器姿态。在这些系统中,有力交互要么通过稳态跟踪误差隐式实现,要么使用腕部力/扭矩或触觉传感器显式命令。然而,隐式方法在不同物体重量下泛化能力差,而显式方法需要专用硬件并增加系统复杂性。在这项工作中,我们提出了IMPACT,一个将这些有力任务解耦为任务规划和基于内部模型的预测控制的框架。广泛的仿真和真实世界实验表明,所提出的框架实现了更高的成功率、对未见物体重量的更好泛化性,以及更好的安全性和能效。

英文摘要

Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.

2510.14836 2026-06-10 cs.CV cs.RO 版本更新

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

QDepth-VLA:量化深度预测作为视觉-语言-动作模型的辅助监督

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Zhongke Huiling Robot Technology Co.(北京中科创联机器人科技有限公司)

AI总结 提出QDepth-VLA框架,通过辅助深度预测任务增强VLA模型的空间感知与推理能力,在仿真和真实任务中提升操作性能。

详情
AI中文摘要

空间感知和推理对于视觉-语言-动作(VLA)模型完成精细操作任务至关重要。然而,现有方法往往缺乏理解和推理精确控制所需的基本3D结构的能力。为解决这一局限,我们提出QDepth-VLA,一种通过辅助深度预测任务增强VLA模型的通用框架。设计了一个专门的深度专家,用于预测从VQ-VAE编码器获得的深度图的量化潜在令牌,使模型能够学习捕捉关键几何线索的深度感知表示。在仿真基准和真实世界任务上的实验结果表明,QDepth-VLA在操作任务上展现出强大的空间推理能力和竞争性能。

英文摘要

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

2603.20850 2026-06-10 cs.CV cs.RO 版本更新

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

Glove2Hand:从多模态传感手套合成自然的手-物体交互

Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, Li Guan

发表机构 * Meta Reality Labs(Meta现实实验室) Rutgers University(罗格斯大学)

AI总结 提出Glove2Hand框架,将多模态传感手套视频转化为逼真的裸手,并保留物理交互动态;引入3D高斯手模型和扩散手恢复器,创建HandSense数据集,提升下游任务性能。

Comments CVPR 2026 Highlight. This version includes the motion retarget process in the appendix

详情
AI中文摘要

理解手-物体交互(HOI)是计算机视觉、机器人和AR/VR的基础。然而,传统手部视频通常缺乏接触力和运动信号等关键物理信息,并且容易频繁遮挡。为了解决这些挑战,我们提出了Glove2Hand,一个将多模态传感手套HOI视频转化为逼真裸手的框架,同时忠实保留底层物理交互动态。我们引入了一种新颖的3D高斯手模型,确保时间渲染一致性。使用基于扩散的手部恢复器将渲染的手无缝集成到场景中,该恢复器有效处理复杂的手-物体交互和非刚性变形。利用Glove2Hand,我们创建了HandSense,这是第一个多模态HOI数据集,包含手套到手的视频以及同步的触觉和IMU信号。我们证明HandSense显著增强了下游裸手应用,包括基于视频的接触估计和严重遮挡下的手部跟踪。

英文摘要

Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

2512.06628 2026-06-10 cs.RO cs.CV 版本更新

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

MIND-V:基于强化学习物理对齐的长期机器人操作分层世界模型

Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li

发表机构 * Tsinghua University(清华大学) X Square Robot(X Square机器人) Sun Yat-sen University(中山大学) HKUST(香港科技大学)

AI总结 提出MIND-V分层世界模型,通过语义推理、行为语义桥接和运动视频生成,结合强化学习物理对齐,实现长期机器人操作视频的物理合理合成。

详情
AI中文摘要

可扩展的具身智能受到多样化、长期机器人操作数据稀缺的限制。现有视频世界模型仅能合成简单动作的短视频,且常依赖手动定义轨迹。为此,我们提出MIND-V,一种认知分层世界模型,旨在合成物理合理且逻辑连贯的长期机器人操作视频。受认知科学启发,MIND-V通过三个核心组件桥接高层推理与像素级合成:语义推理中心(SRH)利用预训练视觉语言模型进行任务规划;行为语义桥(BSB)将抽象指令转换为域不变表示;运动视频生成器(MVG)用于条件视频渲染。MIND-V采用分阶段视觉未来展开(Staged Visual Future Rollouts)这一测试时优化策略以增强长期鲁棒性。为强制遵循物理定律,我们引入GRPO强化学习后训练阶段,由新颖的物理预见一致性(PFC)奖励引导。PFC利用V-JEPA2世界模型作为物理裁判,在潜在特征空间中惩罚不合理动态。实验证实MIND-V在长期模拟中的SOTA性能及其对策略学习的重要价值,为具身数据合成引入了可扩展且完全自主的框架。

英文摘要

Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.

3. 图像识别、检索与分类 4 篇

2606.10166 2026-06-10 cs.CV 新提交

Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization

融合卫星图像与平面地图的跨视角定位

Quang Long Ho Ngo, Zimin Xia, Alexandre Alahi

AI总结 提出一种融合卫星图像与平面地图的模块,通过跨模态条件化和补丁级融合规则,将定位误差降低30.13%。

详情
AI中文摘要

当前的跨视角定位方法主要依赖卫星图像作为空中模态。尽管近期工作探索了平面地图(如OpenStreetMap瓦片),但这些方法性能往往滞后。然而,两种模态都广泛可用且具有互补特性。卫星图像更接近地面相机图像,提供更精细的细节,而平面地图包含标注对象(如路灯),并在地面被遮挡(如树叶)的区域仍能提供信息。尽管如此,只有一项先前工作提供了融合这两种模态的端到端方法,且未展示其在最先进方法中的潜力。为结合两种模态的优势,我们提出一种新的融合模块,增强标准编码器,并证明将卫星图像与平面地图集成可改进最先进的单模态方法。该模块包括(i)跨模态条件化,处理每种模态编码时考虑另一种模态的信息,以及(ii)控制信息交换粒度的补丁级融合规则。我们取得了最先进的结果,将平均定位误差降低了30.13%。定性上,融合自适应地选择信息更丰富的模态,提高了整体准确性。

英文摘要

Current cross-view localization methods predominantly rely on satellite imagery as the aerial modality. Although recent work explores planimetric maps (e.g., OpenStreetMap tiles), these approaches often lag in performance. Yet both modalities are widely available and possess complementary properties. Satellite images are closer to ground-level camera imagery, offering finer detail, whereas planimetric maps contain annotated objects (e.g., streetlamps) and remain informative in areas where the ground is occluded, such as by foliage. Despite this, only one prior work provides an end-to-end method to fuse the two modalities, and it does not demonstrate their potential within state-of-the-art methods. To combine the strengths of both modalities, we propose a new fusion module that augments standard encoders and demonstrates that integrating satellite imagery with planimetric maps improves state-of-the-art single-modality methods. The module comprises (i) cross-modal conditioning, which processes each modality's encoding with awareness of the other, and (ii) a patch-level fusion rule that controls the granularity of information exchange. We achieve state-of-the-art results, reducing the mean localization error by 30.13\%. Qualitatively, the fusion adaptively selects the more informative modality, improving overall accuracy.

2606.10876 2026-06-10 cs.CV 新提交

Advancing Wood Identification in the Philippines: Utilizing the Xylorix Platform for Efficient AI Model Development and Deployment for Five Key Species

推进菲律宾木材识别:利用Xylorix平台高效开发和部署五种关键树种的AI模型

Rosalie C. Mendoza, Vivian C. Daracan, Arlene D. Romano, Ronniel D. Manalo, Xin Jie Tang, Yi Hong Wong, Yong Haur Tay

发表机构 * College of Forestry and Natural Resources, University of the Philippines Los Banos(菲律宾大学洛桑分校林业与自然资源学院) Agritix

AI总结 本研究利用Xylorix平台,让无编程经验的木材科学家为五种菲律宾硬木开发并部署宏观木材识别AI模型,AUC达0.969-1.000,四种达AA级,证明非程序员可构建适合现场部署的可靠模型。

详情
AI中文摘要

非法采伐和木材贸易在菲律宾持续构成重大挑战,准确的木材物种识别对执法至关重要,但受限于专业设备和专业知识。本研究旨在评估木材科学家能否在没有编程专业知识的情况下,利用Xylorix平台开发和部署宏观木材识别的AI模型,聚焦五种菲律宾硬木:Mangium (Acacia mangium Willd.)、Rain Tree [Samanea saman (Jacq.) Merr.]、Banuyo (Wallaceodendron celebicum Koord.)、Tindalo [Afzelia rhomboidea (Blanco) Vidal] 和 Ipil [Intsia bijuga (Colebr.) O. Kuntze]。二元分类器使用来自260个标本的10,663张经过验证的横截面图像进行训练,并通过标本级平均评分进行评估,以模拟操作现场条件。ROC曲线下面积(AUC)值范围为0.969(Ipil)到1.000(Mangium),平均精度(AP)值范围为0.589(Samanea)到1.000(Mangium)。五个物种中有四个达到AA级(AUC和AP均≥0.90);Rain Tree获得AE级(AUC≥0.90,AP<0.60),原因是其正测试集较小(3个标本)导致AP压缩。所有五个分类器以近乎完美的保真度将目标标本排在非目标标本之上。标本级错误分析显示,Ipil有9个假阴性,主要源于局部图像伪影;Rain Tree有3个假阳性,Tindalo有1个假阳性,由共享的族级解剖特征引起。这些发现表明,Xylorix非程序员可以利用Xylorix平台构建操作可靠的木材识别模型,适用于供应链检查点的现场部署。

英文摘要

Illegal logging and timber trade continue to pose significant challenges in the Philippines, where accurate wood species identification is essential for enforcement but limited by the need for specialised equipment and expertise. This study aims to evaluate whether AI models for macroscopic wood identification can be developed and deployed by wood scientists without programming expertise using the Xylorix platform, focusing on five Philippine hardwood species: Mangium (Acacia mangium Willd.), Rain Tree [Samanea saman (Jacq.) Merr.], Banuyo (Wallaceodendron celebicum Koord.), Tindalo [Afzelia rhomboidea (Blanco) Vidal], and Ipil [Intsia bijuga (Colebr.) O. Kuntze]. Binary classifiers were trained on 10,663 verified cross-section images from 260 specimens and evaluated using specimen-level mean scoring to mirror operational field conditions. Area Under the ROC Curve (AUC) values ranged from 0.969 (Ipil) to 1.000 (Mangium), and Average Precision (AP) values ranged from 0.589 (Samanea) to 1.000 (Mangium). Four of five species achieved AA grade (AUC and AP both \geq 0.90); Rain Tree received AE (AUC \geq 0.90, AP < 0.60) due to AP compression from its small positive test set (3 specimens). All five classifiers rank their target specimens above non-target specimens with near-perfect fidelity. Specimen-level error analysis revealed 9 false negatives from Ipil, primarily stemming from localized image artifacts and 3 false positives for Rain Tree and 1 false positive for Tindalo caused by shared tribal-level anatomical traits. These findings demonstrate that Xylorix non-programmers can leverage the Xylorix platform to construct operationally reliable wood identification models suitable for field deployment at supply chain checkpoints.

2509.19936 2026-06-10 cs.CV 版本更新

CapStARE: Capsule-based Sequential Architecture for Robust and Efficient Gaze Estimation

CapStARE: 基于胶囊的序列架构实现鲁棒高效的目光估计

Miren Samaniego, Igor Rodriguez, Elena Lazkano

发表机构 * University of the Basque Country(巴斯克大学)

AI总结 提出CapStARE,结合冻结ConvNeXt骨干、注意力路由胶囊和双GRU解码器,在ETH-XGaze等数据集上实现实时高精度目光估计,兼顾空间鲁棒性与计算效率。

Comments Preprint for Patter Recognition Journal

详情
AI中文摘要

人类目光估计对于人机交互、社交机器人和辅助系统等应用至关重要。然而,在非约束环境中实现准确、可解释且实时的性能仍然具有挑战性。现有的基于外观的方法通常在空间鲁棒性、计算效率和上下文信息的有效利用之间面临权衡。为了解决这一问题,我们引入了CapStARE,一种基于胶囊的架构,它结合了用于高效特征提取的冻结ConvNeXt骨干网络、用于结构化面部推理的基于注意力路由的胶囊形成,以及用于短时域观测窗口上轻量级序列建模的双GRU解码器。这种设计保留了可解释的部分-整体面部关系,同时通过局部上下文一致性提高了预测稳定性。实验结果表明,该方法在ETH-XGaze(3.36)和MPIIFaceGaze(2.65)上表现强劲,同时在Gaze360(9.06)上也具有竞争力的泛化能力,且所有测试均实现实时推理(<10毫秒)。这些发现表明,所提出的方法为现实交互环境中基于外观的目光估计提供了一个实用且鲁棒的框架。相关代码和实验结果公开于:this https URL

英文摘要

Human gaze estimation is essential for applications such as human-computer interaction, social robotics, and assistive systems. However, achieving accurate, interpretable, and real-time performance in unconstrained environments remains challenging. Existing appearance-based methods often face trade-offs between spatial robustness, computational efficiency, and effective use of contextual information. To address this, we introduce CapStARE, a capsule-based architecture that combines a frozen ConvNeXt backbone for efficient feature extraction, capsule formation with attention-based routing for structured facial reasoning, and dual GRU decoders for lightweight sequential modeling over short-horizon observation windows. This design preserves interpretable part-whole facial relationships while improving prediction stability through local contextual consistency. Experimental results demonstrate strong performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65), while also generalizing competitively on Gaze360 (9.06), all with real-time inference (<10 ms). These findings suggest that the proposed method provides a practical and robust framework for appearance-based gaze estimation in real-world interactive environments. The related code and experimental results are publicly available at: https://github.com/toukapy/capsStare

2511.19706 2026-06-10 eess.IV cs.CV 版本更新

Selective Disk Bispectrum: A Complete and Rotation Invariant Image Descriptor

选择性圆盘双谱:一种完备且旋转不变的图像描述符

Adele Myers Lantow, Nina Miolane

发表机构 * Department of Physics(物理系) Department of Electrical and Computer Engineering(电气与计算机工程系) University of California, Santa Barbara(加州大学圣芭芭拉分校)

AI总结 提出选择性圆盘双谱(SDB),一种复值旋转不变向量,在保持图像除方向外所有信息的同时,降低了计算复杂度,并验证了其在噪声分类和多参考对齐中的鲁棒性。

详情
AI中文摘要

旋转不变性是许多计算机视觉任务的基本要求。历史上,这种归纳偏置通过手工设计的旋转不变表示来编码。这些表示紧凑、可解释且计算快速,但以描述能力为代价。最近,架构通过学习表示来实现归纳偏置。这些表示高度描述性,实现了强大的经验性能,但以效率和可解释性为代价。在这项工作中,我们提出了两种范式交叉点上的替代方案。我们引入了选择性圆盘双谱(SDB),一种复值旋转不变向量,它保留了图像除方向外的所有信息。我们的关键理论贡献是选择性圆盘双谱、其逆变换、其(降低的)空间和计算复杂度(与完整圆盘双谱相比),以及其在噪声下的期望和方差。此外,我们提出了数值SDB近似,并为其准确性和旋转不变性提供了理论保证。在经验上,我们验证了SDB在噪声分类任务中的不变性和鲁棒性。我们在旋转图像的多参考对齐上测试了我们的重建算法。

英文摘要

Rotation invariance is a fundamental requirement across many computer vision tasks. Historically, this inductive bias has been encoded through hand-crafted rotation-invariant representations. These are compact, interpretable, and fast to compute, but they come at the cost of descriptive power. More recently, architectures achieve inductive bias through learned representations. These are highly descriptive and achieve strong empirical performance, at the cost of efficiency and interpretability. In this work, we propose an alternative at the intersection of both paradigms. We introduce the selective disk bispectrum (SDB), a complex-valued rotation-invariant vector that preserves all information about the image except its orientation. Our key theoretical contributions are the selective disk bispectrum, its inversion, its (reduced) spatial and computational complexities (compared to the full disk bispectrum), and its expectation and variance under noise. Furthermore, we propose a numerical SDB approximation and provide theoretical guarantees for its accuracy and rotation invariance. Empirically, we validate SDB's invariance and robustness to noise classification tasks. We test our reconstruction algorithm on multi-reference alignment of rotated images.

4. 目标检测、分割与定位 8 篇

2606.10328 2026-06-10 cs.CV cs.AI 新提交

Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

内容诱导的空间-光谱聚合网络用于遥感图像变化检测

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学)

AI总结 提出内容引导的空间-光谱集成网络(CSI-Net),通过空间推理、光谱差异和内容引导集成模块融合全局空间细节与光谱差异信息,有效抑制未变化区域差异,在三个数据集上取得最优性能。

详情
AI中文摘要

空间和光谱信息的整合有利于提高变化检测性能。然而,现有方法无法有效抑制未变化区域中空间和光谱差异的影响。为了解决这些问题,本文提出了一种内容引导的空间-光谱集成网络(CSI-Net),用于融合全局空间细节和光谱差异信息。具体而言,所提出的CSI-Net由空间推理(SR)模块、光谱差异(SD)模块和内容引导集成(CGI)模块组成。在SR模块中,通过级联图卷积块学习空间信息以进行全局建模。SD模块负责提取光谱特征,通过计算特征的均值和方差来减少未变化区域中光谱差异的影响。此外,为了有效集成空间-光谱特征,我们设计了CGI模块以进一步利用它们的互补信息。在该模块中,引入高层内容信息作为引导,以实现适当的交互。由于高效的空间-光谱融合,所提出的CSI-Net能够更好地学习变化特征,同时实现对光谱差异的抑制。在LEVIR-CD、WHU-CD和CLCD数据集上的实验结果表明,与最先进方法相比,所提出的CSI-Net产生了更好的性能,并且适用于不同场景。

英文摘要

The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

2606.10329 2026-06-10 cs.CV cs.AI 新提交

Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

地震中的建筑变化检测:一种多尺度交互网络和一个变化检测数据集

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学)

AI总结 针对地震后短期成像间隔导致的变化检测难题,构建了土耳其地震变化检测数据集(TUE-CD),并提出多尺度特征交互网络(MSI-Net),通过联合交叉注意力和多尺度偏移校准模块,有效缓解侧视问题,提升变化检测精度。

详情
AI中文摘要

作为最具破坏性的自然灾害之一,近年来地震袭击了世界许多国家,造成了严重的经济损失。变化检测(CD)可应用于震后损伤评估,因为它能从多时相遥感图像中推断出被破坏的变化区域。此外,短成像间隔的变化检测将更好地满足地震后紧急救援的需求。然而,由于缺乏短成像间隔的数据集,当前基于深度神经网络的方法的能力受到限制。为了满足灾后即时救援的需求,我们创建了一个变化检测数据集——土耳其地震变化检测数据集(TUE-CD),用于评估地震后短期内的建筑损坏情况。由于后事件图像的采集间隔短,不同时相图像的成像角度不同,导致了一些侧视问题。为了应对这些挑战,我们提出了一种多尺度特征交互网络(MSI-Net),用于双时相特征之间的高效交互,并减轻侧视问题的影响。具体来说,所提出的MSI-Net由联合交叉注意力(JCA)模块、多尺度偏移校准(MOC)模块和特征集成(FeI)模块组成。JCA模块统一了通道交叉注意力和空间联合注意力,以实现充分的特征交互。MOC模块进一步估计偏移量,以将双时相图像与多尺度特征对齐。最后,通过FeI模块融合校准后的特征和多尺度特征,用于变化区域的预测。在WHU-CD、CLCD和构建的TUE-CD数据集上的实验表明,所提出的MSI-Net比考虑的最先进的变化检测方法提供了更好的结果。

英文摘要

As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

2606.10696 2026-06-10 cs.CV 新提交

Don't waste SAM

不要浪费 SAM

Nermeen Abou Baker, Uwe Handmann

发表机构 * Ruhr West University of Applied Sciences - Dept of Computer Science(鲁尔西部应用科学大学计算机科学系)

AI总结 本文评估了SAM在垃圾分割任务中的泛化能力,通过微调SAM-ViT-H模型,在三个数据集上显著提升IoU,表明微调SAM作为基础模型对下游任务至关重要。

Comments Published at European Symposium on Artificial Neural Networks (ESANN2023), Computational Intelligence and Machine Learning. Bruges (Belgium)

详情
AI中文摘要

Meta AI 最近发布了 Segment Anything Model (SAM),该模型在各种任务中展示了卓越的零样本图像分割性能,具有显著的准确性。尽管 SAM 无法在多个研究领域提供精确的分割,但它仍然是支持分割流程的宝贵起点,特别是对于需要大量高级技能标注的任务。本研究旨在使用三个垃圾分割数据集评估 SAM 和微调 SAM 模型的泛化能力。尽管这些数据集是从真实场景中捕获的(与 SAM 预训练的数据相同),但它们带来了若干挑战,包括遮挡、可变形物体、透明物体以及易与背景混淆的物体。我们的发现表明,微调的 SAM-ViT-H 模型在 Zerowaste 和 TACO 数据集上优于最先进的方法,IoU 显著提高了 +30,并且非常接近 TrashCan 1.0 的性能水平,仅相差 -1.44。在评估这些流行的垃圾数据集后,很明显,微调 SAM 作为基础模型是为下游垃圾分割任务提供更好泛化能力的关键步骤。因此,SAM 不应被忽视或浪费。

英文摘要

Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

2606.10699 2026-06-10 cs.CV cs.AI 新提交

Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line

使用YOLOv12模型验证生产线上网线(跳线)中导线的正确颜色顺序

Amin Doroodchi, Danial Soleimany

发表机构 * Computer Department, Islamic Azad University, Beyza Branch(伊斯兰 Azad 大学计算机系,贝兹分校) R&D at Nedaye Sabz Company, Isfahan Branch(Nedaye Sabz 公司研发部,伊斯法罕分校)

AI总结 针对网线生产中导线颜色顺序检测问题,提出基于YOLOv12的目标检测模型,实现高精度实时验证,减少人工错误。

详情
AI中文摘要

在网络电缆的生产过程中,确保标准连接器内部线对的正确颜色顺序对电缆的最终性能至关重要,因为任何错位或颜色顺序错误都可能导致缺陷产品并造成巨大成本。基于数字显微镜目视检查的传统检测方法通常耗时、繁琐且容易出错。在本研究中,开发了一种基于第十二版YOLO目标检测模型的智能系统,用于识别跳线中导线的位置并验证其正确的颜色顺序。使用的数据集包括从网络连接器显微视图中捕获的2500张图像,其中70%用于训练,15%用于验证,15%用于测试。所提出的模型利用单阶段架构和学习过程中的注意力机制,实现了约98%精度的导线检测。此外,总体平均准确率、分类精度和召回率分别约为95%、99%和98%。结果表明,该系统能够在生产线上可靠地实时验证导线颜色顺序的正确性,无需人工干预,从而减少人为错误并提高制造效率。

英文摘要

In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critical role in the final performance of the cable, as any misplacement or color-ordering error can lead to defective products and impose significant costs. Traditional inspection methods based on visual examination through digital microscopes are typically time-consuming, tedious, and prone to human error. In this study, an intelligent system based on the twelfth version of the YOLO1 object detection model was developed to identify the position and verify the correct color sequence of wires in patch cords. The dataset used consisted of 2,500 images captured from microscopic views of network connectors, which were divided into 70% for training, 15% for validation, and 15% for testing. The proposed model, leveraging a single-stage architecture and attention mechanisms during learning, achieved highly accurate wire detection with approximately 98% precision. Additionally, the overall mean accuracy, classification precision, and recall were around 95%, 99%, and 98%, respectively. The results demonstrate that this system can reliably and in real time verify the correctness of wire color sequencing on the production line without the need for human intervention, thereby reducing human error and enhancing efficiency in the manufacturing process.

2606.10769 2026-06-10 cs.CV 新提交

ZODS-RS -- Zero-training Oriented Detection & Segmentation for Remote Sensing

ZODS-RS -- 面向遥感的零训练目标检测与分割

Zuan Gu, Tianhan Gao, Langxu Zhao

发表机构 * Northeastern University, China(东北大学)

AI总结 提出一种无需训练的封闭式管道ZODS-RS,通过原型纯化、旋转尺度等变匹配和不确定性感知像素合并,统一了遥感图像的水平框检测与实例分割,在多个数据集上取得优异性能。

详情
AI中文摘要

遥感与无人机应用需要模型能够跨平台和视角泛化,而无需特定任务训练。然而,无训练管道在处理有向几何、尺度/旋转变化以及拥挤的港口或机场时常常失败,并且很少统一检测与分割。我们提出ZODS-RS,一种无训练、封闭式的管道,输出水平框(HBB)和实例掩码。基于DINOv3密集特征和SAM风格的提议,ZODS-RS链式包含:PP(通过Tyler协方差进行原型纯化)、R-SEM(使用可分离核和全局匈牙利分配的旋转尺度等变匹配)以及UAM(具有自适应先验和可选负原型的不确定性感知逐像素合并)。一个轻量级的CWLA融合多个DINOv3层。在FAIR1M(HBB)上,我们获得$\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$和$\mathrm{AP}_S=\mathbf{2.93}$(船舶/飞机类别平均);在xView(HBB)上,我们报告$\mathrm{mAP}=\mathbf{16.69}$。在我们的无人机数据集上,ZODS-RS实现了掩码$\mathrm{mIoU}=\mathbf{31.10}$,并在单张5090上将小目标AP相对于Grounded-SAM提升了$\mathbf{+30.70}$。这项工作为航空影像中的水平框检测加实例分割提供了统一的、无需训练的解决方案;提供了与DINOv3紧密耦合的PP/R-SEM/UAM的显式封闭形式公式;并在小目标和拥挤目标以及跨域迁移下展示了一致的增益,同时保持部署简单。

英文摘要

Remote-sensing and UAV applications need models that generalize across platforms and viewpoints without task-specific training. Yet training-free pipelines often falter on oriented geometry, scale/rotation variation, and crowded ports or airfields, and rarely unify detection and segmentation. We introduce ZODS-RS, a training-free, closed-form pipeline that outputs horizontal boxes (HBB) and instance masks. Built on DINOv3 dense features and SAM-style proposals, ZODS-RS chains: PP (prototype purification via Tyler covariance), R-SEM (rotation-scale equivariant matching with separable kernels and global Hungarian assignment), and UAM (uncertainty-aware pixelwise merging with adaptive priors and optional negative prototypes). A lightweight CWLA fuses multiple DINOv3 layers. On FAIR1M (HBB) we obtain $\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$ and $\mathrm{AP}_S=\mathbf{2.93}$ \emph{(class-averaged over ship/airplane)}; on xView (HBB) we report $\mathrm{mAP}=\mathbf{16.69}$. On our UAV dataset, ZODS-RS achieves mask $\mathrm{mIoU}=\mathbf{31.10}$ and improves small-object AP by $\mathbf{+30.70}$ over Grounded-SAM on a single 5090. This work offers a unified, \emph{no-training} solution for horizontal-box detection plus instance segmentation in aerial imagery; provides explicit closed-form formulations for PP/R-SEM/UAM tightly coupled with DINOv3; and demonstrates \emph{consistent} gains on small and crowded targets and under cross-domain shifts while keeping deployment simple.

2606.10940 2026-06-10 cs.CV cs.AI cs.LG 新提交

Democratising Camera Trap AI: An Open-Source Model for Detecting UK Mammals

民主化相机陷阱AI:用于检测英国哺乳动物的开源模型

Paul Fergus, Philip Stephens, Russell A. Hill, Lee Oliver, Katie Appleby, Sarah Beatham, Naomi Davies Walsh, Stuart Nixon, Naomi Matthews, Chris Sutherland, Kelly Hitchcock

发表机构 * Liverpool John Moores University(利物浦约翰穆里斯大学) Durham University(杜伦大学) MammalWeb(哺乳动物网) Game & Wildlife Conservation Trust(游戏与野生动物保护信托) National Trust(国家信托) Animal and Plant Health Agency(动物和植物卫生局) Chester Zoo(切斯特动物园) University of St Andrews(圣安德鲁大学) Nottingham Trent University(诺丁汉特伦特大学)

AI总结 发布一个针对31类(28种英国常见哺乳动物和鸟类)的开源目标检测模型,基于YOLO26x在48,165个标注实例上训练,mAP@0.5达0.984,旨在降低生态学家使用AI的门槛。

Comments 15 Pages, 4 Figures

详情
AI中文摘要

相机陷阱已成为生物多样性监测的基石,但将大量图像转化为可用生态数据的人工智能通常被锁定在商业平台之后,或针对与不列颠群岛不相符的动物群进行训练。为了消除障碍并提高采用率,我们发布了一个针对31类(28种英国常见哺乳动物和鸟类,以及人类、校准杆和车辆等实用类)的开源目标检测模型,该模型基于从多个地点经过十年运营部署(通过Conservation AI及其后续项目Trap Tracker)收集的48,165个标注实例的精选数据集。该模型是YOLO26x检测器,在80/10/10的类别分层划分上进行训练和测试,在保留的验证集上,IoU为0.5时平均精度为0.984(IoU 0.5-0.95时为0.956),精确率为0.988,召回率为0.965。在未见过的保留测试集上,31个类别的平均物种置信度范围为0.96至0.99,假阴性率为0.17%,主要集中在困难的夜间、远处或遮挡图像中。这些指标来自与训练相同站点和相机池的数据,因此在新站点的性能留待未来工作。我们以非商业许可发布ONNX格式的训练权重,支持本地桌面和实时相机,明确面向没有机器学习经验的生态学家。此发布是对过去十年中开发的多个付费模型的有意制衡。

英文摘要

Camera traps have become a cornerstone of biodiversity monitoring, but the artificial intelligence that turns vast quantities of images into usable ecological data is often locked behind commercial platforms or trained on fauna that does not match that of the British Isles. In an attempt to remove barriers and increase uptake, we release an open-source object detection model for 31 classes, 28 common UK mammal and bird species, plus utility classes for humans, calibration poles, and vehicles, drawn from a curated dataset of 48,165 labelled instances assembled from multiple sites over a decade of operational deployment through Conservation AI and its successor, Trap Tracker. The model, a YOLO26x detector trained and tested on an 80/10/10 class-stratified split, achieves a mean Average Precision of 0.984 at Intersection over Union (IoU) of 0.5 (0.956 at IoU 0.5-0.95) on the held-out validation set, with precision 0.988 and recall 0.965. On an unseen held-out test split, mean per-species confidence ranged from 0.96 to 0.99 across the 31 classes, with a 0.17% false-negative rate concentrated in difficult night-time, distant, or occluded images. These metrics are from data from the same pool of sites and cameras as training, so performance at entirely new sites is left to future work. We release the trained weights in ONNX format under a non-commercial licence, with local desktop and real-time camera support, aimed explicitly at ecologists with no machine-learning experience. This release is a deliberate counterweight to the multiple paid for models that have developed over the last decade.

2507.02513 2026-06-10 cs.CV 版本更新

Automatic Labelling for Low-Light Pedestrian Detection

低光照行人检测的自动标注

Dimitrios Bouzoulas, Eerik Alamikkotervo, Risto Ojala

发表机构 * Energy and Mechanical Engineering, Aalto University(艾尔沃斯大学能源与机械工程系)

AI总结 提出一种自动红外-RGB流水线,利用红外检测生成标签训练低光照行人检测模型,在KAIST数据集上优于真实标签。

详情
AI中文摘要

RGB图像中的行人检测是行人安全的关键任务,因为自动驾驶车辆和高级驾驶辅助系统中最常见的传感器是RGB相机。低光照行人检测缺乏大型公共数据集和自动标注流水线。本研究提出一种自动红外-RGB流水线作为解决方案。该流水线包括:1) 红外检测,使用微调的红外行人检测模型;2) 标签转移过程,将红外检测结果转移到对应的RGB图像;3) 使用生成的标签训练低光照RGB行人检测的目标检测模型。研究使用KAIST数据集进行。评估中,三个目标检测模型DETR、YOLO和RCNN在生成的标签和真实标签上分别训练。在未见过的图像上比较时,结果显示,在mAP@50和LAMR指标上,基于生成标签训练的模型在6个案例中的5个优于基于真实标签训练的模型,并且在所有案例中mAP@50-95指标均优于真实标签。获得的结果表明,所提出的自动标注流水线可用于低光照行人检测数据集的可扩展标注。本研究的源代码可在GitHub上获取:this https URL

英文摘要

Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. Low-light pedestrian detection lacks large public datasets and autolabelling pipelines. This research proposes a solution in the form of an automated infrared-RGB pipeline. The pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For evaluation, three object detection models, DETR, YOLO, and RCNN, were trained on generated and ground truth labels. When compared on previously unseen images, the results showed that the models trained on generated labels out-performed the ones trained on ground-truth in 5 out of 6 cases for the mAP@50 and LAMR metrics, and outperformed ground-truth on mAP@50-95 in all cases. Acquired results indicate that the proposed auto-labelling pipeline could be used for scalable annotation of low-light datasets for pedestrian detection. The source code for this research is available on GitHub: https://github.com/BouzoulasDimitrios/IR-RGB-autoamed-low-light-pedestrian-labelling

2603.11917 2026-06-10 cs.CV 版本更新

PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

PicoSAM3:实时传感器区域感兴趣分割

Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno

发表机构 * ETH Zürich(苏黎世联邦理工学院) IBM Research(IBM研究院)

AI总结 PicoSAM3是一款轻量级实时传感器区域分割模型,结合密集CNN架构、区域兴趣提示编码和知识蒸馏,实现低延迟高精度分割。

详情
AI中文摘要

实时、在设备上的分割对于延迟敏感且隐私保护的应用至关重要,如智能眼镜和物联网设备。我们介绍了PicoSAM3,一个针对边缘和传感器执行优化的轻量级提示视觉分割模型,包括在索尼IMX500视觉传感器上的部署。PicoSAM3拥有1.3M参数,结合密集CNN架构、区域兴趣提示编码、高效通道注意机制以及从SAM2和SAM3的知识蒸馏。在COCO和LVIS数据集上,PicoSAM3分别达到65.45%和64.01%的mIoU,优于现有基于SAM和边缘导向的基线模型。INT8量化模型在精度上几乎没有下降,同时在IMX500上实现了11.82ms的实时传感器推断延迟,完全符合其内存和运算限制。消融研究显示,从大SAM模型的知识蒸馏可使mIoU提升高达14.5%,证明了高质量、空间灵活的提示分割可在传感器层面实现。

英文摘要

Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

5. 生成式视觉与世界模型 26 篇

2606.09967 2026-06-10 cs.CV 新提交

ABot-Earth 0.5: Generative 3D Earth Model

ABot-Earth 0.5:生成式3D地球模型

Ming Qian, Tianjian Ouyang, Mingchao Sun, Zijian Wang, Jincheng Xiong, Jiarong Han, Yongchang Zhang, Jiawei Zhang, Xu Wang, Yu Liu, Luyang Tang, Fei Yu, Zengye Ge, Mengmeng Du, Yuan Liu, Nianfei Fan, Song Wang, Yingliang Peng, Chunxue Jia, Yang Liu, Shiying Zeng, Haozhe Shi, Junnan Lai, Hongyu Pan, Zheng Wu, Ning Guo, Mu Xu, Hang Zhang

AI总结 提出ABot-Earth 0.5框架,利用3D高斯泼溅从卫星图像生成大规模无缝3D环境,每平方公里合成时间低于10分钟,支持实时交互可视化,降低3D重建成本。

Comments From Amap-cvlab, Alibaba. Official page: https://abot-earth.amap.com/

详情
AI中文摘要

我们提出ABot-Earth 0.5,一个生成式3D框架,旨在从普遍存在的、地理参考的卫星图像中合成大规模无缝3D环境。为此,我们提出了一种新颖的生成模型,直接使用3D高斯泼溅(3DGS)表示。该模型在多样化的真实世界城市重建语料库上进行训练,学习生成逼真的几何和纹理。在推理时,它仅以卫星图像为条件合成新颖的3D场景,可扩展速率低于每平方公里10分钟,同时表现出卓越的真实感。该框架设计为易于访问,集成了分层细节级别(LOD)结构,允许在基于Web的地图引擎上进行实时交互式可视化。这种高保真模拟沙箱有效缓解了模拟到现实的领域差距,支持关键的具身人工智能下游应用,如闭环无人机导航。通过提供超低成本和高效的解决方案,ABot-Earth 0.5显著降低了大规模3D重建的技术和财务障碍,并推动了全球数字地球可视化的未来。

英文摘要

We present ABot-Earth 0.5, a generative 3D framework designed to synthesize vast, seamless 3D environments from ubiquitous, geospatially referenced satellite imagery. To achieve this, we propose a novel generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation. The model is trained on a diverse corpus of existing real-world urban reconstructions, learning to generate realistic geometry and textures. At inference, it synthesizes novel 3D scenes conditioned solely on satellite imagery at a scalable rate of under 10 minutes per square kilometer, while demonstrating exceptional realism. The framework is designed for accessibility, with integrated hierarchical level-of-detail (LOD) structures that permit real-time, interactive visualization on web-based map engines. This high-fidelity simulation sandbox effectively mitigates the sim-to-real domain gap, enabling critical downstream Embodied AI applications like closed-loop UAV navigation. By providing an ultra-low-cost and high-efficiency solution, ABot-Earth 0.5 significantly lowers the technical and financial barriers to large-scale 3D reconstruction and empowers the future of global digital earth visualization.

2606.10183 2026-06-10 cs.CV cs.AI cs.MM 新提交

Making Time Editable in Video Diffusion Transformers

在视频扩散Transformer中实现时间可编辑性

Konstantin Kuklev, Viacheslav Vasilev, Alexander Kunitsyn, Andrei Ivaniuta, Denis Dimitrov

AI总结 提出一种时间控制方法,通过轻量级时间模块扩展预训练DiT,实现运动速度和时序结构的编辑,无需重新设计骨干网络。

详情
AI中文摘要

现代用于视频生成的扩散Transformer对时间进程的控制和时序动态的编辑能力有限。我们提出一种时间控制方法,通过显式时间编辑扩展预训练DiT,允许控制运动速度和时序结构,而无需重新设计骨干网络。其核心实现通过一个轻量级时间模块增强预训练模型,保留原始生成先验的同时扩展其可控动态范围。

英文摘要

Modern Diffusion Transformers for video generation provide limited control over the progression of time and the editing of temporal dynamics. We propose a temporal-control methodology that extends a pretrained DiT with explicit time editing, allowing control over motion speed and temporal structure without redesigning the backbone. Its core implementation augments the pretrained model with a lightweight temporal module, preserving the original generative prior while expanding its controllable dynamic range.

2606.10450 2026-06-10 cs.CV cs.LG 新提交

Few-step Generative Models as Lossy Compression

少步生成模型作为有损压缩

Fuma Kimishima, Jinjia Zhou

发表机构 * University of Tokyo(东京大学)

AI总结 研究将少步生成模型(Rectified Flow、CTM、MeanFlow)用于反向信道编码框架进行有损压缩,通过参数化等效和局部高斯近似实现无需重训练的编解码,在低分辨率基准上减少编解码时间并提升低比特率下的真实性。

详情
AI中文摘要

DiffC 提供了一种重用预训练扩散模型进行有损压缩的原则性方法,但其编码和解码过程仍然缓慢,因为它们需要许多离散化的前向和反向步骤。我们研究少步生成模型——Rectified Flow、一致性轨迹模型(CTM)和 MeanFlow——是否可以在相同的反向信道编码(RCC)框架中作为编解码器使用。主要挑战在于 RCC 需要后验和共享分布参数,而这些模型并未显式参数化中间条件分布。对于 Rectified Flow 和 MeanFlow,我们利用速度参数化与扩散式去噪参数化之间的等价性来推导 RCC 所需的量。对于从 EDM 蒸馏得到的 CTM,我们采用 EDM 噪声参数化以及中间状态下发送方和共享分布的局部高斯近似。这产生了一个概念验证的概率公式,使得无需重新训练即可使用预训练的少步生成模型进行压缩。在低分辨率基准上,由此产生的编解码器减少了编码和解码时间,并在低比特率范围内提高了真实性。

英文摘要

DiffC provides a principled way to reuse pre-trained diffusion models for lossy compression, but its encoding and decoding procedures remain slow because they require many discretized forward and reverse steps. We study whether few-step generative models -- Rectified Flow, Consistency Trajectory Models (CTM), and MeanFlow -- can be cast as codecs within the same reverse channel coding (RCC) framework. The main challenge is that RCC requires posterior and shared distribution parameters, whereas these models do not explicitly parameterize intermediate conditional distributions. For Rectified Flow and MeanFlow, we use the equivalence between velocity parameterization and diffusion-style denoising parameterization to derive the quantities required by RCC. For CTM, which is distilled from EDM, we adopt the EDM noise parameterization together with local Gaussian approximations of the sender and shared distributions at intermediate states. This yields a proof-of-concept probabilistic formulation that enables compression with pre-trained few-step generative models without retraining. On low-resolution benchmarks, the resulting codecs reduce encoding and decoding time and improve realism in the low-bit-rate regime.

2606.10492 2026-06-10 cs.CV 新提交

PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

PathRelax: 并行路径松弛推测雅可比解码加速自回归文本到图像生成

Haodong Lei, Hongsong Wang, Bingxuan Dai, Pan Zhou

发表机构 * College of Software Engineering, Southeast University(东南大学软件工程学院) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算机与信息系统学院)

AI总结 针对自回归文本到图像模型因长序列导致推理慢的问题,提出并行路径交叉松弛推测雅可比解码框架,通过多序列草稿树结构扩展搜索空间并利用跨路径语义相似性提高接受率,实现3.95-4.18倍加速。

Comments 10 pages, 5 figures

详情
AI中文摘要

自回归文本到图像模型对高分辨率图像生成的需求日益增长,导致令牌序列延长,显著增加了计算成本和推理时间。然而,现有的加速自回归文本到图像模型的最先进方法依赖于链式结构的草稿令牌序列,导致草稿令牌搜索效率低下且接受长度有限。为了解决这一问题,我们提出了并行路径交叉松弛推测雅可比解码(PathSpec),一种通过多序列草稿树结构提升效率的新框架。我们的并行路径推测雅可比解码(PathExplore)扩展了令牌搜索空间,在不牺牲图像质量的情况下实现了更高的加速比。此外,我们引入了跨路径松弛验证(PathRelax),利用序列间的语义相似性进一步提高令牌接受率。在Parti-Prompts、MSCOCO2017和T2ICompBench数据集上的评估表明,我们的方法分别实现了4.14倍、3.95倍和4.18倍的加速比。值得注意的是,PathExplore在没有任何松弛采样的条件下,在加速比上优于GSD和LANTERN等松弛采样方法。此外,PathRelax的松弛机制可以与其他松弛技术无缝集成,实现进一步加速,为实时文本到图像生成提供了高效解决方案。我们的代码可在该https URL获取。

英文摘要

The growing need for high-resolution image generation in autoregressive text-to-image models has resulted in extended token sequences, significantly increasing computational costs and inference times. However, existing state-of-the-art methods for accelerating autoregressive text-to-image models rely on chain-structured draft token sequences, leading to inefficient draft token search and limited acceptance lengths. To address this, we propose parallel-path cross-relaxed speculative Jacobi decoding (\textbf{PathSpec}), a novel framework that enhances efficiency through a multi-sequence draft tree structure. Our parallel-path speculative Jacobi decoding (\textbf{PathExplore}) expands the token search space, achieving a higher speedup ratio without sacrificing image quality. Additionally, we introduce cross-path relaxed verification (\textbf{PathRelax}) that exploits semantic similarities across sequences to further boost token acceptance rates. Evaluated on the Parti-Prompts, MSCOCO2017, and T2ICompBench datasets, our method achieves a speedup ratio of 4.14 $\times$, 3.95$\times$, and 4.18$\times$, respectively. Remarkably, PathExplore, without any relaxed sampling, outperforms relaxed sampling methods in the speedup ratio, such as GSD and LANTERN. Moreover, PathRelax's relaxation mechanism can be seamlessly integrated with other relaxation techniques, enabling further acceleration and providing an efficient solution for real-time text-to-image generation. Our code is available at https://github.com/Haodong-Lei-Ray/PathSpec.

2606.10617 2026-06-10 cs.CV 新提交

SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models

SSR-Merge: 面向扩散模型中免训练的LoRA合并的子空间信号路由

Zhengxuan Wei, Yi Dong, Zonghui Li, Xianhui Lin, Xing Liu, Hong Gu, Shaofeng Zhang, Wenbin Li, Qi Fan

发表机构 * Stanford University(斯坦福大学)

AI总结 提出子空间信号路由(SSR)方法,通过沿秩维度拼接LoRA构建统一子空间,利用逆相关矩阵去相关和方向引导矩阵分离信号,解决参数合并中的干扰问题,理论证明其等价于OLS最优解,并设计流式算法降低开销。

Comments Accepted at ICML 2026

详情
AI中文摘要

低秩适应(LoRA)合并可以有效地将来自多个训练好的LoRA的不同生成能力组合到扩散模型中。然而,现有的LoRA合并技术常常遭受严重的参数干扰,导致共享参数空间中的破坏性冲突。为了解决这个问题,我们提出了子空间信号路由(SSR),它通过路由内部信号而不是执行参数空间合并来解决干扰。具体来说,SSR首先通过沿秩维度拼接候选LoRA来构建一个统一的子空间。接下来,SSR使用逆相关矩阵对该空间内的混合信号进行去相关。最后,一个方向引导矩阵将这些净化后的信号引导到各自的任务特定子空间。我们提供了严格的理论分析,证明SSR与普通最小二乘(OLS)解一致,从而确保数学最优性。我们利用充分统计量的可加性设计了一个流式算法。这使得能够进行即时更新,显著减少内存开销和计算时间。大量实验验证了SSR在保持相当效率的同时显著优于最先进的方法。代码可在该https URL获取。

英文摘要

Low-Rank Adaptation (LoRA) merging can efficiently combine diverse generative capabilities from multiple trained LoRAs for a diffusion model. However, existing LoRA merging techniques often suffer from severe parameter interference, causing destructive collisions in the shared parameter space. To address this, we propose Subspace Signal Routing (SSR), which resolves interference by routing internal signals instead of performing parameter-space merge. Specifically, SSR first constructs a unified subspace by concatenating candidate LoRAs along the rank dimension. Next, SSR employs an inverse correlation matrix to decorrelate mixed signals within this space. Finally, a directional guide matrix steers these purified signals into their respective task-specific subspaces. We provide a rigorous theoretical analysis proving that SSR aligns with the Ordinary Least Squares (OLS) solution, thereby ensuring mathematical optimality. We utilize the additivity of sufficient statistics to design a streaming algorithm. This enables on-the-fly updates that significantly reduce memory overhead and computation time. Extensive experiments validate that SSR significantly outperforms state-of-the-art methods while maintaining comparable efficiency. Code is available at https://github.com/nagara214/SSR-Merge.

2606.10653 2026-06-10 cs.CV 新提交

STEDiff: Strengthening Text Embedding for Text-to-Image Alignment in Diffusion Model

STEDiff: 增强文本嵌入以提升扩散模型中文本到图像的对齐

Hailan Zhang, Haipeng Liu, Bo Fu, Yang Wang

发表机构 * Hailan Zhang, Haipeng Liu, Yang Wang(未明确机构) Bo Fu(未明确机构)

AI总结 提出训练免费的STEDiff方法,通过利用[EOT]令牌增强子句语义并引入语义增强损失,在文本嵌入空间中改善扩散模型对复杂提示的语义对齐,无需微调或布局先验。

Comments 8 pages, 8 figures, to appear at IJCNN 2026

详情
AI中文摘要

尽管预训练的文本到图像(T2I)生成模型可以产生高质量图像,但由于随机噪声和固有的模型限制,它们常常无法忠实地反映复杂提示的语义意图。这个问题经常表现为模型忽略特定对象或无法正确地将属性绑定到其对应的实体上,这一挑战被称为语义对齐。与依赖计算昂贵的微调或劳动密集的布局先验的现有方法不同,我们提出了STEDiff,一种无需训练的方法,旨在直接在文本嵌入空间中增强语义表示。具体来说,我们引入了一种方法,主要利用[EOT]令牌来增强子句的相关语义,然后替换原始提示中的相应令牌。此外,还引入了一种新颖的语义增强损失来强制执行空间约束,确保每个实体的语义精确映射到其各自的图像区域。在T2I-CompBench上的大量定量和定性评估表明,我们的方法在复杂场景中显著提高了语义一致性和生成完整性。

英文摘要

Although pretrained text-to-image (T2I) generation models can produce high-quality images, they often fail to faithfully reflect the semantic intent of complex prompts due to stochastic noise and inherent model limitations. This issue frequently manifests as the model overlooking specific objects or failing to correctly bind attributes to their corresponding entities, a challenge referred to as semantic alignment. Unlike existing approaches that rely on computationally expensive fine-tuning or labor-intensive layout priors, we propose STEDiff, a training-free method designed to enhance semantic representations directly within the text-embedding space. Specifically, we introduce a method that primarily leverages the [EOT] token to strengthen the relevant semantics of sub-sentences and then replaces the corresponding tokens in the original prompt. Furthermore, a novel semantic enhancement loss is incorporated to enforce spatial constraints, ensuring that the semantics of each entity are precisely mapped to their respective image regions. Extensive quantitative and qualitative evaluations on the T2I-CompBench demonstrate that our method notably improves semantic consistency and generation integrity in complex scenarios.

2606.10671 2026-06-10 cs.CV 新提交

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

FadeMem: 面向自回归视频扩散的距离感知记忆巩固

Yu Lu, Junjie Yang, Piotr Koniusz, YuXin Song, Yi Yang

发表机构 * Zhejiang University(浙江大学) University of New South Wales (UNSW)(新南威尔士大学) Data61/CSIRO Baidu Inc(百度公司)

AI总结 提出FadeMem,一种距离感知的KV记忆巩固机制,在固定缓存预算下将历史KV块组织成时间层次,通过幂律分配实现近密远疏的记忆,提升长视频生成中的主体一致性和时间连贯性。

Comments 11 pages, 4 figures

详情
AI中文摘要

自回归视频生成器通过生成连续的时间片段来合成长时间视频,但其历史KV缓存随视频长度增长。现有的有界缓存方法通过局部窗口、汇合令牌或压缩记忆状态来降低这一成本,但它们通常为历史的不同部分分配固定角色。我们提出FadeMem,一种距离感知的KV记忆巩固机制,在固定缓存预算下将历史KV块组织成时间层次。这一设计受频率依赖的时间衰减启发:细节快速去相关,而粗略场景结构和身份在更长时间内保持有用。在生成过程中,新历史作为细粒度条目插入,而较旧的相邻条目在幂律时间分配调度下逐步合并,从而在单个缓存中产生近密远疏的记忆。无需架构更改,FadeMem保留近期上下文以处理短期动态,并保留紧凑的长程锚点以保持身份和场景连贯性。实验表明,与现有有界缓存策略相比,主体一致性、背景稳定性和时间连贯性均得到提升。

英文摘要

Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.

2606.10839 2026-06-10 cs.CV 新提交

HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

HarmoView: 协调多视角约束以实现身份一致视频生成

Cong Wang, Zhentao Yu, Hongmei Wang, Weicong Liang, Zixiang Zhou, Zilin Yang, Jiarong Ou, Rui Chen, Yuan Zhou, Qinglin Lu

发表机构 * Tencent Hunyuan(腾讯混元)

AI总结 提出HarmoView框架,通过多级特征注入、可学习代理令牌和Jump-RoPE等架构改进,结合渐进式视角课程训练,解决大视角变化下身份一致视频生成的外观保真度问题,在多视角基准上达到最优性能。

Comments Project Page: https://conallwang.github.io/HarmoView_Pages

详情
AI中文摘要

当前的身份一致视频生成方法在大的视角变化下难以保持外观保真度。虽然引入多视角参考输入提供了自然解决方案,但由于缺乏有效的多视角输入框架以及多视角数据的稀缺性,进展仍然受限。我们通过提出HarmoView来应对这些挑战,这是一个用于身份一致视频生成的鲁棒框架,通过三种架构改进并辅以分阶段训练课程,有效整合多视角线索。具体来说,我们首先引入多级特征注入(MFI)来锚定身份保真度;通过交叉注意力将来自正面参考的原始ViT特征与文本令牌一起注入,MFI提供了持久的低级外观锚点,补充了DiT块内的高级身份特征,从而增强了身份保持。然后,我们采用可学习代理令牌来统一单/多视角设置下的异构参考布局,同时解决参考-视角不匹配问题。进一步开发了Jump-RoPE用于身份级特征隔离以减少身份串扰。为了在保留原始生成先验的同时激活这些结构能力,我们提出了渐进式视角课程。这种四阶段训练策略采用视角丢弃,以促进从原始T2V生成到高保真、身份持久的空间推理的稳定过渡。此外,我们构建了一个大规模多视角数据集以解决数据稀缺问题。在我们的多视角基准上的广泛评估(包含100个手动策划的案例,涵盖52个独特身份)表明,HarmoView显著优于开源基线,并匹配领先的闭源引擎,在身份一致视频生成中实现了最先进的性能。

英文摘要

Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three architectural refinements complemented by a staged training curriculum. Specifically, we first introduce Multi-level Feature Injection to anchor identity fidelity; by injecting raw ViT features from frontal references alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, we employ learnable proxy tokens to unify heterogeneous reference layouts across single-/multi-view settings while simultaneously resolving the reference-view mismatch problem. Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the original generative priors, we propose the Progressive View Curriculum. This four-stage training strategy employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity. Extensive evaluation on our multi-view benchmark, comprising 100 manually-curated cases spanning 52 unique identities, demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

2606.10892 2026-06-10 cs.CV cs.AI 新提交

Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding

通过定制化概念嵌入改进前景条件外绘中的文本-实例对齐

Yihao Zhao, Xuan Han, Bin He, Mingyu You

AI总结 针对前景条件外绘中文本驱动方法产生的伪影问题,提出定制化概念嵌入扩散框架,通过实例感知损失和语义保持提示模板定制概念嵌入,显著减少伪影并提升图像质量。

详情
AI中文摘要

为了展示产品,商家通常需要花费大量成本制作高质量的展示图像。前景条件外绘(FCO)满足了这一需求,允许用户通过调整文本提示,以低成本为前景实例创建所需的背景。然而,现有的文本驱动FCO方法在其输出中存在关键缺陷,最明显的是伪影,即合成背景中与前景实例共享相同语义的区域。这种伪影降低了物体的显著性并降低了图像质量。我们将问题归因于给定实例与文本派生概念嵌入之间的不对齐。为了解决这个问题,我们提出了定制化概念嵌入扩散(CCE-Diffusion)框架。其核心是CCE模块,用于定制概念嵌入,弥合通用名词语义与特定视觉实例之间的差距。实例感知损失指导模块的优化,而语义保持提示模板防止定制化嵌入扭曲提示中的其他词。定性和定量评估均表明,CCE-Diffusion显著减少了输出中的伪影。作为即插即用组件,CCE模块可以集成到各种FCO方法中,提升其性能。

英文摘要

To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO) meets this demand, allowing users to create desired backgrounds for foreground instances at a low cost by adjusting the text prompt. However, existing text-driven FCO methods exhibit critical flaws in their outputs, most notably the presence of artifacts, which refer to regions in the synthesized background that share the same semantics as the foreground instance. Such artifacts diminish the object's prominence and degrade image quality. We attribute the issue to the misalignment between the given instance and text-derived concept embeddings. To address this, we propose the Customized Concept Embedding Diffusion (CCE-Diffusion) framework. Its core is a CCE-Module to customize concept embeddings, bridging the gap between generic noun semantics and a specific visual instance. An Instance-Aware Loss guides the module's optimization, while a Semantic-Preserving Prompt Template prevents customized embeddings from distorting other words in the prompt. Both qualitative and quantitative evaluations demonstrate that CCE-Diffusion significantly reduces artifacts in the outputs. As a plug-and-play component, the CCE-Module can integrate with various FCO methods, enhancing their performance.

2606.10902 2026-06-10 cs.CV cs.AI 新提交

Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

Pose-ICL:面向姿态可控主体定制的3D感知上下文学习

Xuan Han, Yihao Zhao, Mingyu You

AI总结 提出Pose-ICL框架,通过3D感知上下文学习和表面锚定位置嵌入(SAPE)实现无调优的姿态可控主体定制,显著提升姿态准确性和身份一致性。

详情
AI中文摘要

主体定制是现代图像生成中的基础任务。通过提供少量参考图像和文本提示,用户可以生成特定对象在任意期望场景中的图像。然而,现有方法在实现定制主体的有效姿态控制方面仍存在困难。在实践中,它们常常表现出不准确的姿态或不一致的跨姿态外观。这些局限性表明,对于2D原生骨干网络而言,以体积方式理解对象仍然是一个重大挑战。为了应对这一挑战,我们提出了Pose-ICL,这是一个无需调优的框架,利用3D感知上下文学习(ICL)通过多个配对的图像-姿态参考直接适应新主体。其核心机制——表面锚定位置嵌入(SAPE)——通过将图像令牌锚定到体积边界框的表面坐标,赋予模型显式的3D感知能力。专门的优化确保了其与现有DiT模型的无缝兼容性。在3D资产和真实世界主体上的广泛评估表明,Pose-ICL在姿态准确性和身份一致性方面均显著优于当前方法。

英文摘要

Subject Customization is a foundational task in modern image generation. By providing a few reference images and a text prompt, users can generate images of a specific object in any desired scene. However, existing methods still struggle to achieve effective pose control for customized subjects. In practice, they often exhibit inaccurate poses or inconsistent cross-pose appearances. These limitations suggest that understanding objects in a volumetric manner remains a significant challenge for 2D-native backbones. To address this challenge, we propose Pose-ICL, a tuning-free framework that leverages 3D-aware In-Context Learning (ICL) to directly adapt to new subjects through multiple paired image-pose references. Its core mechanism,Surface-Anchored Position Embedding (SAPE), equips the model with explicit 3D awareness by anchoring image tokens to the surface coordinates of a volumetric bounding box. Dedicated refinements ensure its seamless compatibility with existing DiT models. Extensive evaluations on both 3D assets and real-world subjects demonstrate that Pose-ICL significantly outperforms current methods in both pose accuracy and identity consistency.

2606.11096 2026-06-10 cs.CV 新提交

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

IDEAL: 深度对齐实现离散表示自编码器

Yitong Chen, Zijie Diao, Junke Wang, Lingyu Kong, Yixuan Ren, Bo He, Yu-Gang Jiang, Zuxuan Wu

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所) Shanghai Innovation Institute(上海创新研究院) University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出IDEAL框架,通过联合对齐量化令牌与浅层和深层VFM特征,提升离散表示自编码器的重建质量,在ImageNet上实现0.61 rFID,并创下自回归图像生成新纪录(gFID 1.89)。

Comments Code is available at https://github.com/Row11n/IDEAL

详情
AI中文摘要

基于预训练视觉基础模型(VFM),表示自编码器(RAE)最近成为构建用于图像生成的语义丰富潜在空间的有前途方法。然而,它们的重建质量通常仍然次优,很大程度上是因为深层VFM表示没有保留足够的细粒度视觉细节。这种限制在离散化后变得更加严重,缺失的低级信息难以恢复。事实上,我们观察到浅层VFM特征保留了更丰富的局部外观和结构细节,这补充了现有RAE中使用的深层特征所携带的高级语义。受这种互补特性的启发,我们提出了Ideal,一种用于离散表示自编码的深度对齐框架。通过联合对齐量化令牌与浅层和深层VFM特征,Ideal使得生成的离散视觉令牌能够同时保留视觉保真度和丰富语义。大量实验表明,Ideal实现了卓越的重建性能,在ImageNet上达到0.61 rFID,比之前最佳方法高出0.28。当用于自回归图像生成时,Ideal进一步产生了1.89的gFID,为自回归图像生成建立了新的最先进水平。

英文摘要

Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.

2606.11148 2026-06-10 cs.CV 新提交

MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

MOFA-VTON: 虚拟试衣中细粒度调整带来的更多时尚可能性

Xiaoyu Han, Chenyang Wang, Jing Wang, Shunyuan Zheng, Quanling Meng, Shengping Zhang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) HiDream.ai Inc.(HiDream.ai公司) Harbin Institute of Technology (Weihai) Qingdao Research Institute(哈尔滨工业大学(威海)青岛研究院)

AI总结 提出MOFA-VTON方法,通过用户绘制简单草图实现虚拟试衣中服装布局的细粒度调整,利用掩码构建策略和布局调整模块,在VITON-HD和DressCode数据集上超越现有方法。

Comments Accepted to CVPR 2026 (Highlight)

详情
AI中文摘要

虚拟试衣旨在将店内服装图像贴合到特定人体上。理想的虚拟试衣方法应提供多样且灵活的着装选择,准确反映现实场景中不同的穿着风格,并根据个人偏好和时尚追求进行定制。然而,当前方法主要按照相同的穿着模式直接替换原始服装为目标服装。这种对服装适配的有限控制可能导致固定且单调的试衣输出。为了探索虚拟试衣中细粒度调整带来的更多时尚可能性,我们提出了一种新颖的虚拟试衣方法,称为MOFA-VTON,它允许用户通过简单草图调整试衣结果中的服装适配。具体来说,我们首先设计了一种掩码构建策略,将用户绘制的曲线草图转换为双区域掩码,取代传统的与服装无关的掩码,并为后续生成过程提供细粒度的布局指导。此外,我们提出了布局调整块,利用交叉注意力机制独立学习人体上半身和下半身区域的布局对应关系,细化这两个区域的空间排列。通过这些实现,我们的方法能够灵活地对目标服装进行细粒度调整,克服了固定布局的限制。在VITON-HD和DressCode数据集上的大量实验表明,我们提出的MOFA-VTON优于先前的最先进方法,并为虚拟试衣提供了更多时尚可能性。

英文摘要

Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.

2606.11155 2026-06-10 cs.CV 新提交

Mean Flow Distillation: Robust and Stable Distillation for Flow Matching Models

平均流蒸馏:面向流匹配模型的鲁棒稳定蒸馏方法

An Zhao, Shengyuan Zhang, Zhongjian Sun, Yixiang Zhou, Zejian Li, Ling Yang, Tianrun Chen, Lingyun Sun

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出平均流蒸馏(MFD)框架,通过时间低通滤波抑制优化噪声并保证轨迹一致性,实现流匹配模型的高保真单步生成。

详情
AI中文摘要

流匹配模型在广泛的生成任务中展现出强大性能。然而,它们依赖于基于ODE的迭代采样,在推理中产生大量计算开销,限制了其在实时场景中的应用。虽然蒸馏是一种有前景的解决方案,但现有方法大多借鉴基于扩散的分数匹配,往往未能利用流的固有几何结构,并遭受训练不稳定、高方差和生成质量下降的问题。在本文中,我们提出平均流蒸馏(MFD),一种专为流匹配模型设计的新型蒸馏框架。我们从理论上证明,MFD充当时间低通滤波器,有效抑制变分分数蒸馏(VSD)中固有的高频优化噪声,同时确保全局轨迹一致性。我们进一步证明了平均流匹配定理,表明匹配期望平均速度足以实现严格的分布对齐。在实验上,在包括4D占用预测和文本到图像生成在内的高维流形挑战性任务中,MFD实现了最先进的性能,实现了高保真单步生成。

英文摘要

Flow Matching models have demonstrated strong performance across a wide range of generative tasks. However, their reliance on ODE-based iterative sampling incurs substantial computational overhead in inference, which limits their applicability in real-time scenes. While distillation is a promising solution, existing approaches largely borrow from diffusion-based score matching, often failing to exploit the intrinsic geometric structure of flows and suffering from training instability, high variance, and degraded generation quality. In this paper, we propose Mean Flow Distillation (MFD), a novel distillation framework tailored for flow matching models. We theoretically demonstrate that MFD acts as a temporal low-pass filter, effectively suppressing the high-frequency optimization noise inherent in variational score distillation (VSD) while ensuring global trajectory consistency. We further prove the Mean Flow Matching Theorem, establishing that matching expected average velocities is sufficient for strict distribution alignment. Empirically, on challenging tasks of high-dimensional manifolds including 4D occupancy forecasting and text-to-image generation, MFD achieves state-of-the-art performance, enabling high-fidelity single-step generation.

2606.11180 2026-06-10 cs.CV 新提交

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Lip Forcing: 用于实时唇部同步的少步自回归扩散

Paul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee, Siyoon Jin, Heeseong Shin, Jung Yi, Yunjin Park, Chulmin Park, Seungryong Kim

发表机构 * KAIST AI(韩国科学技术院人工智能) AIPARK

AI总结 提出Lip Forcing,首个用于视频到视频唇部同步的自回归扩散方法,通过蒸馏14B教师模型为因果学生模型,仅需两步去噪即可实现实时同步,并引入同步窗口DMD、两步推理计划和SyncNet奖励。

Comments Project Page: https://cvlab-kaist.github.io/LipForcing/

详情
AI中文摘要

基于扩散的唇部同步模型实现了强大的视觉质量和音视频对齐,但全序列双向注意力和大量去噪步骤使其不适用于实时推理。我们提出了Lip Forcing,据我们所知,这是首个用于视频到视频(V2V)唇部同步的自回归扩散方法,它将一个14B音频条件双向视频扩散教师模型蒸馏为因果学生模型。在推理时,学生模型仅需两步去噪即可生成每个块,无需推理时的CFG,从而实现实时唇部同步。针对唇部同步的教师轨迹分析揭示了CFG保真度-同步权衡:无CFG预测偏向参考保真度,而CFG引导预测偏向中间轨迹带内的同步。Lip Forcing将这一发现转化为三个分析导出的组件:同步窗口DMD、两步推理计划和基于SyncNet的奖励。我们在两个学生尺度上验证了Lip Forcing,两者均从14B教师模型蒸馏而来。1.3B学生模型以31 FPS的速度进入实时流式处理,比同尺度的双向模型快17.6倍。14B学生模型是V2V唇部同步中报道的最大扩散模型,在参考保真度相当的情况下,运行速度比其教师模型快39.8倍。两种尺度的首帧时间均为亚毫秒级,远低于所有扩散基线。

英文摘要

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, $17.6\times$ faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs $39.8\times$ faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.

2606.11187 2026-06-10 cs.CV 新提交

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Next Forcing: 基于多块预测的因果世界建模

Gangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu, Yujun Shen, Xin Yang, Yinghao Xu

发表机构 * Robbyant HUST(华中科技大学) HKUST(香港科技大学) HKUST (GZ)(香港科技大学(广州))

AI总结 提出Next Forcing框架,通过多块预测训练目标加速视频生成模型收敛、提升精度并实现推理加速,在多个基准上取得最优结果。

Comments Project page: https://gangweix.github.io/next-forcing/

详情
AI中文摘要

自回归视频生成已成为世界动作模型(WAMs)的强大范式。然而,现有方法存在训练收敛慢和收敛精度有限的问题,尤其是在高帧率下,因为训练监督仅限于当前块,缺乏关于未来动态的明确信号;此外,由于迭代视频去噪,推理速度也较慢。在本文中,我们提出Next Forcing,一种用于因果世界建模的多块预测(MCP)框架,可实现更快的训练、更高的精度和加速推理。受大语言模型中多token预测的启发,Next Forcing引入了MCP训练目标,通过轻量级辅助MCP模块增强主模型,以同时去噪多个未来时间范围(next$^1$、next$^2$、next$^3$块)的视频块。这些MCP模块在预测深度上形成因果链,其中从主模型多个层融合的中间特征被用于预测未来动态,使得近期预测能够为远期预测提供信息,并向主模型提供密集的多尺度时间监督。在训练中,MCP模块显著加速收敛并提高收敛精度,尤其是在高帧率下:在50 fps下,Next Forcing在5k训练步数上比LingBot-VA相对提升93.1%,收敛速度提升2.3倍,并在RoboTwin基准(Clean/Random上94.1/93.5%)上建立了新的最先进结果。在推理时,MCP模块可以保留以并行预测当前块和下一个视频块,实现2倍推理加速。Next Forcing还在PhyWorld(评估视频生成中物理规律遵循的基准)上展示了显著改进,并在通用视频预训练上实现了超过50%的FVD降低。

英文摘要

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

2606.09901 2026-06-10 cs.GR cs.CV cs.HC cs.LG cs.MM 交叉投稿

On the Controllability-Fidelity Frontier in Diffusion Editing

扩散编辑中的可控性-保真度前沿

Yi Hu, Leying Yi, Emily Davis, Finn Carter

发表机构 * Xidian University(西安电子科技大学)

AI总结 本文理论结合实证研究扩散图像编辑中用户意图遵循、非目标内容保持与输出质量间的权衡,提出算法框架并揭示关键失败模式,讨论伦理考量。

Comments Preprint

详情
AI中文摘要

基于扩散的生成模型实现了强大的图像编辑能力,但在保持保真度和安全性的同时实现精确控制仍然具有挑战性。我们对可控的基于扩散的图像编辑进行了全面的理论和实证研究,分析了用户意图遵循、非目标内容保持和输出质量之间的权衡。我们的工作涵盖了文本和掩码引导编辑、点/拖拽操作以及基于反演的流程。我们推导了编辑目标的数学公式,并分析了噪声注入、分数引导和反演误差的动力学。我们提供了重构误差、重复编辑下的稳定性以及变化局部性的理论界限。我们提出了掩码局部化和指令引导编辑的算法框架(附伪代码),并在多个任务和指标(FID、身份相似性、CLIP对齐、伪影分数等)上进行了广泛的实验,比较了最先进的方法(例如TF-ICON \cite{lu2023tficone}、DragFlow \cite{zhou2025dragflow}、InstructPix2Pix \cite{brooks2023instructpix2pix}、UltraEdit \cite{zhao2024ultraedit})。我们的结果揭示了关键失败模式,如身份漂移、提示敏感性和组合错误。我们还讨论了图像编辑中的伦理考量,包括滥用风险、偏见、同意以及概念擦除技术(例如MACE \cite{lu2024mace}、ANT \cite{li2025ant}、EraseAnything \cite{gao2024eraseanything})作为安全措施。最后,我们总结了负责任、高保真度扩散编辑的最佳实践和未来方向。

英文摘要

Diffusion-based generative models enable powerful image editing capabilities, but achieving precise control while maintaining fidelity and safety remains challenging. We present a comprehensive theoretical and empirical study of controllable diffusion-based image editing, analyzing the trade-offs between adherence to user intent, preservation of non-target content, and output quality. Our work spans text- and mask-guided edits, point/drag manipulation, and inversion-based pipelines. We derive mathematical formulations of editing objectives and analyze dynamics of noise injection, score guidance, and inversion error. We provide theoretical bounds on reconstruction error, stability under repeated edits, and locality of changes. We propose algorithmic frameworks (with pseudocode) for mask-localized and instruction-guided editing, and present extensive experiments comparing state-of-the-art methods (e.g.\ TF-ICON \cite{lu2023tficone}, DragFlow \cite{zhou2025dragflow}, InstructPix2Pix \cite{brooks2023instructpix2pix}, UltraEdit \cite{zhao2024ultraedit}) on multiple tasks and metrics (FID, identity similarity, CLIP alignment, artifact scores, etc). Our results reveal key failure modes, such as identity drift, prompt sensitivity, and compositional errors. We also discuss ethical considerations in image editing, including misuse risks, bias, consent, and concept erasure techniques (e.g.\ MACE \cite{lu2024mace}, ANT \cite{li2025ant}, EraseAnything \cite{gao2024eraseanything}) as safeguards. We conclude with best practices and future directions for responsible, high-fidelity diffusion-based editing.

2506.14753 2026-06-10 cs.CV cs.LG 版本更新

Cost-Aware Routing for Efficient Text-To-Image Generation

面向文本到图像生成的高效路由:成本感知方法

Qinchan Li, Kenneth Chen, Changyue Su, Wittawat Jitkrittum, Qi Sun, Patsorn Sangkloy

发表机构 * Tandon School of Engineering, New York University(纽约大学Tandon工程学院) Google(谷歌) Eigen 4D Inc.(Eigen 4D公司)

AI总结 提出成本感知路由框架,根据提示复杂度自动选择不同去噪步数或模型,在保证高质量的同时降低计算成本,优于单一模型。

Comments Accepted by TMLR

详情
AI中文摘要

扩散模型以其通过迭代去噪过程为输入提示生成高保真图像的能力而闻名。不幸的是,由于固有的顺序生成过程,高保真度也伴随着高计算成本。在这项工作中,我们寻求在质量和计算成本之间实现最优平衡,并提出一个框架,允许每个提示的计算量根据其复杂度而变化。每个提示自动路由到最合适的文本到图像生成函数,该函数可能对应扩散模型的不同去噪步数,或一个不同的、独立的文本到图像模型。与统一的成本降低技术(例如,蒸馏、模型量化)不同,我们的方法通过学习将昂贵的选择(例如,100+去噪步)仅保留给少数复杂提示,而对较简单的提示采用更经济的选择(例如,小型蒸馏模型),从而实现最优权衡。我们在COCO和DiffusionDB上经验性地证明,通过学习路由到九个已训练的文本到图像模型,我们的方法能够提供比这些模型单独使用时更高的平均质量。代码可在以下网址获取:https://this URL。

英文摘要

Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due to the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone. Code is available at https://github.com/winglicopy/CATImage.

2509.16518 2026-06-10 cs.CV cs.AR 版本更新

FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

FG-Attn:在视频扩散模型中利用细粒度稀疏注意力

Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Tianlei Pang, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对视频扩散模型中注意力层计算开销大的问题,提出FG-Attn,一种低开销的细粒度稀疏注意力机制,在MxN块粒度上跳过分数计算,实现最高2.45倍加速。

详情
AI中文摘要

使用扩散变压器进行媒体生成可能需要评估极长序列上的注意力,其中注意力层占生成延迟的大部分。利用注意力图中的稀疏性为降低这一成本提供了有前景的机会。在这项工作中,我们展示了扩散变压器中的注意力图在视频生成模型中表现出显著的细粒度稀疏性。然而,现有的稀疏注意力方法过于粗粒度,留下了大量未处理的冗余计算,或者在更细粒度上产生高开销。我们提出FG-Attn,一种新颖的低开销细粒度稀疏注意力机制,它在MxN块的粒度上跳过分数计算,其中N>=1且M>=16,每个块是M个查询和N个键之间查询-键点积的结果。FG-Attn解决了GPU上稀疏注意力内核中硬件利用率不足的关键挑战,同时避免了不规则内存访问和冗余操作的开销。FG-Attn可以完全取代现有的稀疏注意力方法,并将块稀疏注意力方法扩展到现代GPU上的更细粒度。在70%稀疏度下,FG-Attn比最先进的FlashInfer快2.45倍,平均减少注意力内核时间14.7%。FG-Attn将端到端视频生成时间比Flash Attention 3加速高达1.40倍(平均1.18倍)。

英文摘要

Using diffusion transformers for media generation may require evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps offers a promising opportunity to reduce this cost. In this work, we show that attention maps in diffusion transformers exhibit significant fine-grained sparsity in video generation models. Existing sparse attention methods, however, are too coarse-grained, leaving a large fraction of redundant computation unaddressed, or incur high overheads at finer granularity. We propose FG-Attn, a novel, low-overhead fine-grained sparse attention mechanism that skips score computations at the granularity of a MxN tile, where N>=1 and M>=16, and where each block is the result of query-key dot products between M queries and N keys. FG-Attn addresses the key challenge of hardware underutilization in sparse attention kernels on GPUs, without incurring the overheads of irregular memory access and redundant operations. FG-Attn can fully supersede existing sparse attention methods and extend block sparse attention methods to finer granularities on modern GPUs. At 70% sparsity, FG-Attn is up to 2.45X faster than the state-of-art FlashInfer, and reduces attention kernel time by 14.7% on average. FG-Attn speeds up end-to-end video generation times by up to 1.40X (1.18X on average) over Flash Attention 3.

2512.08180 2026-06-10 cs.CV 版本更新

GeoLoom: High-quality Geometric Diagram Generation from Textual Input

GeoLoom:从文本输入生成高质量几何图形

Xiaojing Wei, Ting Zhang, Wei He, Jingdong Wang, Hua Huang

发表机构 * GitHub

AI总结 提出GeoLoom框架,通过自动形式化模块和坐标求解器,将自然语言几何描述转化为高质量图形,并引入约束评估指标,显著优于现有方法。

详情
AI中文摘要

高质量几何图形生成既带来挑战也带来机遇:它要求严格的空间准确性,同时提供明确的约束来指导生成。受近期在几何问题求解中使用形式语言和符号求解器以增强正确性和可解释性的进展启发,我们提出了GeoLoom,一个用于几何领域文本到图形生成的新颖框架。GeoLoom包含两个核心组件:一个自动形式化模块,将自然语言翻译成专门设计的面向生成的形式语言GeoLingua;以及一个坐标求解器,利用高效的蒙特卡洛优化将形式约束映射到精确坐标。为支持该框架,我们引入了GeoNF,一个将自然语言几何描述与形式化GeoLingua描述对齐的数据集。我们进一步提出了一种基于约束的评估指标,量化结构偏差,为迭代细化提供数学上有依据的监督。实验结果表明,GeoLoom在结构保真度上显著优于最先进的基线,为可解释和可扩展的图形生成提供了原则性基础。

英文摘要

High-quality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.

2512.12675 2026-06-10 cs.CV cs.AI 版本更新

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Scone:通过统一理解-生成建模弥合主体驱动图像生成中的组合与区分

Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang

发表机构 * Peking University(北京大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队) Zhongguancun Academy(中关村学院) HKUST(香港科技大学) Beijing Key Laboratory of Data Intelligence and Security (Peking University)(北京数据智能与安全重点实验室(北京大学))

AI总结 提出Scone方法,通过统一理解-生成模型结合组合与区分能力,采用两阶段训练实现主体身份保持与干扰最小化,在双基准上优于现有开源模型。

Comments CVPR 2026 Highlight. Code: https://github.com/Ryann-Ran/Scone

详情
AI中文摘要

主体驱动图像生成已从单主体发展到多主体组合,但忽略了区分能力——即当输入包含多个候选主体时,区分并生成正确主体的能力。这一限制制约了其在复杂、真实视觉场景中的有效性。我们提出Scone,一种统一理解-生成方法,整合了组合与区分。Scone使理解专家充当语义桥梁,传递语义信息并引导生成专家在最小化干扰的同时保持主体身份。两阶段训练方案首先学习组合,然后通过语义对齐和基于注意力的掩码增强区分。我们还引入了SconeEval,一个用于评估多种场景下组合与区分的基准。实验表明,Scone在两个基准上的组合与区分任务中均优于现有开源模型。我们的模型、基准和训练数据可在以下网址获取:this https URL。

英文摘要

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

2512.14614 2026-06-10 cs.CV cs.GR 版本更新

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

WorldPlay:面向实时交互式世界建模的长期几何一致性

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出WorldPlay流式视频扩散模型,通过双重动作表示、重构上下文记忆和上下文强制蒸馏方法,实现实时交互式世界建模并保持长期几何一致性,生成24 FPS的720p长视频。

Comments project page: https://3d-models.hunyuan.tencent.com/world/, demo: https://3d.hunyuan.tencent.com/sceneTo3D, code: https://github.com/Tencent-Hunyuan/HY-WorldPlay

详情
AI中文摘要

本文提出WorldPlay,一种流式视频扩散模型,能够实现实时、交互式的世界建模,并保持长期几何一致性,解决了当前方法在速度与内存之间的权衡。WorldPlay的威力来自三个关键要素。1)我们使用双重动作表示(Dual Action Representation),以响应用户的键盘和鼠标输入实现鲁棒的动作控制。2)为了强制长期一致性,我们的重构上下文记忆(Reconstituted Context Memory)从过去帧动态重建上下文,并使用时间重构使几何上重要但久远的帧保持可访问,有效缓解记忆衰减。3)我们还提出上下文强制(Context Forcing),一种针对记忆感知模型的新型蒸馏方法。对齐教师和学生之间的记忆上下文,保留了学生使用长程信息的能力,在实现实时速度的同时防止误差漂移。综合来看,WorldPlay以24 FPS生成具有优越一致性的长时域流式720p视频,与现有技术相比表现更优,并在多种场景中展现出强大的泛化能力。项目页面和在线演示可访问:this https URL 和 this https URL。

英文摘要

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key ingredients. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

2602.06886 2026-06-10 cs.CV 版本更新

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

提示重注入:缓解多模态扩散变换器中的提示遗忘

Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本研究针对多模态扩散变换器中提示遗忘问题,提出一种无需训练的提示重注入方法,通过将早期层的提示表示重新注入到后期层,以提升指令遵循能力和生成质量。

Comments 19 pages

详情
AI中文摘要

多模态扩散变换器(MMDiTs)用于文本到图像生成时,维持文本和图像分支的分离,并在整个去噪过程中实现文本标记与视觉潜在变量之间的双向信息流。在此设置中,我们观察到一种提示遗忘现象:随着深度增加,文本分支中提示表示的语义逐渐被遗忘。我们进一步通过探测三个代表性MMDiTs--SD3、SD3.5和FLUX.1中文本分支各层表示的语言属性,验证了这种影响。受这些发现启发,我们引入了一种无需训练的方法,即提示重注入,通过将早期层的提示表示重新注入到后期层来缓解这种遗忘。在GenEval、DPG和T2I-CompBench++上的实验表明,这种方法在指令遵循能力方面有显著提升,并在捕获偏好、美学和整体文本-图像生成质量的指标上也有所改进。

英文摘要

Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

2605.07800 2026-06-10 cs.CV 版本更新

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

SARA: 语义自适应关系对齐用于视频扩散模型

Jiesong Lian, Zixiang Zhou, Ruizhe Zhong, Yuan Zhou, Qinglin Lu, Rui Wang, Long Hu, Yixue Hao, Baoru Huang

发表机构 * Tencent Hunyuan(腾讯文英)

AI总结 提出SARA方法,通过文本条件显著性引导令牌对监督,提升视频扩散模型的文本对齐和运动质量。

详情
AI中文摘要

最近的视频扩散模型(VDMs)能够合成视觉上令人信服的片段,但仍会丢失实体、错误绑定属性并削弱提示中指定的交互。表示对齐目标如VideoREPA和MoAlign通过从冻结的视觉基础模型中提取时空令牌关系来改进细粒度文本跟随,但其成对监督预算由视觉或运动线索分配,而非根据每对与提示的相关性。我们提出SARA,语义自适应关系对齐,它保持对冻结VFM目标的令牌关系蒸馏(TRD),并添加一个文本条件显著性来决定哪些令牌对携带监督。一个轻量级的Stage 1对齐器使用每个实体的SAM 3.1掩码监督和InfoNCE正则化器进行训练,其连续显著性通过一个对路由算子融合到TRD中,该算子为每个令牌对分配权重,只要其两个端点中有一个是显著的,从而将监督导向主体-主体和主体-背景对,远离背景-背景对。在Wan2.2持续训练设置中,SARA在13维VLM评估标准、公共VBench基准和盲用户研究中,在文本对齐和运动质量上均优于SFT、VideoREPA和MoAlign。项目页面:此 https URL。

英文摘要

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage~1 aligner is trained with per-entity SAM~3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study. Project page: https://saradit.github.io/.

2606.08674 2026-06-10 cs.CV cs.AI 版本更新

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

BioVid: 具有生物行为语义理解的自回归视频生成

Tsung-Wei Pan, Jung-Hua Wang

发表机构 * Department of Electrical Engineering, National Taiwan Ocean University(国立台湾海洋大学电子工程系) AI research center, National Taiwan Ocean University(国立台湾海洋大学人工智能研究中心)

AI总结 提出BioVid,一种数据驱动的自回归视频生成框架,通过FSQ-R3GAN分词器和因果Transformer学习生物行为的自然时长分布,无需预设长度约束。

详情
AI中文摘要

现有的视频生成框架将序列时长视为外部指定参数——固定的帧数或文本提示——生成的片段在时间边界上与真实行为数据的统计结构脱节。这一假设与生物行为根本不一致,因为动作时长在个体和实例之间自然变化,并编码在数据本身中。我们提出BioVid,一种数据驱动的自回归视频生成框架,直接从训练数据中学习生物行为的时序结构,包括其自然长度分布。在第一阶段,有限标量量化GAN(FSQ-R3GAN)分词器将每个视频帧编码为紧凑的离散表示,结合R3GAN的稳定相对训练目标和FSQ的保证码本利用率,实现高保真空间重建而无需码本崩溃。在第二阶段,因果Transformer自回归地对生成的令牌序列建模,并在行为事件达到语义闭合时学习发出序列结束(EOS)令牌,终止分布自然地从训练数据中涌现,而非任何人为指定的约束。在人类饮酒行为数据集(NTU RGB+D, A001, n=94)上的实验表明,BioVid生成的长度分布与保留测试数据的分布紧密匹配,与真实分布的Wasserstein-1距离为1.24——相比之下,固定长度基线为6.05,VideoGPT为15.48——同时保持有竞争力的空间保真度。

英文摘要

Existing video generation frameworks treat sequence duration as an externally prescribed parameter -- fixed frame counts or text prompts -- producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals and instances and is encoded in the data itself. We present BioVid, a data-driven autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, including their natural length distributions. In the first stage, a Finite Scalar Quantization GAN (FSQ-R3GAN) tokenizer encodes each video frame into a compact discrete representation, combining the stabilized relativistic training objective of R3GAN with FSQ's guaranteed codebook utilization to achieve high-fidelity spatial reconstruction without codebook collapse. In the second stage, a causal Transformer models the resulting token sequences autoregressively and learns to emit an End-of-Sequence (EOS) token when the behavioral event reaches semantic closure, with the termination distribution emerging naturally from the training data rather than any human-specified constraint. Experiments on a human drinking behavior dataset (NTU RGB+D, A001, n=94) demonstrate that BioVid's generated length distribution closely matches that of held-out test data, achieving a Wasserstein-1 distance of 1.24 against the ground truth -- compared to 6.05 for a fixed-length baseline and 15.48 for VideoGPT -- while maintaining competitive spatial fidelity.

2310.05264 2026-06-10 cs.LG cs.CV 版本更新

The Emergence of Reproducibility and Generalizability in Diffusion Models

扩散模型中可重复性与泛化性的出现

Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, Qing Qu

发表机构 * CIFAR-10 dataset(CIFAR-10数据集)

AI总结 研究发现扩散模型在相同初始噪声和确定性采样器下,不同模型输出高度相似,且这种可重复性在记忆和泛化两种训练模式下均存在,对训练效率、模型隐私等有重要启示。

Comments NeurIPS Diffusion Model Workshop 2023 (best paper award), the Forty-first International Conference on Machine Learning (ICML 2024)

详情
AI中文摘要

在这项工作中,我们研究了扩散模型的一个有趣且普遍的现象,我们称之为“一致模型可重复性”:给定相同的起始噪声输入和确定性采样器,不同的扩散模型通常会产生非常相似的输出。我们通过全面的实验证实了这一现象,这意味着不同的扩散模型一致地达到相同的数据分布和评分函数,无论扩散模型框架、模型架构或训练过程如何。更引人注目的是,我们的进一步研究表明,扩散模型学习到的不同分布受到训练数据大小的影响。这一点得到了以下事实的支持:模型可重复性表现在两种不同的训练机制中:(i)“记忆机制”,其中扩散模型过拟合到训练数据分布,以及(ii)“泛化机制”,其中模型学习底层数据分布。我们的研究还发现,这一有价值的特性推广到许多扩散模型的变体,包括用于条件使用、解决逆问题和模型微调的变体。最后,我们的工作提出了许多有趣的理论问题供未来研究,并强调了关于训练效率、模型隐私和扩散模型受控生成的实际意义。

英文摘要

In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility": given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and scoring function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions affected by the training data size. This is supported by the fact that the model reproducibility manifests in two distinct training regimes: (i) "memorization regime", where the diffusion model overfits to the training data distribution, and (ii) "generalization regime", where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional use, solving inverse problems, and model fine-tuning. Finally, our work raises numerous intriguing theoretical questions for future investigation and highlights practical implications regarding training efficiency, model privacy, and the controlled generation of diffusion models.

2601.08379 2026-06-10 cs.LG cs.AI cs.CV 版本更新

MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance

MMD Guidance: 基于最大均值差异引导的无训练分布适应扩散模型

Matina Mahdizadeh Sani, Nima Jamali, Mohammad Jalali, Farzan Farnia

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MMD Guidance,一种无训练方法,通过最大均值差异梯度引导扩散模型采样,实现与参考数据分布对齐,无需重新训练。

详情
AI中文摘要

预训练扩散模型已成为无条件及条件样本生成的有力先验,但其输出常偏离用户特定目标数据的特征。这种不匹配在领域适应任务中尤为突出,此时仅有少量参考样本可用且重新训练扩散模型不可行。现有推理时引导方法可调整采样轨迹,但通常优化替代目标(如分类器似然)而非直接对齐目标分布。我们提出MMD Guidance,一种无训练机制,通过生成样本与参考数据集之间的最大均值差异(MMD)梯度增强反向扩散过程。MMD能从有限数据中提供可靠分布估计,实践中方差低,且可高效微分,特别适合引导任务。我们的框架通过乘积核自然扩展到条件生成模型中的提示感知适应。此外,由于引导在潜在扩散模型(LDM)的潜在空间中进行,因此可高效计算。在合成及真实世界基准上的实验表明,MMD Guidance能在保持样本保真度的同时实现分布对齐。项目代码见该网址。

英文摘要

Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose \emph{MMD Guidance}, a training-free mechanism that augments the reverse diffusion process with gradients of the \textit{Maximum Mean Discrepancy (MMD)} between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity. The project code is available at github.com/matinamehdizadeh/MMD-Guidance.

6. 3D视觉、点云与空间智能 13 篇

2606.10019 2026-06-10 cs.CV cs.AI cs.RO 新提交

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

广义CVO:基于二阶黎曼优化的快速无对应局部点云配准

Ray Zhang, Marcus Greiff, Thomas Lew, John Subosits

AI总结 提出一种基于几何表面结构和再生核希尔伯特空间嵌入的无对应局部点云配准方法,采用二阶流形优化实现高达10倍加速,在LiDAR和RGB-D跟踪及物体配准中显著降低漂移并提升鲁棒性。

Comments 16 pages, 12 figures

详情
AI中文摘要

我们提出了一种快速且无需对应关系的局部点云配准方法,该方法利用了几何表面结构和再生核希尔伯特空间(RKHS)嵌入。该方法将点云表示为具有逐点各向异性核的连续函数,这些核编码了局部几何信息。这种公式化在沿表面法线方向改善对齐的同时,放松了沿切线方向的对齐。为了解决由此产生的配准问题,我们提出了一种具有近似黎曼海森矩阵的二阶流形优化方案,与先前基于无对应RKHS方法中使用的一阶求解器相比,实现了高达10倍的加速。我们展示了在多种室内外数据集上改进的帧到帧LiDAR和RGB-D跟踪精度。在驾驶领域的LiDAR跟踪配准任务中,我们在具有挑战性的特征稀疏环境下实现了平移和旋转漂移均减少超过55%。在物体配准基准测试中,我们展示了相比基于ICP的方法更强的鲁棒性,并且在优化全局初始化时(尤其是在中等错位情况下)获得了进一步的提升。

英文摘要

We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.

2606.10364 2026-06-10 cs.CV 新提交

Benchmarking stereo reconstruction for 3D printable Martian terrain models

用于3D打印火星地形模型的立体重建基准测试

Josephine Wang

发表机构 * MIT Cambridge, MA, USA(麻省理工学院,马萨诸塞州剑桥,美国)

AI总结 针对火星图像低纹理、不规则和部分观测的特点,评估从NASA好奇号图像估计立体深度、补全几何并导出可打印网格的流程,发现基准精度不直接迁移到火星地形重建,几何补全存在局部保真度与全局连通性的权衡。

Comments 9 pages, 7 figures, CVPR End-to-End 3D Workshop 2026

详情
AI中文摘要

从火星车图像重建可打印的3D模型具有挑战性,因为火星地形纹理低、不规则且部分被观测。我们评估了一个流程,该流程从NASA好奇号图像估计立体深度,补全几何,并导出水密OBJ网格。在Middlebury数据集上,RAFT-Stereo优于半全局块匹配(SGBM),将视差MAE从3.22像素降低到0.73像素,并将有效预测覆盖率从76.3%提高到100.0%。然而,在好奇号图像上,RAFT更密集的视差显示出较弱的边缘对齐和更高的光度重投影误差,表明基准精度不能直接迁移到火星地形重建。几何补全展示了局部保真度与全局连通性之间的权衡。我们发现,alpha形状保留了准确但碎片化的结构,泊松重建产生更连贯的网格但增加了无支撑表面,而确定性扩散填充基线介于两者之间但对立体质量敏感。总体而言,标准立体和补全方法可以产生火星地形的可打印近似,但可靠的重建需要更强的领域特定验证。

英文摘要

Reconstructing printable 3D models from Mars rover imagery is challenging because Martian terrain is low-texture, irregular, and partially observed. We evaluate a pipeline that estimates stereo depth from NASA Curiosity images, completes geometry, and exports watertight OBJ meshes. On Middlebury, RAFT-Stereo outperforms semi-global block matching (SGBM), reducing disparity MAE from 3.22px to 0.73px and increasing valid prediction coverage from 76.3% to 100.0%. On Curiosity imagery, however, RAFT's denser disparities show weaker edge alignment and higher photometric reprojection error, suggesting that benchmark accuracy does not directly transfer to Martian terrain reconstruction. Geometry completion demonstrates a tradeoff between local fidelity and global connectivity. We find that alpha shapes preserve accurate but fragmented structure, Poisson reconstruction produces more coherent meshes but adds unsupported surfaces, and a deterministic diffusion-fill baseline is intermediate but sensitive to stereo quality. Overall, standard stereo and completion methods can produce printable approximations of Martian terrain, but reliable reconstruction requires stronger domain-specific validation.

2606.10395 2026-06-10 cs.CV 新提交

Efficient RWKV-based Representation Learning for 3D Point Clouds

基于高效RWKV的三维点云表示学习

Yun Liu, Xuefeng Yan, Liangliang Nan, Xianzhi Li, Peng Li, Zhe Zhu, Honghua Chen, Mingqiang Wei

发表机构 * School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) Shenzhen Institute of Research, Nanjing University of Aeronautics and Astronautics(南京航空航天大学深圳研究院) Collaborative Innovation Center of Novel Software Technology and Industrialization(新型软件技术与产业化协同创新中心) Urban Data Science section, Delft University of Technology(代尔夫特理工大学城市数据科学部) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出P-RWKV模块,通过局部感知扩展和空间上下文增强,将RWKV从序列建模适配到3D点云,实现线性复杂度的全局依赖建模,在多项任务中以更低计算成本取得竞争性能。

详情
AI中文摘要

最近提出的接收加权键值(RWKV)模型结合了RNN风格的循环,为建模全局依赖提供了Transformer二次自注意力的线性复杂度替代方案。然而,当直接应用于点云时,原本为序列文本开发的RWKV难以有效捕捉局部几何结构和建模空间依赖。为了解决这个问题,我们提出了\textbf{P-RWKV}模块,它在保持RWKV效率优势的同时,弥合了序列建模与不规则3D几何之间的差距。它包含一个局部感知扩展(LPE)组件,用于沿时空序列扩展上下文感知,以及一个空间上下文增强(SCE)组件,用于增强空间意识。为了验证P-RWKV在点云理解中的有效性,我们构建了PointER,一个单模态自监督表示学习框架,其编码器由堆叠的P-RWKV模块组成。此外,我们将P-RWKV扩展到跨模态设置,并将所提出的核心子模块集成到多种架构中,展示了强大的即插即用灵活性和架构通用性。大量实验表明,P-RWKV模块及其关键子模块在各种任务中以较低的计算成本和推理延迟取得了竞争性能。代码将在接收后发布。

英文摘要

The recent receptance weighted key value (RWKV) model combines RNN-style recurrence, offering a linear-complexity alternative to Transformers' quadratic self-attention for modeling global dependencies. However, when directly applied to point clouds, RWKV, originally developed for sequential text, struggles to capture local geometric structures and model spatial dependencies effectively. To address this, we propose the \textbf{P-RWKV} block, which bridges the gap between sequence modeling and irregular 3D geometry while preserving the efficiency advantages of RWKV. It consists of a Local Perception Expansion (LPE) component to expand contextual perception along the spatio-temporal sequence and a Spatial Context Enhancement (SCE) component to strengthen spatial awareness. To validate the effectiveness of P-RWKV for point cloud understanding, we construct PointER, a single-modality self-supervised representation learning framework whose encoder is composed of stacked P-RWKV blocks. Furthermore, we extend P-RWKV to a cross-modality setting and integrate the proposed core sub-modules into multiple architectures, demonstrating strong plug-and-play flexibility and architectural generality. Extensive experiments show that the P-RWKV block and its key sub-modules achieve competitive performance across various tasks with lower computational cost and inference latency. Code will be released upon acceptance.

2606.10478 2026-06-10 cs.CV 新提交

3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

3D-CoS:基于VLM代码合成的新型3D重建范式

Yuhao Wang, Puyi Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yu Cheng

发表机构 * Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学) Microsoft(微软) University of Oxford(牛津大学)

AI总结 提出3D代码合成(3D-CoS)范式,将3D资产表示为可执行的Blender代码,利用VLM进行程序化重建,实现高可控性和局部编辑能力。

Comments Preprint. 24 pages, 11 figures

详情
AI中文摘要

最近的3D重建和编辑系统大多基于隐式或显式表示,如NeRF、点云或网格。尽管这些表示能够实现高保真渲染,但它们本质上是低层次的,难以通过编程控制。相比之下,我们提出并系统评估了一种新的3D重建范式——3D代码合成(3D-CoS),其中3D资产被构建为可执行的Blender代码,这是一种可编程且可解释的媒介。为了评估当前VLM使用代码表示3D对象的能力,我们在统一协议下评估了代表性的开源和闭源VLM在基于代码的重建中的表现。我们进一步引入了一套结构化的代码合成工作流,包括基于蓝图的规划、Blender API文档的检索增强生成(RAG)、少样本几何演示以及用于逐部分代码生成的组件级Agent工作流。为了展示这种表示的独特优势,我们进一步评估了局部文本驱动的修改,并将我们的基于代码的编辑与基于点云的3D编辑基线进行了比较。我们的研究表明,代码作为3D表示提供了强大的可控性和局部性,在目标编辑评估中产生了更强的编辑保真度和更好的未编辑区域保持。我们的工作还分析了这种范式的潜力,描绘了当前VLM在程序化3D建模中的能力边界,并强调了代码合成作为可编辑3D重建的一个有前景的方向。

英文摘要

Most recent 3D reconstruction and editing systems operate on implicit and explicit representations such as NeRF, point clouds, or meshes. While these representations enable high-fidelity rendering, they are fundamentally low-level and hard to control programmatically. In contrast, we propose and systematically evaluate a new 3D reconstruction paradigm, 3D Code Synthesis (3D-CoS), where 3D assets are constructed as executable Blender code, a programmatic and interpretable medium. To assess how well current VLMs can use code to represent 3D objects, we evaluate representative open-source and closed-source VLMs in code-based reconstruction under a unified protocol. We further introduce a suite of structured code-synthesis workflows, including blueprint-based planning, Retrieval-Augmented Generation (RAG) over Blender API documentation, few-shot geometric demonstrations, and a component-level Agent workflow for part-wise code generation. To demonstrate the unique advantages of this representation, we further evaluate localized text-driven modifications and compare our code-based edits with a point-cloud-based 3D editing baseline. Our study shows that code as a 3D representation offers strong controllability and locality, yielding stronger edit fidelity and better preservation of unedited regions in our targeted editing evaluation. Our work also analyzes the potential of this paradigm, delineates the current capability frontier of VLMs for programmatic 3D modeling, and highlights code synthesis as a promising direction for editable 3D reconstruction.

2606.10541 2026-06-10 cs.CV 新提交

GRAR: Glass-induced Reflection Artifact Removal in LiDAR Point Clouds

GRAR: LiDAR点云中玻璃引起的反射伪影去除

Wanpeng Shao, Zeyi Guo, Bo Zhang, Yifei Xue, Tie Ji, Yizhen Lao

发表机构 * College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院) School of Design, Hunan University(湖南大学设计学院) School of Finance and Statistics, Hunan University(湖南大学金融与统计学院)

AI总结 提出两阶段框架,先利用多模态视觉基础模型和几何线索精确分割玻璃区域,再基于物理驱动的反射感知局部-全局几何相似性描述符去除反射伪影,在多个公开数据集上优于现有方法。

详情
AI中文摘要

在城市环境中采集的地面激光扫描(TLS)点云经常受到玻璃引起的反射伪影的影响,严重降低了后续应用的质量。现有的反射伪影去除方法通常依赖于理想的反射对称性假设,但其性能受限于不准确的玻璃估计和不足的几何表示。为了解决这些问题,我们提出了一种新颖的统一框架,旨在实现鲁棒的反射伪影去除:在第一阶段,我们利用多模态视觉基础模型生成初始玻璃掩膜,然后使用几何线索进行细化以获得高精度的玻璃区域,随后进行玻璃补全以恢复透明表面上由于无回波测量导致的缺失区域;在第二阶段,我们提出了一种物理驱动的描述符,称为反射感知局部-全局几何相似性(RE-LGGS),该描述符基于实际的激光反射几何,并使用基于PCA的局部形状表示联合编码多尺度几何结构和方向一致性,从而显著提高了对不完美观测的鲁棒性。在多个公开TLS数据集上的大量实验表明,我们的框架在反射伪影去除方面始终优于最先进的方法。

英文摘要

Terrestrial Laser Scanning (TLS) point clouds captured in urban environments frequently suffer from glass-induced reflection artifacts, severely degrading downstream applications. Existing reflection artifact removal methods generally rely on ideal reflection symmetry assumptions, yet their performance is limited by inaccurate glass estimation and insufficient geometric representations. To address these issues, we propose a novel unified framework aimed at robust reflection artifact removal: In the first stage, we leverage a multi-modal vision foundation model to produce initial glass masks, which are then refined using geometric cues to achieve high-precision glass regions, followed by glass completion to recover missing regions caused by no-return measurements on transparent surfaces; In the second stage, we propose a physics-driven descriptor, termed Reflection-aware Local-Global Geometric Similarity (RE-LGGS), which is grounded in actual laser reflection geometry and jointly encodes multi-scale geometric structures and orientation consistency using PCA-based local shape representations, thereby significantly improving robustness against imperfect observations. Extensive experiments on multiple public TLS datasets demonstrate that our framework consistently outperforms state-of-the-art methods in reflection artifacts removal.

2606.10594 2026-06-10 cs.CV 新提交

Segment and Select: Vision-Language Segmentation in 3D Scenarios

Segment and Select: 3D场景中的视觉-语言分割

Yulin Chen, Zhihang Zhong, Yuenan Hou

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出SEGA3D范式,通过掩码候选生成器、大语言模型和语义空间选择器实现3D场景中基于语言指令的细粒度分割,在ScanNet和Matterport3D上分别提升8.3和5.3 mIoU。

Comments The core idea is to reformulate 3D vision-language segmentation as the segment-and-select paradigm (free from the superpoint dependency)

详情
AI中文摘要

3D视觉-语言分割旨在根据语言指令和视觉观察在3D场景中分割目标对象。现有技术严重依赖粗糙的超点表示来降低计算复杂度,这导致分割质量差和对象边界混乱。本文提出用于3D视觉-语言分割的SEGment-And-Select(SEGA3D)范式,该范式直接操作于细粒度视觉信息,无需依赖超点。具体而言,我们首先利用掩码候选生成器提供细粒度的类别掩码候选,显著提高候选掩码相对于超点对应物的质量。然后,利用大语言模型(LLM)基于语言描述和视觉特征生成语义和空间信息。LLM输出和视觉特征被输入到语义-空间选择器(SSS)以产生排名最高的掩码候选。最后,设计循环验证模块(LVM)从选定的候选掩码中产生分割掩码。我们的SEGA3D在ScanRefer、ScanNet和Matterport3D基准测试中取得了有竞争力的性能。值得注意的是,我们的SEGA3D在ScanNet和Matterport3D上分别超过最佳性能对手8.3 mIoU和5.3 mIoU。代码将在发表后提供。

英文摘要

3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmentation quality and messy object boundaries. In this paper, we propose the SEGment-And-select (SEGA3D) paradigm for 3D visionlanguage segmentation that directly operates on the fine-grained visual information and is free from the superpoint dependency. Specifically, we first leverage a mask candidate generator to provide fine-grained categorical mask candidates, substantially improving the quality of candidate masks over the superpoint counterparts. Then, a Large Language Model (LLM) is utilized to generate the semantic and spatial information based on the linguistic description and visual features. The LLM output and visual features are fed to the Semantic-Spatial Selector (SSS) to produce the top-ranking mask candidates. Eventually, the Loopback Verification Module (LVM) is designed to yield the segmentation mask from the selected candidate masks. Our SEGA3D attains competitive performance on ScanRefer, ScanNet and Matterport3D benchmarks. Notably, our SEGA3D surpasses the top-performing counterpart by 8.3 mIoU and 5.3 mIoU on ScanNet and Matterport3D, respectively. Codes will be available upon publication.

2606.10602 2026-06-10 cs.CV 新提交

Globally Localizing Lunar Rover in Pixels via Graph Alignment

通过图对齐在像素级全局定位月球车

Mao Chen, Xu Yang, Chuankai Liu, Xiangkai Zhang, Xiaoxue Wang, Zheng Bo, Zuoyu Zhang, Zhiyong Liu

发表机构 * The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Beijing Aerospace Control Center(北京航天飞行控制中心) Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间应用工程与技术中心)

AI总结 提出WARG框架,利用统一图学习和重投影图匹配解决月球车跨视角定位中的实体纠缠、视角差异和仿真到真实域偏移问题,在玉兔二号真实数据上实现1.68米定位误差。

详情
AI中文摘要

精确的月球车定位是自主月球探测的前提,然而全球导航卫星系统(GNSS)信号的缺失以及局部定位方法的累积漂移严重限制了远程任务。跨视角定位通过匹配月球车视角和卫星视角图像提供了一种有前景的无漂移全局解决方案。然而,月球环境为对应点对齐带来了独特挑战,包括实体间纠缠、视角间差异以及仿真到真实的域偏移。为了解决这些挑战,我们提出了重投影图扭曲对齐(WARG),一个利用统一图学习和重投影图匹配实现鲁棒跨视角对齐的框架。在合成的LuSNAR数据集上预训练后,WARG的平均测试误差为0.32米,并在合成月球南极区域展现出鲁棒的零样本泛化能力,误差为3.63米。更重要的是,在玉兔二号月球车的真实数据上验证时,WARG在100米×100米的搜索区域内实现了1.68米的定位误差,相当于在空间分辨率为1.40米/像素的低分辨率卫星图像中达到近像素级精度。除了精度,WARG计算高效,仅含1.56M参数,是之前轻量级模型的16.12%,在NVIDIA RTX A6000 GPU上运行频率为5.49 Hz,接近GNSS级更新频率。最后,我们观察到WARG通过跨视角定位学习自然发展出低级空间感知能力,包括语义分割和结构推理,突显其作为以最小标注成本实现空间智能的有前景范式的潜力。源代码见:此 https URL。

英文摘要

Precise rover localization is a prerequisite for autonomous lunar exploration, yet the absence of Global Navigation Satellite System (GNSS) signals and the cumulative drift of local localization methods severely constrain long-range missions. Cross-view localization provides a promising drift-free global solution by matching rover-view and satellite-view imagery. However, the lunar environment poses unique challenges for correspondence alignment, including inter-entity entanglement, inter-viewpoint divergence, and simulation-to-real domain shift. To address these challenges, we propose Warped Alignment of Reprojected Graphs (WARG), a framework that leverages unified graph learning and reprojected graph matching for robust cross-view alignment. Pretrained on the synthetic LuSNAR dataset, WARG achieves an average test error of 0.32 m and demonstrates robust zero-shot generalization to the synthetic lunar south pole region with an error of 3.63 m. More importantly, when validated on real-world data from the YuTu-2 rover, WARG achieves a localization error of 1.68 m within a 100 m x 100 m search area, corresponding to nearly one-pixel precision in low-resolution satellite imagery with a spatial resolution of 1.40 m/pixel. Beyond accuracy, WARG is computationally efficient, containing only 1.56M parameters, corresponding to 16.12% of previous lightweight models, and operating at 5.49 Hz on an NVIDIA RTX A6000 GPU, approaching GNSS-level update frequency. Finally, we observe that WARG naturally develops low-level spatial awareness, including semantic segmentation and structural reasoning, through cross-view localization learning, highlighting its potential as a promising paradigm for spatial intelligence with minimal annotation cost. The source code is available at https://github.com/maochen-casia/warg.

2606.10988 2026-06-10 cs.CV cs.GR 新提交

AnimaSpark: A Feed-Forward Method for Animating Arbitrary 3D Objects

AnimaSpark: 一种用于任意3D对象动画的前馈方法

Yiming Zhao, Haoyu Sun, Aoyu Wang

发表机构 * Bytedance(字节跳动)

AI总结 提出AnimaSpark管道,通过将带骨骼的3D模型渲染为多层图像表示,输入视频生成模型,再提取2D关键点运动并提升至3D空间,实现类别无关的3D动画生成,在文本-运动对齐、运动质量和计算效率上优于现有方法。

详情
AI中文摘要

尽管生成式AI的最新进展显著加速了静态3D模型创建流程,但类别无关的3D动画合成仍然是3D资产生产中的主要瓶颈。当前的类别无关动画生成方法在推理速度、运动质量和文本提示遵循方面存在关键限制,使得该过程仍依赖于劳动密集型的手工艺术。为解决这些挑战,本文介绍了AnimaSpark,一种用于类别无关3D动画生成的新型管道。我们的方法受以下关键洞察驱动:对于3D世界中的许多基本运动,相应的关节变换通常可以在二维子空间内有效建模。该管道首先将带骨骼的静态3D模型渲染为其网格和骨架的多层图像表示,随后将其输入视频生成模型。然后,我们在生成的视频上采用关键点跟踪算法,捕获投影到相机观察平面上的骨骼关节运动。在最后阶段,我们从这些跟踪的关键点中提取平面平移和旋转,并将其从2D域提升到3D空间以驱动角色动画。全面评估表明,我们的方法在关键指标(包括文本-运动对齐、运动质量和计算效率)上优于现有最先进技术。

英文摘要

While recent advancements in generative AI have substantially accelerated static 3D model creation workflows, the synthesis of category-agnostic 3D animations remains a significant bottleneck in 3D asset production. Current methods for category-agnostic animation generation exhibit critical limitations in inference speed, motion quality, and adherence to textual prompts, thereby leaving the process dependent on labor-intensive manual artistry. To address these challenges, this paper introduces AnimaSpark, a novel pipeline for category-agnostic 3D animation generation. Our approach is motivated by the key insight that for many fundamental motions in the 3D world, the corresponding joint transformations can often be effectively modeled within a two-dimensional subspace. The pipeline begins by rendering a rigged static 3D model into multi-layered image representations of its mesh and skeleton, which are subsequently fed into a video generation model. We then employ a keypoint tracking algorithm on the generated video to capture the motion of the skeletal joints projected onto the camera's viewing plane. In the final stage, we distill the planar translations and rotations from these tracked keypoints and lift them from the 2D domain into 3D space to animate the character. Comprehensive evaluations reveal that our method achieves superior performance over existing state-of-the-art techniques across key metrics, including text-motion alignment, quality of motion, and computational efficiency.

2504.18424 2026-06-10 cs.CV 版本更新

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

LaRI: 用于单视图3D几何推理的分层射线交点

Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka

发表机构 * ETH Zurich(苏黎世联邦理工学院) Adobe Research(Adobe研究)

AI总结 提出LaRI方法,通过分层点图预测射线与多个表面的交点,实现单次前馈的完整场景重建,支持物体级和场景级任务。

Comments Project page: https://ruili3.github.io/lari

详情
Journal ref
ICML 2026
AI中文摘要

我们提出了分层射线交点(LaRI),一种用于从单张图像进行遮挡几何推理的全监督方法。与仅限于可见表面的传统深度估计不同,LaRI使用分层点图预测相机射线相交的多个表面。与现有利用神经隐式表示或迭代优化的方法相比,LaRI在一次前馈传递中完成完整的场景重建,实现了高效且视图对齐的几何推理,以支持物体级和场景级任务。我们进一步提出预测射线停止索引,该索引从LaRI的输出中识别有效的相交像素和层。为了更好地支持和评估这一任务,我们使用渲染引擎构建了一个注释流水线,为五个公共数据集(包括覆盖3D物体和场景的合成数据和真实世界数据)构建了注释。作为一种通用方法,LaRI的性能在物体级和场景级重建任务中得到了验证。

英文摘要

We present Layered Ray Intersections (LaRI), a fully supervised method for occluded geometry reasoning from a single image. Unlike conventional depth estimation, which is limited to visible surfaces, LaRI predicts multiple surfaces intersected by the camera rays using layered point maps. Compared to the existing approaches that leverage neural implicit representations or iterative refinement, LaRI achieves complete scene reconstruction in one feed-forward pass, enabling efficient and view-aligned geometric reasoning to underpin both object-level and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. To better underpin and evaluate this task, we build an annotation pipeline using rendering engines, construct annotations for five public datasets, including synthetic and real-world data covering 3D objects and scenes. As a generic method, LaRI's performance is validated in object-level and scene-level reconstruction tasks.

2507.13595 2026-06-10 cs.CV 版本更新

NoiseSDF2NoiseSDF: Learning Clean Neural Fields from Noisy Supervision

NoiseSDF2NoiseSDF: 从含噪监督中学习干净的神经场

Tengkai Wang, Weihao Li, Ruikai Cui, Shi Qiu, Nick Barnes

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出NoiseSDF2NoiseSDF方法,通过最小化含噪SDF表示之间的MSE损失,从含噪点云中学习干净的神经SDF,实现隐式去噪和表面优化。

Comments 16 pages, 7 figures

详情
AI中文摘要

从点云重建准确的隐式表面表示仍然是一项具有挑战性的任务,特别是当数据使用低质量扫描设备捕获时。这些点云通常包含大量噪声,导致表面重建不准确。受2D图像中Noise2Noise范式的启发,我们引入了NoiseSDF2NoiseSDF,一种旨在将此概念扩展到3D神经场的新方法。我们的方法通过最小化含噪SDF表示之间的MSE损失,从含噪点云中通过含噪监督学习干净的神经SDF,使网络能够隐式去噪并细化表面估计。我们在ShapeNet、ABC、Famous和Real数据集等基准上评估了NoiseSDF2NoiseSDF的有效性。实验结果表明,我们的框架显著提高了从含噪输入重建的表面质量。

英文摘要

Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using low-quality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs.

2601.04776 2026-06-10 cs.CV 版本更新

Segmentation-Driven Monocular Shape from Polarization based on Physical Model

基于物理模型的分割驱动单目光学偏振形状恢复

Jinyu Zhang, Xu Ma, Weili Chen

发表机构 * Key Laboratory of Photoelectronic Imaging Technology and System of Ministry of Education of China, School of Optics and Photonics, Beijing Institute of Technology(中国教育部光电成像技术与系统重点实验室,光学与 photonics 学院,北京理工大学) National Key Laboratory of Scattering and Radiation, Beijing Institute of Environmental Features(散射与辐射国家重点实验室,北京环境特征研究院)

AI总结 提出分割驱动单目光学偏振形状恢复框架,通过偏振辅助自适应区域生长分割凸子区域并引入多尺度融合凸性先验约束,有效解决方位角歧义,提升重建精度与几何保真度。

Comments 23 pages, 10 figures, submittd to Elsevier Pattern Recognition

详情
AI中文摘要

单目光学偏振形状恢复(SfP)利用光偏振特性与表面几何之间的内在关系,从单视角偏振图像中恢复表面法线,为三维(3D)重建提供了一种紧凑且稳健的方法。尽管具有潜力,现有的单目SfP方法受到方位角歧义(偏振分析的固有限制)的影响,严重损害了重建的准确性和稳定性。本文提出了一种新颖的分割驱动单目SfP(SMSfP)框架,将全局形状恢复重新表述为在自适应分割的凸子区域上的一组局部重建。具体而言,提出了一种偏振辅助自适应区域生长(PARG)分割策略,将全局凸性假设分解为局部凸区域,有效抑制方位角歧义并保持表面连续性。此外,开发了一种多尺度融合凸性先验(MFCP)约束,以确保局部表面一致性并增强精细纹理和结构细节的恢复。在合成和真实世界数据集上的大量实验验证了所提出的方法,与现有的基于物理的单目SfP技术相比,在消歧准确性和几何保真度方面显示出显著改进。

英文摘要

Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.

2606.05399 2026-06-10 cs.CV 版本更新

UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching

UniPixie: 基于流匹配的统一概率三维物理学习

Qilin Huang, Quynh Anh Huynh, Long Le, Chen Wang, Chuhao Chen, Ryan Lucas, Eric Eaton, Lingjie Liu

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Southern University of Science and Technology(南方科技大学) MIT(麻省理工学院)

AI总结 提出UniPixie框架,通过流匹配学习从单张视觉输入到连续可控材料属性分布的映射,实现多样物理场生成并降低杨氏模量预测误差超50%。

Comments Published at CVPR 2026 as a Highlight. Project page: https://unipixie.github.io/

详情
AI中文摘要

现有的前馈网络擅长从视觉外观预测单一物理属性集,但这种点估计范式从根本上无法捕捉现实世界固有的物理模糊性。我们通过将物理预测重构为学习可控、连续的材料属性分布任务来解决这一问题。我们引入UNIPIXIE框架,该框架训练用于从单张视觉输入预测一条连续且参数化的物理合理材料属性路径。通过在我们的PIXIEMULTIVERSE数据集上学习沿物体从最软到最硬谱的直接映射,UNIPIXIE允许通过单个直观参数可控地生成多样、物理有效的材料场。关键的是,UNIPIXIE引入了一种新颖的统一架构,为多种物理求解器生成可模拟的参数,包括基于连续介质的物质点法(MPM)、基于线性混合蒙皮(LBS)的降阶变形以及基于锚点的弹簧-质量系统,解决了先前工作中的关键可移植性问题。实验表明,我们的方法不仅生成丰富多样的合理动力学,而且相比最强的确定性基线,将杨氏模量预测误差降低了50%以上,弥合了静态点估计与物理现实连续性之间的差距。项目页面:https://unipixie.github.io/

英文摘要

Existing feed-forward networks excel at predicting a single set of physical properties from visual appearance, but this point-estimate paradigm fundamentally fails to capture the real world's inherent physical ambiguity. We address this by reframing physics prediction as a task of learning a controllable, continuous distribution of material properties. We introduce UNIPIXIE, a framework trained to predict a continuous and parameterized path of physically plausible material properties from a single visual input. By learning a direct mapping along an object's softest-to-stiffest spectrum on our PIXIEMULTIVERSE dataset, UNIPIXIE allows for controllable generation of diverse, physically valid material fields via a single intuitive parameter. Crucially, UNIPIXIE introduces a novel unified architecture to produce simulation-ready parameters for diverse physics solvers, including continuum-based Material Point Method (MPM), reduced-order deformation based on Linear Blend Skinning (LBS), and anchor-based Spring-Mass systems, addressing a key portability issue in prior work. Experiments show our approach not only generates a rich variety of plausible dynamics but also reduces Young's Modulus prediction error by over 50% against the strongest deterministic baseline, bridging the gap between static point estimates and the continuous nature of physical reality. Project page: https://unipixie.github.io/

2601.06997 2026-06-10 cs.RO cs.CV 版本更新

ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction

ObjSplat: 几何感知的高斯面元用于主动物体重建

Yuetao Li, Zhizhou Jia, Yu Zhang, Qun Hao, Shaohui Zhang

发表机构 * School of Optics and Photonics, Beijing Institute of Technology(光学与光子学学院,北京理工大学) School of Optoelectronic Engineering, Changchun University of Science and Technology(光电工程学院,长春理工大学)

AI总结 提出ObjSplat框架,利用高斯面元统一表示,通过几何感知视点评估和下一最佳路径规划器,实现高效高保真的主动物体重建。

Comments Accepted to IEEE T-ASE. Code: https://github.com/Li-Yuetao/ObjSplat , Project Page: https://li-yuetao.github.io/ObjSplat-page/

详情
AI中文摘要

自主高保真物体重建是创建数字资产和弥合机器人模拟与现实差距的基础。我们提出ObjSplat,一个主动重建框架,利用高斯面元作为统一表示,逐步重建未知物体,同时具有逼真的外观和准确的几何。针对传统基于不透明度或深度线索的局限性,我们引入了几何感知视点评估管线,明确建模背面可见性和遮挡感知的多视图共视性,即使在几何复杂的物体上也能可靠地识别未重建区域。此外,为了克服贪婪规划策略的局限性,ObjSplat采用下一最佳路径(NBP)规划器,在动态构建的空间图上执行多步前瞻。通过联合优化信息增益和移动成本,该规划器生成全局高效的轨迹。在仿真和真实世界文化遗物上的大量实验表明,ObjSplat在几分钟内生成物理一致的模型,与最先进方法相比,实现了卓越的重建保真度和表面完整性,同时显著减少了扫描时间和路径长度。项目页面:此https URL。

英文摘要

Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: https://li-yuetao.github.io/ObjSplat-page/ .

7. 医学影像与生物视觉 19 篇

2606.10021 2026-06-10 cs.CV 新提交

SpineReport: Automated 3D Quantification and Reporting of Lumbar Spine Degeneration on MRI

SpineReport: MRI上腰椎退变的自动化3D量化与报告

Nathan Molinier, Adrian A. Marth, Reto Sutter, Christoph Germann, Jacob A. Connolly, Mathieu Guay-Paquet, Nathan D. Schilaty, Kenneth A. Weber, Julien Cohen-Adad

AI总结 提出SpineReport开源框架,利用鲁棒解剖分割从腰椎MRI中提取3D形态和信号特征,生成个体化报告,在中央管狭窄评估中AUC达0.95。

Comments Submitted to Medical Image Analysis

详情
AI中文摘要

腰椎疾病是全球致残的主要原因,但MRI上退变的可靠量化仍具挑战。临床实践中,分析主要在二维(2D)中进行,因为手动三维(3D)评估耗时。然而,2D测量重复性有限,尤其当解剖结构不与成像平面对齐时。现有自动化方法通常局限于2D、依赖离散分级或缺乏鲁棒性和可解释性。我们介绍SpineReport,一个用于腰椎MRI全面3D形态测量的开源全自动框架。利用鲁棒解剖分割,该方法从关键结构中提取定量指标,包括椎管、脊髓、椎骨、椎间盘和椎间孔。这些指标包括形态和信号特征,支持跨受试者和纵向评估。SpineReport进一步生成个体化报告,允许与队列分布比较,提高脊柱形态的可解释性和客观表征。临床相关性根据放射科医生报告的中央管、侧隐窝和椎间孔狭窄严重程度分级进行评估。指标与中央管狭窄严重程度强相关,T2加权脑脊液信号表现最佳(AUC = 0.95)。椎管前后径和面积比也显示出强相关性和高区分能力(AUC > 0.80)。对于侧隐窝狭窄,相关性中等,侧方脑脊液信号最具信息量(AUC = 0.73)。尽管感兴趣区域提取鲁棒,但未观察到与椎间孔狭窄的显著关联。SpineReport作为开放获取工具发布:此https URL

英文摘要

Lumbar spine conditions are a leading cause of disability worldwide, yet reliable quantification of degeneration from MRI remains challenging. In clinical practice, analysis is predominantly performed in two dimensions (2D), as manual three-dimensional (3D) assessment is time-consuming. However, 2D measurements suffer from limited reproducibility, particularly when anatomical structures are not aligned with the imaging plane. Existing automated approaches are often restricted to 2D, rely on discrete grading, or lack robustness and interpretability. We introduce SpineReport, an open-source, fully automated framework for comprehensive 3D morphometric analysis of lumbar spine MRI. Leveraging robust anatomical segmentations, the method extracts quantitative metrics from key structures, including the spinal canal, spinal cord, vertebrae, intervertebral discs, and foramina. These include both morphological and signal-based features, enabling cross-subject and longitudinal assessment. SpineReport further generates subject-specific reports that allow comparison with cohort distributions, improving interpretability and objective characterization of spinal morphology. Clinical relevance was evaluated against radiologist-reported severity grades for central canal, lateral recess, and foraminal stenosis. Metrics showed strong associations with central canal stenosis severity, with T2-weighted CSF signal providing the highest performance (AUC = 0.95). Canal AP diameter and area ratios also demonstrated strong correlations and high discriminative ability (AUC > 0.80). For lateral recess stenosis, associations were moderate, with lateral CSF signal being the most informative (AUC = 0.73). No significant associations were observed for foraminal stenosis despite robust region-of-interest extraction. SpineReport is released as an open-access tool: https://ivadomed.github.io/SpineReport/

2606.10088 2026-06-10 cs.CV 新提交

Interpretable Temporal Facial-Region Motion Analysis for In-the-Wild Parkinson's Disease Video Classification

可解释的时序面部区域运动分析用于野外帕金森病视频分类

Riyadh Almushrafy

AI总结 提出基于面部区域关键点的时序运动描述符,在YouTubePD基准上实现轻量级且可解释的PD视频分类,平衡准确率达0.826。

Comments 22 pages, 6 figures. Submitted to Biomedical Signal Processing and Control

详情
AI中文摘要

面部表情减少是帕金森病(PD)常见的运动表现,通常描述为面部运动减退或面部运动迟缓。本文研究从面部区域关键点提取的时序运动描述符是否能够支持野外PD相关视频分类,并在YouTubePD基准上进行评估。每个视频使用来自14个预定义面部区域的几何描述符表示。在相同的二分类协议下,比较了静态几何、归一化几何、基于速度的描述符、相对速度描述符以及GRU序列基线。为了评估稳定性和可解释性,研究包括种子鲁棒性分析、区域级消融和排列重要性。最佳结果使用归一化速度描述符和随机森林分类器获得,在保留测试集上达到平衡准确率0.826和AUROC 0.855。在10个随机种子下,该表示保持稳定,平衡准确率为0.810 ± 0.018,AUROC为0.855 ± 0.005。总体而言,结果表明归一化的面部区域运动是YouTubePD视频分类的一种轻量级且可解释的表示。该研究作为基准级分析,不声称临床严重程度评估或MDS-UPDRS面部表情评分。

英文摘要

Reduced facial expressivity is a common motor manifestation of Parkinson's disease (PD), often described as hypomimia or facial bradykinesia. This paper examines whether temporal motion descriptors extracted from facial-region keypoints can support in-the-wild PD-related video classification on the YouTubePD benchmark. Each video is represented using geometric descriptors from 14 predefined facial regions. Static geometry, normalized geometry, velocity-based descriptors, relative-velocity descriptors, and a GRU sequence baseline are compared under the same binary classification protocol. To assess stability and interpretability, the study includes seed-robustness analysis, region-level ablation, and permutation importance. The best result is obtained with normalized velocity descriptors and a Random Forest classifier, reaching a balanced accuracy of 0.826 and an AUROC of 0.855 on the held-out test split. Across 10 random seeds, this representation remains stable, with balanced accuracy of 0.810 +/- 0.018 and AUROC of 0.855 +/- 0.005. Overall, the results suggest that normalized facial-region motion is a lightweight and interpretable representation for YouTubePD video classification. The study is framed as a benchmark-level analysis and does not claim clinical severity assessment or MDS-UPDRS facial-expression scoring.

2606.10115 2026-06-10 cs.CV 新提交

Improving PET/CT-Based Whole-Body Lesion Segmentation Using Prediction Uncertainty-Augmented Models

利用预测不确定性增强模型改进PET/CT全身病灶分割

Bashirul Azam Biswas, Biratal Raj Wagle, Zhihan Yang, Marc A. Seltzer, Matthew E. Maeder, James B. Yu, Indrani Bhattacharya

AI总结 提出一种不确定性感知框架,结合贝叶斯集成、体素不确定性量化与不确定性增强训练,提升PET/CT全身病灶分割的鲁棒性和病灶检测能力,在AutoPET-III和Deep-PSMA数据集上验证。

Comments 32 pages, 10 figures, 5 tables

详情
AI中文摘要

准确的全身正电子发射断层扫描(PET)/计算机断层扫描(CT)病灶分割对于癌症分期和治疗计划至关重要。PET提供不同放射性示踪剂的功能代谢信息,而CT提供解剖定位。由于细微的影像特征、混杂因素和读者间差异,从PET/CT影像中勾画病灶在临床上具有挑战性。现有的深度学习方法存在训练随机性、预测不一致、高肿瘤负荷病例中病灶遗漏以及缺乏不确定性量化等问题,限制了其临床可靠性。以nnU-Net为基线,我们提出了一种用于全身PET/CT病灶分割的不确定性感知框架,该框架整合了(1)贝叶斯集成以减少训练随机性,(2)具有认知和偶然分解的体素级不确定性量化,以及(3)认知不确定性增强训练以提高病灶检测。使用两个公开数据集AutoPET-III(1,611次扫描)和Deep-PSMA(200次扫描),包含多种癌症类型的FDG和PSMA研究,进行训练和评估。在未见过的AutoPET-III测试集上,贝叶斯集成相比确定性nnU-Net模型提高了鲁棒性和性能。不确定性图突出了模型不一致的区域,并与错误分类(尤其是假阳性)相关。不确定性增强训练以增加假阳性体积为代价提高了病灶恢复,反映了精确率-召回率的权衡。一种病例自适应路由策略通过在基础模型和增强模型之间进行选择,进一步提高了Dice系数。据我们所知,这是第一项在多示踪剂、泛癌种PET/CT分割中系统研究不确定性量化,并将贝叶斯集成与不确定性感知建模相结合的工作。

英文摘要

Accurate lesion segmentation from whole-body Positron Emission Tomography (PET)/Computed Tomography (CT) scans is essential for cancer staging and treatment planning. PET provides functional metabolic information with different radiotracers, while CT offers anatomical localization. Lesion delineation from PET/CT imaging is clinically challenging due to subtle imaging features, confounders, and inter-reader variability. Existing deep learning approaches suffer from training-related stochasticity, inconsistent predictions, missed lesions in high tumor-burden cases, and lack uncertainty quantification, limiting their clinical reliability. Using nnU-Net as a baseline, we propose an uncertainty-aware framework for whole-body PET/CT lesion segmentation that integrates (1) Bayesian ensembling to reduce training stochasticity, (2) voxel-wise uncertainty quantification with epistemic and aleatoric decomposition, and (3) epistemic uncertainty-augmented training to improve lesion detection. Two public datasets, AutoPET-III (1,611 scans) and Deep-PSMA (200 scans), comprising FDG and PSMA studies across multiple cancer types, are used for training and evaluation. Bayesian ensembling improves robustness and performance over deterministic nnU-Net models on the unseen AutoPET-III test set. Uncertainty maps highlight regions of model disagreement and correlate with misclassifications, particularly false positives. Uncertainty-augmented training improves lesion recovery at the cost of increased FPVol, reflecting a precision-recall trade-off. A case-adaptive routing strategy further improves Dice by selecting between the base and augmented models. To our knowledge, this is the first study to systematically investigate uncertainty quantification in multi-tracer, pan-cancer PET/CT segmentation and to combine Bayesian ensembling with uncertainty-aware modeling for this task.

2606.10372 2026-06-10 cs.CV 新提交

ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment

ClinReadNet: 一种受临床阅读启发的低剂量腹部CT图像质量评估网络

Xianye Xiao, Yulong Zou, Yujie Luo, Taihui Yu, Cun-Jing Zheng, Yuan-ming Geng, Shuihua Wang, Yudong Zhang, Jin Hong

发表机构 * School of Mathematics and Computer Sciences, Nanchang University(南昌大学数学与计算机科学学院) School of Information Engineering, Nanchang University(南昌大学信息工程学院) Department of Radiology, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University(中山纪念医院放射科,中山大学) Department of Stomatology, Zhujiang Hospital, Southern Medical University(南方医科大学珠江医院口腔科) Department of Biological Sciences, School of Science, Xi'an Jiaotong Liverpool University(西安交通大学利物浦大学科学学院生物科学系) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 提出ClinReadNet框架,通过模拟放射科医生阅读习惯,结合Sobel序数质量网络和窗口多尺度温度多头自注意力模块,并设计分层排序概率分数损失函数,在LDCTIQAG2023数据集上实现SOTA性能。

详情
AI中文摘要

在腹部CT成像中,开发一种模拟医生阅读习惯的低剂量无参考图像质量评估(No-reference IQA)模型具有重要的实际价值。本文提出了一种新颖的基于深度学习的框架ClinReadNet,其设计与放射科医生的临床阅读逻辑一致:首先,引入Sobel序数质量网络(SOQN)模块,该模块能同时关注与图像质量高度相关的边缘细节和整个图像的质量分布模式,准确匹配“兼顾局部细节与整体上下文”的临床阅片判断习惯;其次,该框架集成了(移位)窗口多尺度温度多头自注意力((S)W-MTMSA)模块,进一步复制了放射科医生从整体扫描到局部聚焦的阅片过程,并通过多锐度注意力精确锁定感兴趣区域;第三,设计了分层排序概率分数(HRPS)损失函数,该函数结合了粗分类和细分类的双重逻辑,同时关注分级标签之间的距离信息,有效提升了图像质量评估的性能。在LDCTIQAG2023数据集上进行的实验表明,所提方法达到了当前最先进(SOTA)性能:皮尔逊线性相关系数(PLCC)、斯皮尔曼秩相关系数(SROCC)和肯德尔秩相关系数(KROCC)的值分别达到0.9507、0.9554和0.8629,其绝对值之和(Score)为2.7690,优于现有方法。

英文摘要

In abdominal CT imaging, developing a low-dose, no-reference image quality assessment (No-reference IQA) model that mimics doctors' reading habits for evaluating CT image quality has significant practical value. This paper proposes a novel deep learning-based framework, ClinReadNet, whose design aligns with the clinical reading logic of radiologists: first, it introduces the Sobel ordinal quality network (SOQN) module, which can simultaneously focus on edge details highly relevant to image quality and the quality distribution pattern of the entire image, accurately matching the clinical image-reading judgment habit of "considering both local details and overall context"; second, the framework integrates the (shifted) window multi-scale temperature multi-head self-attention ((S)W-MTMSA) module, which further replicates the radiologists' image-reading process of shifting from overall scanning to local focusing, and accurately locks in regions of interest through multi-sharpness attention; third, it designs the hierarchical ranked probability score (HRPS) loss function, which combines the dual logics of coarse classification and fine classification, while paying attention to the distance information between grading labels, effectively improving the performance of image quality assessment. Experiments conducted on the LDCTIQAG2023 dataset show that the proposed method achieves the current state-of-the-art (SOTA) performance: the values of Pearson's linear correlation coefficient (PLCC), Spearman's rank-order correlation coefficient (SROCC), and Kendall's rank-order correlation coefficient (KROCC) reach 0.9507, 0.9554, and 0.8629 respectively, with the sum of their absolute values (Score) being 2.7690, outperforming existing methods.

2606.10378 2026-06-10 cs.CV 新提交

FSS-Net: Frequency-Spatial Synergy Network with Wavelet Attention for Carotid Artery Ultrasound Segmentation

FSS-Net:用于颈动脉超声分割的频率-空间协同网络与小波注意力

Jiawei Liu, Zhijiang Wan, Junhua Hu, Rongli Zhang, Zhongbiao Xu, Yankun Cao, Yuan Chen, Jin Hong

发表机构 * Ji luan Academy, Nanchang University(井然学院,南昌大学) School of Information Engineering, Nanchang University(信息工程学院,南昌大学) State Key Laboratory of Water Cycle and Water Security, China Institute of Water Resources and Hydropower Research(水循环与水安全国家重点实验室,中国水利水电科学研究院) Department of Diagnostic Radiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong(诊断放射科,李嘉诚医学部,香港大学) Department of Radiotherapy, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Southern Medical University(放疗科,广东省人民医院,广东省医学科学院,南方医科大学) Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University(SDU-NTU人工智能研究联合中心(C-FAIR),山东大学) Department of Pediatrics, Shandong Provincial Hospital Affiliated to Shandong First Medical University(儿科,山东省立医院(附属山东第一医科大学))

AI总结 提出频率-空间协同网络(FSS-Net),集成小波变换、多域注意力和边缘增强,在颈动脉超声数据集上实现96.46%的Dice分数,有效分割颈动脉并识别斑块。

详情
AI中文摘要

超声成像中颈动脉的精确分割对于中风风险评估至关重要。然而,散斑噪声、低对比度和模糊边界仍然是主要挑战。在本文中,我们提出了一种频率-空间协同网络(FSS-Net),以实现噪声鲁棒且高精度的颈动脉分割。该网络将小波变换、多域注意力和边缘增强集成到一个统一的编码器-解码器架构中。具体来说,设计了一个通道-空间-小波注意力(CSWA)模块,以抑制频率域中的噪声并净化语义特征。引入了一个小波增强瓶颈(WEB)模块,以高效捕获长距离全局依赖关系。此外,一个拉普拉斯引导的自适应边缘融合(LAEF)模块补偿高频细节并保持边界连续性。在颈动脉超声数据集上的大量实验表明,FSS-Net在低信噪比条件下达到了96.46%的Dice分数(DSC)和强鲁棒性,优于几种最先进的方法。该方法实现了超声成像中颈动脉的精确分割,有效识别颈动脉粥样硬化斑块,并通过其他任务(如乳腺癌分割)验证,表明其在超声图像中识别异常组织肿块具有良好的临床应用潜力。

英文摘要

Accurate segmentation of carotid arteries in ultrasound imaging is critical for stroke risk assessment. However, speckle noise, low contrast, and blurred boundaries remain major challenges. In this paper, we propose a Frequency-Spatial Synergy Network (FSS-Net) to achieve noise-robust and high-precision carotid artery segmentation. The network integrates wavelet transform, multi-domain attention, and edge enhancement into a unified encoder-decoder architecture. Specifically, a Channel-Spatial-Wavelet Attention (CSWA) module is designed to suppress noise and purify semantic features in the frequency domain. A Wavelet-Enhanced Bottleneck (WEB) module is introduced to capture long-range global dependencies efficiently. Furthermore, a Laplacian-Guided Adaptive Edge Fusion (LAEF) module compensates high-frequency details and maintains boundary continuity. Extensive experiments on carotid ultrasound datasets show that FSS-Net achieves a Dice score (DSC) of 96.46% and strong robustness under low SNR conditions, outperforming several state-of-the-art methods. This method realizes accurate segmentation of carotid artery in ultrasonic imaging, effectively identifies carotid atherosclerotic plaque, and is verified by other task (such as segmentation of breast cancer), suggesting that it has good clinical application potential in identifying abnormal tissue masses in ultrasonic images.

2606.10735 2026-06-10 cs.CV physics.med-ph 新提交

Patient-Level Diagnosis of Acute Myeloid Leukemia via Deep Learning Analysis of Bone Marrow Smear

基于深度学习分析骨髓涂片的急性髓系白血病患者级诊断

Yuqi Ma, Tianyi Wang, Weihua Meng, Hongru Chen, Fajin Tao, Qunxian Lu, Lin An, Xiaodong Mo, Gen Yang

发表机构 * State Key Laboratory of Nuclear Physics and Technology, School of Physics, Peking University(北京大学核物理与天体物理国家重点实验室,物理学院) Peking University People’s Hospital, Peking University Institute of Hematology, National Clinical Research Center for Hematologic Disease, Beijing Key Laboratory of Hematopoietic Stem Cell Transplantation(北京大学人民医院,北京大学血液病研究所,国家血液病临床医学研究中心,北京造血干细胞移植重点实验室) Shanghai Dishuo Beiken Biotechnology Co., Ltd.(上海迪朔生物科技有限公司)

AI总结 提出从细胞到患者的深度学习流程,通过YOLO检测细胞、EfficientNet-B0分类复合母细胞样细胞(CBLC),聚合细胞级预测为患者级CBLC比率,实现AML辅助诊断,在外部验证集上F1达0.91。

Comments 4 figures

详情
AI中文摘要

骨髓涂片检查对于急性髓系白血病(AML)评估仍然重要,但手动单细胞解释劳动强度大,患者级诊断需要聚合大量细胞观察结果。我们提出了一种从细胞到患者的深度学习流程,用于从骨髓涂片图像进行AML辅助诊断。该研究包括来自六个匿名中心的258名患者,其中主要队列来自中心1-3的169名患者,外部验证队列来自中心4-6的89名患者。使用16类细胞注释词汇描述全局细胞组成,包括粒细胞、单核细胞、红系、淋巴、嗜酸性粒细胞和其他细胞。该模型不识别严格的AML母细胞或白血病母细胞,而是针对专家定义的复合类别——复合母细胞样细胞(CBLC),根据项目范围内的形态学标准,包括N、N1、M、M1、R、R1、J和J1。基于YOLO的固定分割模块检测细胞,预测轮廓通过轮廓IoU与专家多边形注释匹配,并生成标准化的单细胞裁剪。通过两阶段GT-to-YOLO和YOLO-to-YOLO策略训练EfficientNet-B0分类器,包括类别不平衡校正、中心-边界正则化和形态学辅助监督。将细胞级预测聚合为患者级CBLC比率,用于AML导向的诊断支持。该流程实现了稳定的内部验证并保持了外部泛化能力,在中心4、5和6上的集成加权F1分数分别为0.9076、0.8696和0.9124。

英文摘要

Bone marrow smear review remains important for acute myeloid leukemia (AML) assessment, but manual single-cell interpretation is labor-intensive and patient-level diagnosis requires aggregation of many cellular observations. We present a cell-to-patient deep learning pipeline for AML-assisted diagnosis from bone marrow smear images. The study included 258 patients from six anonymized centers, including a main cohort of 169 patients from Centers 1-3 and an external validation cohort of 89 patients from Centers 4-6. A 16-category cell annotation vocabulary was used to describe the global cellular composition, including granulocytic, monocytic, erythroid, lymphoid, eosinophilic, and other cells. Rather than identifying strict AML blasts or leukemic blasts, the model targets an expert-defined composite category termed Composite Blast-like Cells (CBLC), comprising N, N1, M, M1, R, R1, J, and J1 according to the project-wide morphological standard. A fixed YOLO-based segmentation module detected cells, predicted contours were matched to expert polygon annotations by contour IoU, and standardized single-cell crops were generated. An EfficientNet-B0 classifier was trained through a two-stage GT-to-YOLO and YOLO-to-YOLO strategy with class-imbalance correction, center-border regularization, and morphology-assisted supervision. Cell-level predictions were aggregated into patient-level CBLC ratios for AML-oriented diagnostic support. The pipeline achieved stable internal validation and maintained external generalization, with ensemble weighted F1-scores of 0.9076, 0.8696, and 0.9124 on Centers 4, 5, and 6, respectively.

2606.10756 2026-06-10 cs.CV physics.med-ph 新提交

DD-INR: Dynamics-Driven Implicit Neural Representation for Accelerated Whole-Brain Functional MRI Reconstruction

DD-INR: 用于加速全脑功能磁共振成像重建的动力学驱动隐式神经表示

Qiaoxin Li, Caini Pan, Pierre-Antoine Comby, Chaithya Giliyar, Philippe Ciuciu

发表机构 * MIND, Inria, Palaiseau, France(MIND、Inria、法国帕莱赛欧) Neurospin, CEA Paris Saclay, France(Neurospin、CEA巴黎萨克雷、法国) CEA NeuroSpin, Paris-Saclay University, CNRS BAOBAB, Gif-sur-Yvette, France(CEA NeuroSpin、巴黎萨克雷大学、CNRS BAOBAB、法国吉夫-sur-伊夫特)

AI总结 提出DD-INR框架,通过将fMRI数据分解为静态背景和动态成分并用隐式神经表示建模动态,实现加速fMRI重建,提升图像质量和激活模式恢复。

详情
Journal ref
MICCAI 2026 - 29th International Conference on Medical Image Computing and Computer Assisted Intervention, Sep 2026, Strasbourg, France
AI中文摘要

fMRI的加速采集能够增强对大脑神经血管(BOLD)活动的检测,但高k空间欠采样使图像重建变得具有挑战性:任务诱发的BOLD信号幅度较小,传统的解剖MRI重建方法倾向于空间准确性而非时间保真度,因此无法恢复这些信号。我们提出了DD-INR,一个专为加速fMRI设计的动力学驱动隐式神经表示框架,它利用非相干时变采样和定制的时空先验,在模拟和体内采集中均优于传统方法,无论是在图像质量还是激活模式恢复方面。DD-INR通过将fMRI数据分解为静态背景和时变动态成分,仅用专门的INR表示动态部分,从而将模型能力集中在与激活相关的变化上,同时保持紧凑。总的来说,DD-INR为加速fMRI重建提供了一个有前景的框架,有潜力在实际扫描时间限制内提高fMRI研究的灵敏度和鲁棒性。源代码可在该网址获取。

英文摘要

Accelerated acquisition of fMRI enables enhanced detection of neurovascular (BOLD) activity in the brain, but image reconstruction becomes challenging with high k-space undersampling: Task-evoked BOLD signals are small in magnitude, which traditional anatomical MRI reconstruction methods fail to recover, as they favor spatial accuracy over temporal fidelity. We present DD-INR, a Dynamics-Driven Implicit Neural Representation framework tailored for accelerated fMRI that benefits from incoherent time-varying sampling and a tailored spatiotemporal prior, outperforming traditional methods, demonstrated in simulation and in-vivo acquisition, both in terms of image quality and retrieval of activation patterns. DD-INR achieves this by splitting the fMRI data into a static background and a temporally varying dynamic component, representing only the dynamics with a dedicated INR, thereby focusing the model's capacity on activation-relevant changes while remaining compact. In general, DD-INR provides a promising framework for accelerated fMRI reconstruction, with the potential to improve the sensitivity and robustness of fMRI studies within practical scan time limits. The source code is available at https://github.com/JoosenLi/DD-INR.

2606.10778 2026-06-10 cs.CV 新提交

From Patches to Patients: A study of the tile-to-slide performance transferability in Digital Pathology

从斑块到患者:数字病理学中斑块到全切片性能可迁移性的研究

Sofiène Boutaj, Leo Fillioux, Maria Vakalopoulou, Stergios Christodoulidis, Pierre Marza

发表机构 * Université Paris-Saclay, CentraleSupélec, Gustave Roussy, INSERM, IHU PRISM, Cancer Data Science Unit(巴黎-萨克雷大学、中央理工-高等电力学院、古斯塔夫·鲁西研究所、法国国家健康与医学研究院、IHU PRISM、癌症数据科学单元) Université Paris-Saclay, CentraleSupélec, MICS Laboratory(巴黎-萨克雷大学、中央理工-高等电力学院、MICS实验室)

AI总结 研究斑块级线性探测能否作为全切片级性能的可靠代理,通过19个基础模型在42个切片级和16个斑块级任务上的基准测试,发现斑块与切片性能高度相关,斑块级基准测试可有效筛选候选模型。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

基础模型最近通过为全切片图像分析提供稳健表示,重新定义了组织病理学中的最先进技术。然而,为特定临床队列选择最优基础模型目前需要多个预处理步骤,随后对每个模型进行计算昂贵的特征提取和训练多实例学习聚合器。在这项工作中,我们研究高效的斑块级线性探测能否作为切片级性能的可靠代理,从而减少对每个候选编码器运行完整切片级管道的需求。我们在42个切片级和16个斑块级任务上对19个最先进的基础模型进行基准测试,使用ABMIL和均值池化聚合器比较斑块探测指标与切片级结果。我们观察到在不同任务难度下,斑块与切片性能之间存在高度相关性,表明编码器表示质量是WSI成功的主要决定因素。敏感性分析显示,可迁移性在不同模型间稳定,且受队列规模和每张切片斑块数量的影响大于平均任务难度。我们还测量了斑块级和切片级任务中最佳表现模型的一致性,表明斑块基准测试可靠地筛选出强候选模型。总体而言,我们的研究表明,斑块级基准测试为缩小候选模型范围提供了高效且实用的第一步,而切片级评估对于临床任务的最终验证仍然必不可少。

英文摘要

Foundation Models (FMs) have recently redefined the state-of-the-art in histopathology by providing robust representations for whole-slide image (WSI) analysis. However, selecting the optimal foundation model (FM) for a specific clinical cohort currently requires multiple preprocessing steps, followed by computationally expensive feature extraction and the training of a Multiple Instance Learning (MIL) aggregator for every model. In this work, we investigate whether efficient tile-level linear probing can serve as a reliable proxy for slide-level performance, reducing the need to run full slide-level pipelines for every candidate encoder. We benchmark 19 state-of-the-art FMs on 42 slide-level and 16 tile-level tasks, comparing tile probing metrics against slide-level outcomes using ABMIL and Mean Pooling aggregations. We observe a high correlation between tile and slide performance across varying task difficulties, indicating that encoder representation quality is the primary determinant of WSI success. Sensitivity analyses show that transferability is stable across models and is more influenced by cohort sizes and numbers of tiles per slide than by average task difficulty. We also measure the agreement in best performing models between tile and slide-level tasks, showing tile benchmarks reliably shortlist strong candidates. Overall, our study indicates that tile-level benchmarking provides an efficient and practical first step for narrowing down candidate models, while slide-level evaluation remains essential for final validation on clinical tasks.

2606.11001 2026-06-10 cs.CV 新提交

IPSM-Bench: A New Intermediate Phase Segmentation Benchmark in Microstructure Images of Zinc-Based Absorbable Biomaterials

IPSM-Bench:锌基可吸收生物材料显微图像中的新中间相分割基准

Jinglin Xu, Shangyan Zhao, Jiabo Wang, Xinghong Mu, Yulong Lei, Jiacheng Zhang, Hongbo Sun, Yageng Li

发表机构 * School of Artificial Intelligence, University of Science and Technology Beijing(北京科技大学人工智能学院) School of Advanced Materials Innovation, University of Science and Technology Beijing(北京科技大学前沿材料创新学院) China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(中国电信人工智能技术(北京)有限公司) School of Materials Science and Engineering, University of Science and Technology Beijing(北京科技大学材料科学与工程学院) Institute of Materials Intelligent Technology, Liaoning Academy of Materials(辽宁材料实验室材料智能技术研究所) School of Big Data and Software Engineering, Chongqing University(重庆大学大数据与软件工程学院)

AI总结 针对锌合金中间相分割面临的数据稀缺、低对比度等挑战,构建最大高质量数据集IPSM-Bench,并提出空间上下文先验引导的SAM方法SCoP-SAM,实现最优分割性能。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

锌基合金是不可或缺的新兴可吸收金属生物材料,其宏观性能受微观结构特征控制。中间相——关键的微观结构成分——在调节机械和功能性能中起关键作用。然而,锌合金显微组织中的中间相分割面临严峻挑战:标注数据集稀缺、对比度低、小目标检测困难以及形态异质性。为此,我们构建了IPSM-Bench,这是用于锌合金中间相分割的最大高质量数据集。此外,我们提出了SCoP-SAM,一种新的空间上下文先验引导的SAM方法,利用中间相的梯度结构和灰度属性捕获空间上下文先验,并将其融入整个SAM编码-解码过程,从而提升分割性能。基于提出的IPSM-Bench,我们建立了中间相分割的新基准,以系统评估最先进方法并推动锌合金微观结构分析研究。在IPSM-Bench和额外的公共合金基准上的大量实验表明,我们的SCoP-SAM不仅在锌合金中间相分割上实现了最先进性能,而且对其他合金场景也具有显著的泛化能力。

英文摘要

Zinc-based alloys are indispensable emerging absorbable metallic biomaterials, and their macroscopic performance is governed by microstructural characteristics. Intermediate phases-key microstructural constituents-are pivotal in regulating mechanical and functional properties. However, intermediate phase segmentation in zinc alloy microstructures faces formidable challenges: scarce annotated datasets, low contrast, difficulty detecting small targets, and heterogeneous morphologies. To this end, we construct IPSM-Bench, the largest high-quality dataset for zinc-alloy intermediate phase segmentation. Furthermore, we propose SCoP-SAM, a new Spatial Context Prior-guided SAM method that leverages the gradient structure and grayscale properties of intermediate phases to capture spatial context priors and incorporates them into the entire SAM encoding-decoding process, improving segmentation performance. Based on the proposed IPSM-Bench, we establish a new benchmark for intermediate phase segmentation to systematically evaluate state-of-the-art (SOTA) methods and advance research on zinc alloy microstructure analysis. Extensive experiments on IPSM-Bench and additional public alloy benchmarks demonstrate that our SCoP-SAM not only achieves SOTA performance for zinc-alloy intermediate phase segmentation but also generalizes remarkably well to other alloy scenarios.

2606.11012 2026-06-10 cs.CV 新提交

An Uncertainty Estimation Framework for Dose Accumulation in Adaptive Radiotherapy: Application to CBCT-Guided Radiotherapy for Cervical Cancer

自适应放疗中剂量累积的不确定性估计框架:应用于宫颈癌CBCT引导放疗

Cedric Hemon, Delphine Lebret, Jean-Claude Nunes, Valentin Boussot, Karine Peignaux, Nathalie Mesgouez-Nebout, Chantal Hanzen, Antoine Simon, Anaïs Barateau, Renaud de Crevoisier, Caroline Lafond

发表机构 * Univ. Rennes, CLCC Eugène Marquis, INSERM, LTSI - UMR 1099(雷恩大学,尤金·马奎斯中心,法国国家健康与医学研究院,LTSI - UMR 1099) Department of Radiation Oncology, Centre Georges-Francois Leclerc(乔治-弗朗索瓦·勒克莱尔中心放射肿瘤科) Institut de Cancérologie de l’Ouest–Site Paul Papin(西部癌症研究所-保罗·帕潘院区) CLCC Henri Becquerel(亨利·贝克勒尔中心)

AI总结 提出IMPACT-DoseAcc框架,通过贝叶斯分割引导和集成分割模型两种策略量化DIR不确定性,并传播至累积剂量指标,应用于宫颈癌CBCT引导oART,验证了不确定性校准和几何误差相关性。

Comments Under revision

详情
AI中文摘要

背景与目的:oART能够每日根据分次间解剖变化调整计划,但累积剂量估计仍受限于DIR、分割和解剖不确定性。我们在IMPACT中引入IMPACT-DoseAcc,一个不确定性感知的剂量累积框架,用于语义特征驱动的图像分析。该框架具有模态和疾病无关性,并应用于宫颈癌(LACC)的CBCT引导oART。材料与方法:回顾性分析9例LACC患者,使用每日CBCT衍生的虚拟CT进行剂量重新计算。IMPACT-DoseAcc关注DIR引起的不确定性,不建模vCT生成的不确定性。在IMPACT-Reg中测试了两种DIR不确定性策略:一种贝叶斯分割引导方法,使用一个概率模型量化解剖不确定性;以及一个针对结构的分割模型集成,以捕获认知变异性。体素级不确定性图通过剂量变形和累积传播,生成概率剂量体积直方图。集成不确定性通过变形场上的体素级标准差量化,几何误差通过变形轮廓与验证轮廓之间的表面距离评估。解剖变异性加权优化了聚合。结果:集成DIR不确定性与几何误差相关,CTVt和膀胱的Pearson系数分别为0.63和0.66。对于CTVt,pDVH达到96.3±3.9%的覆盖率,显示传播不确定性的校准。加权稳定了各分次和器官的估计。结论:IMPACT-DoseAcc将配准驱动的不确定性传播至累积剂量指标,改进了解剖变化下累积剂量的解释。其3DSlicer集成支持可重复、不确定性知情的ART工作流程。

英文摘要

Background and purpose: oART enables daily plan adaptation to interfraction anatomical variations, but cumulative dose estimation remains limited by DIR, segmentation, and anatomical uncertainties. We introduce IMPACT-DoseAcc, an uncertainty-aware dose accumulation framework, within IMPACT for semantic feature-driven image analysis. The framework is modality- and disease-agnostic and is applied to CBCT-guided oART for cervical cancer (LACC). Material and Methods: Nine LACC patients were retrospectively analyzed using daily CBCT-derived virtual CTs for dose recalculation. IMPACT-DoseAcc focuses on uncertainty from DIR, without modeling vCT-generation uncertainty. Two DIR uncertainty strategies were tested within IMPACT-Reg: a Bayesian segmentation-guided approach using one probabilistic model to quantify anatomical uncertainty, and an ensemble of segmentation models targeting structures to capture epistemic variability. Voxel-wise uncertainty maps were propagated through dose warping and accumulation to generate probabilistic dose-volume histograms. Ensemble uncertainty was quantified from voxel-wise standard deviation across deformation fields, and geometric error was assessed using surface distance between warped and validated contours. Anatomical-variability weighting refined aggregation. Results: Ensemble DIR uncertainty correlated with geometric error, with Pearson coefficients of 0.63 for CTVt and 0.66 for bladder. For CTVt, pDVHs achieved 96.3 +/- 3.9% coverage, showing calibration of propagated uncertainty. Weighting stabilized estimates across fractions and organs. Conclusions: IMPACT-DoseAcc propagates registration-driven uncertainty to cumulative dose metrics, improving interpretation of accumulated dose under anatomical variations. Its 3DSlicer integration supports reproducible, uncertainty-informed ART workflows.

2606.11106 2026-06-10 cs.CV cs.AI 新提交

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

FADA: 可访问的胎儿超声解读与标注——基于选择性蒸馏的统一视觉-语言模型

Mahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes, Nader Mohammed, Abdullatif Magram, Khalid Alyafei, Mowafa Househ, Marco Agus

发表机构 * Hamad Bin Khalifa University(哈马德·本·哈利法大学) HMC(哈马德医疗公司) Advanced AlRazi Diagnostic Center(高级阿尔拉齐诊断中心) Sidra Medicine(锡德拉医学)

AI总结 提出统一视觉-语言模型FADA,通过选择性蒸馏从四个领域基础模型提取知识,实现胎儿超声的解读、分类、检测和分割,在单个消费级GPU上训练,无需外部标签,可在智能手机上离线运行。

详情
AI中文摘要

全球范围内受过训练的超声技师短缺限制了低收入和中等收入国家的产前超声筛查,这些国家超过一半的孕妇未接受专业超声检查。当前的深度学习方法分别处理检测、分割或分类,每个任务都需要单独的模型和推理时的专家指定标签。我们提出FADA,一个基于Qwen3.5-VL构建的统一视觉-语言模型,通过单一解读优先的流程执行临床解读、分类、检测和分割,无需外部标签。FADA通过离线预计算特征缓存,从四个领域基础模型(FetalCLIP、UltraSAM、USF-MAE、UltraFedFM)中蒸馏知识。选择性蒸馏仅对标注任务应用特征对齐,而解读任务依赖标准微调,在大多数评估指标上持续优于完全蒸馏。推荐变体FADA-SKD在分割上达到0.8820平均Dice,检测上达到0.7671 mAP@0.50,结构化解读合规性达到100%。专家超声技师对237张图像的验证确认了在自主和人机协同模式下输出临床可接受,其中73.5%的解读在临床医生指导下获得完美评分。该系统可在单个消费级GPU上训练,无需云连接即可部署。我们通过在商用智能手机(高通骁龙7 Gen 1,12 GB RAM)上使用GGUF量化的this http URL运行压缩的0.8B模型,验证了边缘部署,完全离线完成全部5阶段流程约需60秒。这为将AI辅助胎儿评估与便携式超声设备集成提供了实用途径,直接解决了资源受限环境中的诊断可及性差距。代码、模型和数据可在https://this https URL获取。

英文摘要

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

2606.10713 2026-06-10 eess.IV cs.AI cs.CV cs.LG 交叉投稿

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

++nnU-Net: 基于前缀数据增强的nnU-Net扩展

Ana Sofia Santos, André Ferreira, Gijs Luijten, Naida Solak, Lisle Faray de Paiva, Behrus Hinrichs-Puladi, Jens Kleesiek, Jan Egger, Victor Alves

发表机构 * Center Algoritmi / LASI, University of Minho, Braga, Portugal(阿尔戈里米中心/拉斯伊大学,明霍大学,布拉加,葡萄牙) Institute for Artificial Intelligence in Medicine, University Medicine Essen, Essen, Germany(医学人工智能研究所,埃森医学院,埃森,德国) Institute of Medical Informatics / Dept. of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Germany(医学信息学研究所/口腔和颅面外科部,亚琛大学医院,德国) Faculty of Computer Science, University of Duisburg-Essen, Essen, Germany(计算机科学学院,杜伊斯堡-埃森大学,埃森,德国)

AI总结 提出++nnU-Net,通过图像配准进行数据增强,在预处理和训练前生成变形图像,在5个2D数据集上提升Dice系数最高约22%。

Comments 7 pages, 1 figure, 2 tables

详情
AI中文摘要

nnU-Net在医学分割任务中持续展现出成功,这严重依赖于标注生物医学数据的可用性和多样性。然而,由于隐私法规和标注成本等因素,收集医学影像队列仍然具有挑战性。因此,数据增强在增加数据可用性的同时保持解剖学可行性方面起着关键作用。为此,我们提出了++nnU-Net,一种基于图像配准的新型数据增强模块,在预处理和训练之前运行。我们的框架在五个不同的2D数据集上进行了评估。在该工作流中,图像数据经过两阶段配准过程,生成新的变形图像。然后将变换应用于相应的分割。此外,该管道计算可用磁盘空间,生成补充的二进制合成掩码并生成检查点。我们证明++nnU-Net优于nnU-Net基线,在Dice相似系数得分上有所提升。在最显著的情况下,我们观察到性能提升约22%。这些发现强调了基于配准的数据增强的有效性,特别是对于2D医学影像数据集,并表明++nnU-Net为在数据有限的情况下提高分割性能提供了一种实用且可扩展的方法。++nnU-Net的源代码可在以下网址获取:this https URL

英文摘要

The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git

2505.23341 2026-06-10 cs.CV 版本更新

Dual-stream attention-guided learning for weakly supervised whole slide image classification

双流注意力引导学习用于弱监督全切片图像分类

Daoxi Cao, Hangbei Cheng, Yijin Li, Ruolin Zhou, Xuehan Zhang, Xinyi Li, Binwei Li, Xuancheng Gu, Jianan Zhang, Xueyu Liu, Yongfei Wu

发表机构 * College of Computer Science and Technology, College of Data Science, Taiyuan University of Technology(太原科技大学计算机科学与技术学院、数据科学学院) College of Humanities, Law and Foreign Languages, Taiyuan University of Technology(太原科技大学人文学院、法律与外语学院) College of Artificial Intelligence, Taiyuan University of Technology(太原科技大学人工智能学院) School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络安全学院) School of Mathematics, Taiyuan University of Technology(太原科技大学数学学院)

AI总结 提出双流注意力引导学习框架,通过师生双流架构和注意力引导伪标签,解决弱监督下全切片图像中关键区域识别和实例关系建模问题,在合成和真实病理数据集上优于现有方法。

详情
AI中文摘要

全切片图像(WSIs)因其超高分辨率和丰富的形态学信息在癌症诊断中发挥关键作用,多实例学习(MIL)已成为解决WSIs巨大尺寸和实例细粒度标注稀缺的主流范式。然而,现有大多数MIL方法难以仅使用切片级标签准确识别诊断关键局部区域(实例),并且在高效建模实例间关系方面存在不足。为解决这些问题,我们提出了一种双流注意力引导学习(DSAGL)框架。DSAGL通过师生双流架构桥接切片级监督和实例级学习,并通过生成注意力引导伪标签缓解实例歧义。该框架采用共享轻量级编码器高效建模长距离依赖,并利用基于注意力的融合机制增强对稀疏信息区域的敏感性。在合成基准和真实病理WSI数据集上的大量实验表明,DSAGL在弱监督下始终优于最先进的MIL方法,实现了卓越的判别性能和鲁棒性。

英文摘要

Whole slide images (WSIs) play a crucial role in cancer diagnosis due to their ultra-high resolution and rich morphological information, and multiple instance learning (MIL) has become a prevalent paradigm to solve the massive size of WSIs and the scarcity of fine-grained annotations of instance. However, most existing MIL methods struggle to accurately identify diagnostically critical local regions (instance) using only slide-level labels, and suffer from modelling the relationship of instances efficiently. To address these defects, we propose a Dual-Stream Attention-Guided Learning (DSAGL) framework. DSAGL bridges slide-level supervision and instance-level learning through a teacher-student dual-stream architecture, and mitigates instance ambiguity by generating attention-guided pseudo labels. The framework employs a shared lightweight encoder to efficiently model long-range dependencies and an attention-based fusion mechanism to enhance sensitivity to sparse, informative regions. Extensive experiments on synthetic benchmarks and real-world pathological WSI datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL methods, achieving superior discriminative performance and robustness under weak supervision.

2509.05913 2026-06-10 cs.CV 版本更新

A fine-grained attention and geometric correspondence model for musculoskeletal risk classification in athletes using multimodal visual and skeletal features

基于多模态视觉和骨骼特征的运动员肌肉骨骼风险分类的细粒度注意力与几何对应模型

Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Tamanna Shermin, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam

发表机构 * Department of Computer Science and Engineering, United International University(计算机科学与工程系,国际联合大学) Department of Data Science and Artificial Intelligence, Monash University(数据科学与人工智能系,墨尔本大学) Faculty of Science and Technology, Charles Darwin University(科学与技术学院,查尔斯达尔文大学) Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory, Dhaka(应用人工智能与智能系统实验室,达卡)

AI总结 提出ViSK-GAT多模态框架,融合图像和骨骼坐标特征,通过细粒度注意力模块和几何对应模块实现运动员肌肉骨骼风险八级分类,关键指标超93%。

Comments Published in Computers and Electrical Engineering

详情
Journal ref
Computers and Electrical Engineering, Vol. 138, 111281, 2026
AI中文摘要

肌肉骨骼疾病对运动员构成重大风险,早期风险评估对于预防至关重要。然而,现有方法大多针对受控环境设计,由于依赖单一数据类型,无法在复杂环境中可靠地评估风险。本研究引入了ViSK-GAT(视觉-骨骼几何注意力变换器),一种新颖的多模态深度学习框架,利用视觉和基于骨骼坐标的特征对肌肉骨骼风险进行分类。通过结合图像和骨骼坐标创建了自定义多模态数据集(MusDis-Sports),每个样本根据快速全身评估(REBA)系统标记为八个风险类别。ViSK-GAT集成了两个创新模块:细粒度注意力模块(FGAM),在融合前通过自注意力细化模态内特征;以及多模态几何对应模块(MGCM),增强图像特征与坐标之间的跨模态对齐。该模型取得了稳健的性能,所有关键指标均超过93%。概率分布误差指标也显示出较低的均方根误差(RMSE)为0.1205和平均绝对误差(MAE)为0.0156。ViSK-GAT持续优于最先进的深度学习骨干网络,展示了其在推动人工智能驱动的肌肉骨骼风险评估和实现运动领域及时干预方面的潜力。

英文摘要

Musculoskeletal disorders pose significant risks to athletes, and early risk assessment is essential for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research introduces ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework that classifies musculoskeletal risk using both visual and skeletal coordinate-based features. A custom multimodal dataset (MusDis-Sports) was created by combining images and skeletal coordinates, with each sample labeled into eight risk categories based on the Rapid Entire Body Assessment (REBA) system. ViSK-GAT integrates two innovative modules: the Fine-Grained Attention Module (FGAM), which refines intra-modal features through self-attention before fusion, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal alignment between image features and coordinates. The model achieved robust performance, with all key metrics exceeding 93%. Probability distribution error metrics also showed a low Root Mean Squared Error (RMSE) of 0.1205 and a Mean Absolute Error (MAE) of 0.0156. ViSK-GAT consistently outperformed state-of-the-art (SOTA) deep learning backbones and showed its potential to advance artificial intelligence-driven musculoskeletal risk assessment and enable timely interventions in sports.

2602.01951 2026-06-10 cs.CV 版本更新

Enabling Progressive Whole-slide Image Analysis with Multi-scale Pyramidal Network

利用多尺度金字塔网络实现渐进式全切片图像分析

Shuyang Wu, Yifu Qiu, Ines P Nearchou, Sandrine Prost, Jonathan A Fallowfield, Hakan Bilen, Timothy J Kendall

发表机构 * Institute for Regeneration and Repair, University of Edinburgh(再生与修复研究所,爱丁堡大学) School of Informatics, University of Edinburgh(信息学院,爱丁堡大学) Indica Labs, 8700 Education Pl NW, Bldg. B Albuquerque, US(Indica实验室,美国阿尔伯克基8700教育大道西北区B座) Medical School, University of St Andrews(医学学校,圣安德鲁大学)

AI总结 提出多尺度金字塔网络(MSPN),一种即插即用模块,仅使用单一高倍输入实现渐进式多尺度全切片图像分析,通过网格重映射和粗引导网络学习粗粒度上下文,在多个任务和框架上一致提升MIL性能。

详情
AI中文摘要

多实例学习(MIL)常用于计算病理学(CPath),其中多尺度特征对于捕捉精细细胞细节和广泛组织结构至关重要。然而,现有的多尺度MIL方法通常依赖于不灵活的多倍率输入或计算成本高昂的架构。随着预训练基础模型(FMs)成为特征提取的趋势并推动轻量级模型的发展,我们重新思考并探索更高效的多尺度MIL方法。在本文中,我们提出了多尺度金字塔网络(MSPN),一种用于基于注意力的MIL的即插即用模块。MSPN仅使用单一高倍输入实现渐进式多尺度全切片图像分析。它由(1)基于网格的重映射组成,该重映射聚合高倍特征以导出空间感知的粗粒度特征图,以及(2)粗引导网络(CGN),该网络学习粗粒度上下文。我们将MSPN作为附加模块在4个基于注意力的框架上,针对5个临床相关任务,使用2个基础模型和一个预训练的MIL框架进行基准测试。我们的结果表明,MSPN在比较的配置和任务上一致地提高了MIL性能,同时保持轻量且易于使用。

英文摘要

Multiple-instance Learning (MIL) is commonly used for computational pathology (CPath), where multi-scale features are essential for capturing both fine cellular details and broad tissue architecture. However, existing multi-scale MIL approaches typically rely on the inflexible multi-magnification inputs or the computationally expensive architectures. As pre-trained foundation models (FMs) become the trend for feature extraction and boost lightweight models, we rethink and explore a more efficient multi-scale MIL method. In this paper, we propose the Multi-scale Pyramidal Network (MSPN), a plug-and-play module for attention-based MIL. MSPN introduces progressive multi-scale whole-slide image analysis using only a single high-magnification input. It consists of (1) grid-based remapping that aggregates high-magnification features to derive spatially-aware coarse feature maps, and (2) the Coarse Guidance Network (CGN) that learns coarse contexts. We benchmark MSPN as an add-on module to 4 attention-based frameworks on 5 clinically relevant tasks with 2 foundation models, and a pre-trained MIL framework. Our results demonstrate that MSPN consistently improves MIL across the compared configurations and tasks, while being lightweight and easy-to-use.

2604.28095 2026-06-10 cs.CV 版本更新

UHR-Net: An Uncertainty-Aware Hypergraph Refinement Network for Medical Image Segmentation

UHR-Net:一种用于医学图像分割的不确定性感知超图精炼网络

Shuokun Cheng, Jinghao Shi, Kun Sun

发表机构 * School of Computer Sciences, China University of Geosciences (Wuhan)(中国地质大学(武汉)计算机科学学院)

AI总结 针对病灶边界模糊和小病灶分割困难,提出UHR-Net,采用不确定性导向实例对比预训练和不确定性引导超图精炼模块,在五个公开数据集上取得一致提升。

Comments 12 pages, 4 figures, 4 tables

详情
AI中文摘要

准确的病灶分割对于临床诊断和治疗规划至关重要。然而,病灶通常与周围组织相似且边界不清,导致边界/过渡区域的预测不稳定。此外,小病灶的线索可能被多尺度特征提取稀释,导致欠分割或过分割。为了解决这些挑战,我们提出了一种不确定性感知超图精炼网络(UHR-Net)。首先,我们引入了一种不确定性导向实例对比(UO-IC)预训练策略,该策略将几何感知的复制-粘贴增强与病灶样背景区域的难负样本挖掘相结合,以提高对小型和视觉模糊病灶的实例级判别能力。其次,我们设计了一个不确定性引导超图精炼(UGHR)模块,该模块从粗概率图中导出基于熵的不确定性图,以指导超图精炼。通过将超边原型分为前景和背景组,UGHR解耦了高阶交互并改善了模糊区域的精炼。在五个公开基准上的实验表明,与强基线相比取得了持续改进。代码可在以下网址获取:this https URL。

英文摘要

Accurate lesion segmentation is crucial for clinical diagnosis and treatment planning. However, lesions often resemble surrounding tissues and exhibit ill-defined boundaries, leading to unstable predictions in boundary/transition regions. Moreover, small-lesion cues can be diluted by multi-scale feature extraction, causing under- or over-segmentation. To address these challenges, we propose an Uncertainty-Aware Hypergraph Refinement Network (UHR-Net). First, we introduce an Uncertainty-Oriented Instance Contrastive (UO-IC) pretraining strategy that couples geometry-aware copy-paste augmentation with hard-negative mining of lesion-like background regions to improve instance-level discrimination for small and visually ambiguous lesions. Second, we design an Uncertainty-Guided Hypergraph Refinement (UGHR) block, which derives an entropy-based uncertainty map from a coarse probability map to guide hypergraph refinement. By splitting hyperedge prototypes into foreground and background groups, UGHR decouples higher-order interactions and improves refinement in ambiguous regions. Experiments on five public benchmarks demonstrate consistent gains over strong baselines. Code is available at: https://github.com/CUGfreshman/UHR-Net.

2606.09681 2026-06-10 cs.CV 版本更新

GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker Development

GenEyePose:用于数字神经生理学生物标志物开发的无患者、基于知识的扫视眼动建模

Tianyu Lin, Jooyoung Ryu, Puvada Sreevarsha, Rahul Srinivasaragavan, Riya Satavlekar, Susan Kim, Nidhi Soley, Yujie Yan, Ishan Vatsaraj, Carl Harris, Aimon Rahman, Vishal Patel, Joseph Greenstein, Casey Taylor, Kemar E. Green

发表机构 * Whiting School of Engineering, Johns Hopkins University(约翰霍普金斯大学惠廷工程学院) Department of Neurology, Johns Hopkins Medicine(约翰霍普金斯医学院神经内科)

AI总结 提出首个全合成、无患者的多模态眼动生成流水线,用于泛化扫视分析;基于合成数据训练的深度学习分类器在真实临床数据上区分正常与异常扫视精度,AUROC达0.76。

详情
AI中文摘要

眼动(包括扫视)被广泛认为是神经生理状态的高度敏感和客观生物标志物。检测神经系统疾病中的扫视特征提供了一种快速、便携的脑成像替代方案,避免了获取和成本障碍。目前,由于隐私问题和数据集稀缺,缺乏稳健的AI视频眼动图解决方案(例如数字生物标志物)用于筛查、分诊或定位脑异常。在这项工作中,我们提出了第一个完全合成、无患者的多模态眼动生成流水线,用于泛化扫视分析。使用该合成数据集,我们训练了一个深度学习分类器,以区分正常和异常(低度量和高度量)扫视精度,并在真实临床数据上评估其性能。该模型实现了0.76的AUROC和0.71的灵敏度,表明合成数据在临床应用中具有强大的泛化潜力,包括作为家庭和急诊室环境中的筛查工具或精确神经解剖定位工具。

英文摘要

Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoiding access and cost barriers. Currently, there are no robust AI-enabled video-oculographic solutions (e.g., digital biomarkers) for screening, triaging, or localizing brain abnormalities due to privacy issues and scarce datasets. In this work, we propose the first fully synthetic, patient-free, multimodal eye movement generation pipeline for generalizable saccade analysis. Using this synthetic dataset, we trained a deep learning classifier to distinguish between normal and abnormal (hypometria and hypermetria) saccadic accuracies and evaluated its performance on real-world clinical data. The model achieved an AUROC of 0.76 and a sensitivity of 0.71, showing that the synthetic data has strong potential to generalize for clinical applications, including as a screening tool in at-home and emergency room settings or a tool for precise neuroanatomic localization.

2412.16758 2026-06-10 physics.med-ph cs.CV 版本更新

Training Set Augmentation and Biology-Aware Harmonization Improve Radiomic Models for Lung Cancer Prediction in Indeterminate Nodules

训练集增强与生物学感知的谐波化改善不确定肺结节中肺癌预测的影像组学模型

Claire Huchthausen, Menglin Shi, Gabriel L. A. de Sousa, James Larner, Einsley Janowski, Jonathan Colen, Krishni Wijesooriya

发表机构 * Department of Radiation Oncology, University of Virginia School of Medicine(弗吉尼亚大学医学院放射肿瘤学系) Department of Physics, University of Virginia(弗吉尼亚大学物理系) Department of Physics, Massachusetts Institute of Technology(麻省理工学院物理系) Department of Biomedical Engineering, Northwestern University(西北大学生物医学工程系) Department of Radiation Oncology, University of Virginia(弗吉尼亚大学放射肿瘤学系) Old Dominion University(旧 Dominion 大学)

AI总结 针对早期肺结节恶性率低和图像采集差异问题,通过加入后期结节扩充训练集,并采用生物学感知的谐波化方法校正采集效应,显著提升了影像组学模型的预测性能(ROC-AUC 0.74)。

Comments 22 pages, 5 figures, plus supplemental material; updated with the accepted version of the manuscript

详情
AI中文摘要

基于CT影像组学的机器学习有潜力比标准方法更早预测肺结节(PNs)中的肺癌。早期发育PNs的低恶性率和可变的图像采集方式阻碍了用于诊断这些PNs的影像组学模型的开发。为应对这些挑战,我们利用后期发育的PNs扩充训练集,并对采集效应进行谐波化处理。我们研究了低于标准诊断灵敏度的早期发育良性及恶性PNs(n=106)。当仅使用早期发育PNs的ComBat谐波化影像组学特征训练时,分类器预测恶性程度的表现接近随机。随后,我们用后期发育的良性及恶性PNs(n=225)扩充训练集。我们评估了谐波化是否必须纳入影响新增训练数据中采集效应的生物学因素。为校正来自四种采集协议的变异性,我们比较了:1)生物学无感知谐波化,2)使用区分早期发育、后期发育良性、后期发育恶性数据集的协变量进行谐波化,3)分别对每个数据集进行谐波化。使用扩充训练集但采用生物学无感知谐波化的模型未能持续改进。使用协变量谐波化(ROC-AUC 0.74 [0.69-0.79])或分别谐波化(ROC-AUC 0.71 [0.66-0.77])的扩充训练数据获得了更高的测试ROC-AUC(Delong检验,p<=0.05)和PR-AUC(Wilcoxon检验,p<=0.05)。在一项原理验证方法学研究中,我们通过一个小型单中心数据集证明,结合来自后期发育良性及恶性PNs的影像组学特征需要生物学感知的谐波化。

英文摘要

CT radiomics-based machine learning has potential to predict lung cancer in pulmonary nodules (PNs) earlier than standard-of-care methods. Low malignancy rates in early-development PNs and variable image acquisition hinder development of radiomic models for diagnosing these PNs. To address these challenges, we augmented training using later-development PNs and harmonized for acquisition effects. We examine early-development benign and malignant PNs (n=106) below the sensitivity of standard-of-care diagnosis. Classifiers predicting malignancy performed near chance when trained on ComBat-harmonized radiomic features from only early-development PNs. We then augmented training with later-development benign and malignant PNs (n=225). We evaluated whether harmonization must incorporate biology that impacts acquisition effects in added training data. To correct variability from four acquisition protocols, we compared: 1) biology-unaware harmonization, 2) harmonizing with a covariate distinguishing early-development, later-development benign, later-development malignant datasets, 3) harmonizing each dataset separately. Models trained using augmentation, but biology-unaware harmonization, failed to improve consistently. Augmented training data harmonized with a covariate (ROC-AUC 0.74 [0.69-0.79]) or separately (ROC-AUC 0.71 [0.66-0.77]) yielded higher test ROC-AUC (Delong, p<=0.05) and PR-AUC (Wilcoxon, p<=0.05). In a proof-of-principle methodological study, we demonstrate with a small single-center dataset that combining radiomic features from later-development benign and malignant PNs requires biology-aware harmonization.

2507.22017 2026-06-10 eess.IV cs.CV 版本更新

Cyst-X: A Multi-Center MRI Benchmark and Federated Learning Framework for Malignancy-Risk Stratification of Pancreatic Cystic Neoplasm

Cyst-X:用于胰腺囊性肿瘤恶性风险分层的多中心MRI基准与联邦学习框架

Hongyi Pan, Gorkem Durak, Elif Keles, Ziliang Hong, Deniz Seyithanoglu, Zheyuan Zhang, Alpay Medetalibeyoglu, Halil Ertugrul Aktas, Andrea Mia Bejar, Yavuz Taktak, Gulbiz Dagoglu Kartal, Mehmet Sukru Erturk, Timurhan Cebeci, Yury Velichko, Lili Zhao, Emil Agarunov, Federica Proietto Salanitri, Concetto Spampinato, Pallavi Tiwari, Ziyue Xu, Sachin Jambawalikar, Ivo G. Schoots, Marco J. Bruno, Chenchan Huang, Candice W. Bolan, Tamas Gonda, Frank H. Miller, Rajesh N. Keswani, Michael B. Wallace, Ulas Bagci

发表机构 * Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University(机器与混合智能实验室,放射科,西北大学) Istanbul Faculty of Medicine, Istanbul University(伊斯坦布尔大学医学学院) Department of Biomedical Engineering and Radiology, University of Wisconsin-Madison(生物医学工程与放射科,威斯康星大学麦迪逊分校) Department of Preventive Medicine, Northwestern University(预防医学系,西北大学) Division of Gastroenterology and Hepatology, New York University(消化内科与肝病科,纽约大学) Department of Electrical, Electronic and Computer Engineering, University of Catania(电气、电子和计算机工程系,卡塔尼亚大学) NVIDIA Department of Radiology, Columbia University(放射科,哥伦比亚大学) Department of Radiology and Nuclear Medicine, Erasmus Medical Center(放射科与核医学科,埃因霍温医学院) Department of Gastroenterology and Hepatology, Erasmus Medical Center(消化内科与肝病科,埃因霍温医学院) Department of Radiology, New York University(放射科,纽约大学) Division of Gastroenterology and Hepatology, Mayo Clinic Florida(消化内科与肝病科,迈阿密诊所佛罗里达分部) Department of Gastroenterology and Hepatology, Northwestern University(消化内科与肝病科,西北大学)

AI总结 提出Cyst-X,一个多中心MRI基准和联邦学习框架,用于IPMN恶性风险分层,结合PanSegNet分割器和3D DenseNet-121分类器,在内部交叉验证中达到0.85的AUC,性能与放射科医生相当。

详情
AI中文摘要

预计到2030年,胰腺癌将成为第二大致命癌症,因此早期检测至关重要。导管内乳头状黏液性肿瘤(IPMN)是关键的癌前病变,目前指南在恶性风险分层方面存在困难,导致不必要的手术或漏诊。在此,我们介绍Cyst-X,一个用于IPMN恶性风险分层的多中心MRI基准和联邦学习框架。该数据集包含来自七个国际中心764名患者的1,461次腹部MRI扫描,具有基于组织病理学或三年影像随访的三级恶性标签和专家胰腺分割。该流程将PanSegNet胰腺分割器与3D DenseNet-121分类器以及并行放射组学预测器相结合。在内部交叉验证中,深度学习分类器在T2加权MRI上对高风险与低风险或无风险鉴别达到了平均受试者工作特征曲线下面积(AUC)0.85(95%置信区间0.84-0.86),平均精确度从患病率基线0.23提高到0.64。当训练分布在多个机构之间且不交换原始患者图像时,该性能得以保持(AUC 0.85,FedProx)。在仅基于影像条件下评估的629例读者子集上,与三位盲法放射科医生相比,该分类器在特异性相当的情况下达到或超过了敏感性。为了加速早期胰腺癌检测研究,我们公开发布Cyst-X数据集、分割掩膜和训练模型,作为首个用于胰腺囊性肿瘤分析的大规模多中心MRI资源。

英文摘要

Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we introduce Cyst-X, a multi-center MRI benchmark and a federated learning framework for IPMN malignancy-risk stratification. The dataset comprises 1,461 abdominal MRI scans from 764 patients at seven international centers, with three-tier malignancy labels anchored in histopathology or three-year imaging follow-up and expert pancreas segmentations. The pipeline couples the PanSegNet pancreas segmenter with a 3D DenseNet-121 classifier and a parallel radiomics predictor. On internal cross-validation, the deep learning classifier reached a mean area under the receiver operating characteristic curve (AUC) of 0.85 (95% confidence interval 0.84-0.86) on T2-weighted MRI for high-risk versus low- or no-risk discrimination, with the average precision rising from a prevalence baseline of 0.23 to 0.64. This performance was preserved (AUC 0.85, FedProx) when training was distributed across institutions without exchange of raw patient images. Benchmarked against three blinded radiologists on a 629-case reader subset evaluated under imaging-only conditions, the classifier matched or exceeded sensitivity at comparable specificity. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset, segmentation masks, and trained models as the first large-scale, multi-centre MRI resource for pancreatic cystic neoplasm analysis.

8. 文档图像、OCR与图表理解 6 篇

2606.10640 2026-06-10 cs.CV 新提交

ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

ChartLens:用于图表数据校正和事实性摘要精炼的双分支框架

Hao Liu, Ruping Cao, Kun Wang, Zhiran Li, Fan Liu, Yupeng Hu, Liqiang Nie

发表机构 * Shandong University(山东大学) Southeast University(东南大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出ChartLens双分支框架,通过结构感知CSV验证校正和文本保留引导的摘要精炼,提升图表数据恢复与摘要事实性,在DataMFM挑战赛Track 2中获第一。

详情
AI中文摘要

在本报告中,我们展示了针对DataMFM挑战赛Track 2:图表理解(Chart Understanding)的冠军解决方案。该赛道要求模型从图表图像中恢复结构化图表数据并生成忠实于事实的自然语言摘要。为了满足准确数据提取和事实性叙述的互补需求,我们提出了ChartLens,一个用于图表数据校正和摘要精炼的双分支框架。ChartLens由两个关键模块组成:结构感知CSV验证与校正(SAVC)和文本保留引导的摘要精炼(TRSR)。SAVC通过验证和校正提高结构化数据提取的可靠性,而TRSR通过保留图表中的关键文本和数值证据来增强摘要生成。通过结合模型自适应、基于校正的生成和OCR辅助的证据依据,ChartLens改善了结构化数据恢复和摘要事实性。在测试集上,我们的最终系统获得了69.10的总分,并在Track 2中排名第一,证明了其在准确图表理解方面的有效性。我们的代码将在以下网址发布:this https URL。

英文摘要

In this report, we present our champion solution for the DataMFM Challenge Track 2: Chart Understanding. This track requires models to recover structured chart data and generate faithful natural-language summaries from chart images. To address the complementary requirements of accurate data extraction and factual narration, we propose ChartLens, a dual-branch framework for chart data correction and summary refinement. ChartLens consists of two key modules: Structure-Aware CSV Verification and Correction (SAVC) and Text-Retention-Guided Summary Refinement (TRSR). SAVC improves the reliability of structured data extraction through verification and correction, while TRSR enhances summary generation by preserving critical textual and numerical evidence from charts. By combining model adaptation, correction-based generation, and OCR-assisted evidence grounding, ChartLens improves both structured data recovery and summary factuality. On the test set, our final system achieves an overall score of 69.10 and ranks first in Track 2, demonstrating its effectiveness for accurate chart understanding. Our code will be released at: https://github.com/iLearn-Lab/CVPRW26-ChartLens.

2606.10701 2026-06-10 cs.CV 新提交

Vector Map as Language: Toward Unified Remote Sensing Vector Mapping

向量地图即语言:迈向统一的遥感向量制图

Yinglong Yan, Yunkai Yang, Haoyi Wang, Wei Fu, Linshan Wu, Honghu Pan, Shaobo Xia, Shanghang Zhang, Hao Chen, Leyuan Fang

发表机构 * School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(香港科技大学计算机科学与工程系) Department of Geomatics Engineering, Changsha University of Science and Technology(长沙理工大学测绘工程系) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室)

AI总结 提出VecLang范式,将多类向量制图重构为结构化文本生成,通过类GeoJSON语言统一表达不同地理实体,并设计渐进式视觉-语言映射框架和层次化向量语言优化方法,实现跨类别、跨数据集和开放词汇的向量地图生成。

详情
AI中文摘要

遥感向量制图旨在从遥感图像中生成地理实体的结构化地图,如建筑物、道路和水体。实践中,向量地图通常包含多个类别层和异构实体结构,需要统一模型满足多样化的制图需求。然而,现有方法通常将向量对象表示为多边形或图,使其仅适用于特定类别:多边形难以捕捉拓扑关系,而图往往模糊实例边界。我们观察到,语言作为人类交流的自然媒介,提供了一种灵活且富有表现力的表示,能够容纳异构地图元素,包括几何、语义和拓扑。受此启发,我们提出向量地图即语言(VecLang),一种统一范式,将多类向量制图重构为结构化文本生成。VecLang将不同地理实体的共同元素编码为类GeoJSON的向量语言,从而在共享文本格式内实现跨类别建模。为了可靠地生成这种语言,我们设计了一个渐进式视觉-语言映射框架,首先定位向量化单元,然后生成结构化地图元素。我们进一步引入层次化向量语言优化,利用强化学习提高语法有效性、内容保真度和地图可执行性。我们还构建了包含54K图像和800K实例的VecMap-Bench,支持标准和泛化设置下的训练与评估。大量实验表明,VecLang能够处理单类和多样向量制图,同时实现强大的跨数据集和开放词汇泛化。模型和数据集已公开于该网址。

英文摘要

Remote sensing vector mapping aims to generate structured maps of geospatial entities, such as buildings, roads, and water bodies, from remote sensing imagery. In practice, vector maps usually contain multiple category layers and heterogeneous entity structures, requiring a unified model for diverse mapping needs. However, existing methods typically represent vector objects as polygons or graphs, making them suitable only for specific categories: polygons poorly capture topological relations, while graphs often blur instance boundaries. We observe that language, as a natural medium for human communication, offers a flexible and expressive representation that can accommodate heterogeneous map elements, including geometry, semantics, and topolog. Motivated by this insight, we propose Vector Map as Language (VecLang), a unified paradigm that reformulates multiclass vector mapping as structured text generation. VecLang encodes the common elements of different geospatial entities into a GeoJSON-like vector language, enabling cross-category modeling within a shared textual format. To generate this language reliably, we design a progressive vision-language mapping framework that first localizes vectorization units and then generates structured map elements. We further introduce Hierarchical Vector Language Optimization, which uses reinforcement learning to improve syntax validity, content fidelity, and map executability. We also build VecMap-Bench with 54K images and 800K instances, supporting training and evaluation across standard and generalization settings. Extensive experiments demonstrate that VecLang handles both single-class and multiclass vector mapping while achieving strong cross-dataset and open-vocabulary generalization. The model and dataset are publicly available at https://github.com/yyyyll0ss/VecLang.

2606.10953 2026-06-10 cs.AI cs.CV 交叉投稿

Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

Architect-Ant: 可编辑的建筑平面图自动家具布置

Fedor Rodionov, Aleksandar Cvejic, Michael Birsak, John Femiani, Peter Wonka

发表机构 * King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学) Miami University(迈阿密大学)

AI总结 提出基于微调视觉语言模型的可编辑自动家具布置框架Architect-Ant,通过领域特定语言编码布局并优化,生成符合建筑约束的合理布局。

Comments 17 pages, 10 figures

详情
AI中文摘要

带家具的平面图是房地产可视化、室内设计和建筑工作流程的基础。然而,由于缺乏带有对象级家具标注的真实专业设计平面图数据集,自动家具布置的进展受到限制。为解决这一差距,我们引入了AntPlan-270,这是一个包含270个建筑平面图的精选数据集,每个房间都有家具边界框标注,涵盖十个住宅房间类别。基于该数据集,我们提出了Architect-Ant,一个由微调视觉语言模型驱动的可编辑自动家具布置框架。家具布局使用一种紧凑的、基于坐标的领域特定语言(DSL)表示,该语言编码对象类别和相对于房间几何形状的位置。为了提高空间推理能力,我们生成了程序化推理轨迹,捕捉建筑约束,如墙壁对齐、门窗间隙、流通、固定装置兼容性和房间特定家具清单,并使用它们来监督模型的微调。然后,我们对候选对象位置应用偏好优化,以进一步提高布局质量。生成的DSL可以栅格化为语义掩码,并用于条件化基于Flux的LoRA渲染器,生成逼真的蓝图风格带家具平面图图像,同时保留可编辑的符号布局。布局布置实验表明,Architect-Ant能生成几何上有效且功能上合理的布局,并为更大的仅结构平面图数据集的家具布置提供了一条可扩展的路径。

英文摘要

Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan datasets with object-level furniture annotations. To address this gap, we introduce AntPlan-270, a curated dataset of 270 architectural floor plans with per-room furniture bounding box annotations across ten residential room categories. Building on this dataset, we present Architect-Ant, an editable automatic furnishing framework powered by a fine-tuned vision-language model. Furniture layouts are represented using a compact, coordinate-based domain-specific language (DSL) that encodes object categories and placements relative to the room geometry. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room-specific furniture inventories, and use them to supervise fine-tuning of the model. We then apply preference optimization over candidate object placements to further refine layout quality. The generated DSL can be rasterized into semantic masks and used to condition a Flux-based LoRA renderer, producing realistic blueprint-style furnished floor-plan images while preserving the editable symbolic layout. Experiments on layout furnishing show that Architect-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure-only floor-plan datasets.

2602.19086 2026-06-10 cs.CV 版本更新

Seal-Robust KCR: A Robust Kuzushiji Character Recognition Framework under Seal Interference

抗印章干扰的KCR:一种鲁棒的印章干扰下久世文字符识别框架

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko, Shinsuke Mori

发表机构 * Kyoto University(京都大学)

AI总结 针对印章干扰导致久世文字符识别性能下降的问题,提出一种结合文档修复和合成数据增强的抗印章干扰框架,在真实和合成测试集上分别降低字符错误率39.7%和50.1%。

Comments Supplementary material is available at https://ruiyangju.github.io/Seal-Robust-KCR

详情
AI中文摘要

久世文是前现代日本最广泛使用的草书书写系统之一。由于其高度草书的形态和广泛的字形变化,大多数现代日本读者无法阅读久世文字符。因此,近年来的研究集中在开发自动化久世文字符识别(KCR)方法,这些方法在相对干净的日本历史文档图像上取得了强劲性能。尽管印章经常出现在日本历史文档中,现有方法在印章干扰下,特别是当印章与字符重叠时,往往无法保持识别精度。为了应对这一挑战,我们提出了一种抗印章干扰的KCR框架。基于字符检测、分类和排序,所提出的框架额外引入了文档修复以减轻印章干扰,从而提升整体识别性能。此外,我们引入了一种新颖的合成数据增强策略来增强字符检测模型的性能。我们进一步纠正了标注错误,重构了数据集,并创建了一个合成测试集以模拟严重的印章干扰。实验结果表明,所提出的框架在减轻印章干扰对KCR的影响方面是有效的。与常规基线和NDLkotenOCR相比,它在真实测试集上分别实现了39.7%和5.9%的相对字符错误率(CER)降低,在合成测试集上分别实现了50.1%和41.7%的降低。

英文摘要

Kuzushiji was one of the most widely used cursive writing systems in pre-modern Japan. Due to its highly cursive forms and extensive glyph variations, most modern Japanese readers are unable to read Kuzushiji characters. Consequently, recent studies have focused on developing automated Kuzushiji character recognition (KCR) methods, which have achieved strong performance on relatively clean Japanese historical document images. Although seals frequently appear in Japanese historical documents, existing methods often fail to maintain recognition accuracy under seal interference, particularly when seals overlap with characters. To address this challenge, we propose a seal-robust KCR framework. Based on character detection, classification, and ordering, the proposed framework additionally incorporates document restoration to mitigate seal interference, thereby improving overall recognition performance. In addition, we introduce a novel synthetic data augmentation strategy to enhance the performance of character detection models. We further correct annotation errors, reconstruct the dataset, and create a synthetic test set to simulate severe seal interference. Experimental results demonstrate the effectiveness of the proposed framework in mitigating the impact of seal interference on KCR. Compared with a conventional baseline and NDLkotenOCR, it achieves relative character error rate (CER) reductions of 39.7% and 5.9%, respectively, on the real test set, and 50.1% and 41.7%, respectively, on the synthetic test set.

2604.22192 2026-06-10 cs.CV 版本更新

CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

CharTide: 数据中心的图表到代码生成通过三视角微调和查询驱动进化

Xiangxi Zheng, Kuang He, Jiayi Hu, Ping Yu, Rui Yan, Yuan Yao, Peng Hou, Anxiang Zeng, Alex Jinpeng Wang

发表机构 * Nanjing University(南京大学) LLM Team, Shopee Pte. Ltd.(Shopee 联邦学习团队) East China Normal University(华东师范大学) Nanjing University of Science and Technology(南京理工大学) Central South University(中南大学)

AI总结 提出CharTide框架,通过三视角微调解耦视觉感知与代码逻辑,并引入基于信息不变性的查询驱动强化学习进行数据验证,在多个基准上超越GPT-4o。

Comments Accepted to ACL 2026 Main

详情
AI中文摘要

图表到代码生成要求视觉语言模型(VLM)具有严格的视觉精度和语法正确性。然而,现有方法从根本上受到数据中心限制:尽管可用的图表到代码数据集不断增长,但简单地扩展同质图表-代码对会将视觉感知与程序逻辑混淆,阻止模型充分利用多模态监督的丰富性。我们提出CharTide,一种新颖的数据中心框架,系统性地重新设计图表到代码生成的训练和对齐数据。首先,我们通过三视角微调策略构建一个200万样本的数据集,明确将训练解耦为视觉感知、纯文本代码逻辑和模态融合流,使7B模型仅使用监督数据就能超越专门的基线。其次,我们将对齐重新表述为一个数据验证问题,而不是启发式评分任务。为此,我们引入了一种基于信息不变性原理的查询驱动强化学习框架:下游模型应对原始图表和生成图表上的相同视觉查询产生一致的答案。超越刚性规则匹配或VLM评分,我们使用冻结的检查器通过原子QA任务客观验证生成的图表,基于答案准确性提供可验证的奖励信号。在ChartMimic、Plot2Code和ChartX上的实验表明,CharTide-7B/8B显著优于开源基线,超越GPT-4o,并与GPT-5竞争。

英文摘要

Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.

2605.07415 2026-06-10 cs.CV cs.CL 版本更新

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

ChartREG++:面向多样化指代线索和多目标指代的图表指代表达式定位基准与改进

Tianhao Niu, Ziyu Han, Xuan Dong, Qingfu Zhu, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics(社会计算与交互机器人研究中心)

AI总结 针对现有图表指代表达式定位基准的局限,提出支持多种定位形式、多目标指代、多样化线索和图表类型的基准,并利用代码驱动合成流水线生成像素级实例掩码,训练实例分割模型集成到多模态定位框架,显著提升性能。

详情
AI中文摘要

指代表达式定位是视觉定位的核心问题,广泛用于视觉与语言模型的空间定位与推理诊断,但以往工作多聚焦于自然图像。相比之下,现有的图表指代表达式定位基准存在局限:(1) 大多采用边界框,限制了精细图表元素的定位精度;(2) 大多假设单个或两个指代目标实例,无法处理多实例目标指代;(3) 语言表达过度依赖文本线索或数据排名线索;(4) 仅覆盖狭窄的图表类型范围。为解决这些问题,我们引入了一个图表指代表达式定位基准,系统性地支持多种定位形式、多个指代目标、多样化定位线索和多种图表类型。在代表性多模态大模型上的结果揭示了显著的性能差距。我们进一步引入了一个代码驱动的合成流水线,利用绘图程序与渲染图表基元之间的固有对齐,跨图表元素类型和粒度生成像素级精确的实例掩码。我们使用合成掩码训练了一个实例分割模型,并将其集成到一个通用的多模态定位框架中。最终系统在我们的基准上持续优于基线,并很好地泛化到从ChartQA导出的真实图表定位基准。

英文摘要

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

9. 低层视觉、计算成像与图像增强 10 篇

2606.10275 2026-06-10 cs.CV 新提交

FoA-SR: Faithful or Aesthetic? Profile-Aware Preference Optimization for Real-World Image Super-Resolution

FoA-SR: 忠实还是美观?面向真实世界图像超分辨率的轮廓感知偏好优化

Amjad Mahdi Alqarni, Peizhong Ju

发表机构 * Department of Computer Science(计算机科学系) University of Kentucky(肯塔基大学)

AI总结 提出FoA-SR,基于偏好优化实现真实世界图像超分辨率,通过忠实和美观两种轮廓分别优化适配器,在RealSR和DIV2K上验证了可分离的恢复策略。

Comments 17 pages, 6 figures, 9 tables. Preprint

详情
AI中文摘要

真实世界图像超分辨率(SR)通常设计为单一恢复目标,尽管当前生成模型能够为同一输入产生多个高质量重建。本文认为,最佳恢复策略取决于特定的恢复轮廓:忠实恢复优先考虑参考一致性、结构保持和幻觉抑制,而美观恢复优先考虑视觉愉悦和自然细节。我们提出FoA-SR,一种基于轮廓的新型真实世界SR偏好优化方法。为实现此目标,FoA-SR从我们的监督式FLUX.2-based SR适配器(Flux2SR)开始,该适配器通过LR潜在条件、流匹配和图像空间重建损失进行配对LR到HR图像超分辨率训练。在开发共享监督式超分辨率适配器后,FoA-SR为每个输入图像生成共享随机候选池,并使用轮廓特定的忠实和美观奖励对相同候选进行排序,以挖掘胜者-败者对。这些对用于微调单独的LoRA适配器,同时保持基础模型冻结。在RealSR和DIV2K上的实验表明,FoA-SR可以将同一SR适配器导向不同的恢复目标:忠实适配器改善参考一致性指标,而美观适配器提升无参考感知质量指标。我们的候选池分析显示,忠实和美观奖励经常选择不同的胜者,而Hybrid-LoRA消融表明,将两个轮廓合并为一个奖励会产生隐式折衷,而非显式轮廓控制。

英文摘要

Real-world image super-resolution (SR) is often designed with a single restoration objective, despite the current capacity of generative models to produce multiple high-quality reconstructions for the same input. In this paper, we argue that the best restoration strategy is subject to the specific restoration profile: a Faithful restoration prioritizes reference consistency, structure preservation, and hallucination suppression, whereas an Aesthetic restoration prioritizes visually pleasing and natural-looking details. We propose FoA-SR, a novel preference optimization approach to real-world SR based on profiles. To achieve this goal, FoA-SR starts with our supervised FLUX.2-based SR adapter (Flux2SR) trained with LR latent conditioning, flow matching, and image-space reconstruction losses for paired LR-to-HR image super-resolution. Following the development of the shared supervised super-resolution adapter, FoA-SR generates a shared stochastic candidate pool for each input image and ranks the same candidates using profile-specific Faithful and Aesthetic rewards to mine winner-loser pairs. These pairs are used to fine-tune separate LoRA adapters while keeping the base model frozen. Experiments on RealSR and DIV2K show that FoA-SR can steer the same SR adapter towards distinct restoration objectives: a Faithful adapter improves reference-consistent metrics while an Aesthetic adapter boosts metrics that measure perceptual quality without reference. Our candidate-pool analysis shows that Faithful and Aesthetic rewards frequently select different winners, and a Hybrid-LoRA ablation shows that collapsing both profiles into one reward yields an implicit compromise rather than explicit profile control.

2606.10350 2026-06-10 cs.CV 新提交

Multi-Angular Reflectance Anisotropy Observed from UAV Multispectral Imagery

无人机多光谱影像观测的多角度反射率各向异性

Zhenqiang Qin, Chenguang Dai, Min Wang, Xian Li

发表机构 * University of Information Engineering(信息工程大学)

AI总结 提出一种几何感知的多角度观测提取流程,从BRDF角度量化观测几何效应,通过SFM精化相机参数并重投影同质区域,联合提取多波段反射率和观测几何参数,发现红边和近红外波段反射率变化达119-137%。

详情
AI中文摘要

由于低空飞行和宽视场成像,无人机多光谱影像自然包含多角度观测,这可能引入几何驱动的辐射变异性。本研究提出一种几何感知的多角度观测提取流程,从BRDF角度量化观测几何效应。具体地,通过运动恢复结构(SFM)精化相机内参和外参,并将正射影像上标注的同质区域重投影到从不同视角获取的多个原始子图像上。这使得能够在不同观测方向下联合提取同一地面目标的多波段反射率和观测几何参数。进一步利用(VZA,RAA)域中的波段极坐标可视化分析提取的观测值。草地目标的结果显示,十个波段均存在明显的反射率各向异性,其中红边和近红外波段的最大与最小反射率变化达119-137%,表明观测几何效应对辐射一致性有不可忽视的影响。

英文摘要

UAV multispectral imagery naturally contains multi-angular observations due to low flight altitude and wide field-of-view imaging, which may introduce geometry-driven radiometric variability. This study proposes a geometry-aware multi-angular observation extraction workflow to quantify observation-geometry effects from a BRDF perspective. Specifically, camera intrinsics and extrinsics are refined via structure-from-motion (SFM), and homogeneous regions annotated on an orthomosaic are reprojected onto multiple raw sub-images acquired from different viewpoints. This enables joint extraction of multi-band reflectance and observation geometry parameters for the same ground targets under varying viewing directions. The extracted observations are further analyzed using band-wise polar visualization in the (VZA, RAA) domain. Results on a grassland target show clear reflectance anisotropy across ten bands, with red-edge and nearinfrared bands exhibiting 119-137% variability between maximum and minimum reflectance, indicating non-negligible observation-geometry effects on radiometric consistency.

2606.10373 2026-06-10 cs.CV 新提交

PF-Trans: Physics-Embedded Frequency-Aware Transformer for Spectral Reconstruction

PF-Trans:物理嵌入的频率感知Transformer用于光谱重建

Yuzhe Gui, Tianzhu Liu, Yanfeng Gu, Xian Li

发表机构 * National Natural Science Foundation of China(国家自然科学基金委员会)

AI总结 针对快照宽带滤光片阵列成像中的光谱混叠问题,提出物理嵌入的频率感知Transformer(PF-Trans),通过掩膜注入和灰度一致性损失保证物理保真度,并引入双域块并行FFT分支抑制频域伪影,在GF-5上海数据集上PSNR达48.50 dB。

详情
AI中文摘要

快照宽带滤光片阵列(BFA)成像为光谱重建提供了高光通量,但由于复杂调制引入了严重的光谱混叠。当前的深度学习方法局限于空间去噪,往往无法解决由掩膜结构引起的全局频率特定退化。为了解决这个问题,我们提出了一种物理嵌入的频率感知Transformer(PF-Trans),用于高保真遥感光谱重建。我们的方法通过掩膜注入和灰度一致性损失显式集成物理传感模型,以确保物理保真度。此外,我们引入了一个带有并行快速傅里叶变换(FFT)分支的双域块,使网络能够感知并抑制频域中的混叠伪影。在多个数据集上的大量实验表明,PF-Trans实现了最先进的性能,在GF-5上海数据集上峰值信噪比(PSNR)高达48.50 dB,显著优于对比方法。

英文摘要

Snapshot Broadband Filter Array (BFA) imaging provides high light throughput for spectral reconstruction but introduces severe spectral aliasing due to complex modulation. Current deep learning approaches, limited to spatial denoising, often fail to address the global frequency-specific degradations caused by the mask structure. To address this, we propose a Physics-embedded Frequency-aware Transformer (PF-Trans) for high-fidelity remote sensing spectral reconstruction. Our method explicitly integrates the physical sensing model through mask injection and a gray-scale consistency loss to ensure physical fidelity. Furthermore, we introduce a Dual-domain Block with a parallel Fast Fourier Transform (FFT) branch, enabling the network to perceive and suppress aliasing artifacts in the frequency domain. Extensive experiments on multiple datasets demonstrate that PF-Trans achieves state-of-the-art performance, achieving a Peak Signal-to-Noise Ratio (PSNR) of up to 48.50 dB on the GF-5 Shanghai dataset, significantly outperforming comparison methods.

2606.10628 2026-06-10 cs.CV 新提交

Leveraging Metric Depth for Relative Depth Prediction

利用度量深度进行相对深度预测

Xiaoyang Bi, Shuaikun Liu, Zhaohong Liu, Yuxin Yang, Zhe Zhao, Mengshi Qi, Liang Liu, Huadong Ma

发表机构 * Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia(智能电信软件与多媒体北京市重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对足球场景中相对深度预测训练样本少的问题,提出利用预训练模型的零样本能力学习度量深度,在挑战集上取得2.68×10^{-3}的分数。

详情
AI中文摘要

我们展示了针对2025年SoccerNet单目深度估计竞赛挑战的解决方案。在足球场景中预测相对深度具有挑战性,尤其是仅有数千个训练样本可用。为解决这一问题,我们的方法利用了在大规模数据集上预训练的模型的强大零样本能力来学习度量深度,从而有效进行相对深度预测,在挑战集上取得了$2.68 \ imes 10^{-3}$的分数。

英文摘要

We present our solution to the 2025 SoccerNet Monocular Depth Estimation Competition Challenge. Predicting the relative depth in football scenarios is challenging, especially with only thousands of training samples available. To address this issue, our method leverages the powerful zero-shot capabilities of models pretrained on large-scale datasets to learn metric depth for effective relative depth prediction, achieving a score of $2.68 \times 10^{-3}$ on the challenge set.

2606.11032 2026-06-10 cs.CV 新提交

U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training

U-TTT:通过测试时训练实现可泛化的PET图像去噪

Zhiwen Yang, Jiayin Li, Hao Lu, Hui Zhang, Zihua Wang, Bingzheng Wei, Yan Xu

发表机构 * School of Biological Science and Medical Engineering, Beihang University(北京航空航天大学生物科学与医学工程学院) Department of Biomedical Engineering, Tsinghua University(清华大学生物医学工程系) School of Aerospace Engineering, Tsinghua University(清华大学航天航空学院) ByteDance Inc.(字节跳动有限公司)

AI总结 针对PET图像去噪模型在分布偏移下性能退化的问题,提出U-TTT模型,集成测试时训练(TTT)层,通过自监督动态调整参数,并设计双域自适应机制(空间和频率TTT层),在未见剂量水平和扫描仪下实现最优去噪和泛化。

详情
AI中文摘要

现有的用于正电子发射断层扫描(PET)图像去噪的深度学习模型在分布偏移下常常遭受严重的性能退化,从根本上限制了其稳健的临床部署。这种泛化能力的缺乏源于固定参数模型的传统范式,该范式在训练后无法适应测试数据的变化(例如,剂量水平或扫描仪类型)。为了克服这一限制并实现稳健的泛化,我们引入了U-TTT,一种新颖的U形模型,它集成了测试时训练(TTT)层,通过自监督在推理过程中动态调整模型参数,从而适应每个测试实例的特定特征。此外,为了全面捕捉3D PET数据的复杂退化,U-TTT具有双域自适应机制,包括空间测试时训练(S-TTT)层和频率测试时训练(F-TTT)层。S-TTT层捕捉并校正空间结构退化,而F-TTT层抑制全局噪声谱并恢复精细的高频细节。大量实验表明,U-TTT在PET去噪性能上达到了最先进水平,并在具有挑战性的分布偏移下(包括未见剂量水平和未见扫描仪)展现出优越的泛化能力。我们的代码将在此https URL提供。

英文摘要

Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of generalization stems from the conventional paradigm of fixed-parameter models that cannot adapt to variations in test data (e.g., dose levels or scanner types) after training. To overcome this limitation and achieve robust generalization, we introduce U-TTT, a novel U-shaped model that integrates Test-Time Training (TTT) layers to dynamically adjust model parameters during inference through self-supervision, thereby adapting to the specific characteristics of each test instance. Furthermore, to comprehensively capture the complex degradations of 3D PET data, U-TTT features a dual-domain adaptation mechanism comprising a Spatial Test-Time Training (S-TTT) layer and a Frequency Test-Time Training (F-TTT) layer. The S-TTT layer captures and corrects spatial structural degradations, while the F-TTT layer suppresses global noise spectra and restores delicate high-frequency details. Extensive experiments demonstrate that U-TTT achieves state-of-the-art PET denoising performance and exhibits superior generalization under challenging distribution shifts, including both unseen dose levels and unseen scanners. Our code will be available at https://github.com/Yaziwel/U-TTT.

2606.11131 2026-06-10 cs.CV 新提交

UniPET: a universal network for high-quality PET image denoising across varied dose reduction factors

UniPET:一种适用于不同剂量减少因子的高质量PET图像去噪通用网络

Zhiwen Yang, Yang Zhou, Haowei Chen, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu

AI总结 针对现有PET去噪方法在剂量减少因子变化时性能下降的问题,提出UniPET网络,通过风格对齐网络和区域感知学习策略实现跨DRF的高质量去噪,性能达到最先进水平。

详情
AI中文摘要

大多数现有的基于深度学习的PET图像去噪方法假设低剂量PET图像具有固定且已知的剂量减少因子(DRF)。然而,当DRF在实际应用中超出假设范围时,这些方法会遇到显著的性能下降。为了应对不同DRF带来的挑战,一些初步研究聚焦于通用PET图像去噪任务,旨在训练一个覆盖不同DRF低剂量数据的通用模型。尽管如此,这些通用模型常常难以处理不同DRF数据中存在的风格不匹配问题,导致出现显著的过度平滑效应,即\textit{风格消除问题}。为了解决这个问题,我们创新性地将域泛化引入PET图像去噪,并提出了一种通用PET图像去噪网络(UniPET),以实现跨不同DRF的高质量PET图像去噪。UniPET包含两个主要创新:风格对齐网络(SAN)和区域感知学习策略(RALS)。具体而言,SAN利用源自域泛化的风格对齐技术来对齐和恢复不同DRF下的风格,确保模型在各种DRF下的泛化能力,同时有效保留风格。此外,为了增强风格恢复,RALS区分平坦区域和风格化区域,仅在后者上进行对抗学习,从而更有效地引导模型关注学习风格化区域。实验证明,我们提出的UniPET能够自适应地恢复不同DRF风格,并实现跨DRF的高质量PET图像去噪。全面的实验表明,UniPET在特定DRF下表现出与专用DRF模型相当的性能,并在定量、感知和临床评估中实现了通用PET图像去噪的最先进性能。

英文摘要

Most existing deep learning-based PET image denoising methods assume a fixed and known dose reduction factor (DRF) for low-dose PET images. However, these methods encounter significant performance degradation when the DRF varies beyond the assumed one in practical applications. To address the challenge posed by varied DRFs, several preliminary studies focus on the task of universal PET image denoising, aiming to train a universal model over low-dose data across DRFs. Nonetheless, these vanilla universal models often struggle with misaligned styles present in different DRF data, leading to the \textit{style elimination issue} with a significant over-smoothing effect. To deal with this issue, we innovatively introduce domain generalization to PET image denoising and propose a universal PET image denoising network (UniPET) to achieve high-quality PET image denoising across diverse DRFs. UniPET comprises two primary innovations: a style alignment network (SAN) and a region-aware learning strategy (RALS). Specifically, SAN utilizes style alignment techniques derived from domain generalization to align and recover styles across different DRFs, ensuring the model's generalizability across various DRFs while effectively preserving styles. Furthermore, to enhance style recovery, RALS distinguishes between flat and stylized regions, exclusively conducting adversarial learning on the latter, thereby more effectively guiding the model's focus towards learning stylized regions. It is demonstrated that our proposed UniPET can adaptively recover different DRF styles and achieve high-quality PET image denoising across DRFs. Comprehensive experiments show that UniPET exhibits comparable performance to individual DRF-specific models at specific DRFs and realizes state-of-the-art performance in universal PET image denoising quantitatively, perceptually, and clinically.

2606.11186 2026-06-10 cs.CV 新提交

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

AnyMod-LLVE: 模态无关推理的低光照视频增强

Hangfeng Liang, Yutao Hu, Yanhan Hu, Xiaohan Wu, Wenqi Shao, Ying Fu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AMNet统一多模态框架,通过空间-频谱双门控转换器学习辅助模态与RGB输入的对应关系,支持推理时任意模态组合,解决低光照视频增强中辅助模态缺失问题。

Comments Accepted at ICML 2026; Project page and code: https://lhfgghc.github.io/LLVE-AMNet

详情
AI中文摘要

低光照视频增强(LLVE)由于低照度条件下严重的信息退化仍然是一项具有挑战性的任务。最近的多模态方法通过引入辅助模态(如事件流和红外图像)显著提升了增强性能。然而,这些方法通常假设推理时这些模态可用,这在现实场景中往往不可行。为了解决这个问题,在本工作中,我们提出了AMNet,一个统一的LLVE多模态框架,以支持灵活的模态无关推理,其中辅助模态可能不可用。为了解决模态缺失问题,我们引入了一个空间-频谱双门控转换器,学习辅助模态与RGB输入之间的对应关系,生成隐式辅助表示以支持鲁棒增强。此外,为了充分促进跨模态对应学习,我们基于仅RGB数据集和合成辅助模态进行了大规模多模态预训练。大量实验表明,AMNet能够处理任意推理时的模态组合,并在模态缺失条件下展现出优越的LLVE性能。代码和模型可在项目页面上获取。

英文摘要

Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.

2606.10280 2026-06-10 eess.IV cs.CV 交叉投稿

Overlapped Wavelet Diffusion for Low-Light Image Enhancement

重叠小波扩散用于低光照图像增强

Fen Peng, Taizo Suzuki, Seisuke Kyochi

AI总结 提出重叠小波扩散框架OWDiff,通过重叠小波变换消除块伪影,并引入低频引导的高频增强模块恢复细节,在LOLv1和LOLv2-real数据集上优于现有方法。

Comments Advance published in IEICE Transactions on Information and Systems. DOI: 10.1587/transinf.2026PCP0006. Code: https://github.com/FinnPeg/Overlapped-Wavelet-Diffusion

详情
Journal ref
IEICE Transactions on Information and Systems, Advance online publication, 2026
AI中文摘要

在这项研究中,我们提出了一种用于低光照图像增强(LLIE)的重叠小波扩散框架,该框架包含两个互补组件,以实现无块伪影和细节保持的增强。尽管与传统方法相比,最近基于扩散的LLIE方法表现出显著性能,但DiffLL仍然遭受由Haar小波变换(WT)引起的块伪影以及由于其高频恢复模块(HFRM)的限制导致的边缘模糊或纹理过度平滑。为了克服这些问题,我们引入了重叠小波变换(OWT),它融合了相邻区域的相关性,从而在结构上防止块伪影。此外,我们集成了一个低频引导的高频增强模块(HFEBlock)来加强细节恢复,产生更清晰的边缘和更可靠的纹理。在LOLv1和LOLv2-real数据集上的大量实验表明,我们的框架(称为OWDiff)在定性和定量上均持续优于现有的LLIE方法,在保持计算效率的同时实现了卓越的视觉质量。OWDiff有效解决了Haar WT和HFRM的结构限制,与DiffLL相比,在LOLv1和LOLv2-real数据集上平均PSNR增益为0.58 dB,SSIM相对提高1.64%,LPIPS相对降低5.9%。

英文摘要

In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.

2503.13358 2026-06-10 cs.CV 版本更新

One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

一步残差移位扩散用于图像超分辨率通过蒸馏

Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin

发表机构 * Kandinsky Lab(坎迪斯基实验室) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Luzin Research Center(卢津研究所) Moscow Independent Research Institute of Artificial Intelligence(莫斯科独立人工智能研究 institute) Applied AI Institute(应用人工智能研究所)

AI总结 提出RSD蒸馏方法,通过训练学生网络使基于其生成图像的虚拟ResShift模型与教师一致,实现单步超分辨率,在感知指标上超越教师和SinSR,且参数和计算成本更低。

Comments ICML-2026

详情
AI中文摘要

用于超分辨率(SR)的扩散模型产生高质量的视觉结果,但需要昂贵的计算成本。尽管已经开发了几种加速基于扩散的SR模型的方法,但有些(例如SinSR)无法产生真实的感知细节,而其他(例如OSEDiff)可能会产生不存在的结构。为了克服这些问题,我们提出了RSD,一种新的ResShift蒸馏方法。我们的方法基于训练学生网络生成图像,使得基于这些图像训练的新假ResShift模型与教师模型一致。RSD实现单步恢复,并在各种感知指标(LPIPS、CLIPIQA、MUSIQ)上明显优于教师。我们表明,我们的蒸馏方法可以超越SinSR(另一种基于ResShift的蒸馏方法),使其在感知质量方面与最先进的扩散SR蒸馏方法相当,且计算成本有限。与基于预训练文本到图像模型的SR方法相比,RSD产生具有竞争力的感知质量,并需要更少的参数、GPU内存和训练成本。我们在各种真实世界和合成数据集上提供了实验结果,包括RealSR、RealSet65、DRealSR、ImageNet和DIV2K。我们在以下网址提供代码:此https URL。

英文摘要

Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift. Our method is based on training the student network to produce images such that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a noticeable margin in various perceptual metrics (LPIPS, CLIPIQA, MUSIQ). We show that our distillation method can surpass SinSR, the other distillation-based method for ResShift, making it on par with state-of-the-art diffusion SR distillation methods with limited computational costs in terms of perceptual quality. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality and requires fewer parameters, GPU memory, and training cost. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K. We provide the code at https://github.com/Daniil-Selikhanovych/RSD.

2501.01481 2026-06-10 eess.IV cs.CV 版本更新

Unleashing Correlation and Continuity for Hyperspectral Reconstruction from RGB Images

释放相关性与连续性:从RGB图像进行高光谱重建

Fuxiang Feng, Runmin Cong, Shoushui Wei, Yipeng Zhang, Jun Li, Sam Kwong, Wei Zhang

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Key Laboratory of Machine Intelligence and System Control, Ministry of Education(机器智能与系统控制重点实验室,教育部) University of California, Los Angeles(加州大学洛杉矶分校) Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences(智能地理信息处理重点实验室,中国地质大学) Lingnan University(岭大大学)

AI总结 提出相关性连续性网络(CCNet),通过局部光谱相关性建模(GrSCM)和全局光谱连续性建模(NeSCM)及自适应融合(PAF),实现RGB到高光谱图像的SOTA重建。

详情
AI中文摘要

从RGB图像重建高光谱图像(HSI)可以以较低成本获得高空间分辨率的HSI,显示出巨大的应用潜力。本文揭示了光谱特征的局部相关性和全局连续性对于HSI重建任务至关重要。因此,我们充分探索了这些光谱间关系,并提出了相关性连续性网络(CCNet)用于从RGB图像重建HSI。针对局部光谱的相关性,我们引入了分组光谱相关性建模(GrSCM)模块,该模块在局部范围内高效建立光谱波段相似性。针对全局光谱的连续性,我们设计了邻域光谱连续性建模(NeSCM)模块,该模块利用记忆单元递归地建模全局层面的渐进变化特征。为了探索这两个模块的内在互补性,我们设计了分块自适应融合(PAF)模块,以分块自适应方式将全局连续性特征高效集成到光谱特征中。这些创新提升了重建HSI的质量。我们在光谱重建任务的主流数据集NTIRE2022和NTIRE2020上进行了全面的比较和消融实验。与当前先进的光谱重建算法相比,我们设计的算法达到了最先进(SOTA)性能。

英文摘要

Reconstructing Hyperspectral Images (HSI) from RGB images can yield high spatial resolution HSI at a lower cost, demonstrating significant application potential. This paper reveals that local correlation and global continuity of the spectral characteristics are crucial for HSI reconstruction tasks. Therefore, we fully explore these inter-spectral relationships and propose a Correlation and Continuity Network (CCNet) for HSI reconstruction from RGB images. For the correlation of local spectrum, we introduce the Group-wise Spectral Correlation Modeling (GrSCM) module, which efficiently establishes spectral band similarity within a localized range. For the continuity of global spectrum, we design the Neighborhood-wise Spectral Continuity Modeling (NeSCM) module, which employs memory units to recursively model the progressive variation characteristics at the global level. In order to explore the inherent complementarity of these two modules, we design the Patch-wise Adaptive Fusion (PAF) module to efficiently integrate global continuity features into the spectral features in a patch-wise adaptive manner. These innovations enhance the quality of reconstructed HSI. We perform comprehensive comparison and ablation experiments on the mainstream datasets NTIRE2022 and NTIRE2020 for the spectral reconstruction task. Compared to the current advanced spectral reconstruction algorithms, our designed algorithm achieves State-Of-The-Art (SOTA) performance.

10. 鲁棒性、安全、隐私与可信视觉 12 篇

2606.10309 2026-06-10 cs.CV 新提交

Dissect and Prune: Enhancing Robustness in AI-Generated Image Detection

剖析与剪枝:增强AI生成图像检测的鲁棒性

Dahye Kim, Jaehyun Choi, Hyun Seok Seong, Seongho Kim, Donghun Lee, Sungwon Yi, Jang-Ho Choi

发表机构 * Korea AI Safety Institute (AISI), ETRI, Seongnam, South Korea(韩国人工智能安全研究所(AISI)、ETRI、Seongnam韩国) Department of Artificial Intelligence, Sungkyunkwan University, Suwon, South Korea(人工智能系,全州大学,Suwon韩国)

AI总结 针对AI生成图像检测器对真实类别的预测偏差问题,提出DEAR方法,利用修复图像识别并剪除干扰特征,从而提升对未知生成器和后处理的鲁棒性。

Comments 25 pages, 9 figures, 9 tables, Accepted to ICML 2026; includes appendix

详情
AI中文摘要

虽然现有的AI生成图像检测器报告了高性能,但我们发现这主要是由一种关键的预测不对称性驱动的:对真实类别的偏见严重限制了其对生成内容的敏感性,尤其是在压缩和调整大小等标准后处理操作下。我们假设这源于模型对虚假特征的依赖,这些干扰信号掩盖了真正的生成伪影。为了解决这个问题,我们提出了DEAR(剖析与剪枝),它利用修复图像来识别和剪除这些干扰成分。具体来说,我们发现与修复区域或非修复区域强烈对齐的特征对后处理的鲁棒性较差。通过测量通道激活与修复掩码之间的对齐程度,DEAR移除两端的特征,仅保留那些捕捉真实生成伪影的特征。实验结果表明,我们的方法显著增强了对未见过的生成器和后处理的鲁棒性,有效缓解了预测不对称性。我们的代码可在该 https URL 获取。

英文摘要

While existing AI-generated image detectors report high performance, we identify that this is largely driven by a critical prediction asymmetry: a bias toward the real class that severely limits sensitivity to generated content, especially under standard post-processing operations such as compression and resizing. We hypothesize that this stems from the model's reliance on spurious features, distracting signals that obscure true generative artifacts. To address this, we propose DEAR (Dissect and Prune), which leverages inpainted images to identify and prune these interfering components. Specifically, we find that features strongly aligned to either inpainted or non-inpainted regions are less robust to post-processing. By measuring the alignment between channel activations and inpaint masks, DEAR removes features at both extremes, retaining only those that capture genuine generative artifacts. Experimental results demonstrate that our approach significantly enhances robustness against unseen generators and post-processing, effectively mitigating the prediction asymmetry. Our code is available at https://github.com/dahyedahye/dear.

2606.10571 2026-06-10 cs.CV cs.AI cs.CR 新提交

Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

通过代理特定偏差校正提高视觉-语言预训练模型上的对抗迁移性

Lijia Yu, Jiuxin Cao, Yuchen Qiang, Changhao Chen, Yifei Huang, Bo Liu

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) Purple Mountain Laboratories(紫金山实验室) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 提出DeBias-Attack方法,通过梯度校正消除代理特定偏差,提高对抗样本在VLP模型间的迁移性,实验验证其在多种模型和任务上的有效性。

Comments 17 pages, 7 figures, 10 tables

详情
AI中文摘要

对抗样本揭示了视觉-语言预训练(VLP)模型中的脆弱性,并为提高鲁棒性提供了见解。一个关键特性是跨模型迁移性,这使得基于迁移的黑盒攻击成为可能。然而,现有攻击通常严重依赖代理模型,导致跨模型性能下降。一个原因是对抗优化可能更多地遵循代理模型响应而非输入语义,使得更新方向在代理模型上有效,但对未见目标迁移性较差。我们将这种依赖称为代理特定偏差。受此观察启发,DeBias-Attack通过校正对抗优化方向中的代理特定偏差来提高迁移性。它维护两个扰动分支。主分支在原始图像上优化扰动,并获得用于破坏图像-文本对齐的对抗梯度。参考分支在弱语义图像上优化扰动,该图像由数据集平均图像加上每次迭代重新采样的小高斯噪声构成。由于该弱语义图像几乎不含清晰的视觉内容,其优化更多地反映代理模型响应而非图像语义,其参考梯度估计代理特定偏差。DeBias-Attack在更新对抗图像之前移除主梯度在参考梯度上的对齐投影,然后使用更新后的对抗图像进行上下文感知的文本替换。DeBias-Attack是首个通过梯度校正来校正代理特定偏差的基于迁移的VLP攻击。实验表明,在VLP模型、下游任务以及开源和闭源多模态大语言模型上均表现出强劲性能。

英文摘要

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

2606.10612 2026-06-10 cs.CV 新提交

GaussTrace: Provenance Analysis of 3D Gaussian Splatting Models with Evidence-based LLM Reasoning

GaussTrace:基于证据的LLM推理的3D高斯泼溅模型溯源分析

Haoliang Han, Ziyuan Luo, Renjie Wan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出GaussTrace框架,通过属性统计分析和假设驱动的编辑模拟,结合大语言模型链式推理,构建3D高斯泼溅模型的有向溯源图,无需训练或编辑历史。

Comments Accepted by ICML2026

详情
AI中文摘要

3D高斯泼溅(3DGS)是一种创建高保真3D资产的有力技术。然而,3DGS模型在数字平台上的广泛共享和迭代修改给知识产权保护和取证溯源带来了紧迫挑战。为此,我们提出GaussTrace,一种用于构建3DGS模型有向溯源图的新框架。GaussTrace将溯源分析表述为基于证据的推理问题。它基于3DGS参数的属性统计特征来捕捉内在属性。此外,我们引入常见操作的假设驱动编辑模拟,为可能的变换路径提供辅助证据。这些统计和模拟线索共同使大语言模型(LLM)能够执行结构化思维链(CoT)推理,产生方向性溯源推断和可解释的边原因。实验结果表明,GaussTrace有效构建了不同3DGS模型之间的演化关系,无需模型训练或访问编辑历史,即可提供准确、可解释且鲁棒的溯源图。项目页面:此https URL。

英文摘要

3D Gaussian Splatting (3DGS) is a powerful technique for creating high-fidelity 3D assets. However, the widespread sharing and iterative modification of 3DGS models across digital platforms create pressing challenges for intellectual property protection and forensic traceability. To address this, we propose GaussTrace, a novel framework for constructing directed provenance graphs for 3DGS models. GaussTrace formulates provenance analysis as an evidence-based reasoning problem. It builds upon attribute-wise statistical profiling of 3DGS parameters to capture intrinsic properties. Moreover, we introduce hypothesis-driven editing simulations of common operations to provide auxiliary evidence for plausible transformation pathways. These statistical and simulated cues jointly enable a Large Language Model (LLM) to perform structured Chain-of-Thought (CoT) reasoning, yielding directional provenance inferences and explainable edge reasons. Experimental results demonstrate that GaussTrace effectively constructs evolutionary relationships among diverse 3DGS models, delivering accurate, interpretable, and robust provenance graphs without requiring model training or access to editing histories. Project page: https://haolianghan.github.io/GaussTrace.

2606.10939 2026-06-10 cs.CV 新提交

PENet+: A Lightweight Residual Transformer Framework for Efficient Image Steganalysis

PENet+: 一种用于高效图像隐写分析的轻量级残差Transformer框架

Jincheol AN, Dongsu Kim, Haneol Jang, YoungJoon Yoo

发表机构 * Chung-Ang University(中央大学) Hanbat National University(韩巴大学) SNUAILAB

AI总结 提出PENet+,通过保留自注意力拓扑、精简分类器通道、激活感知HPF茎和MobileNetV2骨干,在保持检测精度的同时大幅降低参数量和FLOPs。

Comments IEEE ACCESS

详情
AI中文摘要

图像隐写分析,即检测嵌入数字图像中的隐藏信息,是现代网络安全和数字取证的核心组成部分。最近的残差Transformer架构,如像素差分卷积与增强Transformer网络(PENet)[1],实现了强大的检测精度,但其计算和内存需求阻碍了在资源受限环境中的部署。我们提出PENet+,一种轻量级隐写分析框架,在保持PENet判别性结构的同时显著提高效率。我们并非重新设计或压缩注意力块,而是保留PENet的自注意力拓扑以确保可重复性,并添加一个分类器精简阶段,逐步缩小SPP到FC1的输入通道(SPP:空间金字塔池化;FC1:第一个全连接层),从而大幅减少参数和FLOPs,且精度损失可忽略。我们进一步通过激活感知机制细化高通滤波器(HPF)茎,该机制早期聚合HPF响应并选择平衡的SRM-Gabor top-K子集,并将PENet的骨干替换为MobileNetV2风格的倒残差网络。一个K=31滤波器(16个Gabor + 15个SRM)的平衡配置在较低计算量下达到或超越更重设置。最后,我们从隐写分析角度论证PReLU,认为保留负响应有助于捕捉ReLU抑制的弱隐写线索。在512x512分辨率下的独立ALASKA2 JPEG QF90协议上(5000张封面图像用于训练、验证和内部测试;另外19000张封面图像用于评估集),PENet+相比重新评估的PENet基线,参数量减少高达45.5%,FLOPs减少约97%,为资源受限的隐写分析提供了一种计算高效的方向。设备级延迟和功耗测量留待未来工作。

英文摘要

Image steganalysis, the detection of hidden information embedded in digital images, is a core component of modern cybersecurity and digital forensics. Recent residual Transformer architectures, such as the Pixel-Difference-Convolution and Enhanced-Transformer-Network (PENet) [1], achieve strong detection accuracy, but their computational and memory demands hinder deployment in resource-constrained settings. We present PENet+, a lightweight steganalysis framework that preserves PENet's discriminative structure while substantially improving efficiency. Rather than redesigning or compressing the attention blocks, we retain PENet's self-attention topology for reproducibility and add a classifier-streamlining stage that progressively narrows the SPP-to-FC1 input channels (SPP: spatial pyramid pooling; FC1: first fully connected layer), yielding large reductions in parameters and FLOPs with negligible accuracy loss. We further refine the high-pass-filter (HPF) stem with an activation-aware mechanism that aggregates HPF responses early and selects a balanced SRM-Gabor top-K subset, and we replace PENet's backbone with a MobileNetV2-style inverted residual network. A balanced configuration with K=31 filters (16 Gabor + 15 SRM) matches or surpasses heavier settings at lower compute. Finally, we motivate PReLU from a steganalysis standpoint, arguing that preserving negative responses helps capture weak stego cues that ReLU suppresses. On a disjoint ALASKA2 JPEG QF90 protocol at 512x512 resolution (5,000 cover images for training, validation, and internal testing; a separate 19,000-cover evaluation set), PENet+ achieves up to 45.5% fewer parameters and about 97% fewer FLOPs than the re-evaluated PENet baseline, offering a computationally efficient direction for resource-constrained steganalysis. Device-level latency and power measurements remain future work.

2606.09881 2026-06-10 cs.LG cs.CR cs.CV 交叉投稿

Toward Calibrated, Fair, and accurate Deepfake Detection

迈向校准、公平且准确的深度伪造检测

Ryan Brown, Chris Russell

发表机构 * University of Oxford(牛津大学)

AI总结 提出Face-Fairness框架,通过Face-Feature Tuning实现无需人口统计标签的深度伪造检测公平性,同时保持或提升整体准确率。

详情
AI中文摘要

深度伪造检测器在不同人口群体间表现出较大的性能差距。现有的公平性方法需要人口统计标签、重新训练或牺牲准确性。我们引入了Face-Fairness (FF),一个即插即用的偏差缓解框架。我们的主要贡献是Face-Feature Tuning (FFT),这是首个在深度伪造检测中展示的无人口统计标签的公平性方法:一个轻量级校准器,基于冻结的人脸嵌入进行logit重映射。我们通过两种变体补充FFT:FF-Max,在人口统计标签可用时最大化最差组准确率;以及FF-Discover,通过嵌入发现的组实现相同目标。在域内和跨数据集测试设置中,FF一致地减少了FPR/TPR差距,提高了最小组准确率,同时保持(通常提升)整体准确率。该方法与检测器无关,增加了可忽略的运行时开销,并且不需要访问身份属性。

英文摘要

Deepfake detectors show large performance gaps across demographic groups. Existing fairness approaches require demographic labels, retraining, or sacrifice accuracy. We introduce Face-Fairness (FF), a plug-and-play framework for bias mitigation. Our primary contribution, Face-Feature Tuning (FFT), is the first demographic label-free fairness method demonstrated for deepfake detection: a lightweight calibrator that performs a logit remapping conditioned on frozen face embeddings. We complement FFT with two variants: FF-Max, which maximizes worst-group accuracy when demographics are available, and FF-Discover, which does the same with embedding-discovered groups. Across in-domain and cross-dataset test settings, FF consistently reduces FPR/TPR gaps and improves minimum group accuracy while maintaining (often improving) overall accuracy. The approach is detector-agnostic, adds negligible runtime overhead, and requires no access to identity attributes.

2606.09909 2026-06-10 cs.CR cs.AI cs.CV 交叉投稿

Bypassing Copyright Protection in Diffusion-based Customization via Two-Stage Latent Feature Optimization

通过两阶段潜在特征优化绕过基于扩散的定制中的版权保护

Ziang Xu, Wenbo Yu, Hongyao Yu, Hao Fang, Jiawei Kong, Bin Chen, Hao Wu, Shu-Tao Xia, Zhiyong Wu

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院)

AI总结 提出两阶段潜在特征优化(TS-LFO)攻击方法,通过潜在去噪和重建阶段恢复被防御破坏的映射,有效绕过扩散模型定制中的版权保护。

Comments accepted by KDD 2026

详情
AI中文摘要

随着基于扩散的定制中版权侵权问题的日益关注,对抗性攻击已成为一种突出的防御策略,以防止个性化图像生成中的恶意内容伪造。然而,当前的防御通常会在潜在扩散模型(LDM)的潜在空间中引入持久扰动,这些扰动仍然容易被对手自适应绕过。在本文中,我们引入了两阶段潜在特征优化(TS-LFO),一种针对受保护的基于扩散的定制的高效且有效的版权窃取攻击。我们首先观察到现有防御主要破坏输入图像与其潜在表示之间的映射,从而降低模型生成个性化输出的能力。为了应对这一点,TS-LFO通过两阶段优化过程恢复被破坏的映射。在潜在去噪阶段,我们通过联合最小化潜在-图像对齐损失和具有时间步长依赖权重的潜在扩散损失来增强潜在代码与输入图像之间的语义一致性,有效抑制防御引入的高频噪声。在潜在重建阶段,我们使用像素级约束恢复低频语义信息以细化潜在特征。大量实验表明,TS-LFO持续绕过最先进的(SOTA)版权防御,并在各种设置下优于SOTA版权攻击,如DiffPure、GrIDPure和IMPRESS。

英文摘要

With the growing concerns over copyright infringement in diffusion-based customization, adversarial attacks have emerged as a prominent defense strategy to prevent malicious content forgery in personalized image generation. However, current defenses typically introduce persistent perturbations in the latent space of Latent Diffusion Models (LDMs), which remain susceptible to adaptive bypasses by adversaries. In this paper, we introduce Two-Stage Latent Feature Optimization (TS-LFO), an efficient and effective copyright-stealing attack against protected diffusion-based customization. We begin by observing that existing defenses primarily disrupt the mapping between input images and their latent representations, thereby degrading the model's ability to produce personalized outputs. To counteract this, TS-LFO restores the broken mapping through a two-stage optimization process. In the Latent Denoising Stage, we enhance semantic consistency between latent codes and input images by jointly minimizing a Latent-Image Alignment Loss and a Latent Diffusion Loss with timestep-dependent weights, effectively suppressing the high-frequency noise introduced by defenses. In the Latent Reconstruction Stage, we recover low-frequency semantic information using pixel-level constraints to refine the latent features. Extensive experiments show that TS-LFO consistently bypasses state-of-the-art (SOTA) copyright defenses and outperforms SOTA copyright attacks such as DiffPure, GrIDPure and IMPRESS across diverse settings.

2606.10877 2026-06-10 cs.LG cs.CV 交叉投稿

XtrAIn: Training-Guided Occlusion for Feature Attribution

XtrAIn:训练引导的遮挡特征归因

Thodoris Lymperopoulos, Ioannis Kakogeorgiou, Denia Kanellopoulou

发表机构 * NCSR Demokritos(希腊国家科学研究中心德谟克利特)

AI总结 提出XtrAIn方法,将遮挡操作从输入空间转移到参数空间,通过跟踪模型训练轨迹测量特征相关参数更新对输出logits的影响,解决传统遮挡归因中的偏差和不稳定性问题。

Comments 12 pages, 7 figures, 1 table

详情
AI中文摘要

基于遮挡的归因方法通过扰动输入特征并测量模型输出的变化来估计特征重要性,提供了一种直观的方式。然而,其可靠性受到特征移除实现方式的强烈影响:外部选择的基线可能引入偏差、分布外样本和不稳定的解释,而在非线性模型中,遮挡一组特征也可能改变非遮挡特征的贡献。我们将这种效应称为归因偏移,因为非遮挡特征的归因分数偏离其初始值。为了解决这些导致解释不稳定的主要问题,我们引入了XtrAIn,一种训练引导的归因方法,将遮挡操作从输入空间转移到参数空间。XtrAIn不用于工基线替换输入值,而是遵循模型的训练轨迹,测量特征相关参数更新如何影响输出logits。我们进一步引入了Xstep,一种轻量级近似方法以降低计算成本,以及XtrAIn+,一种目标聚焦变体,强调与目标类别一致的更新。在受控图像数据集和PAM50乳腺癌亚型分类上的实验表明,所提出的方法比标准归因基线产生更清晰、更可解释的归因模式。总体而言,XtrAIn提供了对特征归因的训练感知视角,并为研究训练过程中特征级证据的形成提供了有用的诊断工具。

英文摘要

Occlusion-based attribution methods provide an intuitive way to estimate feature importance by perturbing input features and measuring the resulting change in model output. However, their reliability is strongly affected by how feature removal is implemented: externally selected baselines can introduce bias, out-of-distribution samples, and unstable explanations, while in nonlinear models the occlusion of a set of features can also alter the contribution of non-occluded features. We refer to this effect as attribution shift, as the attribution scores of the non-occluded features drift from their initial values. To challenge these major issues that render explanations unstable, we introduce XtrAIn, a training-guided attribution method that transfers the occlusion operation from the input space to the parameter space. Instead of replacing input values with hand-crafted baselines, XtrAIn follows the model's training trajectory and measures how feature-associated parameter updates affect the output logits. We further introduce Xstep, a lightweight approximation for reducing computational cost, and XtrAIn+, a target-focused variant that emphasizes updates aligned with the target class. Experiments on controlled image datasets and PAM50 breast-cancer subtype classification show that the proposed methods produce cleaner and more interpretable attribution patterns than standard attribution baselines. Overall, XtrAIn provides a training-aware perspective on feature attribution and offers a useful diagnostic tool for studying how feature-level evidence is formed during training.

2411.05698 2026-06-10 cs.CV cs.AI cs.LG 版本更新

Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification

Visual-TCAV:用于图像分类事后可解释性的基于概念的归因和显著性图

Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla

发表机构 * Politecnico di Milano(米兰理工大学)

AI总结 提出Visual-TCAV框架,结合概念激活向量和积分梯度,生成类无关显著性图并估计概念归因,在受控实验中比TCAV更忠实于真实解释。

Comments Accepted in TMLR

详情
AI中文摘要

卷积神经网络在图像分类中表现出色,但由于模型规模和复杂性,解释其预测具有挑战性。最先进的显著性方法生成局部解释,突出输入图像中识别类别的区域,但无法解释感兴趣的概念如何贡献于预测。另一方面,基于概念的方法(如TCAV)提供了网络对人类定义概念敏感性的见解,但无法计算其在特定预测中的归因,也无法显示其在输入图像中的位置。我们引入了Visual-TCAV,一种新颖的可解释性框架,旨在通过提供局部和全局解释来弥合这些方法之间的差距。Visual-TCAV使用概念激活向量(CAV)生成类无关的显著性图,显示网络识别特定概念的位置。此外,它可以使用积分梯度的推广来估计这些概念对任何类别输出的归因。我们通过一个已知解释真实情况的受控实验评估了该方法的忠实性,显示出比TCAV更好的真实情况对齐。我们的代码可在https://this URL获取。

英文摘要

Convolutional Neural Networks (CNNs) have shown remarkable performance in image classification. However, interpreting their predictions is challenging due to the size and complexity of these models. State-of-the-art saliency methods generate local explanations highlighting the area in the input image where a class is identified but cannot explain how a concept of interest contributes to the prediction. On the other hand, concept-based methods, such as TCAV, provide insights into how sensitive the network is to a human-defined concept but cannot compute its attribution in a specific prediction nor show its location within the input image. We introduce Visual-TCAV, a novel explainability framework aiming to bridge the gap between these methods by providing both local and global explanations. Visual-TCAV uses Concept Activation Vectors (CAVs) to generate class-agnostic saliency maps that show where the network recognizes a certain concept. Moreover, it can estimate the attribution of these concepts to the output of any class using a generalization of Integrated Gradients. We evaluate the method's faithfulness via a controlled experiment where the ground truth for explanations is known, showing better ground truth alignment than TCAV. Our code is available at https://github.com/DataSciencePolimi/Visual-TCAV.

2601.19210 2026-06-10 cs.CV 版本更新

Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP

对比谱校正:面向CLIP零样本对抗鲁棒性的测试时防御

Sen Nie, Jie Zhang, Zhuo Wang, Shiguang Shan, Xilin Chen

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出对比谱校正(CSR)方法,利用对抗样本在频率衰减下的特征不一致性,通过谱引导对比目标优化校正扰动,在16个分类基准上平均提升18.1%的强攻击鲁棒性,且推理开销低。

Comments Accepted by ICML 2026

详情
AI中文摘要

视觉语言模型(如CLIP)展现出显著的零样本泛化能力,但仍极易受到对抗样本的攻击。尽管测试时防御方法前景广阔,现有方法无法对强攻击提供足够的鲁棒性,且常受限于高推理延迟和任务特定适用性。为解决这些限制,我们首先研究了对抗样本的内在特性,发现对抗样本在渐进频率衰减下表现出严重的特征不一致性。我们进一步将其归因于模型固有的谱偏差。利用这一洞察,我们提出了一种高效的测试时防御方法,名为对比谱校正(CSR)。CSR优化一个校正扰动,在谱引导的对比目标下将输入重新对齐到自然流形,并以输入自适应方式应用。在16个分类基准上的大量实验表明,CSR在强APGD攻击下平均优于现有技术18.1%,且推理开销适中。此外,CSR在多种视觉任务中展现出广泛的适用性。代码见https://this URL。

英文摘要

Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model's inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong APGD with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://github.com/Summu77/CSR.

2604.06893 2026-06-10 cs.CV cs.LG 版本更新

Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

能量正则化的空间遮蔽:一种增强视觉模型鲁棒性和可解释性的新方法

Tom Devynck, Bilal Faye, Djamel Bouchaffra, Nadjib Lazaar, Hanane Azzag, Mustapha Lebbah

发表机构 * DAVID Lab, UVSQ, Paris-Saclay University(DAVID实验室,UVSQ,巴黎-萨克雷大学) LIPN, UMR CNRS 7030, Sorbonne Paris Nord University(LIPN,UMR CNRS 7030,索邦巴黎北大学) LISN, Paris-Saclay University(LISN,巴黎-萨克雷大学)

AI总结 本文提出能量正则化空间遮蔽框架,通过可微能量最小化问题重新定义特征选择,实现更鲁棒和可解释的视觉模型。

Comments 8 pages

详情
AI中文摘要

深度卷积神经网络通过密集空间特征图的彻底处理取得了显著性能,但这种暴力策略引入了显著的计算冗余并鼓励依赖于虚假背景相关性。为此,我们提出能量正则化空间遮蔽(ERSM),一种新的框架,将特征选择重新公式化为可微能量最小化问题。通过在标准卷积骨干中嵌入轻量级能量-遮蔽层,每个视觉标记被分配一个由两个竞争力组成的标量能量:内在的Unary重要性成本和Pairwise空间一致性惩罚。不同于以往的剪枝方法,ERSM允许网络自主发现针对每个输入的最佳信息密度平衡。我们验证了ERSM在卷积架构上的有效性,证明其产生新兴稀疏性、改进对结构遮挡的鲁棒性,并产生高度可解释的空间遮蔽,同时保持分类准确性。此外,我们表明所学的能量排名在删除基于鲁棒性测试中显著优于基于幅度的剪枝,揭示ERSM作为一种内在去噪机制,能够在无像素级监督的情况下隔离语义物体区域。

英文摘要

Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

2606.02224 2026-06-10 cs.CV 版本更新

Chroma Clues: Leveraging Color Statistics to Detect Synthetic Images

颜色线索:利用颜色统计检测合成图像

Lea Uhlenbrock, Davide Cozzolino, Christian Riess

发表机构 * Deutsche Forschungsgemeinschaft(德国研究基金会)

AI总结 利用生成模型在颜色统计上的弱点,通过手工设计的颜色变换和学习优化的颜色变换,提出像素级或块级颜色敏感特征,实现高泛化准确率和鲁棒性的合成图像检测。

详情
AI中文摘要

AI合成图像的演变和传播正以前所未有的速度进行。图像生成器在完美模仿自然图像的目标上取得了快速进展,这也挑战了图像取证。在这项工作中,我们利用了当前生成模型中一个未被充分探索的线索,即它们在模仿自然图像的颜色统计方面的弱点。我们首先展示了用于训练图像生成器的LPIPS损失对色度的敏感性低于亮度,这可能导致合成图像颜色的统计差异。基于这一观察,我们随后引入了六种手工设计的颜色变换和一种学习任务优化颜色变换的方法,以统计上暴露生成的图像。这些变换可以以多种方式使用。首先,我们在像素级或块级定义了颜色敏感特征。一个简单、可解释的分类器使用这些特征实现了平均泛化准确率93.27%,并对六种后处理具有强鲁棒性。其次,我们证明了这些变换在自然和合成图像区域中表现出特征性的视觉噪声模式,从而实现直观的视觉图像评估。第三,我们证明了这些变换可以增强生成图像中的颜色模式,以改进多类归因。

英文摘要

The evolution and dissemination of AI-synthesized images is occurring at an unprecedented rate. Image generators are making rapid progress in their goal of perfectly imitating natural images, which also challenges image forensics. In this work, we exploit an underexplored cue in current generative models, namely their weakness to imitate color statistics of natural images. We first show that the LPIPS loss used for training image generators is less sensitive to chrominance than to luminance, which may lead to statistical discrepancies in the colors of synthetic images. Building on this observation, we then introduce six hand-crafted color transformations and a method to learn a task-optimized color transform to statistically expose generated images. These transformations can be used in various ways. First, we define color-sensitive features at pixel-level or patch-level. A simple, interpretable classifier achieves with these features an average generalization accuracy of 93.27% and strong robustness against six types of post-processing. Second, we demonstrate that the transformations exhibit characteristic visual noise patterns in natural and synthetic image areas, which enables an intuitive visual image evaluation. Third, we demonstrate that the transforms can enhance color patterns in generated images for improved multiclass attribution.

2604.13776 2026-06-10 cs.CY cs.CL cs.CR cs.CV 版本更新

Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking

谁被标记?AI内容水印中的多元评估差距

Alexander Nemecek, Osama Zafar, Yuqiao Xu, Wenbiao Li, Erman Ayday

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 本文揭示AI内容水印在不同语言、文化和群体间存在系统性偏差,提出跨语言检测一致性、文化多样性覆盖和检测指标人口统计分解三个评估维度,主张水印部署前必须进行公平性审计。

Comments 7 pages. Accepted at the Multimodal Alignment for a Pluralistic Society (MAPS) Workshop, CVPR 2026

详情
AI中文摘要

水印正成为AI内容认证的默认机制,治理政策和框架将其引用为内容溯源的基础设施。然而,在文本、图像和音频模态中,水印信号强度、可检测性和鲁棒性取决于内容本身的统计特性,而这些特性在不同语言、文化视觉传统和人口统计群体间存在系统性差异。我们研究了这种内容依赖性如何产生特定模态的偏差路径。通过回顾各模态的主要水印基准,我们发现除一个例外,没有基准报告跨语言、文化内容类型或人群组的性能。为解决此问题,我们提出了多元水印基准测试的三个具体评估维度:跨语言检测一致性、文化多样性内容覆盖以及检测指标的人口统计分解。我们认为水印是多元对齐管道的一部分,应遵循相同的评估标准。我们将此与当前强制部署水印但未要求公平性评估的治理框架联系起来。我们的立场是评估必须先于部署,并且应用于AI模型的相同偏差审计要求应扩展到验证层。

英文摘要

Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-specific pathways to bias. Reviewing the major watermarking benchmarks across modalities, we find that, with one exception, none report performance across languages, cultural content types, or population groups. To address this, we propose three concrete evaluation dimensions for pluralistic watermark benchmarking: cross-lingual detection parity, culturally diverse content coverage, and demographic disaggregation of detection metrics. We argue that watermarking is part of the pluralistic alignment pipeline and should be held to the same evaluation standards. We connect this to governance frameworks currently mandating watermarking deployment without requiring fairness evaluation. Our position is that evaluation must precede deployment, and that the same bias auditing requirements applied to AI models should extend to the verification layer.

11. 数据集、基准、评测与训练方法 23 篇

2606.09882 2026-06-10 cs.CV cs.LG 新提交

WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

WHU-Infra3D:面向3D路边基础设施清单的全栈多模态数据集与基准

Chong Liu, Luxuan Fu, Xuyu Feng, Zhen Dong, Bisheng Yang

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS)(信息工程测绘遥感国家重点实验室) Wuhan University(武汉大学)

AI总结 提出WHU-Infra3D多模态基准数据集,覆盖三城市53.8公里,融合全景图像与LiDAR点云,提供2D-3D实例关联和跨帧跟踪,支持基础设施状态诊断与属性识别,填补自动化维护数据集空白。

详情
AI中文摘要

数字孪生城市的范式正从粗略的视觉映射转向更精确、可操作的城市资产数字化。然而,现有数据集主要关注粗略的视觉感知,缺乏自动化基础设施维护所需的严格多模态对齐和属性及状态诊断。为弥合这一差距,我们引入了WHU-Infra3D,一个大规模、多模态的基准数据集,专门用于路边基础设施清单。覆盖三个城市53.8公里,WHU-Infra3D独特地集成了全景图像和LiDAR点云,并具有严格的2D-3D实例关联和跨帧跟踪。该数据集包含超过17.5万个多视图2D边界框以及数千个3D基础设施实例,提供了超过18.1万个详细的属性和状态注释(例如,锈蚀、遮挡),以支持运行健康评估。我们在五个核心任务上建立了全面的基线:2D检测、2D跨视图匹配、3D地理识别、3D点云分割和属性识别。广泛的评估暴露了当前模型在长尾缺陷状态上的显著跨城市领域差距和固有脆弱性,使WHU-Infra3D成为推进可扩展、AI驱动的城市基础设施清单和生命周期管理的重要试验场。WHU-Infra3D数据集可在以下网址获取:https://xxx。

英文摘要

The paradigm of digital twin cities is shifting from coarse visual mapping toward more precise and actionable digitization of urban assets. However, existing datasets predominantly focus on coarse visual perception, lacking the strict multi-modal alignment and attribute and status diagnosis required for automated infrastructure maintenance. To bridge this gap, we introduce WHU-Infra3D, a large-scale, multi-modal benchmark dataset dedicated to roadside infrastructure inventory. Covering 53.8 km across three cities, WHU-Infra3D uniquely integrates panoramic imagery and LiDAR point clouds with rigorous 2D-3D instance association and cross-frame tracking. Comprising over 175k multi-view 2D bounding boxes alongside thousands of 3D infrastructure instances, the dataset provides over 181k detailed attribute and status annotations (e.g., rust, occlusion) to empower operational health assessment. We establish comprehensive baselines across five core tasks: 2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, and attribute recognition. Extensive evaluations expose significant cross-city domain gaps and inherent vulnerabilities of current models on long-tailed defective statuses, establishing WHU-Infra3D as an essential testbed for advancing scalable, AI-driven urban infrastructure inventory and lifecycle management. The WHU-Infra3D dataset is available at https://github.com/WHU-USI3DV/WHU-Infra3D.

2606.10066 2026-06-10 cs.CV cs.AI cs.LG 新提交

A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

公共医学视觉语言基准中预训练污染的受控审计

Bruce Changlong Xu, Lan Wu, Alexander Ryu

AI总结 审计发现公共医学VLM基准存在图像源重叠和文本规范顺序交换性信号,但确认的像素级重复罕见,且现有成员推理检测器在小规模医学VLM队列中不可靠。

Comments 30 pages, 7 figures, 9 tables. Preprint

详情
AI中文摘要

医学视觉语言模型(VLM)在公共基准上进行评估,这些基准的图像和问答对多年来一直可自由下载,但报告准确度假设这些示例在预训练中不存在。我们对SLAKE-En、PathVQA、VQA-RAD以及一个辅助的公共OmniMedVQA镜像上的开放VLM进行了审计,使用了四种检测器系列:图像侧近邻重叠(针对PMC-OA-beta)、规范顺序可交换性、队列相对Min-K%++尾部富集以及跨模型Top-K重叠。我们发现SLAKE-En上存在可测量的图像侧源重叠:SigLIP-B-16标记了19.8%的图像,SigLIP-SO400M标记了4.2%,而域外对照产生0/2000个标记。人工裁定显示,相同模态、相同投影的匹配对应不同患者,而非经过验证的像素级重复,因此我们将其解释为源或分布重叠,而非确认的每图像记忆。在文本侧,Qwen2.5-VL在SLAKE-En上显示出规范顺序可交换性信号,该信号在顺序消融和外部非医学基线中仍然存在。在OmniMedVQA镜像上,五个医学和通用VLM触发了可交换性,而BLIP-2保持干净。相比之下,队列相对Min-K%++尾部富集和跨模型Top-K重叠在外部预域基线中崩溃:BLIP-2重现了明显的正信号,尽管缺乏合理的医学VQA暴露。我们得出结论,这些队列相对检测器作为小规模医学VLM队列上的独立成员推理信号是不可靠的。

英文摘要

Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadable for years, yet reported accuracy assumes these examples were absent from pretraining. We audit open VLMs on SLAKE-En, PathVQA, VQA-RAD, and an auxiliary public OmniMedVQA mirror using four detector families: image-side near-neighbour overlap against PMC-OA-beta, canonical-order exchangeability, cohort-relative Min-K%++ tail enrichment, and cross-model top-K overlap. We find measurable image-side source overlap on SLAKE-En: 19.8% of images are flagged under SigLIP-B-16 and 4.2% under SigLIP-SO400M, while out-of-domain controls produce 0/2000 flags. Manual adjudication shows same-modality, same-projection matches to different patients rather than verified pixel-level duplicates, so we interpret this as source or distributional overlap rather than confirmed per-image memorization. On the text side, Qwen2.5-VL on SLAKE-En shows a canonical-order exchangeability signal that survives ordering ablation and external non-medical baselines. On the OmniMedVQA mirror, exchangeability fires for five medical and general VLMs while BLIP-2 remains clean. In contrast, cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap collapse under an external pre-domain baseline: BLIP-2 reproduces the apparent positive signals despite lacking plausible medical-VQA exposure. We conclude that these cohort-relative detectors are unreliable as standalone membership-inference signals on small medical-VLM cohorts.

2606.10107 2026-06-10 cs.CV q-bio.QM 新提交

Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching

最大匹配精度:利用全局最优匹配的实例分割评估指标

Kaden Stillwagon, Alexandra D. VandeLoo, Craig R. Forest

AI总结 提出最大匹配精度(MMA),通过全局最优一对一匹配和逐像素归一化,克服现有指标在细胞分割评估中的不连续、不敏感和匹配非最优问题,提供更稳定、敏感和可解释的评分。

详情
AI中文摘要

可靠评估实例分割模型需要准确且一致反映分割质量的指标。然而,生物成像中最广泛使用的指标存在根本性的数学缺陷:硬交并比阈值导致不连续、低灵敏度的评分;逐对象归一化在对象大小变化下扭曲分数;以及贪婪或一对多匹配过程产生非最优、顺序依赖的对应关系。这些特性共同导致在常见失败模式(如细胞分裂、细胞合并和细胞边界不精确)下产生不直观且不可靠的模型排名。我们提出最大匹配精度(MMA),一种无阈值连续指标,它找到预测对象与真实对象之间的全局最优一对一匹配,并使用逐像素归一化聚合总重叠。我们在三个实验(合成失败案例、渐进式破坏测试和模型排名比较)中评估MMA与AP@50、PQ、SEG和AJI。MMA产生的分数比现有替代方案更稳定、更敏感、更可解释,为生物细胞成像中的公平实例分割基准测试提供了原则性基础。

英文摘要

Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection-over-Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per-object normalization that distorts scores under object size variation; and greedy or one-to-many matching procedures that yield non-optimal, order-dependent correspondences. Together, these properties produce unintuitive and unreliable model rankings under common failure modes such as split cells, merged cells, and cell boundary imprecision. We propose Maximum Matching Accuracy (MMA), a threshold-free continuous metric that finds a globally optimal one-to-one matching between predicted and ground truth objects and aggregates total overlap using per-pixel normalization. We evaluate MMA against AP@50, PQ, SEG, and AJI across three experiments: synthetic failure cases, progressive corruption tests, and a model ranking comparison. MMA produces scores that are more stable, more sensitive, and more interpretable than existing alternatives, providing a principled foundation for fair instance segmentation benchmarking in biological cell imaging.

2606.10136 2026-06-10 cs.CV 新提交

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

iSAGE: 一种通过稀疏点监督进行遥感语义分割的人机协同框架

Osmar Luiz Ferreira de Carvalho, Osmar Abilio de Carvalho Junior, Anesmar Olino de Albuquerque, Daniel Guerreiro e Silva

AI总结 提出iSAGE框架,通过专家点击模型错误像素而非任意像素,无需辅助机制即可匹配密集监督,在BsB Aerial和ISPRS Vaihingen数据集上以极低标注率达到与密集监督相当的性能。

Comments 47 pages, 8 tables, 6 figures

详情
AI中文摘要

遥感中的语义分割需要昂贵的像素级标注,且由于模型很少能在传感器、平台或地理区域间迁移,几乎每个问题都需要新的数据集。现有的人机协同框架通过辅助机制(伪标签、传播、CRF、基础模型提示、辅助头)将稀疏点击扩展为密集监督,这些机制均基于模型的预测分布。在该分布中,一个自信的错误像素与一个自信的正确像素在结构上无法区分,因此任何读取该分布的规则都无法区分两者;区分信号位于模型外部。本文假设,专家针对模型错误(而非任意像素)的点击足以匹配密集监督,无需扩展机制。iSAGE(基于专家指导的迭代稀疏标注)在一个集成的开源平台上实现了这一假设,其中错误加权损失放大了每次点击的梯度,而标注记录本身即为数据集,可扩展、可纠正、可审计。实验采用最小努力策略:每帧每类最多一个标注像素。在BsB Aerial上,iSAGE恢复了密集监督的97.2%(在0.040%的像素上达到74.79% mIoU),并呈现出对比性的类别动态:无定形类别(渗透区域)从种子点开始饱和,而小类别(汽车)需要后期迭代的努力。在ISPRS Vaihingen(外部基准)上,iSAGE以0.011%的像素达到76.78% mIoU,匹配密集基线(76.65%)并超越所有已发表方法。在相同流程下,四种输出读取机制(预算1-100倍的oracle熵、阈值0.90-0.99的伪标签、基于CRF的传播、均匀随机)比iSAGE低7.4至14.5个百分点。在调查的31种方法中,iSAGE是唯一无需辅助机制即可运行的迭代式人机协同框架。

英文摘要

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1--100x, pseudo-labels across thresholds 0.90--0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

2606.10142 2026-06-10 cs.CV 新提交

DB-3DME: From Dataset to Benchmark for Human-aligned Automatic 3D Mesh Evaluation

DB-3DME:从数据集到基准测试,实现与人类对齐的自动3D网格评估

Nanshan Jia, Zhenyu Zhao, Sui Huang, Jingshen Wang, Zeyu Zheng

AI总结 提出DB-3DME数据集与基准,包含2619个合成3D网格及其人类评分,通过微调视觉编码器优化VLM评估性能,显著超越现有模型。

Comments CVPR 2026 workshop paper. 10 pages, 3 figures, 6 tables. Dataset available at GitHub and Hugging Face

详情
AI中文摘要

近年来3D生成的进展在真实性、可控性和效率上取得了显著提升,但3D资产的评估仍未被充分探索。现有的评估范式,包括人工评估、学习指标和视觉语言模型(VLM)作为评判者,在成本、可扩展性、分辨率处理或任务特定对齐方面存在局限性。在这项工作中,我们专注于3D网格评估,并引入了DB-3DME,即用于3D网格评估的数据集和基准。DB-3DME包含2,619个合成3D网格,并配有关于几何和提示遵从性的人工评分。利用该数据集,我们系统地基准测试了最先进的VLM,并发现3D表示的视觉编码是与人类对齐的评估性能的关键因素。受此发现启发,我们通过调整视觉编码器同时冻结语言模型,微调了一个开放权重的VLM——Qwen-2.5-VL-7B,用于3D网格评估。微调后的模型在多个评估维度上显著优于现有的预训练VLM,为自动3D网格评估建立了新的基准。我们在GitHub和Hugging Face上公开发布了基准数据集,以促进未来的研究。

英文摘要

Recent advances in 3D generation have led to substantial improvements in realism, controllability, and efficiency, yet the evaluation of 3D assets remains underexplored. Existing evaluation paradigms, including human evaluation, learned metrics, and vision-language models (VLMs) as judges, suffer from limitations in cost, scalability, resolution handling, or task-specific alignment. In this work, we focus on 3D mesh evaluation and introduce DB-3DME, the Dataset and Benchmark for 3D Mesh Evaluation. DB-3DME contains 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. Using this dataset, we systematically benchmark state-of-the-art VLMs and identify visual encoding of 3D representations as a key factor for human-aligned evaluation performance. Motivated by this finding, we fine-tune an open-weight VLM, Qwen-2.5-VL-7B, for 3D mesh evaluation by adapting the visual encoder while freezing the language model. The fine-tuned model substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions, establishing a new benchmark for automatic 3D mesh evaluation. We publicly release the benchmark dataset on GitHub and Hugging Face to facilitate future research.

2606.10174 2026-06-10 cs.CV 新提交

A Large Scale Open-Source Image and Video Dataset for Robust Wildfire Detection and Classification

用于鲁棒野火检测与分类的大规模开源图像和视频数据集

Emadeldeen Hamdan, Yingyi Luo, B. Ugur Toreyin, Erdem Koyuncu, Adam J. Watts, Ugur Gudukbay, Ahmet Enis Cetin

AI总结 提出大规模开源野火图像视频数据集GWFP,结合多种卷积与Transformer架构及HTE-ResNet方法,实现跨域鲁棒检测。

详情
AI中文摘要

野火检测与监测对于减缓火势蔓延和减少环境及基础设施损害至关重要。本文介绍了GWFP(全球野火预防数据集),这是一个大规模、开源的野火图像和视频数据集,旨在支持早期火灾和烟雾检测研究。GWFP包含地理多样化的野火场景,包括火焰、烟雾、水雾/雾环境条件、近红外(NIR)图像、余烬以及从全球真实场景中收集的具有挑战性的负样本。为了评估数据集的鲁棒性和跨域泛化能力,我们在域内和跨数据集设置下对多种卷积和基于Transformer的架构进行了基准测试。此外,我们探索了使用Hadamard增强残差连接(HTE-ResNet)的轻量级频率-空间特征交互,以分析域偏移条件下的表示鲁棒性。实验结果表明,该方法在真实世界野火监测应用中具有强大的跨数据集泛化能力和实用价值。数据集和源代码将在接收后公开发布。

英文摘要

Wildfire detection and monitoring are critical for mitigating fire spread and reducing environmental and infrastructural damage. In this work, we introduce GWFP (Global Wildfire Prevention Dataset), a large-scale, open-source dataset of wildfire images and videos designed to support early fire and smoke detection research. GWFP contains geographically diverse wildfire scenes, including flames, smoke, Waterdog/Fog environmental conditions, Near Infrared (NIR) imagery, Ember, and challenging negative samples collected from real-world scenarios worldwide. To evaluate dataset robustness and cross-domain generalization, we benchmark multiple convolutional and transformer-based architectures across both in-domain and cross-dataset settings. Additionally, we explore lightweight frequency--spatial feature interaction using Hadamard-enhanced residual connections (HTE-ResNet) to analyze representation robustness under domain-shift conditions. Experimental results demonstrate strong cross-dataset generalization and practical utility for real-world wildfire monitoring applications. The dataset and source code will be publicly released upon acceptance.

2606.10196 2026-06-10 cs.CV cs.AI 新提交

Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning

Fisher引导的自适应微调渐进参数选择

Ghodsiyeh Rostami, Po-Han Chen, Mahdi S. Hosseini

AI总结 提出FisherAdapTune框架,通过追踪Fisher几何的时间漂移渐进选择参数组,在保持适应动态的同时冻结稳定参数以降低泛化误差界,在分割任务上提升分布内性能和零样本迁移。

详情
AI中文摘要

参数高效微调(PEFT)旨在使用少量可训练参数子集来适应预训练模型,然而,现有大多数方法从固定的架构启发式中选择该子集,而不是使用动态的、任务感知的标准。我们引入了\textbf{FisherAdapTune},一个Fisher引导的自适应微调框架,通过追踪参数组Fisher几何的时间漂移来渐进选择参数组。从微调的PAC-Bayesian视角出发,我们将泛化误差界分解为Fisher加权更新成本,并表明曲率贡献已稳定的参数组可以被冻结,以减少误差界而不中断剩余的适应动态。FisherAdapTune使用连续Fisher分布之间的尺度不变Jensen-Shannon距离来制定这一标准,从而产生一个自适应的活动参数集。我们在下游分割任务上评估了我们的方法,结果表明FisherAdapTune在多种设置下提升了分布内性能和零样本迁移,验证了Fisher结构漂移是高效、任务感知适应的有用信号。我们公开发布了代码(\href{this https URL}{code}),以促进我们提出方法的进一步应用。

英文摘要

Parameter-efficient fine-tuning (PEFT) aims to adapt pretrained models with a small trainable parameter subset, however, most existing methods choose this subset from fixed architectural heuristics rather than using dynamic, task-aware criteria. We introduce \textbf{FisherAdapTune}, a Fisher-guided Adaptive Fine-Tuning framework that progressively selects parameter groups by tracking the temporal drift of their Fisher geometry. Starting from a PAC-Bayesian view of fine-tuning, we decompose the generalization error bound into Fisher-weighted update costs and show that parameter groups whose curvature contribution has stabilized can be frozen to reduce the error bound without interrupting the remaining adaptation dynamics. FisherAdapTune formulates this criterion with a scale-invariant Jensen-Shannon distance between consecutive Fisher distributions, yielding an adaptive active parameter set. We evaluate our approach on a downstream segmentation task, and results show FisherAdapTune improves the in-distribution performance and zero-shot transfer in multiple settings, validating that Fisher structural drift is a useful signal for efficient, task-aware adaptation. We release our \href{https://github.com/AtlasAnalyticsLab/FisherAdapTune}{code} publicly to enable further application of our proposed approach.

2606.10488 2026-06-10 cs.CV 新提交

5% > 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning

5% > 100%: 平坦性偏好是您进行多模态参数高效微调所需的一切

Yifan Zhu, Can Lin, Hangjie Yuan, Zixiang Zhao, Pengfei Zhang, Tao Feng, Zhonghong Ou

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhejiang University(浙江大学) ETH Zürich(苏黎世联邦理工学院) Anhui University of Science and Technology(安徽理工大学) Tsinghua University(清华大学) State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 揭示参数高效微调方法中普遍存在的平坦性偏好,即少量尖锐维度主导泛化,并提出FlatPO方法优化这些维度以提升泛化性能。

详情
AI中文摘要

参数高效微调(PEFT)方法为将大型模型适应特定领域的多模态下游任务提供了一种简化和高效的工具。尽管这些方法在实践中证明了其实际效果,但其主要方面仍未得到充分探索。因此,我们仍然对各种PEFT方法中的潜在泛化机制以及如何进一步增强它们感到好奇。在本文中,我们揭示了各种PEFT中普遍存在的平坦性偏好,其中一小部分尖锐维度主导了PEFT的泛化。这一发现暗示了一种有吸引力的可能性:我们可能只需关注这一小部分尖锐维度而非所有维度,就能获得更好的泛化。此外,我们提出了平坦性偏好优化(FlatPO)来平坦化这些关键的尖锐维度,使各种PEFT朝向更好的泛化。大量实验证明了我们的发现和所提方法的有效性。代码可在以下网址获取:https://this URL。

英文摘要

Parameter-Efficient Fine-Tuning (PEFT) methods provide a streamlined and efficient tool for adapting large models to domain-specific multimodal downstream tasks. Although these methods proved their tangible effects in practice, their principal aspects remain under-explored. Therefore we remain curious about the underlying generalization mechanisms in various PEFT methods and how they can be further enhanced. In this paper, we reveal the flatness preference widely present in various PEFTs, where a small fraction of sharp dimensions dominates the generalization of PEFT. This finding suggests an appealing possibility: we may be satisfied with a better generalization by merely attending to this small fraction of sharp dimensions instead of all of them. Furthermore, we propose Flatness Preference Optimization (FlatPO) to flatten these key sharpness dimensions, leading various PEFTs toward better generalization. Extensive experiments demonstrate the effectiveness of our findings and the proposed method. Code is available at https://github.com/Can-Lin/FlatPO.

2606.10620 2026-06-10 cs.CV cs.AI 新提交

Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

图像模型能想象时间吗?ImageTime:通过时空一致性探究视觉世界建模的新基准

Xinrui Wu, Lichen Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ImageTime基准,通过四关键帧协议(初始状态、动作开始、过渡状态、最终状态)评估图像生成模型在时空一致性上的表现,揭示模型在维持连贯视觉世界状态方面的能力与不足。

详情
AI中文摘要

图像生成模型现在能够生成高质量的静态图像,但它们表示视觉世界随时间变化的能力仍然知之甚少。实际工作流程如故事板、逐步插图、参考引导编辑和视频预可视化要求模型在多个视觉状态之间保持身份、对象、空间关系和因果顺序。现有评估主要衡量单图像正确性、组合对齐或视频质量,而未明确图像模型是否能连贯地想象一个时间有序的过程。我们引入ImageTime,一个诊断基准,使用时空一致性作为图像生成中视觉世界建模的行为探针。给定一个动作指令,以及可选地指定初始状态的参考图像,模型必须生成一张包含四个有序关键状态的图像:初始状态、动作开始、过渡状态和最终状态。这个四关键帧协议比单图像生成在时间上要求更高,同时避免了密集视频动态的混淆。ImageTime通过渐进能力层次组织任务,并将每个场景分解为阶段状态谓词、跨帧时间约束和禁止的因果违规。GPT-5.5在结构化的VLM-as-judge协议下对所有生成的图像进行评分,产生可解释的能力分数、诊断子分数和失败标签。通过多家族基准测试,ImageTime揭示了当前图像生成系统在要求随时间维持连贯视觉世界状态时成功、失败和漂移的地方。

英文摘要

Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.

2606.10666 2026-06-10 cs.CV cs.DB 新提交

Analyzing Training-Free Corruption Detection for Object Detection Datasets

分析目标检测数据集的无训练腐败检测

Christian Sieberichs, Simon Geerkens, Thomas Waschulzik, Viswanathan Ramesh, Alexander Braun

发表机构 * University of Applied Sciences Düsseldorf(杜塞尔多夫应用科学大学) Siemens Mobility GmbH(西门子交通有限公司) Goethe University Frankfurt(法兰克福大学)

AI总结 本文研究无训练特征空间方法在目标检测数据集中检测标注错误的应用,发现该方法能可靠暴露语义错误,但位置错误难以检测。

Comments Accepted at DataCV Workshop, Conference on Computer Vision and Pattern Recognition (CVPR) 2026

详情
AI中文摘要

注释错误在计算机视觉数据集中普遍存在,并且会显著降低在其上训练的系统的性能,特别是在目标检测等复杂任务中。存在多种识别注释错误的方法,包括无训练的特征空间方法,这些方法提供了一种快速且可解释的方式来分析注释。然而,对于包含语义和空间信息的目标检测注释,其行为在很大程度上仍未探索。在这项工作中,我们分析了基于特征空间的方法在检测目标检测数据集中的注释错误时的适用性。通过调整现有的特征空间方法,我们表明此类方法可靠地暴露语义错误,而位置错误仍然难以检测。我们使用VOC2012和KITTI,在多个预训练嵌入模型、合成噪声类型(对称、非对称和位置)以及真实世界注释错误上评估了这种行为。所有代码和真实世界腐败数据均可在以下仓库公开获取:https://github.com/ChristianSieberichs/BoundingBox_corruption_detection

英文摘要

Annotation errors are widespread in computer vision datasets and can significantly degrade the performance of systems trained on them, particularly in complex tasks such as object detection. Several approaches exist to identify annotation errors, including training-free feature-space methods which provide a fast and interpretable way to analyze annotations. However, the behavior on object detection annotations, which include semantic and spatial information, remains largely unexplored. In this work we analyze the applicability of feature-space-based approaches for detecting annotation errors in object detection datasets. By adapting an existing feature-space method, we show that such approaches reliably expose semantic mislabel, while positional errors remain difficult to detect. We evaluate this behavior across multiple pretrained embedding models, synthetic noise types (symmetric, asymmetric, and positional), and real-world annotation errors using VOC2012 and KITTI. All code and real-world corruptions are publicly available at the following repository: https://github.com/ ChristianSieberichs/BoundingBox\_corruption\_detection

2606.10790 2026-06-10 cs.CV 新提交

A Multimodal RGB and Events Dataset for Hand Detection in First-Person View

第一人称视角下用于手部检测的多模态RGB和事件数据集

Bharghav Kota, Yulia Sandamirskaya

发表机构 * Zurich University of Applied Sciences(苏黎世应用科技大学)

AI总结 针对移动机器人系统中传统相机在暗光下运动模糊的问题,提出利用事件相机与RGB相机结合的多模态手部检测方法,并通过合成事件数据集实现与现有方法相当的性能。

详情
AI中文摘要

现有的手部检测算法基于图像工作,检测率受限于相机的帧率。在移动机器人系统的手部检测应用中,传统相机会导致运动模糊,尤其是在较暗的光照条件下。我们可以利用事件相机,它具有高动态范围、高时间分辨率和低功耗的特点。最近的研究表明,使用事件相机和帧相机的立体设置可以提高检测精度和带宽-延迟权衡。在目标检测和识别任务中使用事件相机的主要瓶颈是训练数据量相对较少。在这项工作中,我们提出了一种方法以及一个从自我中心、第一人称视角合成的示例性事件手部数据集。数据使用v2e工具箱从现有的RGB Egohands数据集合成。通过改变v2e工具箱的参数,提供不同光照条件和尺度的数据集版本。使用微调后的YOLOv8模型生成地面真值检测,该模型应用于Egohands数据集中的RGB图像,并在高时间分辨率事件上进行插值。我们使用多模态数据集,利用现有的使用事件和RGB相机多模态设置的目标检测算法进行手部检测,并展示了与最先进方法相当的性能。

英文摘要

Existing hand detection algorithms work on images and the detection rate is restricted by the frame rate of the camera. In hand detection applications for moving robotic systems, conventional cameras cause motion blur, especially in darker lighting conditions. We can leverage the use of event-based cameras which possess a high dynamic range, high temporal resolution, and low power consumption. Recent work has shown that using a stereo setup of an event-based and a frame-based camera improves detection accuracy and the bandwidth-latency tradeoff. The main bottleneck in using event-based cameras in object detection and recognition tasks is a relatively low amount of training data. In this work, we propose a methodology and an exemplary synthetic event-based hand dataset from an egocentric, first-person view perspective. The data is synthesized from the existing RGB Egohands dataset with the v2e toolbox. Parameters of the v2e toolbox are varied to provide versions of the dataset with different lighting conditions and scales. Ground truth detections are generated with a fine-tuned YOLOv8 model which is applied to the RGB images in the Egohands dataset and interpolated on the high-temporal resolution events. We use the multi-modal dataset to perform hand detection with existing object detection algorithms which use a multi-modal setup of event and RGB cameras and demonstrate performance comparable to the state-of-the-art.

2606.10894 2026-06-10 cs.CV 新提交

The 1st PortraitCraft Challenge: A CVPR 2026 Workshop Competition on Portrait Composition Understanding and Generation

首届PortraitCraft挑战赛:CVPR 2026研讨会肖像构图理解与生成竞赛

Zijie Lou, Youyun Tang, Xiaochao Qu, Haoxiang Li, Ting Liu, Luoqi Liu, Xun Zhu, Zheng Zhang, Xi Chen, Miao Li, Ji Wu, Dizhe Zhang, Xian Ge, Sujia Wang, Ruiyang Zhang, Jiaming Wang, Xianshun Wang, Lu Qi, Boao Kang, Wei Zhou, Jinghui Sun, Zhenyu Yan, Jiliang Zhao, Rui Yang, Yipo Huang, Boyuan Liu, Shanglin Li, Zifan Xie, Yichen Zhang, Anlan Wang, Wenfeng Lin, Mingyu Guo, Dong Li, Xinghao Wang, Yanting Li, Shanzhao Tong, Shuai He, Qiu Zhou, Yongqi Yang, Taoyang Mu, Dianqiao Lei, Anlong Ming, Huadong Ma

发表机构 * CVPR 2026

AI总结 提出PortraitCraft挑战赛,包含构图理解与生成两个赛道,并发布约5万张肖像数据集,推动肖像美学与可控图像生成研究。

详情
AI中文摘要

本文介绍了首届PortraitCraft挑战赛的概况,该挑战赛是CVPR 2026的官方竞赛之一。挑战赛聚焦于肖像构图理解与生成,旨在推动AI在肖像美学分析和可控图像合成方面的研究。与主要关注全局美学评分的现有数据集和任务不同,PortraitCraft引入了一个统一的评估框架,包含两个互补赛道。赛道1要求模型进行结构化肖像构图理解,赛道2要求模型在显式构图约束下从结构化构图描述生成肖像图像。为支持该挑战赛,我们构建并公开发布了一个大规模肖像构图数据集,包含约50,000张精心策划的真实肖像图像,提供多级监督。本报告描述了挑战赛设置、评估协议、数据集组成和最终结果,并分析了提交方案的技术特点。PortraitCraft挑战赛为肖像构图理解与生成研究提供了一个标准化和可复现的平台,有望推动肖像美学和可控图像生成领域的进一步发展。

英文摘要

This paper presents an overview of the inaugural PortraitCraft Challenge, held as one of the official competitions at CVPR 2026. The challenge focuses on portrait composition understanding and generation, aiming to advance AI research in portrait aesthetics analysis and controllable image synthesis. Unlike existing datasets and tasks that primarily focus on global aesthetic scoring, PortraitCraft introduces a unified evaluation framework comprising two complementary tracks. Track 1 requires models to perform structured portrait composition understanding, and Track 2 requires models to generate portrait images from structured composition descriptions under explicit compositional constraints. To support the challenge, we constructed and publicly released a large-scale portrait composition dataset consisting of approximately 50,000 curated real portrait images, providing multi-level supervision. This report describes the challenge setup, evaluation protocols, dataset composition, and final results, along with an analysis of the technical characteristics of the submitted solutions. The PortraitCraft Challenge provides a standardized and reproducible platform for research on portrait composition understanding and generation, and is expected to foster further progress in the fields of portrait aesthetics and controllable image generation.

2606.10905 2026-06-10 cs.CV 新提交

Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

超越模型规模:通过训练小模型探究视觉上下文学习中的差距

Sunil Khatri, Steven Landgraf, Markus Ulrich, Simon Reiß

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 通过训练仅1百万参数的小模型,挑战大规模视觉上下文学习模型,揭示任务编码、预训练任务和评估指标方面的适应性能力测量差距。

详情
AI中文摘要

视觉上下文学习(VICL)旨在推动自适应视觉模型的发展,使其能够基于少量示例在测试时适应新任务。受自然语言处理研究中上下文学习历史的影响,当前VICL方法通常采用大规模模型和数据扩展作为关键要素。然而,这些要素是否是视觉模型形成上下文学习能力的关键尚不清楚。为了对这类大模型进行压力测试,我们用一个极端的反例挑战它们:我们训练了一个仅含1百万参数和7万张图像的微小视觉上下文模型。我们将这个容量严重受限的小模型与7000倍大的VICL模型在不同自适应设置下进行比较:(1)具有小分布偏移的图像数据,(2)未见过的任务编码,以及(3)全新的任务,即VICL所设想的场景。由于小模型和大模型之间训练资源的巨大差距,我们的实验展示了在任务编码方式、预训练中使用的任务以及评估指标选择方面,自适应能力测量存在的不足。当前VICL基准测试中的这些差距凸显了在自适应能力评估方面进行创新的必要性。

英文摘要

Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.

2606.10967 2026-06-10 cs.CV 新提交

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

视觉上下文学习何去何从?跨领域与任务的统一基准

Pradnya Halady, Jiale Wei, Zdravko Marinov, Alexander Jaus, Simon Reiß

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 针对视觉上下文学习评估局限于预训练镜像任务的问题,构建跨领域和任务的统一基准VIBE,在14个数据集和12个任务上测试6个模型,揭示其适应能力、局限性及失败模式。

详情
AI中文摘要

视觉上下文学习被提出作为动态模型的一种途径,这些模型可以根据提供的上下文生成预测,从而在测试时适应新的视觉任务。然而,对这些模型适应能力的评估一直局限于狭窄的设置,主要反映预训练中的任务或图像领域,而实际适应并不需要。我们通过构建一个广泛的视觉上下文基准(VIBE),重点关注多样化的成像领域和广泛的任务,来解决这一差距。借此,我们能够更清晰地了解视觉上下文模型在面对新的图像和任务分布时的适应能力。我们在14个数据集和12个任务上对六个模型进行了压力测试(总共探索了106个数据集-任务组合),并在统一的、可重复的评估协议下,以一次学习设置进行比较。我们的评估揭示了视觉上下文学习现状的关键见解,包括局限性、系统性失败模式和有前景的方向。为了促进更广泛的评估,我们将公开发布我们的VIBE工具包。

英文摘要

Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

2606.11129 2026-06-10 cs.CV 新提交

WorldOlympiad: Can Your World Model Survive a Triathlon?

WorldOlympiad:你的世界模型能经受铁人三项考验吗?

Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang, Dakai An, Akide Liu, Yinghao Yu, Jiasheng Tang, Fan Wang, Wei Wang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) DAMO Academy, Alibaba Group(阿里巴巴达摩院) The Hong Kong University of Science and Technology(香港科技大学) Monash University(莫纳什大学) TRE, Alibaba Group(阿里巴巴TRE)

AI总结 提出WorldOlympiad基准,从物理忠实性、几何一致性和交互保真度三个维度诊断视频世界模型,揭示现有模型在物理推理、3D一致性和长程交互方面的显著不足。

Comments Project Page: https://alibaba-damo-academy.github.io/WorldOlympiad/, Code: https://github.com/alibaba-damo-academy/WorldOlympiad

详情
AI中文摘要

我们介绍WorldOlympiad,一个用于诊断基于视频的世界模型在物理忠实性、几何一致性和交互保真度方面的基准。现有基准通常关注视觉质量、语义对齐或短期时间一致性,但很少能洞察生成视频是否遵循物理规则、保持连贯的3D结构以及支持长程可控交互。为弥补这一空白,WorldOlympiad将世界模型评估分解为三个互补维度。物理轨迹使用对象分割和MLLM作为评判者,评估生成视频是否遵循力学、热现象和材料属性中的可解释规则。几何轨迹通过高斯泼溅重建生成视频,评估结构一致性、跨视角连贯性和相机轨迹对齐。交互轨迹评估生成序列是否遵循复杂动作提示并在连续视频块间保持平滑连贯的过渡。WorldOlympiad进一步涵盖三个主要下游场景,包括游戏、机器人和通用真实世界视频,捕捉从交互控制、具身操作到开放域运动和相机动态的多样化挑战。这些轨迹和场景共同构成了一个可扩展且可解释的评估套件,揭示了超越通用视频质量的失败模式。对最先进模型的实验揭示了物理推理、3D一致性和长程交互方面的显著差距,强调了为生成式世界模型制定更结构化评估协议的必要性。

英文摘要

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

2606.10255 2026-06-10 eess.IV cs.CV cs.DL cs.LG physics.bio-ph 交叉投稿

POPSICLE: Benchmark Datasets for Segmentation and Localization in CryoET

POPSICLE: 用于冷冻电镜断层扫描中分割和定位的基准数据集

Jonathan Schwartz, Utz Heinrich Ermel, C. Braxton Owens, Zhuowen Zhao, Ariana Peck, Gus L. W. Hart, Grant J. Jensen, Bridget Carragher, Dari Kimanius

发表机构 * Biohub Brigham Young University

AI总结 提出POPSICLE基准套件,基于CryoET数据门户构建,涵盖真核和原核系统、纯化与原位样本,支持体素分割和稀疏定位任务,旨在解决冷冻电镜断层扫描中缺乏标准化基准的问题。

详情
AI中文摘要

冷冻电镜断层扫描(cryoET)通过直接可视化完整细胞内的分子结构,将分子架构与细胞组织在天然环境中联系起来,已成为结构和细胞生物学中的强大工具。然而,实现cryoET的全部潜力日益依赖于计算分析,特别是机器学习(ML)的进步,以解释其复杂且信息丰富的数据。尽管进展迅速,cryoET的ML开发仍受限于缺乏标准化、良好注释的基准。现有评估通常规模小、任务特定且孤立构建,限制了方法间的稳健比较。在此,我们提出POPSICLE,一个基于CryoET数据门户(一个开放、ML就绪的断层数据、元数据和注释库)构建的cryoET分割和大分子定位基准套件。POPSICLE涵盖真核和原核系统、纯化和完全原位样本,以及密集体素分割和稀疏定位任务。基于动态数据资源,它可随着新数据集和注释的出现而扩展。基线实验揭示了模型排名在不同任务间的显著变化,强调了需要针对cryoET独特特征定制的基准,而非从相邻生物医学成像领域借鉴的评估实践。因此,POPSICLE为cryoET中可重复的ML评估提供了开放且可扩展的基础。

英文摘要

Cryo-electron tomography (cryoET) has emerged as a powerful tool in structural and cellular biology by enabling direct visualization of macromolecular structures within intact cells, thereby linking molecular architecture to cellular organization in a native context. Realizing the full potential of cryoET, however, increasingly depends on advances in computational analysis, particularly machine learning (ML), to interpret its complex and information-rich data. Despite rapid progress, ML development for cryoET remains bottlenecked by the lack of standardized, well-annotated benchmarks. Existing evaluations are typically small, task-specific, and are assembled in isolation, limiting robust comparisons across methods. Here, we present POPSICLE, a benchmark suite for cryoET segmentation and macromolecular localization built from the CryoET Data Portal - an open, ML-ready repository of tomographic data, metadata, and annotations. POPSICLE spans eukaryotic and prokaryotic systems, both purified and fully in situ samples, and dense voxel-wise segmentation as well as sparse localization tasks. Built on a living data resource, it can expand as new datasets and annotations become available. Baseline experiments reveal substantial variation in model rankings across tasks, underscoring the need for benchmarks tailored to the unique characteristics of cryoET rather than evaluation practices adapted from adjacent biomedical imaging domains. POPSICLE thus provides an open and extensible foundation for reproducible ML evaluation in cryoET.

2504.03118 2026-06-10 cs.CV cs.AI 版本更新

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

NuWa: 为边缘设备导出轻量级类别特定视觉Transformer

Ziteng Wei, Qiang He, Bing Li, Feifei Chen, Hai Jin, Yun Yang

发表机构 * National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab(大数据技术与系统国家工程研究中心、服务计算技术与系统实验室、集群与网格计算实验室) Swinburne University of Technology(斯威本科技大学) Deakin University(迪金大学)

AI总结 针对边缘设备只需识别特定类别的问题,提出NuWa方法,通过自知识净化去除有害权重,并利用闭式优化高效导出紧凑ViT,无需重训练即可提升类别精度并加速推理。

Comments Accepted at CVPR 2026

详情
AI中文摘要

视觉Transformer(ViT)通常需要压缩以部署在资源受限的边缘设备(如无人机和智能车辆)上。然而,现有的模型压缩方法忽略了许多边缘设备仅需特定类别的知识用于其应用。因此,导出的全类别ViT保留了冗余知识,在这些类别上表现次优。我们发现,简单地将校准数据集替换为类别特定数据不足以解决此问题,因为这些方法面临两个根本限制。首先,它们忽略了存在对类别有害的权重,这些权重干扰特化,而移除它们可以提升类别特定性能。其次,目标类别的多样性和边缘设备的资源约束需要大量定制模型。现有方法耗时且计算成本高,因此不可扩展。在这项工作中,我们提出NuWa,一种成本高效的方法,通过从基础ViT导出小型ViT来应对这些挑战,适用于具有特定类别需求的边缘设备。NuWa执行自知识净化以剪除对类别有害的权重,并通过闭式优化高效导出紧凑ViT。无需剪枝后重训练,导出的边缘ViT在类别特定精度上超越基础ViT,并加速推理。综合实验表明,NuWa在类别特定任务上比最先进的无训练剪枝方法精度高出高达29.00%。与性能最佳的依赖训练剪枝方法相比,NuWa实现了33.69倍的剪枝加速,并将剪枝成本降低高达99.83%,平均精度损失仅为0.61%。项目页面:this https URL。

英文摘要

Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69x pruning speedup and reduces pruning cost by up to 99.83\%, with only a 0.61\% average accuracy loss. Project Page: https://github.com/CGCL-codes/NuWa.

2505.11034 2026-06-10 cs.CV cs.AI cs.LG 版本更新

CleanPatrick: A Benchmark for Image Data Cleaning

CleanPatrick: 图像数据清洗基准

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly

发表机构 * University of Basel(巴塞尔大学) Lucerne University of Applied Sciences and Arts(卢塞恩应用科学大学) University Hospital of Basel(巴塞尔大学医院) Northwestern University(西北大学) Northeast Dermatology Associates(东北皮肤科诊所) Medical University of Vienna(维也纳医科大学) Banner Health(Banner健康系统)

AI总结 提出首个大规模图像数据清洗基准CleanPatrick,基于Fitzpatrick17k皮肤病数据集,收集大量众包标注并采用项目反应理论聚合,将问题检测形式化为排序任务,评估多种方法。

Comments Accepted at Journal of Data-centric Machine Learning Research (DMLR)

详情
AI中文摘要

鲁棒的机器学习依赖于干净的数据,然而当前的图像数据清洗基准依赖于合成噪声或狭窄的人类研究,限制了比较和现实相关性。我们引入CleanPatrick,这是图像领域首个大规模数据清洗基准,基于公开的Fitzpatrick17k皮肤病学数据集构建。我们收集了来自933名医学众包工作者的496,377个二元标注,识别出离题样本(4%)、近似重复(21%)和标签错误(32%),并采用受项目反应理论启发的聚合模型,随后经过专家审查以获得高质量的真实标签。CleanPatrick将问题检测形式化为排序任务,并采用反映真实审计流程的标准排序指标。我们基准测试了经典异常检测器、感知哈希、SSIM、Confident Learning、NoiseRank、FINE、BHN和SelfClean。在CleanPatrick上,自监督表示在近似重复检测方面表现出色,经典方法在受限审查预算下实现了有竞争力的离题检测,而在保守的人类判断下检测不合理标签对于细粒度医学分类仍然具有挑战性。通过发布数据集和评估框架,CleanPatrick使得图像清洗策略的系统比较成为可能。

英文摘要

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (32%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and employs standard ranking metrics that mirror real audit workflows. We benchmark classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, FINE, BHN, and SelfClean. On CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and detecting implausible labels under conservative human judgment remains challenging for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies.

2512.11995 2026-06-10 cs.CV cs.AI cs.LG 版本更新

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

V-REX: 通过问题链进行探索性视觉推理的基准测试

Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学学院市分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出V-REX基准,通过问题链将多步探索推理分解为规划和遵循能力,评估视觉语言模型在复杂开放任务中的表现。

Comments 28 pages

详情
AI中文摘要

尽管许多视觉语言模型(VLM)被开发用于回答定义明确、目标高度具体的简单问题(如大多数基准测试所示),但在实践中,它们通常难以处理复杂的开放式任务,这些任务通常需要在视觉空间中进行多轮探索和推理。这种视觉思维路径不仅像AI侦探一样提供逐步探索和验证,还能对最终答案产生更好的解释。然而,由于中间步骤的探索空间巨大,这些路径难以评估。为弥补这一差距,我们开发了一个评估套件“多步探索视觉推理(V-REX)”,它由一个具有挑战性的视觉推理任务基准和一个评估协议组成。V-REX涵盖了跨不同领域的丰富应用场景。V-REX将多步探索推理转化为问题链(CoQ),并解耦了VLM的能力:(1)规划:通过选择一系列探索性问题来分解开放式任务;(2)遵循:顺序回答精心策划的CoQ以收集信息,从而推导出最终答案。通过每步策划有限的问题和答案选项,V-REX实现了对中间步骤的可靠定量和细粒度分析。通过评估最先进的专有和开源VLM,我们揭示了持续的扩展趋势、规划与遵循能力之间的显著差异,以及多步探索推理中巨大的改进空间。

英文摘要

While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.

2601.22763 2026-06-10 cs.CV 版本更新

Is Task-Specific Training Necessary for Anomaly Detection?

异常检测是否需要任务特定训练?

Xingwu Zhang, Guanxuan Li, Paul Henderson, Gerardo Aragon-Camarasa, Zijun Long

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出基于检索的异常检测框架RAD,无需任务特定训练,通过多级检索匹配记忆库中的无异常特征,在多个基准上达到最优性能,挑战了任务特定训练的必要性。

详情
AI中文摘要

当前最先进的多类无监督异常检测(MUAD)方法依赖于训练编码器-解码器模型来重建无异常特征。然而,我们认为这种任务特定训练在分布偏移下成本高昂,并且基于重建的残差评分进一步面临保真度-稳定性困境。现有的免训练替代方案在MUAD中仍然容易受到跨类别和跨区域不匹配的影响。受这些限制的启发,我们提出了基于检索的异常检测(RAD),一种无需任务特定训练的框架,它将无异常特征存储在记忆库中,并通过多级检索检测异常,将测试补丁与记忆库进行匹配。实验表明,RAD在四个既定基准(MVTec-AD、VisA、Real-IAD、3D-ADAM)的标准和少样本设置下均达到了最先进的性能。在MVTec-AD上,RAD仅使用单个无异常图像即可达到96.7%的像素AUROC,而RAD的全数据性能为98.5%。这些发现共同推翻了MUAD需要任务特定训练的假设,表明最先进的异常检测可以通过免训练的基于记忆的检索实现。我们的代码可在此https URL获取。

英文摘要

Current state-of-the-art multi-class unsupervised anomaly detection (MUAD) methods rely on training encoder--decoder models to reconstruct anomaly-free features. However, we argue that such task-specific training is costly under distribution shifts, and that reconstruction-based residual scoring further faces a fidelity--stability dilemma. Existing training-free alternatives, in turn, remain prone to cross-category and cross-region mismatches in MUAD. Motivated by these limitations, we propose Retrieval-based Anomaly Detection (RAD), a task-specific training-free framework that stores anomaly-free features in a memory and detects anomalies through multi-level retrieval, matching test patches against the memory. Experiments demonstrate that RAD achieves state-of-the-art performance across four established benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under both standard and few-shot settings. On MVTec-AD, RAD reaches 96.7% Pixel AUROC with just a single anomaly-free image compared to 98.5% of RAD's full-data performance. Collectively, these findings overturn the assumption that MUAD requires task-specific training, showing that state-of-the-art anomaly detection is feasible with training-free memory-based retrieval. Our code is available at https://github.com/longkukuhi/RAD.

2602.09809 2026-06-10 cs.CV 版本更新

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

SciFlow-Bench:通过逆解析评估结构感知的科学图表生成

Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Huawei Cloud BU(华为云业务部) Zhongguancun Academy(中关村学院) Beijing Key Laboratory of Data Intelligence and Security (Peking University)(北京数据智能与安全重点实验室(北京大学))

AI总结 提出SciFlow-Bench基准,通过逆解析将生成的图表图像转换为结构化图进行比较,以结构可恢复性而非视觉相似性评估科学图表生成。

详情
AI中文摘要

科学图表传达显式的结构信息,然而现代文本到图像模型通常生成视觉上合理但结构错误的结果。现有基准要么依赖图像中心或主观指标,对结构不敏感,要么评估中间符号表示而非最终渲染图像,导致基于像素的图表生成研究不足。我们引入SciFlow-Bench,一个结构优先的基准,用于直接从像素级输出评估科学图表生成。基于真实科学PDF构建,SciFlow-Bench将每个源框架图与规范真值图配对,并在闭环往返协议下将模型作为黑盒图像生成器进行评估,该协议将生成的图表图像逆解析回结构化图以进行比较。该设计通过结构可恢复性而非仅视觉相似性进行强制评估,并由一个协调规划、感知和结构推理的分层多智能体系统实现。实验表明,保持结构正确性仍然是一个基本挑战,特别是对于具有复杂拓扑的图表,强调了结构感知评估的必要性。

英文摘要

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.

2409.02426 2026-06-10 cs.LG cs.CV 版本更新

Breaking the Curse of Dimensionality: Diffusion Models Efficiently Learn Low-Dimensional Distributions

打破维度诅咒:扩散模型高效学习低维分布

Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出新数学框架,证明扩散模型通过等价于子空间聚类,能以线性于内在维度的样本复杂度学习低维分布,避免维度诅咒。

Comments 37 pages, 8 figures, 2 tables

详情
AI中文摘要

尽管扩散模型在广泛的生成任务中取得了经验上的成功,但其学习数据分布能力的基本原理仍不清楚。在这项工作中,我们开发了一个新的数学框架,解释了扩散模型如何能够从有限数量的训练样本中有效学习低维分布,而不受维度诅咒的影响。具体来说,受图像数据内在低维结构的启发,我们在理论上分析了一个数据分布被建模为低秩高斯混合的场景。在合适的网络参数化下,我们表明优化扩散模型的训练目标等价于在训练样本上解决经典子空间聚类问题,其中每个子空间基对应于一个高斯分量的低秩协方差。这种等价性使我们能够证明,学习底层分布的样本复杂度与数据的内在维度呈线性关系,而不是与环境维度呈指数关系。我们的理论发现得到了经验证据的进一步支持,这些证据展示了在合成和真实世界图像数据集上的泛化相变现象。此外,我们建立了学习到的子空间基与图像数据语义属性之间的对应关系,为可控图像生成提供了原则性基础。

英文摘要

Despite their empirical success across a wide range of generative tasks, the fundamental principles underlying the ability of diffusion models to learn data distributions are poorly understood. In this work, we develop a new mathematical framework that explains how diffusion models can effectively learn low-dimensional distributions from a finite number of training samples without suffering from the curse of dimensionality. Specifically, motivated by the intrinsic low-dimensional structure of image data, we theoretically analyze a setting in which the data distribution is modeled as a mixture of low-rank Gaussians. Under suitable network parameterization, we show that optimizing the training objective of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples, where each subspace basis corresponds to the low-rank covariance of a Gaussian component. This equivalence allows us to show that the sample complexity for learning the underlying distribution scales linearly with the intrinsic dimension of the data, rather than exponentially with the ambient dimension. Our theoretical findings are further supported by empirical evidence that demonstrates phase transition phenomena in generalization on both synthetic and real-world image datasets. Moreover, we establish a correspondence between the learned subspace bases and semantic attributes of image data, providing a principled foundation for controllable image generation.

2411.02817 2026-06-10 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

Conditional Vendi Score: Prompt-Aware Diversity Evaluation for Generative AI Models and LLMs

条件 Vendi 分数:生成式 AI 模型和 LLM 的提示感知多样性评估

Mohammad Jalali, Azim Ospanov, Amin Gohari, Farzan Farnia

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(计算机科学与工程系,香港中文大学) Department of Information Engineering, The Chinese University of Hong Kong(信息工程系,香港中文大学)

AI总结 针对文本提示引导的生成模型,提出条件 Vendi 和条件 RKE 分数,通过条件熵分离模型自身多样性,并证明收敛性及在多个任务中恢复真实多样性排序。

详情
AI中文摘要

由文本提示引导的生成模型在保真度和提示对齐方面被广泛评估,但其产生输出的能力仍未被充分探索。现有的多样性指标(如基于核矩阵的 von Neumann 和 Rényi 熵的 Vendi 和 RKE)是为无条件模型开发的,无法区分提示引起的变异和模型引起的变异。我们通过引入 \textit{Conditional-Vendi} 和 \textit{Conditional-RKE} 来解决这一差距,这些多样性度量源自正半定矩阵的条件熵。这些分数在提示引导生成中分离出模型引起的多样性,其中 Conditional-RKE 具有 $O(1/\sqrt{n})$ 的收敛速度。对于 Conditional-Vendi,我们引入了一种截断谱近似,产生可扩展且一致的估计。在文本到图像、图像字幕和 LLM 任务上的实验表明,条件分数恢复了真实多样性排序,并且还可以引导扩散模型生成更多样化的样本。代码库可从此 https URL 获取。

英文摘要

Generative models guided by text prompts are widely evaluated for fidelity and prompt alignment, yet their ability to produce outputs remains underexplored. Existing diversity metrics such as Vendi and RKE, which are based on the von Neumann and Rényi entropies of kernel matrices, were developed for unconditional models and cannot distinguish prompt-induced from model-induced variability. We address this gap by introducing \textit{Conditional-Vendi} and \textit{Conditional-RKE}, diversity measures derived from the conditional entropy of positive semidefinite matrices. These scores isolate model-induced diversity in prompt-guided generation, with Conditional-RKE enjoying an $O(1/\sqrt{n})$ convergence rate. For Conditional-Vendi, we introduce a truncated-spectrum approximation that yields scalable and consistent estimates. Experiments on text-to-image, image-captioning, and LLM tasks show that the conditional scores recover ground-truth diversity orderings and can also guide diffusion models toward more diverse samples. The codebase is available at https://github.com/mjalali/conditional-vendi.

12. 其他/综合视觉 22 篇

2606.06709 2026-06-10 cs.CV 新提交

USU-Corn-WeedDB: A UAV RGB Image Dataset for Multi-Species Weed Detection in Forage Corn

USU-Corn-WeedDB:用于饲料玉米多物种杂草检测的无人机RGB图像数据集

Utsav Bhandari, Saroj Burlakoti, Rhonda Miller, Sierra Young, Eric Westra, Aaron Etienne

发表机构 * Department of Applied Sciences, Technology, and Education, Utah State University(应用科学、技术和教育系,犹他州立大学) Department of Plants, Soils & Climate, Utah State University(植物、土壤与气候系,犹他州立大学) Department of Civil and Environmental Engineering, Utah State University(土木与环境工程系,犹他州立大学)

AI总结 为解决饲料玉米生产中杂草检测数据集稀缺问题,构建了USU-Corn-WeedDB无人机RGB图像数据集,包含三种杂草的10539个标注实例和8000张未标注图像,并验证了多种YOLO及RT-DETR模型的检测性能。

Comments 8 pages, 4 figures, 1 table

详情
AI中文摘要

饲料玉米生产中的杂草压力导致产量损失高达31.5%,然而基于无人机图像和深度学习的特定地点杂草管理(SSWM)系统仍受限于缺乏田间代表性训练数据集。我们提出了USU-Corn-WeedDB,这是一个公开可用的无人机RGB图像数据集,采集自犹他州Cache Valley的一个商业饲料玉米田,旨在支持有监督和半监督学习框架下的多类别杂草检测。RGB图像于2025年6月27日使用Autel EVO II Dual 640T V2无人机在距地面约10米高度采集,地面采样距离约为0.48厘米/像素。总共366张全分辨率图像被切分为8800个640×640像素的图块。其中,800张图像被手动标注了三种杂草:藜(Chenopodium album)、反枝苋(Amaranthus retroflexus)和狗尾草(Setaria viridis),共包含10539个边界框实例,其余8000个图块作为未标注池用于半监督实验。该数据集反映了自然的类别不平衡,其中反枝苋占标注实例的53.86%,这是有意保留以模拟真实田间条件。为验证数据集实用性,我们在相同条件下训练了28个目标检测模型,涵盖YOLOv8、YOLOv9、YOLOv10、YOLO11、YOLO26和RT-DETR五个架构家族,未进行超参数调优。测试集mAP@0.5范围为0.773至0.840,轻量级模型取得了与边缘部署无人机系统相关的竞争性能。USU-Corn-WeedDB公开于https://this URL。

英文摘要

Weed pressure in forage corn production causes yield losses of up to 31.5%, yet site-specific weed management (SSWM) systems built on UAV imagery and deep learning remain constrained by the scarcity of field-representative training datasets. We present USU-Corn-WeedDB, a publicly available UAV RGB image dataset collected from a commercial forage corn field in Cache Valley, Utah, designed to support multi-class weed detection under both supervised and semi-supervised learning frameworks. RGB imagery was acquired on 27 June 2025 using an Autel EVO II Dual 640T V2 drone at ~10m above ground level, yielding a ground sampling distance of approximately 0.48 cm/pixel. A total of 366 full-resolution images were tiled into 8,800 patches at 640 x 640-pixel resolution. Of these, 800 images were manually annotated for three weed species; common lambsquarters (Chenopodium album), redroot pigweed (Amaranthus retroflexus), and green foxtail (Setaria viridis) comprising 10,539 bounding-box instances, with the remaining 8,000 tiles retained as an unlabeled pool for semi-supervised experiments. This dataset reflects a natural class imbalance where redroot pigweed constitutes 53.86% of annotated instances, which was preserved intentionally to mirror real field conditions. To validate dataset utility, we trained 28 object detection models spanning five architecture families including YOLOv8, YOLOv9, YOLOv10, YOLO11, YOLO26, and RT-DETR under identical conditions without hyperparameter tuning. Test set mAP@0.5 ranged from 0.773 to 0.840, with lightweight models achieving competitive performance relevant to edge-deployed UAV systems. USU-Corn-WeedDB is publicly available at https://doi.org/10.5281/zenodo.20044178.

2606.10431 2026-06-10 cs.CV cs.AI 新提交

Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

视觉辅助的基础模型解决多任务车辆路径问题

Shuangchun Gui, Zhiguang Cao, Wen Song, Yew-Soon Ong

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Institute of Marine Science and Technology, Shandong University(山东大学海洋科学与技术研究院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Centre for Frontier AI Research, Institute of High Performance Computing, Agency for Science, Technology and Research(新加坡科技研究局高性能计算研究所前沿人工智能研究中心)

AI总结 提出视觉辅助基础模型VaFM,通过将约束编码为图像并融合图节点嵌入,同时解决16种VRP变体,在复杂约束变体上超越现有方法。

Comments Accepted by TNNLS

详情
AI中文摘要

多任务车辆路径问题在提升各行业和服务部门效率中扮演关键角色。这些问题包含多种变体,在满足多样化客户约束的同时优化路径成本。现有的多任务VRP求解器仅利用基于图的模态,限制了其处理多约束变体的能力。作为表示复杂语义的格式,视觉模态在编码多样VRP约束方面展现出巨大潜力。这促使我们从视觉图像中学习补丁级语义,然后将其集成到基于图的模型中,以同时解决多种VRP变体。然而,直接将此方法应用于多任务VRP面临三个挑战:1)现有VRP图像缺乏约束表示,这对多任务VRP至关重要;2)单个补丁的固定感受野无法有效适应不同任务的需求;3)约束间像素分布不平衡可能导致模型忽略像素较少的约束。本文提出视觉辅助基础模型(VaFM)以应对这些挑战。在视觉模态中,针对所有约束定制的输入图像由卷积神经网络编码。获得的补丁嵌入与基于图的节点融合以生成解,并设计辅助任务解决像素不平衡问题。VaFM的性能在16种不同VRP变体上进行了评估。实验结果表明,VaFM优于最先进的方法,尤其是在具有复杂约束的变体上。

英文摘要

Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints.

2606.10811 2026-06-10 cs.CV 新提交

Deep learning for echo sounder data

深度学习用于回声测深仪数据

Ketil Malde

发表机构 * Ketil Malde

AI总结 本文探讨深度学习在声学数据(如回声图)中的应用,指出由于声学数据特性,需开发专用方法而非简单复用图像处理模型,并强调缺乏标准数据集和格式是主要障碍。

详情
AI中文摘要

毫无疑问,在过去十年中,机器学习领域的技术已经彻底改变了我们处理和解释数据的方式,尤其是图像和文本。对于水下观测,声学是主要的信息来源,自然地,深度学习方法已被应用于回声图和其他声学数据,但迄今为止成果相当有限。在此,我们认为,由于声学数据的固有特性,重大进展可能需要研究超越简单复用图像处理模型和技术的深度学习方法。目前,方法开发的突破潜力受到缺乏标准数据格式和组织方式的阻碍,更甚的是缺乏具有既定性能目标的现成高质量数据集。为了推动该领域的发展,这些不足应得到纠正。

英文摘要

There is no doubt that over the last decade, techniques from the field of machine learning have revolutionized how we process and interpret data, especially images and text. For underwater observations acoustics is a primary source of information, and naturally, deep learning methods have been applied to echograms and other acoustics data, but so far with rather modest results. Here, we argue that due to intrinsic properties of acoustic data, substantial advances will likely require research into deep learning methods beyond mere recycling of models and techniques from image processing. Currently, the potential for breakthroughs in method development is hindered by the lack of standard data formats and organization, and even more by the lack of readily available, high quality data sets with established performance goals. To advance the field, these shortcomings should be remedied

2606.10874 2026-06-10 cs.CV math.QA quant-ph 新提交

Schmidt Decomposition-Based Methods for Efficient Quantum Image Encoding

基于Schmidt分解的高效量子图像编码方法

Ana-Maria Pangeva, Yassine Ferhi, Alexander Geng, Andreas Weinmann, Desislava Ivanova, Ali Moghiseh

AI总结 针对量子图像编码在NISQ设备上电路复杂度过高的问题,提出基于Schmidt分解的低秩近似方法,在保持图像质量的同时显著降低电路深度和门数量,FRQI模型实现97%的深度缩减且MSE仅约0.27。

详情
AI中文摘要

在量子图像处理中,一个基本步骤是将经典图像数据编码为量子态。这可以通过诸如量子图像的灵活表示(FRQI)、量子概率图像编码(QPIE)和新颖增强量子表示(NEQR)等方法实现。然而,在真实量子硬件上,这些编码会迅速导致电路具有大量门、大电路深度和高量子比特使用量,这对于嘈杂中等规模量子(NISQ)设备来说是一个问题。在这项工作中,我们研究了通过Schmidt分解公式化的低秩状态近似是否有助于降低这种复杂性。该方法仅保留量子态纠缠结构中最显著的部分,使状态准备更高效,同时保留大部分图像信息。我们比较了三种编码技术在其原始形式和低秩近似下的性能,评估了电路深度、CNOT计数、MSE和重建图像的视觉质量等指标。结果揭示了准确性与资源效率之间有意义的权衡,其中FRQI模型实现了97%的电路深度缩减,同时保持了近乎完美的重建(MSE约为0.27)。这证明了低秩技术在近期硬件上推进实用量子图像处理的潜力。

英文摘要

In quantum image processing, a fundamental step is encoding classical image data into quantum states. This can be achieved using methods such as Flexible Representation of Quantum Images (FRQI), Quantum Probability Image Encoding (QPIE), and Novel Enhanced Quantum Representation (NEQR). However, on real quantum hardware, these encodings can quickly lead to circuits with many gates, large circuit depth, and high qubit usage, which is a problem for Noisy Intermediate-Scale Quantum (NISQ) devices. In this work, we investigate whether low-rank state approximation, formulated via Schmidt decomposition, can help reduce this complexity. The method keeps only the most significant parts of a quantum state's entanglement structure, making state preparation more efficient while preserving most of the image information. We compare the three encoding techniques in their original form and with low-rank approximation, evaluating metrics such as circuit depth, CNOT count, MSE, and visual quality of reconstructed images. The results reveal meaningful trade-offs between accuracy and resource efficiency, with the FRQI model achieving a 97 percent reduction in circuit depth while maintaining a near-perfect reconstruction (MSE of about 0.27). This demonstrates the potential of low-rank techniques for advancing practical quantum image processing on near-term hardware.

2606.09842 2026-06-10 cs.HC cs.AI cs.CV 交叉投稿

Integrated Real-Time Motion Tracking and AI Analysis for Athletic Performance Optimization

集成实时运动跟踪与AI分析以优化运动表现

Parth Agrawal, Ronit, Sagar Kumar, Aashish Bhambri

发表机构 * Department of Computer Science(计算机科学系) Department of Computer Science and Engineering(计算机科学与工程系) Chandigarh University(昌迪加尔大学)

AI总结 本文综述了实时人体姿态估计在运动分析中的方法,并开发了一个轻量级原型系统,利用MediaPipe框架提供实时反馈,以优化运动表现。

Comments 6 pages, 10 figures, 2 tables, IC2E3-2026 conference

详情
AI中文摘要

在真实世界环境中应用人体姿态估计(HPE)仍然是一项具有挑战性的任务。本文探讨并综述了实时HPE方法及其在个体运动分析中的局限性,同时开发了一个实用的轻量级原型用于真实世界的测试和使用。从传统的基于标记的运动捕捉系统发展到现代可访问且适应性强的无标记深度学习方法,本文综述了平衡精度和效率的基础架构。我们还比较了算法框架(如自顶向下、自底向上、单阶段方法等)在实际部署指标上的表现,包括推理延迟、帧率、平均关节位置误差和时间抖动,以指导运动应用的模型选择过程。作为我们的主要贡献,我们提出了一个模块化的轻量级软件原型,该原型使用MediaPipe HPE框架结合多种特定于运动的逻辑,为非专业用户提供实时洞察和基于AI的反馈。我们以最小的计算资源推导出运动洞察并提供反馈,同时展示了性能和可靠性指标。最后,我们提出了其他未来研究方向,如结合传感器和AR/VR。这项工作面向研究人员、工程师、运动科学家等,既作为技术资源,也作为实现类似或改进的实时HPE分析系统以增强运动表现或其他目的的有效蓝图。

英文摘要

Applying Human Pose Estimation (HPE) in real world environments remains a challenging task, this paper explores and surveys real time HPE approaches and their limitations in sports analysis for individuals, alongside developing a practical lightweight prototype for real world testing and usage. The older marker-based motion capture systems evolving to the modern accessible and adaptable markerless deep learning approaches, this survey explores the foundational architectures, which balance precision and efficiency. We also compare algorithmic frameworks (top-down, bottom-up, one-stage approaches, etc.) on practical deployment metrics such as inference latency, frame rate, mean per-joint position error, and temporal jitter to guide model selection process for sports application. As our prime contribution, we are proposing a modular, lightweight software prototype, which uses MediaPipe HPE framework with multiple exercise specific logic to deliver real-time insights and AI based feedback for non-expert users. We derive sports insights and providing feedback with minimal computational resources, while showcasing the performance and reliability metrics. In the end, we suggest other future research directions like combining sensors, and AR/VR. This work caters to researchers, engineers, sport scientists, etc., as both technical resource and a valid blueprint to implement a similar or improved real-time HPE analysis system for athletic performance enhancement or other purposes.

2606.09849 2026-06-10 cs.HC cs.CV 交叉投稿

Sketch-to-Layout: A Human-Centric Computational Agent for Constraint-Aware Synthesis of Modular Photobioreactors

草图到布局:面向约束感知的模块化光生物反应器合成的人本计算代理

Xiujin Liu, Shuqi Li, Yuxin Lin

发表机构 * Qrafty Technology Inc.(Qrafty技术公司) University of Michigan(密歇根大学)

AI总结 提出一种计算框架,将用户草图转化为模块化光生物反应器立面布局,通过约束满足问题(CSP)求解器实现近实时合成,并引入弱监督藻类健康监测管道,实现碳中性和自主维护。

Comments 13 pages, 6 figures

详情
AI中文摘要

建筑集成光生物反应器(PBR)为碳中和建筑提供了一条途径,但其部署受到配置复杂性和生物维护的阻碍。本文提出了一种模块化PBR立面系统,由协调设计意图与物理有效性的计算框架驱动。我们引入了具有集成容器和管道几何结构的“碳中和砖块”;整体流体通道实现了“即插即用”组装。为了应对14种模块几何结构的组合复杂性,我们开发了一个计算草图到布局代理,将布局合成公式化为约束满足问题(CSP)。利用CP-SAT引擎,该代理将稀疏的用户草图视为软先验,同时强制执行端口对齐和全局连通性等硬约束。这使得非专家能够在近实时内合成可制造配置。此外,为了促进自主维护,我们提出了一种弱监督藻类健康监测管道。通过采用混合CNN-注意力骨干和时间排序损失,该系统无需绝对真实标签即可从照片中量化生物活力。实验表明,CSP求解器在高达15x15的网格尺度上实现了95.5%的成功率。定性评估证实该框架在确保操作完整性的同时保留了设计语义。长期测试表明,视觉模块产生的健康轨迹与14天生物周期一致,这表明将交互式合成与低成本计算机视觉相结合可以普及可扩展的碳捕获系统。

英文摘要

Building-integrated photobioreactors (PBRs) offer a pathway for carbon-neutral architecture, yet deployment is hindered by configuration complexity and biological maintenance. This paper presents a modular PBR facade system powered by a computational framework reconciling design intent with physical validity. We introduce 'carbon-neutralization bricks' featuring integrated vessel-and-conduit geometry; monolithic fluid channels enable 'plug-and-play' assembly. To navigate the combinatorial complexity of 14 modular geometries, we develop a Computational Sketch-to-Layout Agent that formulates layout synthesis as a Constraint Satisfaction Problem (CSP). Using the CP-SAT engine, the agent treats sparse user sketches as soft priors while enforcing hard constraints like port alignment and global connectivity. This allows non-experts to synthesize fabrication-ready configurations in near real-time. Furthermore, to facilitate autonomous maintenance, we propose a weakly supervised algae health monitoring pipeline. By employing a hybrid CNN-attention backbone and a temporal ranking loss, the system quantifies biological vitality from photographs without absolute ground-truth labels. Experiments demonstrate the CSP solver achieves a 95.5% success rate on grid scales up to 15 x 15. Qualitative evaluations confirm the framework preserves design semantics while ensuring operational integrity. Long-term tests show the vision module produces health trajectories aligned with 14-day biological cycles, suggesting that integrating interactive synthesis with low-cost computer vision can democratize scalable carbon capture systems.

2606.10050 2026-06-10 cs.GR cs.CV 交叉投稿

Continuous Neural Reparameterization as a Deep Geometric Prior for Robust Fixed-Chart UV Repair

连续神经重参数化作为鲁棒固定图表UV修复的深度几何先验

Mohammad Sadegh Salehi

发表机构 * Zero One Creative London, UK(伦敦零一创意公司)

AI总结 提出将固定图表UV展开视为连续神经重参数化,使用未训练的SIREN网络优化几何目标,结合谱初始化、Tutte残差预热等策略,实现零翻转的鲁棒图表求解。

详情
AI中文摘要

传统的UV展开依赖于几何畸变能量的直接优化,可能因无效初始化、局部最小值或拓扑翻转而失败。我们将固定图表UV展开重新定义为连续神经重参数化:一个未训练的SIREN将每个顶点的网格特征映射到UV坐标,其权重针对几何目标进行优化。实际贡献是一个鲁棒的图表求解器配方,结合了Laplace-Beltrami谱输入、Tutte残差预热、$C^2$行列式扩展、单射性屏障以及有效性检查的重试/回退路由,而非声称任何单一组件能保证有效性或应取代重切割方法。NTK-LBO诊断表明,谱条件改变更新几何,尤其在初始化和中秩子空间,但本身不能预测图表成功。在紧凑预切割图表和47图表分层Thingi10K/xatlas切割基准上,神经求解器在所有紧凑图表上产生零翻转,并在42/47个分层求解中有效零翻转。与BFF和OptCuts的比较明确了范围:允许时重切割可以更快且畸变更低,而神经求解器针对提供图表的有效性和验证优先的图集构建。在Amara Spatial生成的网格上,完整的图集构建路径在25个资产集上提供打包图集覆盖,并在大规模Rust图集运行中通过回退路由实现1000/1000严格局部有效且零UV翻转的图集。

英文摘要

Traditional UV unwrapping relies on direct optimization of geometric distortion energies and can fail through invalid initialization, local minima, or topological foldovers. We recast fixed-chart UV unwrapping as continuous neural reparameterization: an untrained SIREN maps per-vertex mesh features to UV coordinates, and its weights are optimized for a geometric objective. The practical contribution is a robust chart-solver recipe, combining Laplace--Beltrami spectral inputs, Tutte residual warm-up, a $C^2$ determinant extension, an injectivity barrier, and validity-checked retry/fallback routing, rather than a claim that any single component guarantees validity or that recutting methods should be replaced. NTK--LBO diagnostics show that spectral conditioning changes update geometry, especially at initialization and mid-rank subspaces, but does not by itself predict chart success. On compact pre-cut charts and a 47-chart stratified Thingi10K/xatlas-cut benchmark, the neural solver produces zero flips on all compact charts and 42/47 valid zero-flip stratified solves. BFF and OptCuts comparisons sharpen the scope: recutting can be faster and lower-distortion when allowed, while the neural solver targets supplied-chart validity and validation-first atlas construction. On Amara Spatial generated meshes, the full atlas construction path gives packed-atlas coverage on a 25-asset set and 1000/1000 strict locally valid atlases with zero UV flips in a large-scale Rust atlas run after fallback routing.

2606.10223 2026-06-10 cs.SD cs.AI cs.CV 交叉投稿

Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

双分支门控融合用于开放集音频深度伪造源追踪

Awais Khan, Kutub Uddin, Khalid Malik

AI总结 针对开放集音频深度伪造源追踪问题,提出双分支门控融合框架,结合XLSR-53和CORES描述符,通过输入条件门控自适应加权,实现域内高精度和域外鲁棒泛化。

详情
AI中文摘要

将合成语音归因于其原始系统仍然是一个开放挑战:闭集模型无法拒绝未见过的合成器并产生过度自信的预测。为了解决这个问题,我们提出了一个双分支门控融合框架,将XLSR-53与CORES配对,CORES是一个66维描述符,与之前仅使用线性滤波器组(LFB)的工作不同,它跨越倒谱、振荡、节奏、能量和频谱维度,以捕获互补的合成伪影。我们的分析表明,XLSR-53在域内(ID)保持判别性,而CORES在分布偏移(OOD)下稳定泛化,但由于SSL表示不平衡,它们的简单拼接失败。为了解决这个问题,一个输入条件门控在联合训练下自适应地加权每个分支,使用交叉熵、用于ID/OOD分离的能量边际损失和门控多样性项。在MLAAD基准上,我们的系统实现了97.6%的ID准确率、4.9%的EERc,并且相对于Interspeech 2025基线,FPR95相对降低了83.5%。

英文摘要

Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.

2606.10407 2026-06-10 cs.SD cs.CV q-bio.QM 交叉投稿

Time-frequency localization of bird calls in dense soundscapes

密集声景中鸟鸣的时频定位

Simen Hexeberg, Fanghui Tong, Hari Vishnu, Mandar Chitre

发表机构 * Acoustic Research Laboratory, National University of Singapore(新加坡国立大学声学研究实验室) Tropical Marine Science Institute, National University of Singapore(新加坡国立大学热带海洋科学研究所) School of Marine Science and Technology, Northwestern Polytechnical University(西北工业大学航海学院)

AI总结 将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在密集热带声景中定位鸟鸣,并引入IoMin评估指标,在分布内和分布外数据上均优于基线。

详情
AI中文摘要

被动声学监测能够大规模观测野生动物,但大多数生物声学分类器仅预测时间窗口内的物种存在,而无法在时间或频率上精确定位发声,限制了后续分析。我们将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在新加坡密集热带声景中定位鸟鸣。此外,我们引入了一个开源的基于浏览器的标注工具,并提出了Intersection over Minimum (IoMin)评估指标,该指标比标准IoU更好地处理模糊的声学边界,更适合当前问题。最佳YOLO模型在新加坡的分布内声景中几乎将基线性能翻倍(81.8% vs. 42.1% IoMin@50 F1分数),同时在夏威夷的未见分布外录音上仍优于基线(58.6% vs. 48.6%)。这些结果表明,目标检测框架是复杂声景中动物发声时频定位的一种有前景的方法。

英文摘要

Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

2606.10611 2026-06-10 cs.LG cs.CV 交叉投稿

Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

几何感知强化学习用于二维不规则排样

Auguste Lehuger, Guillaume Henon-Just

发表机构 * Valeo Brain(法雷奥大脑)

AI总结 提出Polygons Transformer架构与组合优化强化学习框架,使智能体从数据中学习几何先验,在二维不规则排样中达到与最先进启发式算法Sparrow竞争的面积利用率。

Comments 15 pages, 4 figures, 5 tables. Under review at the European Workshop on Reinforcement Learning (EWRL)

详情
AI中文摘要

针对二维不规则排样问题的传统启发式求解器存在一个根本性限制:它们对多边形几何是盲目的,依赖引导式暴力搜索在连续放置空间中导航,几何指导极少。本文认为,强化学习具有独特优势来克服这一瓶颈。通过将优化策略与几何感知神经编码器配对,智能体可以直接从数据中自动发现丰富的几何先验,利用这些学到的直觉来战略性地引导探索。为实现这一点,我们引入了Polygons Transformer(PoT),这是一种新颖的架构,能够编码二维连续矢量几何,同时允许跨多边形注意力。我们将这种新颖架构与组合优化强化学习(CORL)训练框架相结合,以寻找最优解。为了支持这一范式,我们发布了一个源自复杂地理轮廓的开源训练数据集以及一个专门的评估基准。我们的实证验证表明,训练后的智能体在面积利用率方面与最先进的启发式求解器Sparrow高度竞争,证明强化学习可以成功发现并利用几何感知来完成精确的空间任务。

英文摘要

Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforcement Learning is uniquely positioned to overcome this bottleneck. By pairing an optimization policy with a geometry-aware neural encoder, an agent can automatically discover rich geometric priors directly from data, utilizing these learned intuitions to strategically guide exploration. To realize this, we introduce the Polygons Transformer (PoT), a novel architecture that encodes 2D continuous vector geometries while allowing cross-polygons attention. We couple this novel architecture with a Combinatorial Optimization Reinforcement Learning (CORL) training framework to find optimal solutions. To support this paradigm, we release an open-source training dataset derived from complex geographic contours alongside a dedicated evaluation benchmark. Our empirical validation demonstrates that our trained agent achieves area utilization performance highly competitive with Sparrow, the state-of-the-art heuristic solver, proving that reinforcement learning can successfully discover and exploit geometric awareness for precise spatial tasks.

2606.11120 2026-06-10 cs.AI cs.CV 交叉投稿

Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

蒙特卡洛传球搜索:利用轨迹生成进行足球3D反事实传球评估

Andrew Kang, Priya Narasimhan

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出蒙特卡洛传球搜索(MCPS),结合价值模型、世界模型和反事实策略,基于3D轨迹数据评估足球传球,通过两种执行盈余分数实现分布感知的传球分析。

Comments CVPR 2026, CVSports Workshop

详情
AI中文摘要

我们将足球传球评估重新定义为类似蒙特卡洛树搜索(MCTS)的评估问题,其组成部分大多以不同名称存在于文献中:价值模型(控球价值)、世界模型(带球交互的多智能体轨迹)以及反事实动作策略(带噪声的传球变体采样)。基于德甲联赛首个公开的高保真3D球轨迹跟踪数据集,我们引入了蒙特卡洛传球搜索(MCPS),该方法推断每个观察到的传球的踢球参数,采样执行变体和选项变体,使用球条件世界模型将每个候选向前滚动直到下一次球交互,并通过学习到的价值模型对结果进行评分,以获得所获价值的分布。该分布通过两种互补的执行盈余分数(基于均值和基于百分位的分数)实现分布感知的归因,用于分析和排名。为了使世界模型在有限的公开数据下具有样本效率,我们改编了来自自动驾驶的离散令牌自回归轨迹生成器(SMART),并表明与基线相比,它在最佳20次预测准确性上表现强劲,同时支持完全假设的展开以进行下游评估。我们已发布模型检查点和代码。

英文摘要

We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high-fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball-conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution-aware attribution with two complementary execution-surplus scores used for analysis and ranking: mean-based and percentile-based scores. To make the world model sample-efficient under limited public data, we adapt a discrete-token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best-of-20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.

2502.09928 2026-06-10 cs.CV cs.AI 版本更新

Deep Tree Tensor Networks

深度树张量网络

Chang Nie

发表机构 * Nanjing University of Science and Technology(南京理工大学)

AI总结 提出深度树张量网络(DTTN),通过多线性运算捕获指数阶特征交互,在多个基准上超越现有方法。

详情
AI中文摘要

源自量子物理的张量网络(TNs)已被广泛用作指数机器和参数分解器用于识别任务。典型的TN模型,如矩阵乘积态(MPS),在自然图像识别中尚未取得成功应用。当它们被使用时,主要是在现有网络中压缩参数,从而失去了捕获指数阶特征交互的独特能力。本文提出了一种名为\textit{\textbf{深度树张量网络}}(DTTN)的新架构,它通过多线性运算捕获跨特征的$2^L$阶乘法交互,同时本质上展开为具有参数共享属性的\textit{树}状TN拓扑。DTTN由多个反对称交互模块(AIMs)堆叠而成,这种设计便于高效实现。此外,我们的理论分析证明了量子启发的TN模型与多项式/多线性网络在特定条件下的等价性。我们认为DTTN可以促进该领域内更具可解释性的研究。所提出的模型在多个基准和领域上进行了评估,显示出优于同行方法和最先进架构的性能。我们的代码在此https URL公开提供。

英文摘要

Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parametric decomposers for recognition tasks. Typical TN models, such as Matrix Product States (MPS), have not yet achieved successful application in natural image recognition. When employed, they primarily serve to compress parameters within pre-existing networks, thereby losing their distinctive capability to capture exponential-order feature interactions. This paper introduces a novel architecture named \textit{\textbf{D}eep \textbf{T}ree \textbf{T}ensor \textbf{N}etwork} (DTTN), which captures $2^L$-order multiplicative interactions across features through multilinear operations, while essentially unfolding into a \emph{tree}-like TN topology with the parameter-sharing property. DTTN is stacked with multiple antisymmetric interaction modules (AIMs), and this design facilitates efficient implementation. Furthermore, our theoretical analysis demonstrates the equivalence between quantum-inspired TN models and polynomial/multilinear networks under specific conditions. We posit that the DTTN could catalyze more interpretable research within this field. The proposed model is evaluated across multiple benchmarks and domains, demonstrating superior performance compared to both peer methods and state-of-the-art architectures. Our code is publicly available at https://github.com/NieCha/deep_tree_tensor_network.

2509.25017 2026-06-10 cs.LG cs.CV 版本更新

Uncertainty-Aware Deep Learning for Wildfire Danger Forecasting

不确定性感知的深度学习用于野火危险预测

Spyros Kondylatos, Nikolas Papadopoulos, Gustau Camps-Valls, Ioannis Papoutsis

发表机构 * Aix-Marseille University(艾克斯-马赛大学) University of Cambridge(剑桥大学) University of Malaga(马拉加大学) University of Crete(希腊克里特大学)

AI总结 提出不确定性感知深度学习框架,联合捕获认知不确定性和偶然不确定性,提升短期野火危险预测的准确性和可靠性,F1分数提高2.3%,预期校准误差降低2.1%。

详情
AI中文摘要

野火是最严重的自然灾害之一,对人类和自然生态系统构成重大威胁。日益增长的野火风险增加了对不仅准确而且可靠的预测模型的需求。深度学习在预测野火危险方面显示出潜力;然而,其采用受到对其预测可靠性的担忧的阻碍,部分源于缺乏不确定性量化。为应对这一挑战,我们提出了一个不确定性感知的深度学习框架,该框架联合捕获认知(模型)和偶然(数据)不确定性,以增强短期野火危险预测。在次日预测中,与确定性基线相比,我们表现最佳的模型将F1分数提高了2.3%,并将预期校准误差降低了2.1%,从而提升了预测技能和校准能力。我们的实验证实了不确定性估计的可靠性,并展示了它们在决策支持中的实际效用,包括识别拒绝低置信度预测的不确定性阈值,以及生成伴随不确定性层的良好校准的野火危险图。将预测范围延长至十天,我们观察到偶然不确定性随时间增加,表明环境条件的更大变异性,而认知不确定性保持稳定。最后,我们表明,尽管两种不确定性类型在低不确定性情况下可能是冗余的,但在更具挑战性的条件下它们提供互补的见解,强调了联合建模对稳健野火危险预测的价值。总之,我们的方法显著提高了野火危险预测的准确性和可靠性,推动了可信赖的野火深度学习系统的发展。

英文摘要

Wildfires are among the most severe natural hazards, posing a significant threat to both humans and natural ecosystems. The growing risk of wildfires increases the demand for forecasting models that are not only accurate but also reliable. Deep Learning (DL) has shown promise in predicting wildfire danger; however, its adoption is hindered by concerns over the reliability of its predictions, some of which stem from the lack of uncertainty quantification. To address this challenge, we present an uncertainty-aware DL framework that jointly captures epistemic (model) and aleatoric (data) uncertainty to enhance short-term wildfire danger forecasting. In the next-day forecasting, our best-performing model improves the F1 Score by 2.3% and reduces the Expected Calibration Error by 2.1% compared to a deterministic baseline, enhancing both predictive skill and calibration. Our experiments confirm the reliability of the uncertainty estimates and illustrate their practical utility for decision support, including the identification of uncertainty thresholds for rejecting low-confidence predictions and the generation of well-calibrated wildfire danger maps with accompanying uncertainty layers. Extending the forecast horizon up to ten days, we observe that aleatoric uncertainty increases with time, showing greater variability in environmental conditions, while epistemic uncertainty remains stable. Finally, we show that although the two uncertainty types may be redundant in low-uncertainty cases, they provide complementary insights under more challenging conditions, underscoring the value of their joint modeling for robust wildfire danger prediction. In summary, our approach significantly improves the accuracy and reliability of wildfire danger forecasting, advancing the development of trustworthy wildfire DL systems.

2603.04852 2026-06-10 cs.AI cs.CV 版本更新

Non-Parametric Structural Priors for Geometry Theorem Prediction

几何定理预测的非参数结构先验

Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang

发表机构 * School of Artificial Intelligence, Beijing Normal University, Beijing, China(北京师范大学人工智能学院) Engineering Research Center of Intelligent Technology(智能技术与教育应用工程研究中心) Beijing Key Laboratory of Artificial Intelligence for Education, Beijing, China(北京人工智能教育重点实验室) Baidu, Beijing, China(百度)

AI总结 针对几何定理预测中参数模型泛化性差的问题,提出定理前驱图作为非参数结构先验,通过上下文学习实现无训练定理预测,在FormalGeo7k上达到89.29%准确率。

详情
AI中文摘要

多步定理预测是几何问题求解中的核心挑战。现有的神经符号方法严重依赖有监督参数模型,这些模型对不断发展的定理库泛化能力有限。在这项工作中,我们通过上下文学习(ICL)的视角探索无训练定理预测。我们识别出一个关键的可扩展性瓶颈,称为结构漂移:随着推理深度的增加,普通ICL的性能急剧下降,通常降至接近零。我们将这种失败归因于LLM无法恢复潜在拓扑依赖关系,导致无结构探索。为解决此问题,我们提出定理前驱图,将历史解轨迹中的时间依赖关系编码为有向图,并施加显式拓扑约束,从而在推理过程中有效剪枝搜索空间。结合检索增强的图构建和逐步符号执行器,我们的方法使LLM能够在没有任何基于梯度的优化的情况下充当结构化规划器。在FormalGeo7k基准上的实验表明,我们的方法达到了89.29%的准确率,显著优于ICL基线,并与最先进的有监督模型相匹配。这些结果表明,显式结构先验为扩展基于LLM的符号推理提供了一个有前景的方向。

英文摘要

Multi-step theorem prediction is a central challenge in geometry problem solving. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

2605.30370 2026-06-10 cs.NE cs.AI cs.CV cs.LG 版本更新

Updating the standard neuron model in artificial neural networks

更新人工神经网络中的标准神经元模型

Raul Mohedano, Thomas Batard, Erik Velasco-Salido, Ramsses De Los Santos Mendoza, Jorge H. Martínez, Stacey Levine, Marcelo Bertalmío

发表机构 * Spanish National Research Council (CSIC)(西班牙国家研究理事会(CSIC)) Center for Research in Mathematics (CIMAT)(数学研究中心(CIMAT)) Universidad Autónoma de Madrid (UAM)(马德里自治大学(UAM)) National Science Foundation (NSF)(国家科学基金会(NSF))

AI总结 本文用更真实的皮层细胞模型替代标准点神经元模型,在不增加参数的情况下,提升了人工神经网络的表达能力、鲁棒性和学习速度,并减少了记忆化和所需训练数据量。

Comments Acknowledgments included in the manuscript

详情
AI中文摘要

自20世纪50年代诞生以来,人工神经网络(ANNs)一直使用当时神经科学中流行的所谓点神经元模型,希望这种类比能够更好地模拟大脑功能。多年来,神经科学文献表明点神经元模型过于简单,无法正确表示许多基本的神经过程;然而,ANNs中的标准神经元模型仍然保持不变。在这里,我们用一个非常新的皮层细胞模型替代它,并通过理论分析和实验结果证明,仅仅通过使用更真实的神经单元元素而不增加参数数量,所得到的ANNs提供了许多重要优势,包括增强的表达能力、鲁棒性和学习速度,以及减少记忆化和所需的训练数据量。

英文摘要

From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuroscience, hoping that this analogy would allow for a better emulation of brain function. Over the years the neuroscience literature has shown that the point neuron model is too simplistic to properly represent many fundamental neural processes; however, the standard neuron model in ANNs still remains the same. Here we substitute it by a very recent model of cortical cells and demonstrate through theoretical analyses and experimental results how, simply by using a more realistic neural unit element without augmenting the number of parameters, the resulting ANNs offer a number of important advantages that include increases in expressivity, robustness and learning speed, and a reduction in memorization and the amount of training data needed.

2602.16898 2026-06-10 cs.RO cs.AI cs.CV cs.LG 版本更新

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI:一种多智能体框架用于集成通用机器人操作

Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

发表机构 * Department of Electrical Engineering, Sharif University of Technology(电气工程系,谢里夫大学)

AI总结 MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

Comments Some fundemental change in text and codebase

详情
AI中文摘要

MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

英文摘要

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

2603.04056 2026-06-10 cs.CV cs.RO 版本更新

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

长期动态底栖环境中的视觉定位:一个数据集、基于足迹的地面真实信息以及视觉地点识别基准

Martin Kvisvik Larsen, Oscar Pizarro

发表机构 * Department of Marine Technology(海洋技术系) Norwegian University of Science and Technology(挪威科学技术大学) Trondheim, Norway(特罗姆瑟,挪威)

AI总结 本文提出一个用于长期底栖环境视觉定位的 curated 数据集和基于足迹的地面真实方法,评估了八种最先进的视觉地点识别方法,发现其在该数据集上的 Recall@K 显著低于传统基准。

详情
Journal ref
Frontiers in Robotics and AI Volume 13 (2026) 1821019
AI中文摘要

长期视觉定位有潜力降低光学底栖监测中自主水下机器人(AUV)的成本并提高制图质量。尽管有这种潜力,底栖环境中长期视觉定位仍被低估,主要由于缺乏用于基准测试的curated数据集。此外,有限的地理参考精度和图像足迹需要精确的几何信息以实现准确的地面真实。在本文中,我们通过提出一个用于长期视觉定位的底栖环境curated数据集和一种新的方法来为近垂直水下影像的视觉定位结果进行地面真实,解决了这些差距。我们的数据集包括来自五个底栖参考站点的地理参考AUV影像,这些站点在长达六年的期间内被重新访问,包括原始和颜色校正的立体影像、相机校准和亚分米注册的相机姿态。据我们所知,这是首个涵盖多个站点和光层栖息地的长期视觉定位水下数据集。我们的地面真实方法估计3D海底图像足迹,并将具有重叠足迹的相机视图联系起来,确保地面真实链接反映共享的视觉内容。基于此数据集和地面真实,我们基准测试了八种最先进的视觉地点识别(VPR)方法,并发现Recall@K在我们的数据集上显著低于传统陆地和水下基准。最后,我们比较了基于足迹的地面真实与传统位置基于的地面真实,并表明距离阈值地面真实在地形崎岖和海拔变化的站点上会高估VPR Recall@K。共同,curated数据集、地面真实方法和VPR基准为在动态底栖环境中推进长期视觉定位提供了基础。

英文摘要

Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.

2510.15470 2026-06-10 cs.CV cs.IR 版本更新

MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval

MSAM:多语义自适应挖掘用于跨模态无人机视频-文本检索

Jinghao Huang, Yaxiong Chen, Ganchao Liu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) School of Computer Science and Artificial Intelligence, Wuhan University of Technology(武汉理工大学计算机科学与人工智能学院) School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University(西北工业大学人工智能、光学与电子学院(iOPEN))

AI总结 本文提出MSAM方法,通过多语义自适应学习机制提升无人机视频-文本跨模态检索性能,采用细粒度交互和自适应语义构建模块增强特征表示鲁棒性。

详情
AI中文摘要

随着无人机技术的发展,视频数据量迅速增加,亟需高效的语义检索方法。本文首次系统提出并研究无人机视频-文本检索(DVTR)任务。无人机视频具有俯视视角、强结构同质性和目标组合的多义性,挑战了现有针对地面视角设计的跨模态方法。为此,我们提出名为多语义自适应挖掘(MSAM)的新方法。MSAM引入多语义自适应学习机制,整合帧间动态变化并从特定场景区域提取丰富的语义信息,从而增强对无人机视频内容的深度理解和推理。该方法依赖于词与无人机视频帧之间的细粒度交互,整合自适应语义构建模块、分布驱动的语义学习项和多样性语义项,加深文本与无人机视频模态的交互并提升特征表示的鲁棒性。为减少无人机视频复杂背景的干扰,我们引入了跨模态交互特征融合池化机制,专注于目标区域的特征提取和匹配,以最小化噪声影响。在两个自建的无人机视频-文本数据集上进行的广泛实验表明,MSAM在无人机视频-文本检索任务中优于其他现有方法。源代码和数据集将公开发布。

英文摘要

With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

2508.05769 2026-06-10 cs.CV 版本更新

Improving Masked Style Transfer using Blended Partial Convolution

通过混合部分卷积改进遮蔽风格迁移

Seyed Hadi Seyed, Ayberk Cansever, David Hart

发表机构 * East Carolina University(东卡罗来纳大学)

AI总结 本文提出基于部分卷积的风格迁移网络,精准应用于目标区域,并通过内部混合技术弥补区域选择的不完美,提升视觉和量化效果。

详情
Journal ref
IEEE ACCESS Vol. 14 2026
AI中文摘要

艺术风格迁移长期以来依赖于卷积和变压器神经网络的发展。大多数算法将艺术风格应用于整个图像,但个别用户可能只需要将风格应用于图像中的特定区域。标准做法是在风格化后简单地对图像进行遮蔽。本文表明这种做法倾向于不恰当地捕捉目标区域的风格特征。我们提出了一种基于部分卷积的风格迁移网络,能够准确地将风格特征仅应用于目标区域。此外,我们还提出了网络内部的混合技术,以弥补区域选择的不完美。我们通过SA-1B数据集中的示例展示了这种改进在视觉和量化上的提升。代码可在https://github.com/davidmhart/StyleTransferMasked公开获取。

英文摘要

Artistic style transfer has long been possible with the advancements of convolution- and transformer-based neural networks. Most algorithms apply the artistic style transfer to the whole image, but individual users may only need to apply a style transfer to a specific region in the image. The standard practice is to simply mask the image after the stylization. This work shows that this approach tends to improperly capture the style features in the region of interest. We propose a partial-convolution-based style transfer network that accurately applies the style features exclusively to the region of interest. Additionally, we present network-internal blending techniques that account for imperfections in the region selection. We show that this visually and quantitatively improves stylization using examples from the SA-1B dataset. Code is publicly available at https://github.com/davidmhart/StyleTransferMasked.

2407.09510 2026-06-10 cs.CV 版本更新

3DGS.zip: A survey on 3D Gaussian Splatting Compression Methods

3DGS.zip:3D高斯散射压缩方法综述

Milena T. Bagdasarian, Paul Knoll, Yi-Hsin Li, Florian Barthel, Anna Hilsmann, Peter Eisert, Wieland Morgenstern

发表机构 * Fraunhofer HHI(弗劳恩霍夫研究所汉诺威研究所) Humboldt-Universität zu Berlin(柏林洪堡大学) Technische Universität Berlin(柏林技术大学)

AI总结 本文综述了3DGS压缩方法,探讨了压缩与紧缩技术,旨在提高3DGS的效率和实用性,通过减少文件大小和高斯数量来优化质量和性能。

Comments 3D Gaussian Splatting compression survey; 3DGS compression; updated discussion; new approaches added; new illustrations

详情
Journal ref
Computer Graphics Forum, Volume 44, Issue 2 (2025)
AI中文摘要

3D高斯散射(3DGS)作为一种实时辐射场渲染技术,因其质量和速度的先进性能而崭露头角。3DGS将场景建模为三维高斯集合,并通过优化额外属性以符合场景的几何和视觉特性。尽管其在渲染速度和图像保真度方面具有优势,但其显著的存储和内存需求限制了其在移动设备或头显中的应用。为解决这些挑战并推动3DGS的实用性,本文提供了对压缩和紧缩技术的全面详细分析。我们将现有方法分为压缩(减少文件大小)和紧缩(减少高斯数量)两类。两种方法均旨在维持或提升质量,分别通过最小化其各自属性:压缩通过最小化文件大小,紧缩通过最小化高斯数量。我们介绍了所分析方法的基本数学概念,以及关键的实现细节和设计选择。本文详尽讨论了方法之间的相似性和差异性,以及各自的优势和劣势。我们建立了基于关键性能指标和数据集的统一框架,以比较这些方法。由于这些方法在短时间内并行发展,目前尚无全面的比较。本文首次提出一个统一的框架来评估3DGS压缩技术。我们维护一个网站,定期更新新兴方法:https://w-m.github.io/3dgs-compression-survey/。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a cutting-edge technique for real-time radiance field rendering, offering state-of-the-art performance in terms of both quality and speed. 3DGS models a scene as a collection of three-dimensional Gaussians, with additional attributes optimized to conform to the scene's geometric and visual properties. Despite its advantages in rendering speed and image fidelity, 3DGS is limited by its significant storage and memory demands. These high demands make 3DGS impractical for mobile devices or headsets, reducing its applicability in important areas of computer graphics. To address these challenges and advance the practicality of 3DGS, this survey provides a comprehensive and detailed examination of compression and compaction techniques developed to make 3DGS more efficient. We classify existing methods into two categories: compression, which focuses on reducing file size, and compaction, which aims to minimize the number of Gaussians. Both methods aim to maintain or improve quality, each by minimizing its respective attribute: file size for compression and Gaussian count for compaction. We introduce the basic mathematical concepts underlying the analyzed methods, as well as key implementation details and design choices. Our report thoroughly discusses similarities and differences among the methods, as well as their respective advantages and disadvantages. We establish a consistent framework for comparing the surveyed methods based on key performance metrics and datasets. Specifically, since these methods have been developed in parallel and over a short period of time, currently, no comprehensive comparison exists. This survey, for the first time, presents a unified framework to evaluate 3DGS compression techniques. We maintain a website that will be regularly updated with emerging methods: https://w-m.github.io/3dgs-compression-survey/ .

2408.07922 2026-06-10 cs.CV cs.LG 版本更新

A Deep Features-Based Approach Using Modified ResNet50 and Gradient Boosting for Visual Sentiments Classification

基于改进ResNet50和梯度提升的深度特征方法用于视觉情感分类

Arslan Bisharat, Muhammad Mubeen, Arslan Akram, Saadullah Farooq Abbasi, Muhammad Salman Ali, Muhammad Usman Tariq

发表机构 * Department of Computer Science(计算机科学系) Loyola University Chicago(芝加哥洛伊拉大学) University Of the People(人民大学) The Superior University Lahore(拉合尔超级大学) University of Birmingham(伯明翰大学)

AI总结 本文提出一种结合改进ResNet50提取深度特征和梯度提升算法的情感分类方法,通过两个基准数据集验证,优于现有深度学习和机器学习模型。

Comments 4 pages, 4 figures, 3 tables, IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR) 2024

详情
AI中文摘要

视觉情感分析(VSA)的多功能性是其日益受到关注的原因之一。由于以往研究主要集中在单一模态的情感分析上,如文本,因此难以高效管理包含视觉信息的社会媒体数据。此外,大多数视觉情感研究需要充分分类情感,因为它们主要关注简单合并模态属性而未深入研究其复杂关系。为此,提出了一种融合深度学习和机器学习算法的方法。本研究使用深度特征方法进行多类分类,从改进的ResNet50中提取深度特征,并使用梯度提升算法对包含情感内容的照片进行分类。该方法在两个基准数据集CrowdFlower和GAPED上进行了彻底评估。最后,使用最先进的深度学习和机器学习模型来比较所提出的方法。与现有最先进的方法相比,所提出的方法在所呈现的数据集上表现出色。

英文摘要

The versatile nature of Visual Sentiment Analysis (VSA) is one reason for its rising profile. It isn't easy to efficiently manage social media data with visual information since previous research has concentrated on Sentiment Analysis (SA) of single modalities, like textual. In addition, most visual sentiment studies need to adequately classify sentiment because they are mainly focused on simply merging modal attributes without investigating their intricate relationships. This prompted the suggestion of developing a fusion of deep learning and machine learning algorithms. In this research, a deep feature-based method for multiclass classification has been used to extract deep features from modified ResNet50. Furthermore, gradient boosting algorithm has been used to classify photos containing emotional content. The approach is thoroughly evaluated on two benchmarked datasets, CrowdFlower and GAPED. Finally, cutting-edge deep learning and machine learning models were used to compare the proposed strategy. When compared to state-of-the-art approaches, the proposed method demonstrates exceptional performance on the datasets presented.

2305.19369 2026-06-10 eess.IV cs.CV physics.med-ph 版本更新

The Brain Tumor Segmentation (BraTS) Challenge 2023: Glioma Segmentation in Sub-Saharan Africa Patient Population (BraTS-Africa)

2023年脑肿瘤分割(BraTS)挑战:撒哈拉以南非洲患者群体的胶质瘤分割(BraTS-Africa)

Maruf Adewole, Jeffrey D. Rudie, Anu Gbadamosi, Oluyemisi Toyobo, Confidence Raymond, Dong Zhang, Olubukola Omidiji, Rachel Akinola, Mohammad Abba Suwaid, Adaobi Emegoakor, Nancy Ojo, Kenneth Aguh, Chinasa Kalaiwo, Gabriel Babatunde, Afolabi Ogunleye, Yewande Gbadamosi, Kator Iorpagher, Evan Calabrese, Mariam Aboian, Marius Linguraru, Jake Albrecht, Benedikt Wiestler, Florian Kofler, Anastasia Janas, Dominic LaBella, Anahita Fathi Kzerooni, Hongwei Bran Li, Juan Eugenio Iglesias, Keyvan Farahani, James Eddy, Timothy Bergquist, Verena Chung, Russell Takeshi Shinohara, Walter Wiggins, Zachary Reitman, Chunhao Wang, Xinyang Liu, Zhifan Jiang, Ariana Familiar, Koen Van Leemput, Christina Bukas, Maire Piraud, Gian-Marco Conte, Elaine Johansson, Zeke Meier, Bjoern H Menze, Ujjwal Baid, Spyridon Bakas, Farouk Dako, Abiodun Fatade, Udunna C Anazodo

发表机构 * Medical Artificial Intelligence Laboratory (MAI Lab)(医学人工智能实验室(MAI实验室)) Department of Radiation Biology, Radiotherapy and Radiodiagnosis, University of Lagos(拉各斯大学放射生物学、放射治疗与放射诊断系) Department of Radiology, University of California, San Diego(加州大学圣地亚哥分校放射科) Crestview Radiology Limited(Crestview放射科有限公司) Lagos University Teaching Hospital(拉各斯大学教学医院) Lagos State University Teaching Hospital, Ikeja, Lagos, Nigeria(拉各斯州大学教学医院,伊凯贾,拉各斯,尼日利亚) NSIA-Kano Diagnostic Center, Kano Nigeria(NSIA-卡诺诊断中心,卡诺,尼日利亚) Nnamdi Azikiwe University Teaching Hospital, Nnewi, Anambra State, Nigeria(恩内迪·阿齐基韦大学教学医院,恩韦伊,安纳博拉州,尼日利亚) Federal Medical Centre, Abeokuta, Ogun State, Nigeria(阿博库塔联邦医疗中心,奥贡州,尼日利亚) Federal Medical Centre, Umahia, Abia State, Nigeria(乌马希亚联邦医疗中心,阿比亚州,尼日利亚) National Hospital Abuja, FCT, Nigeria(阿布贾国家医院,联邦首都区,尼日利亚) Benue State University Teaching Hospital, Markurdi, Benue State, Nigeria(贝努埃州大学教学医院,马库尔迪,贝努埃州,尼日利亚) Duke University Medical Center, Department of Radiology, USA(达特茅斯大学医学中心,放射科,美国) University of California San Francisco, CA, USA(加州大学旧金山分校,CA,美国) Yale University, New Haven, CT, USA(耶鲁大学,新 Haven,CT,美国) Children's National Hospital, Washington DC, USA(儿童医院华盛顿特区,华盛顿特区,美国) George Washington University, Washington DC, USA(乔治·华盛顿大学,华盛顿特区,美国) Sage Bionetworks, USA(Sage生物网络,美国) Department of Neuroradiology, Technical University of Munich, Munich, Germany(慕尼黑技术大学神经放射科系,慕尼黑,德国) Helmholtz Research Center, Munich, Germany(海德堡研究中心,慕尼黑,德国) Duke University Medical Center, Department of Radiation Oncology, USA(达特茅斯大学医学中心,放射肿瘤科,美国) Children’s Hospital of Philadelphia, University of Pennsylvania, Philadelphia, PA, USA(费城儿童医院,宾夕法尼亚大学,费城,PA,美国) Center for AI and Data Science for Integrated Diagnostics (AI2D) & Center for Biomedical Image Computing and Analytics (CBICA), University of Pennsylvania, Philadelphia, PA, USA(人工智能与数据科学整合诊断中心(AI2D)及生物医学影像计算与分析中心(CBICA),宾夕法尼亚大学,费城,PA,美国) Athinoula A Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Boston, MA, USA(Athinoula A Martinos生物医学影像中心,马萨诸塞总医院,波士顿,MA,美国) University of Zurich, Switzerland(苏黎世大学,瑞士) Cancer Imaging Program, National Cancer Institute, National Institutes of Health, Bethesda, MD 20814, USA(癌症成像计划,国家癌症研究所,国家卫生研究院,贝塞斯达,MD 20814,美国) Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania, Philadelphia, USA(临床流行病学与生物统计学中心,宾夕法尼亚大学,费城,美国) Department of Applied Mathematics and Computer Science, Technical University of Denmark, Denmark(应用数学与计算机科学系,丹麦技术大学,丹麦) Mayo Clinic, MN, USA(梅奥诊所,MN,美国) Precision FDA, U.S. Food and Drug Administration, Silver Spring, MD, USA(Precision FDA,美国食品药品监督管理局,Silver Spring,MD,美国) Booz Allen Hamilton, McLean, VA, USA(Booz Allen Hamilton,麦肯,VA,美国) Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA(放射科,佩尔曼医学院,宾夕法尼亚大学,费城,PA,美国) Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA(病理学与实验室医学系,佩尔曼医学院,宾夕法尼亚大学,费城,PA,美国) Center for Global Health, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA(全球健康中心,佩尔曼医学院,宾夕法尼亚大学,费城,宾夕法尼亚,美国) Montreal Neurological Institute, McGill University, Montreal, Canada(蒙特利尔神经科学研究所,麦吉尔大学,蒙特利尔,加拿大) Department of Medicine, University of Cape Town, South Africa(医学系,开普敦大学,南非) Department of Radiation Medicine, University of Cape Town, South Africa(放射医学系,开普敦大学,南非)

AI总结 研究探讨了在资源有限的撒哈拉以南非洲地区,利用先进机器学习方法进行胶质瘤分割的可行性,旨在改进该地区胶质瘤的诊断和治疗。

Comments arXiv admin note: text overlap with arXiv:2107.02314

详情
AI中文摘要

胶质瘤是最常见的原发性脑肿瘤。尽管胶质瘤相对罕见,但它们是致命性最高的癌症之一,诊断后生存率低于2年。胶质瘤诊断困难、治疗困难且对传统疗法具有内在耐药性。多年来,大量研究改善了胶质瘤的诊断和治疗,降低了全球北方的死亡率,但低收入和中等收入国家(LMICs)患者生存机会未变,且在撒哈拉以南非洲(SSA)人群中的生存率更差。长期生存与识别适当的脑MRI病理特征及通过组织病理学确认有关。自2012年以来,脑肿瘤分割(BraTS)挑战已评估了最先进的机器学习方法以检测、表征和分类胶质瘤。然而,不清楚这些最先进的方法是否能在SSA广泛应用,因为广泛使用低质量MRI技术,产生较差的图像对比度和分辨率,更重要的是,疾病晚期出现的倾向以及SSA中胶质瘤的特殊特征(即疑似更高的脑膜瘤发病率)。因此,BraTS-Africa挑战为通过BraTS挑战将SSA的脑MRI胶质瘤病例纳入全球努力提供了独特机会,以开发和评估计算机辅助诊断(CAD)方法,用于资源有限环境中的胶质瘤检测和表征。

英文摘要

Gliomas are the most common type of primary brain tumors. Although gliomas are relatively rare, they are among the deadliest types of cancer, with a survival rate of less than 2 years after diagnosis. Gliomas are challenging to diagnose, hard to treat and inherently resistant to conventional therapy. Years of extensive research to improve diagnosis and treatment of gliomas have decreased mortality rates across the Global North, while chances of survival among individuals in low- and middle-income countries (LMICs) remain unchanged and are significantly worse in Sub-Saharan Africa (SSA) populations. Long-term survival with glioma is associated with the identification of appropriate pathological features on brain MRI and confirmation by histopathology. Since 2012, the Brain Tumor Segmentation (BraTS) Challenge have evaluated state-of-the-art machine learning methods to detect, characterize, and classify gliomas. However, it is unclear if the state-of-the-art methods can be widely implemented in SSA given the extensive use of lower-quality MRI technology, which produces poor image contrast and resolution and more importantly, the propensity for late presentation of disease at advanced stages as well as the unique characteristics of gliomas in SSA (i.e., suspected higher rates of gliomatosis cerebri). Thus, the BraTS-Africa Challenge provides a unique opportunity to include brain MRI glioma cases from SSA in global efforts through the BraTS Challenge to develop and evaluate computer-aided-diagnostic (CAD) methods for the detection and characterization of glioma in resource-limited settings, where the potential for CAD tools to transform healthcare are more likely.