arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09871 2026-06-10 cs.CV cs.AI cs.LG 新提交

Kwai Keye-VL-2.0 技术报告

Kwai Keye Team, Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen, Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai, Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han, Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang

发表机构 * Kuaishou Group（快手集团）

AI总结提出开源MoE多模态基础模型Keye-VL-2.0，首次将DeepSeek稀疏注意力适配到GQA架构，支持无损256K上下文处理，并通过跨模态多教师策略蒸馏和上下文/视频强化学习解决多任务对齐中的灾难性遗忘，在长视频理解和智能体任务上达到同类最优。

Comments 31 pages, 11 figures

详情

AI中文摘要

我们介绍了 Kwai Keye-VL-2.0-30B-A3B，一个开源的混合专家（MoE）多模态基础模型，旨在推进长视频理解和智能体智能。为应对小时级视频中存在的超长上下文、信息冗余和过高计算成本等挑战，Keye-VL-2.0 首次将 DeepSeek 稀疏注意力（DSA）适配到基于 GQA 的多模态架构中，实现了无损的 256K 上下文处理，同时捕捉关键帧和长程时间依赖。该架构由高度优化的训练和推理基础设施支撑，包括可扩展的视频 I/O、异构 ViT-LM 并行和自定义 DSA 内核，显著提高了吞吐量并最小化计算开销。此外，为克服多任务对齐过程中灾难性遗忘的算法困境，我们引入了跨模态多教师在线策略蒸馏（MOPD），并结合上下文强化学习和视频强化学习。通过将在线策略 rollout 中的密集 token 级教师反馈蒸馏回仅激活 3B 参数的 MoE 骨干网络，Keye-VL-2.0 原生支持跨代码、工具和搜索场景的高级智能体协作，并具备多模态自我纠正能力。在视频理解、时间定位、推理、STEM 和智能体基准上的广泛评估表明，Keye-VL-2.0-30B-A3B 在相似规模模型中达到了最先进的性能，尤其在 TimeLens 上的细粒度时间定位和 Video-MME-v2 及 LongVideoBench 上的长视频理解方面表现优异。我们发布了模型检查点，以加速社区向可扩展且鲁棒的多模态智能体应用迈进。

英文摘要

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

URL PDF HTML ☆

赞 0 踩 0

2606.10819 2026-06-10 cs.CV cs.AI 新提交

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Earth-OneVision：将遥感多模态大语言模型扩展到更多传感器模态和任务

Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li

发表机构 * National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing (SBIIP), Beijing Institute of Technology（北京理工大学空间智能信息处理国家重点实验室）； Aerospace Information Research Institute, Chinese Academy of Sciences（中国科学院空天信息创新研究院）； Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences（中国科学院地理空间信息处理与应用系统技术重点实验室）； Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology（北京理工大学前沿交叉科学研究院）； School of Mechatronical Engineering, Beijing Institute of Technology（北京理工大学机电学院）； School of Earth and Space Sciences, Peking University（北京大学地球与空间科学学院）； School of Electronics, Peking University（北京大学电子学院）； School of Computer Science and Hubei Key Laboratory of Intelligent Geo-Information Processing（华中科技大学计算机科学与技术学院&湖北省智能地理信息处理重点实验室）

AI总结提出Earth-OneVision，一个2B参数的RS-MLLM，通过全粒度视觉语言对齐、空间语言同构序列化和渐进式跨模态适应机制，统一六种传感器模态和九类任务，在多个基准上达到或超越4B-72B模型。

详情

AI中文摘要

RS-MLLM能够对地球观测图像进行自然语言理解和空间推理。然而，现有模型仅支持狭窄的传感器类型和任务范围，导致对地球的碎片化视角，并使得跨模态地球科学知识在很大程度上未被利用。本文提出了Earth-OneVision，一个2B参数的RS-MLLM，它在单一自回归框架内统一了六种传感器模态（即光学、SAR、红外、多光谱、时序和视频）以及跨传感器融合，涵盖9个任务类别。三种专用机制解决了三个瓶颈。全粒度视觉语言对齐（FGVLA）将多级视觉特征与多维语言空间对齐。空间语言同构序列化（SLIS）将异构空间输出统一为自回归令牌。渐进式跨模态适应（PCMA）将复合领域差距分解为连续阶段，依次解决视角和成像物理差距。为了支持联合训练，构建了MMRS-OneVision，包含约340万QA对，涵盖所有六种传感器模态和9个任务类别的跨传感器融合，大大超过了现有的遥感多模态指令数据集。仅用2B参数，Earth-OneVision在广泛基准上取得了具有竞争力或最先进的结果，持续匹配或超越4B-72B的RS-MLLM。它在光学视觉定位的OPT-RSVG测试集上达到87.52%的P@0.5，在SAR VQA基准SARLANG-Bench上达到80.68%，超过7B模型7%以上。它还在多光谱分类的BigEarthNet-MS测试集上达到75.74%的召回率，在跨模态推理的EarthMind-Bench上达到81.94%的MCQ准确率。

英文摘要

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10887 2026-06-10 cs.CV 新提交

从感知到决策：多模态大语言模型中听觉与视觉感知的信息流

Wish Suharitdamrong, Muhammad Awais, Xiatian Zhu, Sara Atito

AI总结研究多模态大语言模型（AVLLMs）中音频和视觉信息流的路径与整合机制，发现顺序流与并行流两种路由模式，并证明信息传递后可丢弃无关token以提升效率。

Comments 40 pages, 29 figures

详情

AI中文摘要

多模态大语言模型（MLLMs）能够听和看，但音频和视觉信号实际上如何通过网络传播以形成答案？尽管它们在研究和实际应用中的作用日益增长，但音频和视觉标记影响最终预测的内部路径仍然知之甚少。在本研究中，我们考察了音频-视觉大语言模型（AVLLMs）内部的音视频信息流，追踪了AVLLMs如何在两种输入配置（音视频视频和多个交错音视频项目）下路由、利用和整合音频与视觉信息。我们发现，对于音视频视频，AVLLMs遵循为VLMs和VideoLLMs建立的顺序信息流路径，音频和视觉贡献沿着该路径按任务对每种模态的依赖程度成比例流动。在多个交错音视频项目的设置中，这种路由转变为不同的并行流。此外，我们证明，一旦音频-视觉和其他类型的标记的信息被传递到LLM，它们可以被丢弃，对模型的预测影响最小甚至略有改善，这适用于多个任务和数据集，从而实现更高效的推理。这些发现适用于多个模型和规模，包括3B和7B规模的Qwen2.5-Omni和Video-SALMONN2 Plus，从而产生了关于这些流结构为何出现的假设。总之，这些结果首次清晰地描绘了AVLLMs如何在网络内部协调声音和视觉，并为音频-视觉及更广泛的MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。

英文摘要

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.10400 2026-06-10 cs.CL cs.CV 交叉投稿

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

视觉语言模型是看见还是猜测？通过措辞控制基准衡量和减少文本先验依赖

Pratham Singla, Shivank Garg, Vihan Singh, Paras Chopra

发表机构 * Lossfunk ； Indian Institute of Technology Roorkee（印度理工学院罗尔基分校）； Raeth AI

AI总结本文构建了540张图像的基准，通过为同一图像生成四种措辞变体，衡量视觉语言模型对文本先验的依赖，发现所有模型在最难变体上性能下降，开放模型下降最严重，并通过无图像消融等分析证实了真正的图像依赖。

Comments 17 pages, 7 figures, Submitted to EMNLP 2026

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被部署在答案必须依据图像内容的场景中，然而它们常常基于文本先验（问题的措辞结合记忆的世界知识）而非图像本身来回答，这夸大了基准分数并产生了自信但无根据的答案。现有基准很少孤立这种行为，因为每张图像通常只与一个固定问题配对。为了衡量这种依赖，我们构建了一个包含540张图像、覆盖六个推理类别的基准，并为相同图像生成四个问题变体，使得措辞而非图像内容成为受控变量。最难的变体直接从图像编写以最小化文本泄漏。我们对十一个VLM进行了基准测试，涵盖从小型开放权重模型到大型闭源系统：每个模型在最难的变体上性能下降，开放模型下降最严重。我们的核心诊断是无图像消融，它将开放权重模型降至其纯文本基线（1%到9%）。进一步的三项分析——LLM评定的难度、低基础到最终文本相似度以及人工重新标注——证实了真正的图像依赖性。与变体构建方式匹配的上下文示例恢复了最高的准确率，而GRPO后训练一个小型VLM在所有四个变体上取得了一致的提升，并泛化到保留的分布外集。文本先验依赖是可测量的，并且部分可通过训练消除。

英文摘要

Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

URL PDF HTML ☆

赞 0 踩 0

2606.10803 2026-06-10 cs.CL cs.AI cs.CV 交叉投稿

Envision4D: 通过前馈4D高斯泼溅展望自动驾驶的视觉未来

Qi Song, Yifei He, Chi Zhang, Zheng Fu, Xuhe Zhao, Mengmeng Yang, Kun Jiang, Rui Huang, Diange Yang

发表机构 * Tsinghua University（清华大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出Envision4D，一种全自监督前馈框架，通过未来姿态预测、层内时间注意力和条件运动提升，实现无位姿的未来外推，在自动驾驶动态场景预测中达到最先进性能。

Comments Project Page: https://maggiesong7.github.io/research/Envision4D/

详情

AI中文摘要

预测动态场景的未来演变在自动驾驶中至关重要。然而，现有的前馈范式主要设计用于插值。当扩展到未来外推时，它们在大位移下会出现重影伪影，并受限于简化的运动假设或严格的未来先验。为了克服这些挑战，我们提出了Envision4D，一种完全自监督的前馈框架，用于无位姿的未来外推。具体来说，我们引入了一个未来姿态预测模块，通过迭代去噪过程推断未来相机参数。此外，为了捕捉非线性动态，我们提出了层内时间注意力，并采用条件运动提升，将高度不确定的外推过程转化为稳健的关系映射。最后，利用渐进式训练策略来稳定无监督运动学习，防止误差累积。大量实验表明，Envision4D实现了最先进的性能，在未来的视图合成中显著优于现有方法。

英文摘要

Forecasting the future evolution of dynamic scenes is crucial in autonomous driving. However, existing feed-forward paradigms are primarily designed for interpolation. When extended to future extrapolation, they suffer from ghosting artifacts under large displacements and are constrained by simplified motion assumptions or strict future priors. To overcome these challenges, we propose Envision4D, a fully self-supervised feed-forward framework for pose-free future extrapolation. Specifically, we introduce a Future Pose Prediction module that infers future camera parameters via an iterative denoising process. Furthermore, to capture non-linear dynamics, we propose In-layer Temporal Attention and employ Conditioned Motion Lifting, which transforms the highly uncertain extrapolation process into robust relational mappings. Finally, a Progressive Training Strategy is utilized to stabilize unsupervised motion learning against error accumulation. Extensive experiments demonstrate that Envision4D achieves state-of-the-art performance, significantly outperforming existing methods in future view synthesis.

URL PDF HTML ☆

赞 0 踩 0

2605.29662 2026-06-10 cs.CV cs.RO 新提交

SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

SAFE-Pruner: 语义注意力引导的未来感知令牌剪枝用于高效视觉-语言-动作操控

Shilin Ma, Chubin Zhang, Changyuan Wang, Yuji Wang, Yue Wu, Zixuan Wang, Jingqi Tian, Zheng Zhu, Yansong Tang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）

AI总结针对视觉-语言-动作模型推理加速中现有剪枝方法忽略深层视觉信息的问题，提出SAFE-Pruner框架，通过引入未来层注意力线索和语义注意力一致性实现前瞻性令牌剪枝，在仿真和真实实验中取得最高1.89倍加速且成功率下降小于1.7%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型的实时推理对于机器人控制至关重要。虽然视觉令牌剪枝在加速推理方面显示出巨大潜力，但现有方法主要基于浅层线索进行剪枝决策，并存在丢弃深层所需视觉信息的风险。为解决此问题，我们提出SAFE-Pruner，一种即插即用的剪枝框架，将未来层的注意力线索融入剪枝决策。具体而言，我们识别出语义注意力一致性，即VLA模型在执行步骤中倾向于将其注意力概率质量集中在同一语义实体上。基于这一观察，我们设计了一种前瞻性策略来预测深层令牌的显著性，从而防止关键令牌过早移除并实现更稳定的加速。我们进一步引入自适应子任务划分策略来检测注意力突变，从而提高预测准确性和剪枝可靠性。在仿真和真实环境中的大量实验表明，我们的方法实现了高达1.89倍的加速，成功率下降最小（低于1.7%），同时比最先进的方法高出1.9%。

英文摘要

Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.

URL PDF HTML ☆

赞 0 踩 0

2606.10025 2026-06-10 cs.RO cs.CV cs.LG 交叉投稿

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

GHOST: 用于泛化机器人操作的层次化子目标策略

Sriram Krishna, Ben Eisner, Haotian Zhan, Ying Yuan, Haoyu Zhen, Chuang Gan, Shubham Tulsiani, David Held

AI总结提出GHOST框架，通过将控制分解为高层子目标预测和低层目标条件控制器，实现视觉运动操作策略的泛化，并利用人类演示适应新物体和任务变化。

Comments Accepted at RSS 2026

详情

AI中文摘要

我们提出了GHOST，一个学习视觉运动操作策略的框架，该策略能够泛化到训练分布之外。GHOST将控制分解为：(i) 高层策略，从多视角RGB-D观测中预测下一个子目标作为3D末端执行器位姿的分布，以及(ii) 低层目标条件控制器，执行特定于具体体的动作。为了将基于图像的策略条件化于3D目标，我们引入了一个简单的空间接口，将预测的目标投影到图像平面，并将其表示为末端执行器热图。在一系列操作任务中，与平坦的扩散策略相比，这种层次化分解持续提高了性能和鲁棒性。此外，我们展示了这种层次化接口也使得整合人类演示变得容易，而无需依赖（嘈杂的）动作重定向。由于子目标在很大程度上与具体体无关，我们在人类视频上训练高层策略，以指定如何应用和组合学到的技能，同时保持低层策略仅在机器人数据上训练。这种层次结构使得能够使用少量人类演示适应新物体和任务变化。

英文摘要

We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.

URL PDF HTML ☆

赞 0 踩 0

2606.10299 2026-06-10 cs.AI cs.CV cs.MA 交叉投稿

What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

空间记忆必须存储什么：遮挡作为语言-智能体记忆的测试

Doeon Kwon, Junho Bang

发表机构 * Space Zero, Inc.（Space Zero公司）

AI总结本文通过实验证明，在空间查询场景中，几何信息必须主导记忆召回，而可见性判断需要独立于记忆召回，并提出了基于射线-体素DDA的可见性谓词计算方法。

Comments 23 pages, 6 figures

详情

AI中文摘要

语言智能体的“记忆宫殿”系统将每条记忆锚定到世界坐标，其直觉是几何提供了文本无法提供的信息。我们使这一直觉可测试，并报告三个结果。首先，记忆宫殿默认将空间邻近性折叠成与近期性和重要性线性混合的做法没有帮助甚至有害：在一个预注册的召回实验中，现有的混合在其自身冻结测试中失败（平均Delta-Hit@5 -0.0375，Wilcoxon p=0.306），处于位置盲基线水平，而几何主导的加权则取得决定性胜利（+0.3208，p<10^-15）：当查询模式是空间时，几何必须主导召回。其次，记忆召回和可见性必须分离：召回在设计上对遮挡不敏感（你能正确记住墙后下一个房间），而可见性是对存储几何的感知谓词，实时系统从未计算过。一行射线与体素的数字微分分析器（DDA），从智能体已经投射的视线射线重新指向，提供了这一点：文本和实时视锥在849个墙后目标上得分均为0.000，而锥体加DDA达到0.982（精确McNemar p<10^-6）；坐标召回分别解决了余弦空值无法解决的近重复位置（1.000 vs 0.533，n=150）。第三，可见性谓词在git提交的预注册下得到实时确认（SPMEM-OCC-LIVE-v1：八个脚本化世界，自动oracle评分，96个墙后目标，假可见从1.000降至0.000，合并精确McNemar p=2.5x10^-29），该运行发现并修复了一个真实的中继锚点缺陷。我们承认遮挡需要几何几乎是同义反复；贡献在于测量和隔离，将空间记忆必须存储的内容与其读取方式分开。这些试验为一个冻结的确认性研究（SPMEM-ZERO-REAL-PREREG-v1）提供动力；完整的人类作者多世界研究（含盲评者）仍是未来工作。

英文摘要

Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p<10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p<10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000->0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.

URL PDF HTML ☆

赞 0 踩 0

2606.10614 2026-06-10 cs.RO cs.CV cs.LG 交叉投稿

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

灵巧点策略：从人类演示中学习基于点的灵巧手策略

Beomjun Kim, Seong Hyeon Park, Seunghoon Sim, Seungjun Moon, Sanghyeok Lee, Jinwoo Shin

发表机构 * KAIST（韩国科学技术院）

AI总结提出Dexterous Point Policy框架，通过统一3D关键点表示从人类视频学习灵巧操作策略，无需机器人演示，在真实任务中达到75%成功率。

详情

AI中文摘要

基于人类演示视频预训练的机器人基础模型显示出潜力，但当策略部署到真实机器人时仍存在显著的具身差距。常见的补救措施是在机器人特定演示上微调这些模型。然而，机器人数据收集可能过于昂贵和耗时，这在灵巧操作中尤为突出，例如，即使是单个原子任务，遥操作多指手也可能需要数天。为了解决这个问题，我们引入了Dexterous Point Policy，一个直接从人类视频学习灵巧操作策略且无需机器人演示的框架。我们的核心见解是，统一的3D关键点表示在用于观察和动作时，可以桥接人类和机器人的具身。具体来说，我们从原始视频中提取任务相关物体和人类手的3D关键点，并训练一个自回归变换器来处理这些关键点。我们观察到，在关键点层面，特别是手腕和指尖，人类和机器人的行为紧密对齐，从而实现直接策略迁移。在一套包括拾取放置和工具使用的真实机器人任务中，Dexterous Point Policy达到了75.0%的成功率，而最先进的VLA基线仅达到1.0%。此外，我们的方法对未见过的场景具有很强的泛化能力，包括多物体环境和新型物体类别。

英文摘要

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

URL PDF HTML ☆

赞 0 踩 0

2606.10818 2026-06-10 cs.RO cs.CV 交叉投稿

IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

IMPACT：面向强力机器人操控的内部模型预测控制学习

Jiawei Gao, Chaoqi Liu, Peilin Wu, Haonan Chen, Yilun Du

发表机构 * Harvard University（哈佛大学）； Stanford University（斯坦福大学）

AI总结提出IMPACT框架，将强力操控任务解耦为任务规划和基于内部模型的预测控制，通过仿真和实验证明其在成功率、泛化性、安全性和能效上的优势。

Comments Project website: https://gao-jiawei.com/IMPACT/

详情

AI中文摘要

现实世界中的机器人操控任务通常涉及与环境的有力交互，例如使用不同重量的工具、运输不同质量的物体以及执行接触密集任务（如擦桌子）。先前的基于学习方法通常采用模仿学习策略，输出由低级阻抗控制器跟踪的目标末端执行器姿态。在这些系统中，有力交互要么通过稳态跟踪误差隐式实现，要么使用腕部力/扭矩或触觉传感器显式命令。然而，隐式方法在不同物体重量下泛化能力差，而显式方法需要专用硬件并增加系统复杂性。在这项工作中，我们提出了IMPACT，一个将这些有力任务解耦为任务规划和基于内部模型的预测控制的框架。广泛的仿真和真实世界实验表明，所提出的框架实现了更高的成功率、对未见物体重量的更好泛化性，以及更好的安全性和能效。

英文摘要

Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.

URL PDF HTML ☆

赞 0 踩 0

2510.14836 2026-06-10 cs.CV cs.RO 版本更新

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

QDepth-VLA：量化深度预测作为视觉-语言-动作模型的辅助监督

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Beijing Zhongke Huiling Robot Technology Co.（北京中科创联机器人科技有限公司）

AI总结提出QDepth-VLA框架，通过辅助深度预测任务增强VLA模型的空间感知与推理能力，在仿真和真实任务中提升操作性能。

2603.20850 2026-06-10 cs.CV cs.RO 版本更新

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

Glove2Hand：从多模态传感手套合成自然的手-物体交互

Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, Li Guan

发表机构 * Meta Reality Labs（Meta现实实验室）； Rutgers University（罗格斯大学）

AI总结提出Glove2Hand框架，将多模态传感手套视频转化为逼真的裸手，并保留物理交互动态；引入3D高斯手模型和扩散手恢复器，创建HandSense数据集，提升下游任务性能。

Comments CVPR 2026 Highlight. This version includes the motion retarget process in the appendix

详情

AI中文摘要

理解手-物体交互（HOI）是计算机视觉、机器人和AR/VR的基础。然而，传统手部视频通常缺乏接触力和运动信号等关键物理信息，并且容易频繁遮挡。为了解决这些挑战，我们提出了Glove2Hand，一个将多模态传感手套HOI视频转化为逼真裸手的框架，同时忠实保留底层物理交互动态。我们引入了一种新颖的3D高斯手模型，确保时间渲染一致性。使用基于扩散的手部恢复器将渲染的手无缝集成到场景中，该恢复器有效处理复杂的手-物体交互和非刚性变形。利用Glove2Hand，我们创建了HandSense，这是第一个多模态HOI数据集，包含手套到手的视频以及同步的触觉和IMU信号。我们证明HandSense显著增强了下游裸手应用，包括基于视频的接触估计和严重遮挡下的手部跟踪。

英文摘要

Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

URL PDF HTML ☆

赞 0 踩 0

2512.06628 2026-06-10 cs.RO cs.CV 版本更新

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

MIND-V：基于强化学习物理对齐的长期机器人操作分层世界模型

Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li

发表机构 * Tsinghua University（清华大学）； X Square Robot（X Square机器人）； Sun Yat-sen University（中山大学）； HKUST（香港科技大学）

AI总结提出MIND-V分层世界模型，通过语义推理、行为语义桥接和运动视频生成，结合强化学习物理对齐，实现长期机器人操作视频的物理合理合成。

详情

AI中文摘要

可扩展的具身智能受到多样化、长期机器人操作数据稀缺的限制。现有视频世界模型仅能合成简单动作的短视频，且常依赖手动定义轨迹。为此，我们提出MIND-V，一种认知分层世界模型，旨在合成物理合理且逻辑连贯的长期机器人操作视频。受认知科学启发，MIND-V通过三个核心组件桥接高层推理与像素级合成：语义推理中心（SRH）利用预训练视觉语言模型进行任务规划；行为语义桥（BSB）将抽象指令转换为域不变表示；运动视频生成器（MVG）用于条件视频渲染。MIND-V采用分阶段视觉未来展开（Staged Visual Future Rollouts）这一测试时优化策略以增强长期鲁棒性。为强制遵循物理定律，我们引入GRPO强化学习后训练阶段，由新颖的物理预见一致性（PFC）奖励引导。PFC利用V-JEPA2世界模型作为物理裁判，在潜在特征空间中惩罚不合理动态。实验证实MIND-V在长期模拟中的SOTA性能及其对策略学习的重要价值，为具身数据合成引入了可扩展且完全自主的框架。

英文摘要

Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.

URL PDF HTML ☆

赞 0 踩 0

2606.10166 2026-06-10 cs.CV 新提交

Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization

融合卫星图像与平面地图的跨视角定位

Quang Long Ho Ngo, Zimin Xia, Alexandre Alahi

AI总结提出一种融合卫星图像与平面地图的模块，通过跨模态条件化和补丁级融合规则，将定位误差降低30.13%。

详情

AI中文摘要

当前的跨视角定位方法主要依赖卫星图像作为空中模态。尽管近期工作探索了平面地图（如OpenStreetMap瓦片），但这些方法性能往往滞后。然而，两种模态都广泛可用且具有互补特性。卫星图像更接近地面相机图像，提供更精细的细节，而平面地图包含标注对象（如路灯），并在地面被遮挡（如树叶）的区域仍能提供信息。尽管如此，只有一项先前工作提供了融合这两种模态的端到端方法，且未展示其在最先进方法中的潜力。为结合两种模态的优势，我们提出一种新的融合模块，增强标准编码器，并证明将卫星图像与平面地图集成可改进最先进的单模态方法。该模块包括（i）跨模态条件化，处理每种模态编码时考虑另一种模态的信息，以及（ii）控制信息交换粒度的补丁级融合规则。我们取得了最先进的结果，将平均定位误差降低了30.13%。定性上，融合自适应地选择信息更丰富的模态，提高了整体准确性。

英文摘要

Current cross-view localization methods predominantly rely on satellite imagery as the aerial modality. Although recent work explores planimetric maps (e.g., OpenStreetMap tiles), these approaches often lag in performance. Yet both modalities are widely available and possess complementary properties. Satellite images are closer to ground-level camera imagery, offering finer detail, whereas planimetric maps contain annotated objects (e.g., streetlamps) and remain informative in areas where the ground is occluded, such as by foliage. Despite this, only one prior work provides an end-to-end method to fuse the two modalities, and it does not demonstrate their potential within state-of-the-art methods. To combine the strengths of both modalities, we propose a new fusion module that augments standard encoders and demonstrates that integrating satellite imagery with planimetric maps improves state-of-the-art single-modality methods. The module comprises (i) cross-modal conditioning, which processes each modality's encoding with awareness of the other, and (ii) a patch-level fusion rule that controls the granularity of information exchange. We achieve state-of-the-art results, reducing the mean localization error by 30.13\%. Qualitatively, the fusion adaptively selects the more informative modality, improving overall accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.10876 2026-06-10 cs.CV 新提交

Advancing Wood Identification in the Philippines: Utilizing the Xylorix Platform for Efficient AI Model Development and Deployment for Five Key Species

推进菲律宾木材识别：利用Xylorix平台高效开发和部署五种关键树种的AI模型

Rosalie C. Mendoza, Vivian C. Daracan, Arlene D. Romano, Ronniel D. Manalo, Xin Jie Tang, Yi Hong Wong, Yong Haur Tay

发表机构 * College of Forestry and Natural Resources, University of the Philippines Los Banos（菲律宾大学洛桑分校林业与自然资源学院）； Agritix

AI总结本研究利用Xylorix平台，让无编程经验的木材科学家为五种菲律宾硬木开发并部署宏观木材识别AI模型，AUC达0.969-1.000，四种达AA级，证明非程序员可构建适合现场部署的可靠模型。

详情

AI中文摘要

非法采伐和木材贸易在菲律宾持续构成重大挑战，准确的木材物种识别对执法至关重要，但受限于专业设备和专业知识。本研究旨在评估木材科学家能否在没有编程专业知识的情况下，利用Xylorix平台开发和部署宏观木材识别的AI模型，聚焦五种菲律宾硬木：Mangium (Acacia mangium Willd.)、Rain Tree [Samanea saman (Jacq.) Merr.]、Banuyo (Wallaceodendron celebicum Koord.)、Tindalo [Afzelia rhomboidea (Blanco) Vidal] 和 Ipil [Intsia bijuga (Colebr.) O. Kuntze]。二元分类器使用来自260个标本的10,663张经过验证的横截面图像进行训练，并通过标本级平均评分进行评估，以模拟操作现场条件。ROC曲线下面积（AUC）值范围为0.969（Ipil）到1.000（Mangium），平均精度（AP）值范围为0.589（Samanea）到1.000（Mangium）。五个物种中有四个达到AA级（AUC和AP均≥0.90）；Rain Tree获得AE级（AUC≥0.90，AP<0.60），原因是其正测试集较小（3个标本）导致AP压缩。所有五个分类器以近乎完美的保真度将目标标本排在非目标标本之上。标本级错误分析显示，Ipil有9个假阴性，主要源于局部图像伪影；Rain Tree有3个假阳性，Tindalo有1个假阳性，由共享的族级解剖特征引起。这些发现表明，Xylorix非程序员可以利用Xylorix平台构建操作可靠的木材识别模型，适用于供应链检查点的现场部署。

英文摘要

Illegal logging and timber trade continue to pose significant challenges in the Philippines, where accurate wood species identification is essential for enforcement but limited by the need for specialised equipment and expertise. This study aims to evaluate whether AI models for macroscopic wood identification can be developed and deployed by wood scientists without programming expertise using the Xylorix platform, focusing on five Philippine hardwood species: Mangium (Acacia mangium Willd.), Rain Tree [Samanea saman (Jacq.) Merr.], Banuyo (Wallaceodendron celebicum Koord.), Tindalo [Afzelia rhomboidea (Blanco) Vidal], and Ipil [Intsia bijuga (Colebr.) O. Kuntze]. Binary classifiers were trained on 10,663 verified cross-section images from 260 specimens and evaluated using specimen-level mean scoring to mirror operational field conditions. Area Under the ROC Curve (AUC) values ranged from 0.969 (Ipil) to 1.000 (Mangium), and Average Precision (AP) values ranged from 0.589 (Samanea) to 1.000 (Mangium). Four of five species achieved AA grade (AUC and AP both \geq 0.90); Rain Tree received AE (AUC \geq 0.90, AP < 0.60) due to AP compression from its small positive test set (3 specimens). All five classifiers rank their target specimens above non-target specimens with near-perfect fidelity. Specimen-level error analysis revealed 9 false negatives from Ipil, primarily stemming from localized image artifacts and 3 false positives for Rain Tree and 1 false positive for Tindalo caused by shared tribal-level anatomical traits. These findings demonstrate that Xylorix non-programmers can leverage the Xylorix platform to construct operationally reliable wood identification models suitable for field deployment at supply chain checkpoints.

URL PDF HTML ☆

赞 0 踩 0

2509.19936 2026-06-10 cs.CV 版本更新

CapStARE: Capsule-based Sequential Architecture for Robust and Efficient Gaze Estimation

CapStARE: 基于胶囊的序列架构实现鲁棒高效的目光估计

Miren Samaniego, Igor Rodriguez, Elena Lazkano

发表机构 * University of the Basque Country（巴斯克大学）

AI总结提出CapStARE，结合冻结ConvNeXt骨干、注意力路由胶囊和双GRU解码器，在ETH-XGaze等数据集上实现实时高精度目光估计，兼顾空间鲁棒性与计算效率。

Comments Preprint for Patter Recognition Journal

详情

AI中文摘要

人类目光估计对于人机交互、社交机器人和辅助系统等应用至关重要。然而，在非约束环境中实现准确、可解释且实时的性能仍然具有挑战性。现有的基于外观的方法通常在空间鲁棒性、计算效率和上下文信息的有效利用之间面临权衡。为了解决这一问题，我们引入了CapStARE，一种基于胶囊的架构，它结合了用于高效特征提取的冻结ConvNeXt骨干网络、用于结构化面部推理的基于注意力路由的胶囊形成，以及用于短时域观测窗口上轻量级序列建模的双GRU解码器。这种设计保留了可解释的部分-整体面部关系，同时通过局部上下文一致性提高了预测稳定性。实验结果表明，该方法在ETH-XGaze（3.36）和MPIIFaceGaze（2.65）上表现强劲，同时在Gaze360（9.06）上也具有竞争力的泛化能力，且所有测试均实现实时推理（<10毫秒）。这些发现表明，所提出的方法为现实交互环境中基于外观的目光估计提供了一个实用且鲁棒的框架。相关代码和实验结果公开于：this https URL

英文摘要

Human gaze estimation is essential for applications such as human-computer interaction, social robotics, and assistive systems. However, achieving accurate, interpretable, and real-time performance in unconstrained environments remains challenging. Existing appearance-based methods often face trade-offs between spatial robustness, computational efficiency, and effective use of contextual information. To address this, we introduce CapStARE, a capsule-based architecture that combines a frozen ConvNeXt backbone for efficient feature extraction, capsule formation with attention-based routing for structured facial reasoning, and dual GRU decoders for lightweight sequential modeling over short-horizon observation windows. This design preserves interpretable part-whole facial relationships while improving prediction stability through local contextual consistency. Experimental results demonstrate strong performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65), while also generalizing competitively on Gaze360 (9.06), all with real-time inference (<10 ms). These findings suggest that the proposed method provides a practical and robust framework for appearance-based gaze estimation in real-world interactive environments. The related code and experimental results are publicly available at: https://github.com/toukapy/capsStare

URL PDF HTML ☆

赞 0 踩 0

2511.19706 2026-06-10 eess.IV cs.CV 版本更新

Selective Disk Bispectrum: A Complete and Rotation Invariant Image Descriptor

选择性圆盘双谱：一种完备且旋转不变的图像描述符

Adele Myers Lantow, Nina Miolane

发表机构 * Department of Physics（物理系）； Department of Electrical and Computer Engineering（电气与计算机工程系）； University of California, Santa Barbara（加州大学圣芭芭拉分校）

AI总结提出选择性圆盘双谱（SDB），一种复值旋转不变向量，在保持图像除方向外所有信息的同时，降低了计算复杂度，并验证了其在噪声分类和多参考对齐中的鲁棒性。

详情

AI中文摘要

旋转不变性是许多计算机视觉任务的基本要求。历史上，这种归纳偏置通过手工设计的旋转不变表示来编码。这些表示紧凑、可解释且计算快速，但以描述能力为代价。最近，架构通过学习表示来实现归纳偏置。这些表示高度描述性，实现了强大的经验性能，但以效率和可解释性为代价。在这项工作中，我们提出了两种范式交叉点上的替代方案。我们引入了选择性圆盘双谱（SDB），一种复值旋转不变向量，它保留了图像除方向外的所有信息。我们的关键理论贡献是选择性圆盘双谱、其逆变换、其（降低的）空间和计算复杂度（与完整圆盘双谱相比），以及其在噪声下的期望和方差。此外，我们提出了数值SDB近似，并为其准确性和旋转不变性提供了理论保证。在经验上，我们验证了SDB在噪声分类任务中的不变性和鲁棒性。我们在旋转图像的多参考对齐上测试了我们的重建算法。

英文摘要

Rotation invariance is a fundamental requirement across many computer vision tasks. Historically, this inductive bias has been encoded through hand-crafted rotation-invariant representations. These are compact, interpretable, and fast to compute, but they come at the cost of descriptive power. More recently, architectures achieve inductive bias through learned representations. These are highly descriptive and achieve strong empirical performance, at the cost of efficiency and interpretability. In this work, we propose an alternative at the intersection of both paradigms. We introduce the selective disk bispectrum (SDB), a complex-valued rotation-invariant vector that preserves all information about the image except its orientation. Our key theoretical contributions are the selective disk bispectrum, its inversion, its (reduced) spatial and computational complexities (compared to the full disk bispectrum), and its expectation and variance under noise. Furthermore, we propose a numerical SDB approximation and provide theoretical guarantees for its accuracy and rotation invariance. Empirically, we validate SDB's invariance and robustness to noise classification tasks. We test our reconstruction algorithm on multi-reference alignment of rotated images.

URL PDF HTML ☆

赞 0 踩 0

2606.10328 2026-06-10 cs.CV cs.AI 新提交

Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

内容诱导的空间-光谱聚合网络用于遥感图像变化检测

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University（控制科学与工程学院，山东大学）

AI总结提出内容引导的空间-光谱集成网络(CSI-Net)，通过空间推理、光谱差异和内容引导集成模块融合全局空间细节与光谱差异信息，有效抑制未变化区域差异，在三个数据集上取得最优性能。

详情

AI中文摘要

空间和光谱信息的整合有利于提高变化检测性能。然而，现有方法无法有效抑制未变化区域中空间和光谱差异的影响。为了解决这些问题，本文提出了一种内容引导的空间-光谱集成网络（CSI-Net），用于融合全局空间细节和光谱差异信息。具体而言，所提出的CSI-Net由空间推理（SR）模块、光谱差异（SD）模块和内容引导集成（CGI）模块组成。在SR模块中，通过级联图卷积块学习空间信息以进行全局建模。SD模块负责提取光谱特征，通过计算特征的均值和方差来减少未变化区域中光谱差异的影响。此外，为了有效集成空间-光谱特征，我们设计了CGI模块以进一步利用它们的互补信息。在该模块中，引入高层内容信息作为引导，以实现适当的交互。由于高效的空间-光谱融合，所提出的CSI-Net能够更好地学习变化特征，同时实现对光谱差异的抑制。在LEVIR-CD、WHU-CD和CLCD数据集上的实验结果表明，与最先进方法相比，所提出的CSI-Net产生了更好的性能，并且适用于不同场景。

英文摘要

The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

URL PDF HTML ☆

赞 0 踩 0

2606.10329 2026-06-10 cs.CV cs.AI 新提交

Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

地震中的建筑变化检测：一种多尺度交互网络和一个变化检测数据集

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University（控制科学与工程学院，山东大学）

AI总结针对地震后短期成像间隔导致的变化检测难题，构建了土耳其地震变化检测数据集（TUE-CD），并提出多尺度特征交互网络（MSI-Net），通过联合交叉注意力和多尺度偏移校准模块，有效缓解侧视问题，提升变化检测精度。

详情

AI中文摘要

作为最具破坏性的自然灾害之一，近年来地震袭击了世界许多国家，造成了严重的经济损失。变化检测（CD）可应用于震后损伤评估，因为它能从多时相遥感图像中推断出被破坏的变化区域。此外，短成像间隔的变化检测将更好地满足地震后紧急救援的需求。然而，由于缺乏短成像间隔的数据集，当前基于深度神经网络的方法的能力受到限制。为了满足灾后即时救援的需求，我们创建了一个变化检测数据集——土耳其地震变化检测数据集（TUE-CD），用于评估地震后短期内的建筑损坏情况。由于后事件图像的采集间隔短，不同时相图像的成像角度不同，导致了一些侧视问题。为了应对这些挑战，我们提出了一种多尺度特征交互网络（MSI-Net），用于双时相特征之间的高效交互，并减轻侧视问题的影响。具体来说，所提出的MSI-Net由联合交叉注意力（JCA）模块、多尺度偏移校准（MOC）模块和特征集成（FeI）模块组成。JCA模块统一了通道交叉注意力和空间联合注意力，以实现充分的特征交互。MOC模块进一步估计偏移量，以将双时相图像与多尺度特征对齐。最后，通过FeI模块融合校准后的特征和多尺度特征，用于变化区域的预测。在WHU-CD、CLCD和构建的TUE-CD数据集上的实验表明，所提出的MSI-Net比考虑的最先进的变化检测方法提供了更好的结果。

英文摘要

As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10696 2026-06-10 cs.CV 新提交

Don't waste SAM

不要浪费 SAM

Nermeen Abou Baker, Uwe Handmann

发表机构 * Ruhr West University of Applied Sciences - Dept of Computer Science（鲁尔西部应用科学大学计算机科学系）

AI总结本文评估了SAM在垃圾分割任务中的泛化能力，通过微调SAM-ViT-H模型，在三个数据集上显著提升IoU，表明微调SAM作为基础模型对下游任务至关重要。

Comments Published at European Symposium on Artificial Neural Networks (ESANN2023), Computational Intelligence and Machine Learning. Bruges (Belgium)

详情

DOI: 10.14428/esann/2023.ES2023-116

AI中文摘要

Meta AI 最近发布了 Segment Anything Model (SAM)，该模型在各种任务中展示了卓越的零样本图像分割性能，具有显著的准确性。尽管 SAM 无法在多个研究领域提供精确的分割，但它仍然是支持分割流程的宝贵起点，特别是对于需要大量高级技能标注的任务。本研究旨在使用三个垃圾分割数据集评估 SAM 和微调 SAM 模型的泛化能力。尽管这些数据集是从真实场景中捕获的（与 SAM 预训练的数据相同），但它们带来了若干挑战，包括遮挡、可变形物体、透明物体以及易与背景混淆的物体。我们的发现表明，微调的 SAM-ViT-H 模型在 Zerowaste 和 TACO 数据集上优于最先进的方法，IoU 显著提高了 +30，并且非常接近 TrashCan 1.0 的性能水平，仅相差 -1.44。在评估这些流行的垃圾数据集后，很明显，微调 SAM 作为基础模型是为下游垃圾分割任务提供更好泛化能力的关键步骤。因此，SAM 不应被忽视或浪费。

英文摘要

Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

URL PDF HTML ☆

赞 0 踩 0

2606.10699 2026-06-10 cs.CV cs.AI 新提交

Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line

使用YOLOv12模型验证生产线上网线（跳线）中导线的正确颜色顺序

Amin Doroodchi, Danial Soleimany

发表机构 * Computer Department, Islamic Azad University, Beyza Branch（伊斯兰 Azad 大学计算机系，贝兹分校）； R&D at Nedaye Sabz Company, Isfahan Branch（Nedaye Sabz 公司研发部，伊斯法罕分校）

AI总结针对网线生产中导线颜色顺序检测问题，提出基于YOLOv12的目标检测模型，实现高精度实时验证，减少人工错误。

详情

AI中文摘要

在网络电缆的生产过程中，确保标准连接器内部线对的正确颜色顺序对电缆的最终性能至关重要，因为任何错位或颜色顺序错误都可能导致缺陷产品并造成巨大成本。基于数字显微镜目视检查的传统检测方法通常耗时、繁琐且容易出错。在本研究中，开发了一种基于第十二版YOLO目标检测模型的智能系统，用于识别跳线中导线的位置并验证其正确的颜色顺序。使用的数据集包括从网络连接器显微视图中捕获的2500张图像，其中70%用于训练，15%用于验证，15%用于测试。所提出的模型利用单阶段架构和学习过程中的注意力机制，实现了约98%精度的导线检测。此外，总体平均准确率、分类精度和召回率分别约为95%、99%和98%。结果表明，该系统能够在生产线上可靠地实时验证导线颜色顺序的正确性，无需人工干预，从而减少人为错误并提高制造效率。

英文摘要

In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critical role in the final performance of the cable, as any misplacement or color-ordering error can lead to defective products and impose significant costs. Traditional inspection methods based on visual examination through digital microscopes are typically time-consuming, tedious, and prone to human error. In this study, an intelligent system based on the twelfth version of the YOLO1 object detection model was developed to identify the position and verify the correct color sequence of wires in patch cords. The dataset used consisted of 2,500 images captured from microscopic views of network connectors, which were divided into 70% for training, 15% for validation, and 15% for testing. The proposed model, leveraging a single-stage architecture and attention mechanisms during learning, achieved highly accurate wire detection with approximately 98% precision. Additionally, the overall mean accuracy, classification precision, and recall were around 95%, 99%, and 98%, respectively. The results demonstrate that this system can reliably and in real time verify the correctness of wire color sequencing on the production line without the need for human intervention, thereby reducing human error and enhancing efficiency in the manufacturing process.

URL PDF HTML ☆

赞 0 踩 0

2606.10769 2026-06-10 cs.CV 新提交

ZODS-RS -- Zero-training Oriented Detection & Segmentation for Remote Sensing

ZODS-RS -- 面向遥感的零训练目标检测与分割

Zuan Gu, Tianhan Gao, Langxu Zhao

发表机构 * Northeastern University, China（东北大学）

AI总结提出一种无需训练的封闭式管道ZODS-RS，通过原型纯化、旋转尺度等变匹配和不确定性感知像素合并，统一了遥感图像的水平框检测与实例分割，在多个数据集上取得优异性能。

详情

AI中文摘要

遥感与无人机应用需要模型能够跨平台和视角泛化，而无需特定任务训练。然而，无训练管道在处理有向几何、尺度/旋转变化以及拥挤的港口或机场时常常失败，并且很少统一检测与分割。我们提出ZODS-RS，一种无训练、封闭式的管道，输出水平框（HBB）和实例掩码。基于DINOv3密集特征和SAM风格的提议，ZODS-RS链式包含：PP（通过Tyler协方差进行原型纯化）、R-SEM（使用可分离核和全局匈牙利分配的旋转尺度等变匹配）以及UAM（具有自适应先验和可选负原型的不确定性感知逐像素合并）。一个轻量级的CWLA融合多个DINOv3层。在FAIR1M（HBB）上，我们获得$\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$和$\mathrm{AP}_S=\mathbf{2.93}$（船舶/飞机类别平均）；在xView（HBB）上，我们报告$\mathrm{mAP}=\mathbf{16.69}$。在我们的无人机数据集上，ZODS-RS实现了掩码$\mathrm{mIoU}=\mathbf{31.10}$，并在单张5090上将小目标AP相对于Grounded-SAM提升了$\mathbf{+30.70}$。这项工作为航空影像中的水平框检测加实例分割提供了统一的、无需训练的解决方案；提供了与DINOv3紧密耦合的PP/R-SEM/UAM的显式封闭形式公式；并在小目标和拥挤目标以及跨域迁移下展示了一致的增益，同时保持部署简单。

英文摘要

Remote-sensing and UAV applications need models that generalize across platforms and viewpoints without task-specific training. Yet training-free pipelines often falter on oriented geometry, scale/rotation variation, and crowded ports or airfields, and rarely unify detection and segmentation. We introduce ZODS-RS, a training-free, closed-form pipeline that outputs horizontal boxes (HBB) and instance masks. Built on DINOv3 dense features and SAM-style proposals, ZODS-RS chains: PP (prototype purification via Tyler covariance), R-SEM (rotation-scale equivariant matching with separable kernels and global Hungarian assignment), and UAM (uncertainty-aware pixelwise merging with adaptive priors and optional negative prototypes). A lightweight CWLA fuses multiple DINOv3 layers. On FAIR1M (HBB) we obtain $\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$ and $\mathrm{AP}_S=\mathbf{2.93}$ \emph{(class-averaged over ship/airplane)}; on xView (HBB) we report $\mathrm{mAP}=\mathbf{16.69}$. On our UAV dataset, ZODS-RS achieves mask $\mathrm{mIoU}=\mathbf{31.10}$ and improves small-object AP by $\mathbf{+30.70}$ over Grounded-SAM on a single 5090. This work offers a unified, \emph{no-training} solution for horizontal-box detection plus instance segmentation in aerial imagery; provides explicit closed-form formulations for PP/R-SEM/UAM tightly coupled with DINOv3; and demonstrates \emph{consistent} gains on small and crowded targets and under cross-domain shifts while keeping deployment simple.

URL PDF HTML ☆

赞 0 踩 0

2606.10940 2026-06-10 cs.CV cs.AI cs.LG 新提交

Democratising Camera Trap AI: An Open-Source Model for Detecting UK Mammals

民主化相机陷阱AI：用于检测英国哺乳动物的开源模型

Paul Fergus, Philip Stephens, Russell A. Hill, Lee Oliver, Katie Appleby, Sarah Beatham, Naomi Davies Walsh, Stuart Nixon, Naomi Matthews, Chris Sutherland, Kelly Hitchcock

发表机构 * Liverpool John Moores University（利物浦约翰穆里斯大学）； Durham University（杜伦大学）； MammalWeb（哺乳动物网）； Game & Wildlife Conservation Trust（游戏与野生动物保护信托）； National Trust（国家信托）； Animal and Plant Health Agency（动物和植物卫生局）； Chester Zoo（切斯特动物园）； University of St Andrews（圣安德鲁大学）； Nottingham Trent University（诺丁汉特伦特大学）

AI总结发布一个针对31类（28种英国常见哺乳动物和鸟类）的开源目标检测模型，基于YOLO26x在48,165个标注实例上训练，mAP@0.5达0.984，旨在降低生态学家使用AI的门槛。

Comments 15 Pages, 4 Figures

详情

AI中文摘要

相机陷阱已成为生物多样性监测的基石，但将大量图像转化为可用生态数据的人工智能通常被锁定在商业平台之后，或针对与不列颠群岛不相符的动物群进行训练。为了消除障碍并提高采用率，我们发布了一个针对31类（28种英国常见哺乳动物和鸟类，以及人类、校准杆和车辆等实用类）的开源目标检测模型，该模型基于从多个地点经过十年运营部署（通过Conservation AI及其后续项目Trap Tracker）收集的48,165个标注实例的精选数据集。该模型是YOLO26x检测器，在80/10/10的类别分层划分上进行训练和测试，在保留的验证集上，IoU为0.5时平均精度为0.984（IoU 0.5-0.95时为0.956），精确率为0.988，召回率为0.965。在未见过的保留测试集上，31个类别的平均物种置信度范围为0.96至0.99，假阴性率为0.17%，主要集中在困难的夜间、远处或遮挡图像中。这些指标来自与训练相同站点和相机池的数据，因此在新站点的性能留待未来工作。我们以非商业许可发布ONNX格式的训练权重，支持本地桌面和实时相机，明确面向没有机器学习经验的生态学家。此发布是对过去十年中开发的多个付费模型的有意制衡。

英文摘要

Camera traps have become a cornerstone of biodiversity monitoring, but the artificial intelligence that turns vast quantities of images into usable ecological data is often locked behind commercial platforms or trained on fauna that does not match that of the British Isles. In an attempt to remove barriers and increase uptake, we release an open-source object detection model for 31 classes, 28 common UK mammal and bird species, plus utility classes for humans, calibration poles, and vehicles, drawn from a curated dataset of 48,165 labelled instances assembled from multiple sites over a decade of operational deployment through Conservation AI and its successor, Trap Tracker. The model, a YOLO26x detector trained and tested on an 80/10/10 class-stratified split, achieves a mean Average Precision of 0.984 at Intersection over Union (IoU) of 0.5 (0.956 at IoU 0.5-0.95) on the held-out validation set, with precision 0.988 and recall 0.965. On an unseen held-out test split, mean per-species confidence ranged from 0.96 to 0.99 across the 31 classes, with a 0.17% false-negative rate concentrated in difficult night-time, distant, or occluded images. These metrics are from data from the same pool of sites and cameras as training, so performance at entirely new sites is left to future work. We release the trained weights in ONNX format under a non-commercial licence, with local desktop and real-time camera support, aimed explicitly at ecologists with no machine-learning experience. This release is a deliberate counterweight to the multiple paid for models that have developed over the last decade.

URL PDF HTML ☆

赞 0 踩 0

2507.02513 2026-06-10 cs.CV 版本更新

Automatic Labelling for Low-Light Pedestrian Detection

低光照行人检测的自动标注

Dimitrios Bouzoulas, Eerik Alamikkotervo, Risto Ojala

发表机构 * Energy and Mechanical Engineering, Aalto University（艾尔沃斯大学能源与机械工程系）

AI总结提出一种自动红外-RGB流水线，利用红外检测生成标签训练低光照行人检测模型，在KAIST数据集上优于真实标签。

详情

AI中文摘要

RGB图像中的行人检测是行人安全的关键任务，因为自动驾驶车辆和高级驾驶辅助系统中最常见的传感器是RGB相机。低光照行人检测缺乏大型公共数据集和自动标注流水线。本研究提出一种自动红外-RGB流水线作为解决方案。该流水线包括：1) 红外检测，使用微调的红外行人检测模型；2) 标签转移过程，将红外检测结果转移到对应的RGB图像；3) 使用生成的标签训练低光照RGB行人检测的目标检测模型。研究使用KAIST数据集进行。评估中，三个目标检测模型DETR、YOLO和RCNN在生成的标签和真实标签上分别训练。在未见过的图像上比较时，结果显示，在mAP@50和LAMR指标上，基于生成标签训练的模型在6个案例中的5个优于基于真实标签训练的模型，并且在所有案例中mAP@50-95指标均优于真实标签。获得的结果表明，所提出的自动标注流水线可用于低光照行人检测数据集的可扩展标注。本研究的源代码可在GitHub上获取：this https URL

英文摘要

Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. Low-light pedestrian detection lacks large public datasets and autolabelling pipelines. This research proposes a solution in the form of an automated infrared-RGB pipeline. The pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For evaluation, three object detection models, DETR, YOLO, and RCNN, were trained on generated and ground truth labels. When compared on previously unseen images, the results showed that the models trained on generated labels out-performed the ones trained on ground-truth in 5 out of 6 cases for the mAP@50 and LAMR metrics, and outperformed ground-truth on mAP@50-95 in all cases. Acquired results indicate that the proposed auto-labelling pipeline could be used for scalable annotation of low-light datasets for pedestrian detection. The source code for this research is available on GitHub: https://github.com/BouzoulasDimitrios/IR-RGB-autoamed-low-light-pedestrian-labelling

URL PDF HTML ☆

赞 0 踩 0

2603.11917 2026-06-10 cs.CV 版本更新

PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

PicoSAM3：实时传感器区域感兴趣分割

Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno

发表机构 * ETH Zürich（苏黎世联邦理工学院）； IBM Research（IBM研究院）

AI总结 PicoSAM3是一款轻量级实时传感器区域分割模型，结合密集CNN架构、区域兴趣提示编码和知识蒸馏，实现低延迟高精度分割。

详情

AI中文摘要

实时、在设备上的分割对于延迟敏感且隐私保护的应用至关重要，如智能眼镜和物联网设备。我们介绍了PicoSAM3，一个针对边缘和传感器执行优化的轻量级提示视觉分割模型，包括在索尼IMX500视觉传感器上的部署。PicoSAM3拥有1.3M参数，结合密集CNN架构、区域兴趣提示编码、高效通道注意机制以及从SAM2和SAM3的知识蒸馏。在COCO和LVIS数据集上，PicoSAM3分别达到65.45%和64.01%的mIoU，优于现有基于SAM和边缘导向的基线模型。INT8量化模型在精度上几乎没有下降，同时在IMX500上实现了11.82ms的实时传感器推断延迟，完全符合其内存和运算限制。消融研究显示，从大SAM模型的知识蒸馏可使mIoU提升高达14.5%，证明了高质量、空间灵活的提示分割可在传感器层面实现。

英文摘要

Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

URL PDF HTML ☆

赞 0 踩 0

2606.09967 2026-06-10 cs.CV 新提交

ABot-Earth 0.5: Generative 3D Earth Model

ABot-Earth 0.5：生成式3D地球模型

Ming Qian, Tianjian Ouyang, Mingchao Sun, Zijian Wang, Jincheng Xiong, Jiarong Han, Yongchang Zhang, Jiawei Zhang, Xu Wang, Yu Liu, Luyang Tang, Fei Yu, Zengye Ge, Mengmeng Du, Yuan Liu, Nianfei Fan, Song Wang, Yingliang Peng, Chunxue Jia, Yang Liu, Shiying Zeng, Haozhe Shi, Junnan Lai, Hongyu Pan, Zheng Wu, Ning Guo, Mu Xu, Hang Zhang

AI总结提出ABot-Earth 0.5框架，利用3D高斯泼溅从卫星图像生成大规模无缝3D环境，每平方公里合成时间低于10分钟，支持实时交互可视化，降低3D重建成本。

Comments From Amap-cvlab, Alibaba. Official page: https://abot-earth.amap.com/

详情

AI中文摘要

我们提出ABot-Earth 0.5，一个生成式3D框架，旨在从普遍存在的、地理参考的卫星图像中合成大规模无缝3D环境。为此，我们提出了一种新颖的生成模型，直接使用3D高斯泼溅（3DGS）表示。该模型在多样化的真实世界城市重建语料库上进行训练，学习生成逼真的几何和纹理。在推理时，它仅以卫星图像为条件合成新颖的3D场景，可扩展速率低于每平方公里10分钟，同时表现出卓越的真实感。该框架设计为易于访问，集成了分层细节级别（LOD）结构，允许在基于Web的地图引擎上进行实时交互式可视化。这种高保真模拟沙箱有效缓解了模拟到现实的领域差距，支持关键的具身人工智能下游应用，如闭环无人机导航。通过提供超低成本和高效的解决方案，ABot-Earth 0.5显著降低了大规模3D重建的技术和财务障碍，并推动了全球数字地球可视化的未来。

英文摘要

We present ABot-Earth 0.5, a generative 3D framework designed to synthesize vast, seamless 3D environments from ubiquitous, geospatially referenced satellite imagery. To achieve this, we propose a novel generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation. The model is trained on a diverse corpus of existing real-world urban reconstructions, learning to generate realistic geometry and textures. At inference, it synthesizes novel 3D scenes conditioned solely on satellite imagery at a scalable rate of under 10 minutes per square kilometer, while demonstrating exceptional realism. The framework is designed for accessibility, with integrated hierarchical level-of-detail (LOD) structures that permit real-time, interactive visualization on web-based map engines. This high-fidelity simulation sandbox effectively mitigates the sim-to-real domain gap, enabling critical downstream Embodied AI applications like closed-loop UAV navigation. By providing an ultra-low-cost and high-efficiency solution, ABot-Earth 0.5 significantly lowers the technical and financial barriers to large-scale 3D reconstruction and empowers the future of global digital earth visualization.

URL PDF HTML ☆

赞 0 踩 0

2606.10183 2026-06-10 cs.CV cs.AI cs.MM 新提交

Making Time Editable in Video Diffusion Transformers

在视频扩散Transformer中实现时间可编辑性

Konstantin Kuklev, Viacheslav Vasilev, Alexander Kunitsyn, Andrei Ivaniuta, Denis Dimitrov

AI总结提出一种时间控制方法，通过轻量级时间模块扩展预训练DiT，实现运动速度和时序结构的编辑，无需重新设计骨干网络。

2606.10450 2026-06-10 cs.CV cs.LG 新提交

通过定制化概念嵌入改进前景条件外绘中的文本-实例对齐

Yihao Zhao, Xuan Han, Bin He, Mingyu You

AI总结针对前景条件外绘中文本驱动方法产生的伪影问题，提出定制化概念嵌入扩散框架，通过实例感知损失和语义保持提示模板定制概念嵌入，显著减少伪影并提升图像质量。

详情

AI中文摘要

为了展示产品，商家通常需要花费大量成本制作高质量的展示图像。前景条件外绘（FCO）满足了这一需求，允许用户通过调整文本提示，以低成本为前景实例创建所需的背景。然而，现有的文本驱动FCO方法在其输出中存在关键缺陷，最明显的是伪影，即合成背景中与前景实例共享相同语义的区域。这种伪影降低了物体的显著性并降低了图像质量。我们将问题归因于给定实例与文本派生概念嵌入之间的不对齐。为了解决这个问题，我们提出了定制化概念嵌入扩散（CCE-Diffusion）框架。其核心是CCE模块，用于定制概念嵌入，弥合通用名词语义与特定视觉实例之间的差距。实例感知损失指导模块的优化，而语义保持提示模板防止定制化嵌入扭曲提示中的其他词。定性和定量评估均表明，CCE-Diffusion显著减少了输出中的伪影。作为即插即用组件，CCE模块可以集成到各种FCO方法中，提升其性能。

英文摘要

To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO) meets this demand, allowing users to create desired backgrounds for foreground instances at a low cost by adjusting the text prompt. However, existing text-driven FCO methods exhibit critical flaws in their outputs, most notably the presence of artifacts, which refer to regions in the synthesized background that share the same semantics as the foreground instance. Such artifacts diminish the object's prominence and degrade image quality. We attribute the issue to the misalignment between the given instance and text-derived concept embeddings. To address this, we propose the Customized Concept Embedding Diffusion (CCE-Diffusion) framework. Its core is a CCE-Module to customize concept embeddings, bridging the gap between generic noun semantics and a specific visual instance. An Instance-Aware Loss guides the module's optimization, while a Semantic-Preserving Prompt Template prevents customized embeddings from distorting other words in the prompt. Both qualitative and quantitative evaluations demonstrate that CCE-Diffusion significantly reduces artifacts in the outputs. As a plug-and-play component, the CCE-Module can integrate with various FCO methods, enhancing their performance.

URL PDF HTML ☆

赞 0 踩 0

FG-Attn：在视频扩散模型中利用细粒度稀疏注意力

Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Tianlei Pang, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对视频扩散模型中注意力层计算开销大的问题，提出FG-Attn，一种低开销的细粒度稀疏注意力机制，在MxN块粒度上跳过分数计算，实现最高2.45倍加速。

详情

AI中文摘要

使用扩散变压器进行媒体生成可能需要评估极长序列上的注意力，其中注意力层占生成延迟的大部分。利用注意力图中的稀疏性为降低这一成本提供了有前景的机会。在这项工作中，我们展示了扩散变压器中的注意力图在视频生成模型中表现出显著的细粒度稀疏性。然而，现有的稀疏注意力方法过于粗粒度，留下了大量未处理的冗余计算，或者在更细粒度上产生高开销。我们提出FG-Attn，一种新颖的低开销细粒度稀疏注意力机制，它在MxN块的粒度上跳过分数计算，其中N>=1且M>=16，每个块是M个查询和N个键之间查询-键点积的结果。FG-Attn解决了GPU上稀疏注意力内核中硬件利用率不足的关键挑战，同时避免了不规则内存访问和冗余操作的开销。FG-Attn可以完全取代现有的稀疏注意力方法，并将块稀疏注意力方法扩展到现代GPU上的更细粒度。在70%稀疏度下，FG-Attn比最先进的FlashInfer快2.45倍，平均减少注意力内核时间14.7%。FG-Attn将端到端视频生成时间比Flash Attention 3加速高达1.40倍（平均1.18倍）。

英文摘要

Using diffusion transformers for media generation may require evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps offers a promising opportunity to reduce this cost. In this work, we show that attention maps in diffusion transformers exhibit significant fine-grained sparsity in video generation models. Existing sparse attention methods, however, are too coarse-grained, leaving a large fraction of redundant computation unaddressed, or incur high overheads at finer granularity. We propose FG-Attn, a novel, low-overhead fine-grained sparse attention mechanism that skips score computations at the granularity of a MxN tile, where N>=1 and M>=16, and where each block is the result of query-key dot products between M queries and N keys. FG-Attn addresses the key challenge of hardware underutilization in sparse attention kernels on GPUs, without incurring the overheads of irregular memory access and redundant operations. FG-Attn can fully supersede existing sparse attention methods and extend block sparse attention methods to finer granularities on modern GPUs. At 70% sparsity, FG-Attn is up to 2.45X faster than the state-of-art FlashInfer, and reduces attention kernel time by 14.7% on average. FG-Attn speeds up end-to-end video generation times by up to 1.40X (1.18X on average) over Flash Attention 3.

URL PDF HTML ☆

赞 0 踩 0

2512.08180 2026-06-10 cs.CV 版本更新

SARA: 语义自适应关系对齐用于视频扩散模型

Jiesong Lian, Zixiang Zhou, Ruizhe Zhong, Yuan Zhou, Qinglin Lu, Rui Wang, Long Hu, Yixue Hao, Baoru Huang

发表机构 * Tencent Hunyuan（腾讯文英）

AI总结提出SARA方法，通过文本条件显著性引导令牌对监督，提升视频扩散模型的文本对齐和运动质量。

详情

AI中文摘要

最近的视频扩散模型（VDMs）能够合成视觉上令人信服的片段，但仍会丢失实体、错误绑定属性并削弱提示中指定的交互。表示对齐目标如VideoREPA和MoAlign通过从冻结的视觉基础模型中提取时空令牌关系来改进细粒度文本跟随，但其成对监督预算由视觉或运动线索分配，而非根据每对与提示的相关性。我们提出SARA，语义自适应关系对齐，它保持对冻结VFM目标的令牌关系蒸馏（TRD），并添加一个文本条件显著性来决定哪些令牌对携带监督。一个轻量级的Stage 1对齐器使用每个实体的SAM 3.1掩码监督和InfoNCE正则化器进行训练，其连续显著性通过一个对路由算子融合到TRD中，该算子为每个令牌对分配权重，只要其两个端点中有一个是显著的，从而将监督导向主体-主体和主体-背景对，远离背景-背景对。在Wan2.2持续训练设置中，SARA在13维VLM评估标准、公共VBench基准和盲用户研究中，在文本对齐和运动质量上均优于SFT、VideoREPA和MoAlign。项目页面：此 https URL。

英文摘要

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage~1 aligner is trained with per-entity SAM~3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study. Project page: https://saradit.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.08674 2026-06-10 cs.CV cs.AI 版本更新

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

BioVid: 具有生物行为语义理解的自回归视频生成

Tsung-Wei Pan, Jung-Hua Wang

发表机构 * Department of Electrical Engineering, National Taiwan Ocean University（国立台湾海洋大学电子工程系）； AI research center, National Taiwan Ocean University（国立台湾海洋大学人工智能研究中心）

AI总结提出BioVid，一种数据驱动的自回归视频生成框架，通过FSQ-R3GAN分词器和因果Transformer学习生物行为的自然时长分布，无需预设长度约束。

详情

AI中文摘要

现有的视频生成框架将序列时长视为外部指定参数——固定的帧数或文本提示——生成的片段在时间边界上与真实行为数据的统计结构脱节。这一假设与生物行为根本不一致，因为动作时长在个体和实例之间自然变化，并编码在数据本身中。我们提出BioVid，一种数据驱动的自回归视频生成框架，直接从训练数据中学习生物行为的时序结构，包括其自然长度分布。在第一阶段，有限标量量化GAN（FSQ-R3GAN）分词器将每个视频帧编码为紧凑的离散表示，结合R3GAN的稳定相对训练目标和FSQ的保证码本利用率，实现高保真空间重建而无需码本崩溃。在第二阶段，因果Transformer自回归地对生成的令牌序列建模，并在行为事件达到语义闭合时学习发出序列结束（EOS）令牌，终止分布自然地从训练数据中涌现，而非任何人为指定的约束。在人类饮酒行为数据集（NTU RGB+D, A001, n=94）上的实验表明，BioVid生成的长度分布与保留测试数据的分布紧密匹配，与真实分布的Wasserstein-1距离为1.24——相比之下，固定长度基线为6.05，VideoGPT为15.48——同时保持有竞争力的空间保真度。

英文摘要

Existing video generation frameworks treat sequence duration as an externally prescribed parameter -- fixed frame counts or text prompts -- producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals and instances and is encoded in the data itself. We present BioVid, a data-driven autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, including their natural length distributions. In the first stage, a Finite Scalar Quantization GAN (FSQ-R3GAN) tokenizer encodes each video frame into a compact discrete representation, combining the stabilized relativistic training objective of R3GAN with FSQ's guaranteed codebook utilization to achieve high-fidelity spatial reconstruction without codebook collapse. In the second stage, a causal Transformer models the resulting token sequences autoregressively and learns to emit an End-of-Sequence (EOS) token when the behavioral event reaches semantic closure, with the termination distribution emerging naturally from the training data rather than any human-specified constraint. Experiments on a human drinking behavior dataset (NTU RGB+D, A001, n=94) demonstrate that BioVid's generated length distribution closely matches that of held-out test data, achieving a Wasserstein-1 distance of 1.24 against the ground truth -- compared to 6.05 for a fixed-length baseline and 15.48 for VideoGPT -- while maintaining competitive spatial fidelity.

URL PDF HTML ☆

赞 0 踩 0

2310.05264 2026-06-10 cs.LG cs.CV 版本更新

The Emergence of Reproducibility and Generalizability in Diffusion Models

扩散模型中可重复性与泛化性的出现

Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, Qing Qu

发表机构 * CIFAR-10 dataset（CIFAR-10数据集）

AI总结研究发现扩散模型在相同初始噪声和确定性采样器下，不同模型输出高度相似，且这种可重复性在记忆和泛化两种训练模式下均存在，对训练效率、模型隐私等有重要启示。

Comments NeurIPS Diffusion Model Workshop 2023 (best paper award), the Forty-first International Conference on Machine Learning (ICML 2024)

详情

AI中文摘要

在这项工作中，我们研究了扩散模型的一个有趣且普遍的现象，我们称之为“一致模型可重复性”：给定相同的起始噪声输入和确定性采样器，不同的扩散模型通常会产生非常相似的输出。我们通过全面的实验证实了这一现象，这意味着不同的扩散模型一致地达到相同的数据分布和评分函数，无论扩散模型框架、模型架构或训练过程如何。更引人注目的是，我们的进一步研究表明，扩散模型学习到的不同分布受到训练数据大小的影响。这一点得到了以下事实的支持：模型可重复性表现在两种不同的训练机制中：（i）“记忆机制”，其中扩散模型过拟合到训练数据分布，以及（ii）“泛化机制”，其中模型学习底层数据分布。我们的研究还发现，这一有价值的特性推广到许多扩散模型的变体，包括用于条件使用、解决逆问题和模型微调的变体。最后，我们的工作提出了许多有趣的理论问题供未来研究，并强调了关于训练效率、模型隐私和扩散模型受控生成的实际意义。

英文摘要

In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility": given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and scoring function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions affected by the training data size. This is supported by the fact that the model reproducibility manifests in two distinct training regimes: (i) "memorization regime", where the diffusion model overfits to the training data distribution, and (ii) "generalization regime", where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional use, solving inverse problems, and model fine-tuning. Finally, our work raises numerous intriguing theoretical questions for future investigation and highlights practical implications regarding training efficiency, model privacy, and the controlled generation of diffusion models.

URL PDF HTML ☆

赞 0 踩 0

2601.08379 2026-06-10 cs.LG cs.AI cs.CV 版本更新

MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance

MMD Guidance: 基于最大均值差异引导的无训练分布适应扩散模型

Matina Mahdizadeh Sani, Nima Jamali, Mohammad Jalali, Farzan Farnia

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出MMD Guidance，一种无训练方法，通过最大均值差异梯度引导扩散模型采样，实现与参考数据分布对齐，无需重新训练。

详情

AI中文摘要

预训练扩散模型已成为无条件及条件样本生成的有力先验，但其输出常偏离用户特定目标数据的特征。这种不匹配在领域适应任务中尤为突出，此时仅有少量参考样本可用且重新训练扩散模型不可行。现有推理时引导方法可调整采样轨迹，但通常优化替代目标（如分类器似然）而非直接对齐目标分布。我们提出MMD Guidance，一种无训练机制，通过生成样本与参考数据集之间的最大均值差异（MMD）梯度增强反向扩散过程。MMD能从有限数据中提供可靠分布估计，实践中方差低，且可高效微分，特别适合引导任务。我们的框架通过乘积核自然扩展到条件生成模型中的提示感知适应。此外，由于引导在潜在扩散模型（LDM）的潜在空间中进行，因此可高效计算。在合成及真实世界基准上的实验表明，MMD Guidance能在保持样本保真度的同时实现分布对齐。项目代码见该网址。

英文摘要

Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose \emph{MMD Guidance}, a training-free mechanism that augments the reverse diffusion process with gradients of the \textit{Maximum Mean Discrepancy (MMD)} between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity. The project code is available at github.com/matinamehdizadeh/MMD-Guidance.

URL PDF HTML ☆

赞 0 踩 0

2606.10019 2026-06-10 cs.CV cs.AI cs.RO 新提交

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

广义CVO：基于二阶黎曼优化的快速无对应局部点云配准

Ray Zhang, Marcus Greiff, Thomas Lew, John Subosits

AI总结提出一种基于几何表面结构和再生核希尔伯特空间嵌入的无对应局部点云配准方法，采用二阶流形优化实现高达10倍加速，在LiDAR和RGB-D跟踪及物体配准中显著降低漂移并提升鲁棒性。

Comments 16 pages, 12 figures

详情

AI中文摘要

我们提出了一种快速且无需对应关系的局部点云配准方法，该方法利用了几何表面结构和再生核希尔伯特空间（RKHS）嵌入。该方法将点云表示为具有逐点各向异性核的连续函数，这些核编码了局部几何信息。这种公式化在沿表面法线方向改善对齐的同时，放松了沿切线方向的对齐。为了解决由此产生的配准问题，我们提出了一种具有近似黎曼海森矩阵的二阶流形优化方案，与先前基于无对应RKHS方法中使用的一阶求解器相比，实现了高达10倍的加速。我们展示了在多种室内外数据集上改进的帧到帧LiDAR和RGB-D跟踪精度。在驾驶领域的LiDAR跟踪配准任务中，我们在具有挑战性的特征稀疏环境下实现了平移和旋转漂移均减少超过55%。在物体配准基准测试中，我们展示了相比基于ICP的方法更强的鲁棒性，并且在优化全局初始化时（尤其是在中等错位情况下）获得了进一步的提升。

英文摘要

We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.

URL PDF HTML ☆

赞 0 踩 0

2606.10364 2026-06-10 cs.CV 新提交

Benchmarking stereo reconstruction for 3D printable Martian terrain models

用于3D打印火星地形模型的立体重建基准测试

Josephine Wang

发表机构 * MIT Cambridge, MA, USA（麻省理工学院，马萨诸塞州剑桥，美国）

AI总结针对火星图像低纹理、不规则和部分观测的特点，评估从NASA好奇号图像估计立体深度、补全几何并导出可打印网格的流程，发现基准精度不直接迁移到火星地形重建，几何补全存在局部保真度与全局连通性的权衡。

Comments 9 pages, 7 figures, CVPR End-to-End 3D Workshop 2026

详情

AI中文摘要

从火星车图像重建可打印的3D模型具有挑战性，因为火星地形纹理低、不规则且部分被观测。我们评估了一个流程，该流程从NASA好奇号图像估计立体深度，补全几何，并导出水密OBJ网格。在Middlebury数据集上，RAFT-Stereo优于半全局块匹配（SGBM），将视差MAE从3.22像素降低到0.73像素，并将有效预测覆盖率从76.3%提高到100.0%。然而，在好奇号图像上，RAFT更密集的视差显示出较弱的边缘对齐和更高的光度重投影误差，表明基准精度不能直接迁移到火星地形重建。几何补全展示了局部保真度与全局连通性之间的权衡。我们发现，alpha形状保留了准确但碎片化的结构，泊松重建产生更连贯的网格但增加了无支撑表面，而确定性扩散填充基线介于两者之间但对立体质量敏感。总体而言，标准立体和补全方法可以产生火星地形的可打印近似，但可靠的重建需要更强的领域特定验证。

英文摘要

Reconstructing printable 3D models from Mars rover imagery is challenging because Martian terrain is low-texture, irregular, and partially observed. We evaluate a pipeline that estimates stereo depth from NASA Curiosity images, completes geometry, and exports watertight OBJ meshes. On Middlebury, RAFT-Stereo outperforms semi-global block matching (SGBM), reducing disparity MAE from 3.22px to 0.73px and increasing valid prediction coverage from 76.3% to 100.0%. On Curiosity imagery, however, RAFT's denser disparities show weaker edge alignment and higher photometric reprojection error, suggesting that benchmark accuracy does not directly transfer to Martian terrain reconstruction. Geometry completion demonstrates a tradeoff between local fidelity and global connectivity. We find that alpha shapes preserve accurate but fragmented structure, Poisson reconstruction produces more coherent meshes but adds unsupported surfaces, and a deterministic diffusion-fill baseline is intermediate but sensitive to stereo quality. Overall, standard stereo and completion methods can produce printable approximations of Martian terrain, but reliable reconstruction requires stronger domain-specific validation.

URL PDF HTML ☆

赞 0 踩 0

2606.10395 2026-06-10 cs.CV 新提交

Efficient RWKV-based Representation Learning for 3D Point Clouds

基于高效RWKV的三维点云表示学习

Yun Liu, Xuefeng Yan, Liangliang Nan, Xianzhi Li, Peng Li, Zhe Zhu, Honghua Chen, Mingqiang Wei

发表机构 * School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics（南京航空航天大学计算机科学与技术学院）； Shenzhen Institute of Research, Nanjing University of Aeronautics and Astronautics（南京航空航天大学深圳研究院）； Collaborative Innovation Center of Novel Software Technology and Industrialization（新型软件技术与产业化协同创新中心）； Urban Data Science section, Delft University of Technology（代尔夫特理工大学城市数据科学部）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出P-RWKV模块，通过局部感知扩展和空间上下文增强，将RWKV从序列建模适配到3D点云，实现线性复杂度的全局依赖建模，在多项任务中以更低计算成本取得竞争性能。

详情

AI中文摘要

最近提出的接收加权键值（RWKV）模型结合了RNN风格的循环，为建模全局依赖提供了Transformer二次自注意力的线性复杂度替代方案。然而，当直接应用于点云时，原本为序列文本开发的RWKV难以有效捕捉局部几何结构和建模空间依赖。为了解决这个问题，我们提出了\textbf{P-RWKV}模块，它在保持RWKV效率优势的同时，弥合了序列建模与不规则3D几何之间的差距。它包含一个局部感知扩展（LPE）组件，用于沿时空序列扩展上下文感知，以及一个空间上下文增强（SCE）组件，用于增强空间意识。为了验证P-RWKV在点云理解中的有效性，我们构建了PointER，一个单模态自监督表示学习框架，其编码器由堆叠的P-RWKV模块组成。此外，我们将P-RWKV扩展到跨模态设置，并将所提出的核心子模块集成到多种架构中，展示了强大的即插即用灵活性和架构通用性。大量实验表明，P-RWKV模块及其关键子模块在各种任务中以较低的计算成本和推理延迟取得了竞争性能。代码将在接收后发布。

英文摘要

The recent receptance weighted key value (RWKV) model combines RNN-style recurrence, offering a linear-complexity alternative to Transformers' quadratic self-attention for modeling global dependencies. However, when directly applied to point clouds, RWKV, originally developed for sequential text, struggles to capture local geometric structures and model spatial dependencies effectively. To address this, we propose the \textbf{P-RWKV} block, which bridges the gap between sequence modeling and irregular 3D geometry while preserving the efficiency advantages of RWKV. It consists of a Local Perception Expansion (LPE) component to expand contextual perception along the spatio-temporal sequence and a Spatial Context Enhancement (SCE) component to strengthen spatial awareness. To validate the effectiveness of P-RWKV for point cloud understanding, we construct PointER, a single-modality self-supervised representation learning framework whose encoder is composed of stacked P-RWKV blocks. Furthermore, we extend P-RWKV to a cross-modality setting and integrate the proposed core sub-modules into multiple architectures, demonstrating strong plug-and-play flexibility and architectural generality. Extensive experiments show that the P-RWKV block and its key sub-modules achieve competitive performance across various tasks with lower computational cost and inference latency. Code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.10478 2026-06-10 cs.CV 新提交

3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

3D-CoS：基于VLM代码合成的新型3D重建范式

Yuhao Wang, Puyi Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yu Cheng

发表机构 * Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）； Microsoft（微软）； University of Oxford（牛津大学）

AI总结提出3D代码合成（3D-CoS）范式，将3D资产表示为可执行的Blender代码，利用VLM进行程序化重建，实现高可控性和局部编辑能力。

Comments Preprint. 24 pages, 11 figures

详情

AI中文摘要

最近的3D重建和编辑系统大多基于隐式或显式表示，如NeRF、点云或网格。尽管这些表示能够实现高保真渲染，但它们本质上是低层次的，难以通过编程控制。相比之下，我们提出并系统评估了一种新的3D重建范式——3D代码合成（3D-CoS），其中3D资产被构建为可执行的Blender代码，这是一种可编程且可解释的媒介。为了评估当前VLM使用代码表示3D对象的能力，我们在统一协议下评估了代表性的开源和闭源VLM在基于代码的重建中的表现。我们进一步引入了一套结构化的代码合成工作流，包括基于蓝图的规划、Blender API文档的检索增强生成（RAG）、少样本几何演示以及用于逐部分代码生成的组件级Agent工作流。为了展示这种表示的独特优势，我们进一步评估了局部文本驱动的修改，并将我们的基于代码的编辑与基于点云的3D编辑基线进行了比较。我们的研究表明，代码作为3D表示提供了强大的可控性和局部性，在目标编辑评估中产生了更强的编辑保真度和更好的未编辑区域保持。我们的工作还分析了这种范式的潜力，描绘了当前VLM在程序化3D建模中的能力边界，并强调了代码合成作为可编辑3D重建的一个有前景的方向。

英文摘要

Most recent 3D reconstruction and editing systems operate on implicit and explicit representations such as NeRF, point clouds, or meshes. While these representations enable high-fidelity rendering, they are fundamentally low-level and hard to control programmatically. In contrast, we propose and systematically evaluate a new 3D reconstruction paradigm, 3D Code Synthesis (3D-CoS), where 3D assets are constructed as executable Blender code, a programmatic and interpretable medium. To assess how well current VLMs can use code to represent 3D objects, we evaluate representative open-source and closed-source VLMs in code-based reconstruction under a unified protocol. We further introduce a suite of structured code-synthesis workflows, including blueprint-based planning, Retrieval-Augmented Generation (RAG) over Blender API documentation, few-shot geometric demonstrations, and a component-level Agent workflow for part-wise code generation. To demonstrate the unique advantages of this representation, we further evaluate localized text-driven modifications and compare our code-based edits with a point-cloud-based 3D editing baseline. Our study shows that code as a 3D representation offers strong controllability and locality, yielding stronger edit fidelity and better preservation of unedited regions in our targeted editing evaluation. Our work also analyzes the potential of this paradigm, delineates the current capability frontier of VLMs for programmatic 3D modeling, and highlights code synthesis as a promising direction for editable 3D reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.10541 2026-06-10 cs.CV 新提交

GRAR: Glass-induced Reflection Artifact Removal in LiDAR Point Clouds

GRAR: LiDAR点云中玻璃引起的反射伪影去除

Wanpeng Shao, Zeyi Guo, Bo Zhang, Yifei Xue, Tie Ji, Yizhen Lao

发表机构 * College of Computer Science and Electronic Engineering, Hunan University（湖南大学计算机科学与电子工程学院）； School of Design, Hunan University（湖南大学设计学院）； School of Finance and Statistics, Hunan University（湖南大学金融与统计学院）

AI总结提出两阶段框架，先利用多模态视觉基础模型和几何线索精确分割玻璃区域，再基于物理驱动的反射感知局部-全局几何相似性描述符去除反射伪影，在多个公开数据集上优于现有方法。

详情

AI中文摘要

在城市环境中采集的地面激光扫描（TLS）点云经常受到玻璃引起的反射伪影的影响，严重降低了后续应用的质量。现有的反射伪影去除方法通常依赖于理想的反射对称性假设，但其性能受限于不准确的玻璃估计和不足的几何表示。为了解决这些问题，我们提出了一种新颖的统一框架，旨在实现鲁棒的反射伪影去除：在第一阶段，我们利用多模态视觉基础模型生成初始玻璃掩膜，然后使用几何线索进行细化以获得高精度的玻璃区域，随后进行玻璃补全以恢复透明表面上由于无回波测量导致的缺失区域；在第二阶段，我们提出了一种物理驱动的描述符，称为反射感知局部-全局几何相似性（RE-LGGS），该描述符基于实际的激光反射几何，并使用基于PCA的局部形状表示联合编码多尺度几何结构和方向一致性，从而显著提高了对不完美观测的鲁棒性。在多个公开TLS数据集上的大量实验表明，我们的框架在反射伪影去除方面始终优于最先进的方法。

英文摘要

Terrestrial Laser Scanning (TLS) point clouds captured in urban environments frequently suffer from glass-induced reflection artifacts, severely degrading downstream applications. Existing reflection artifact removal methods generally rely on ideal reflection symmetry assumptions, yet their performance is limited by inaccurate glass estimation and insufficient geometric representations. To address these issues, we propose a novel unified framework aimed at robust reflection artifact removal: In the first stage, we leverage a multi-modal vision foundation model to produce initial glass masks, which are then refined using geometric cues to achieve high-precision glass regions, followed by glass completion to recover missing regions caused by no-return measurements on transparent surfaces; In the second stage, we propose a physics-driven descriptor, termed Reflection-aware Local-Global Geometric Similarity (RE-LGGS), which is grounded in actual laser reflection geometry and jointly encodes multi-scale geometric structures and orientation consistency using PCA-based local shape representations, thereby significantly improving robustness against imperfect observations. Extensive experiments on multiple public TLS datasets demonstrate that our framework consistently outperforms state-of-the-art methods in reflection artifacts removal.

URL PDF HTML ☆

赞 0 踩 0

2606.10594 2026-06-10 cs.CV 新提交

Segment and Select: Vision-Language Segmentation in 3D Scenarios

Segment and Select: 3D场景中的视觉-语言分割

Yulin Chen, Zhihang Zhong, Yuenan Hou

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； University of Science and Technology of China（中国科学技术大学）； Shanghai Jiaotong University（上海交通大学）

AI总结提出SEGA3D范式，通过掩码候选生成器、大语言模型和语义空间选择器实现3D场景中基于语言指令的细粒度分割，在ScanNet和Matterport3D上分别提升8.3和5.3 mIoU。

Comments The core idea is to reformulate 3D vision-language segmentation as the segment-and-select paradigm (free from the superpoint dependency)

详情

AI中文摘要

3D视觉-语言分割旨在根据语言指令和视觉观察在3D场景中分割目标对象。现有技术严重依赖粗糙的超点表示来降低计算复杂度，这导致分割质量差和对象边界混乱。本文提出用于3D视觉-语言分割的SEGment-And-Select（SEGA3D）范式，该范式直接操作于细粒度视觉信息，无需依赖超点。具体而言，我们首先利用掩码候选生成器提供细粒度的类别掩码候选，显著提高候选掩码相对于超点对应物的质量。然后，利用大语言模型（LLM）基于语言描述和视觉特征生成语义和空间信息。LLM输出和视觉特征被输入到语义-空间选择器（SSS）以产生排名最高的掩码候选。最后，设计循环验证模块（LVM）从选定的候选掩码中产生分割掩码。我们的SEGA3D在ScanRefer、ScanNet和Matterport3D基准测试中取得了有竞争力的性能。值得注意的是，我们的SEGA3D在ScanNet和Matterport3D上分别超过最佳性能对手8.3 mIoU和5.3 mIoU。代码将在发表后提供。

英文摘要

3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmentation quality and messy object boundaries. In this paper, we propose the SEGment-And-select (SEGA3D) paradigm for 3D visionlanguage segmentation that directly operates on the fine-grained visual information and is free from the superpoint dependency. Specifically, we first leverage a mask candidate generator to provide fine-grained categorical mask candidates, substantially improving the quality of candidate masks over the superpoint counterparts. Then, a Large Language Model (LLM) is utilized to generate the semantic and spatial information based on the linguistic description and visual features. The LLM output and visual features are fed to the Semantic-Spatial Selector (SSS) to produce the top-ranking mask candidates. Eventually, the Loopback Verification Module (LVM) is designed to yield the segmentation mask from the selected candidate masks. Our SEGA3D attains competitive performance on ScanRefer, ScanNet and Matterport3D benchmarks. Notably, our SEGA3D surpasses the top-performing counterpart by 8.3 mIoU and 5.3 mIoU on ScanNet and Matterport3D, respectively. Codes will be available upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.10602 2026-06-10 cs.CV 新提交

Globally Localizing Lunar Rover in Pixels via Graph Alignment

通过图对齐在像素级全局定位月球车

Mao Chen, Xu Yang, Chuankai Liu, Xiangkai Zhang, Xiaoxue Wang, Zheng Bo, Zuoyu Zhang, Zhiyong Liu

发表机构 * The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Beijing Aerospace Control Center（北京航天飞行控制中心）； Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences（中国科学院空间应用工程与技术中心）

AI总结提出WARG框架，利用统一图学习和重投影图匹配解决月球车跨视角定位中的实体纠缠、视角差异和仿真到真实域偏移问题，在玉兔二号真实数据上实现1.68米定位误差。

详情

AI中文摘要

精确的月球车定位是自主月球探测的前提，然而全球导航卫星系统（GNSS）信号的缺失以及局部定位方法的累积漂移严重限制了远程任务。跨视角定位通过匹配月球车视角和卫星视角图像提供了一种有前景的无漂移全局解决方案。然而，月球环境为对应点对齐带来了独特挑战，包括实体间纠缠、视角间差异以及仿真到真实的域偏移。为了解决这些挑战，我们提出了重投影图扭曲对齐（WARG），一个利用统一图学习和重投影图匹配实现鲁棒跨视角对齐的框架。在合成的LuSNAR数据集上预训练后，WARG的平均测试误差为0.32米，并在合成月球南极区域展现出鲁棒的零样本泛化能力，误差为3.63米。更重要的是，在玉兔二号月球车的真实数据上验证时，WARG在100米×100米的搜索区域内实现了1.68米的定位误差，相当于在空间分辨率为1.40米/像素的低分辨率卫星图像中达到近像素级精度。除了精度，WARG计算高效，仅含1.56M参数，是之前轻量级模型的16.12%，在NVIDIA RTX A6000 GPU上运行频率为5.49 Hz，接近GNSS级更新频率。最后，我们观察到WARG通过跨视角定位学习自然发展出低级空间感知能力，包括语义分割和结构推理，突显其作为以最小标注成本实现空间智能的有前景范式的潜力。源代码见：此 https URL。

英文摘要

Precise rover localization is a prerequisite for autonomous lunar exploration, yet the absence of Global Navigation Satellite System (GNSS) signals and the cumulative drift of local localization methods severely constrain long-range missions. Cross-view localization provides a promising drift-free global solution by matching rover-view and satellite-view imagery. However, the lunar environment poses unique challenges for correspondence alignment, including inter-entity entanglement, inter-viewpoint divergence, and simulation-to-real domain shift. To address these challenges, we propose Warped Alignment of Reprojected Graphs (WARG), a framework that leverages unified graph learning and reprojected graph matching for robust cross-view alignment. Pretrained on the synthetic LuSNAR dataset, WARG achieves an average test error of 0.32 m and demonstrates robust zero-shot generalization to the synthetic lunar south pole region with an error of 3.63 m. More importantly, when validated on real-world data from the YuTu-2 rover, WARG achieves a localization error of 1.68 m within a 100 m x 100 m search area, corresponding to nearly one-pixel precision in low-resolution satellite imagery with a spatial resolution of 1.40 m/pixel. Beyond accuracy, WARG is computationally efficient, containing only 1.56M parameters, corresponding to 16.12% of previous lightweight models, and operating at 5.49 Hz on an NVIDIA RTX A6000 GPU, approaching GNSS-level update frequency. Finally, we observe that WARG naturally develops low-level spatial awareness, including semantic segmentation and structural reasoning, through cross-view localization learning, highlighting its potential as a promising paradigm for spatial intelligence with minimal annotation cost. The source code is available at https://github.com/maochen-casia/warg.

URL PDF HTML ☆

赞 0 踩 0

2606.10988 2026-06-10 cs.CV cs.GR 新提交

AnimaSpark: A Feed-Forward Method for Animating Arbitrary 3D Objects

AnimaSpark: 一种用于任意3D对象动画的前馈方法

Yiming Zhao, Haoyu Sun, Aoyu Wang

发表机构 * Bytedance（字节跳动）

AI总结提出AnimaSpark管道，通过将带骨骼的3D模型渲染为多层图像表示，输入视频生成模型，再提取2D关键点运动并提升至3D空间，实现类别无关的3D动画生成，在文本-运动对齐、运动质量和计算效率上优于现有方法。

详情

AI中文摘要

尽管生成式AI的最新进展显著加速了静态3D模型创建流程，但类别无关的3D动画合成仍然是3D资产生产中的主要瓶颈。当前的类别无关动画生成方法在推理速度、运动质量和文本提示遵循方面存在关键限制，使得该过程仍依赖于劳动密集型的手工艺术。为解决这些挑战，本文介绍了AnimaSpark，一种用于类别无关3D动画生成的新型管道。我们的方法受以下关键洞察驱动：对于3D世界中的许多基本运动，相应的关节变换通常可以在二维子空间内有效建模。该管道首先将带骨骼的静态3D模型渲染为其网格和骨架的多层图像表示，随后将其输入视频生成模型。然后，我们在生成的视频上采用关键点跟踪算法，捕获投影到相机观察平面上的骨骼关节运动。在最后阶段，我们从这些跟踪的关键点中提取平面平移和旋转，并将其从2D域提升到3D空间以驱动角色动画。全面评估表明，我们的方法在关键指标（包括文本-运动对齐、运动质量和计算效率）上优于现有最先进技术。

英文摘要

While recent advancements in generative AI have substantially accelerated static 3D model creation workflows, the synthesis of category-agnostic 3D animations remains a significant bottleneck in 3D asset production. Current methods for category-agnostic animation generation exhibit critical limitations in inference speed, motion quality, and adherence to textual prompts, thereby leaving the process dependent on labor-intensive manual artistry. To address these challenges, this paper introduces AnimaSpark, a novel pipeline for category-agnostic 3D animation generation. Our approach is motivated by the key insight that for many fundamental motions in the 3D world, the corresponding joint transformations can often be effectively modeled within a two-dimensional subspace. The pipeline begins by rendering a rigged static 3D model into multi-layered image representations of its mesh and skeleton, which are subsequently fed into a video generation model. We then employ a keypoint tracking algorithm on the generated video to capture the motion of the skeletal joints projected onto the camera's viewing plane. In the final stage, we distill the planar translations and rotations from these tracked keypoints and lift them from the 2D domain into 3D space to animate the character. Comprehensive evaluations reveal that our method achieves superior performance over existing state-of-the-art techniques across key metrics, including text-motion alignment, quality of motion, and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2504.18424 2026-06-10 cs.CV 版本更新

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

LaRI: 用于单视图3D几何推理的分层射线交点

Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Adobe Research（Adobe研究）

AI总结提出LaRI方法，通过分层点图预测射线与多个表面的交点，实现单次前馈的完整场景重建，支持物体级和场景级任务。

Comments Project page: https://ruili3.github.io/lari

详情

Journal ref: ICML 2026

AI中文摘要

ObjSplat: 几何感知的高斯面元用于主动物体重建

Yuetao Li, Zhizhou Jia, Yu Zhang, Qun Hao, Shaohui Zhang

发表机构 * School of Optics and Photonics, Beijing Institute of Technology（光学与光子学学院，北京理工大学）； School of Optoelectronic Engineering, Changchun University of Science and Technology（光电工程学院，长春理工大学）

AI总结提出ObjSplat框架，利用高斯面元统一表示，通过几何感知视点评估和下一最佳路径规划器，实现高效高保真的主动物体重建。

Comments Accepted to IEEE T-ASE. Code: https://github.com/Li-Yuetao/ObjSplat , Project Page: https://li-yuetao.github.io/ObjSplat-page/

详情

DOI: 10.1109/TASE.2026.3700105

AI中文摘要

自主高保真物体重建是创建数字资产和弥合机器人模拟与现实差距的基础。我们提出ObjSplat，一个主动重建框架，利用高斯面元作为统一表示，逐步重建未知物体，同时具有逼真的外观和准确的几何。针对传统基于不透明度或深度线索的局限性，我们引入了几何感知视点评估管线，明确建模背面可见性和遮挡感知的多视图共视性，即使在几何复杂的物体上也能可靠地识别未重建区域。此外，为了克服贪婪规划策略的局限性，ObjSplat采用下一最佳路径（NBP）规划器，在动态构建的空间图上执行多步前瞻。通过联合优化信息增益和移动成本，该规划器生成全局高效的轨迹。在仿真和真实世界文化遗物上的大量实验表明，ObjSplat在几分钟内生成物理一致的模型，与最先进方法相比，实现了卓越的重建保真度和表面完整性，同时显著减少了扫描时间和路径长度。项目页面：此https URL。

英文摘要

Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: https://li-yuetao.github.io/ObjSplat-page/ .

URL PDF HTML ☆

赞 0 踩 0

2606.10021 2026-06-10 cs.CV 新提交

SpineReport: Automated 3D Quantification and Reporting of Lumbar Spine Degeneration on MRI

SpineReport: MRI上腰椎退变的自动化3D量化与报告

Nathan Molinier, Adrian A. Marth, Reto Sutter, Christoph Germann, Jacob A. Connolly, Mathieu Guay-Paquet, Nathan D. Schilaty, Kenneth A. Weber, Julien Cohen-Adad

AI总结提出SpineReport开源框架，利用鲁棒解剖分割从腰椎MRI中提取3D形态和信号特征，生成个体化报告，在中央管狭窄评估中AUC达0.95。

Comments Submitted to Medical Image Analysis

详情

AI中文摘要

腰椎疾病是全球致残的主要原因，但MRI上退变的可靠量化仍具挑战。临床实践中，分析主要在二维（2D）中进行，因为手动三维（3D）评估耗时。然而，2D测量重复性有限，尤其当解剖结构不与成像平面对齐时。现有自动化方法通常局限于2D、依赖离散分级或缺乏鲁棒性和可解释性。我们介绍SpineReport，一个用于腰椎MRI全面3D形态测量的开源全自动框架。利用鲁棒解剖分割，该方法从关键结构中提取定量指标，包括椎管、脊髓、椎骨、椎间盘和椎间孔。这些指标包括形态和信号特征，支持跨受试者和纵向评估。SpineReport进一步生成个体化报告，允许与队列分布比较，提高脊柱形态的可解释性和客观表征。临床相关性根据放射科医生报告的中央管、侧隐窝和椎间孔狭窄严重程度分级进行评估。指标与中央管狭窄严重程度强相关，T2加权脑脊液信号表现最佳（AUC = 0.95）。椎管前后径和面积比也显示出强相关性和高区分能力（AUC > 0.80）。对于侧隐窝狭窄，相关性中等，侧方脑脊液信号最具信息量（AUC = 0.73）。尽管感兴趣区域提取鲁棒，但未观察到与椎间孔狭窄的显著关联。SpineReport作为开放获取工具发布：此https URL

英文摘要

Lumbar spine conditions are a leading cause of disability worldwide, yet reliable quantification of degeneration from MRI remains challenging. In clinical practice, analysis is predominantly performed in two dimensions (2D), as manual three-dimensional (3D) assessment is time-consuming. However, 2D measurements suffer from limited reproducibility, particularly when anatomical structures are not aligned with the imaging plane. Existing automated approaches are often restricted to 2D, rely on discrete grading, or lack robustness and interpretability. We introduce SpineReport, an open-source, fully automated framework for comprehensive 3D morphometric analysis of lumbar spine MRI. Leveraging robust anatomical segmentations, the method extracts quantitative metrics from key structures, including the spinal canal, spinal cord, vertebrae, intervertebral discs, and foramina. These include both morphological and signal-based features, enabling cross-subject and longitudinal assessment. SpineReport further generates subject-specific reports that allow comparison with cohort distributions, improving interpretability and objective characterization of spinal morphology. Clinical relevance was evaluated against radiologist-reported severity grades for central canal, lateral recess, and foraminal stenosis. Metrics showed strong associations with central canal stenosis severity, with T2-weighted CSF signal providing the highest performance (AUC = 0.95). Canal AP diameter and area ratios also demonstrated strong correlations and high discriminative ability (AUC > 0.80). For lateral recess stenosis, associations were moderate, with lateral CSF signal being the most informative (AUC = 0.73). No significant associations were observed for foraminal stenosis despite robust region-of-interest extraction. SpineReport is released as an open-access tool: https://ivadomed.github.io/SpineReport/

URL PDF HTML ☆

赞 0 踩 0

2606.10088 2026-06-10 cs.CV 新提交

Interpretable Temporal Facial-Region Motion Analysis for In-the-Wild Parkinson's Disease Video Classification

可解释的时序面部区域运动分析用于野外帕金森病视频分类

Riyadh Almushrafy

AI总结提出基于面部区域关键点的时序运动描述符，在YouTubePD基准上实现轻量级且可解释的PD视频分类，平衡准确率达0.826。

Comments 22 pages, 6 figures. Submitted to Biomedical Signal Processing and Control

详情

AI中文摘要

面部表情减少是帕金森病（PD）常见的运动表现，通常描述为面部运动减退或面部运动迟缓。本文研究从面部区域关键点提取的时序运动描述符是否能够支持野外PD相关视频分类，并在YouTubePD基准上进行评估。每个视频使用来自14个预定义面部区域的几何描述符表示。在相同的二分类协议下，比较了静态几何、归一化几何、基于速度的描述符、相对速度描述符以及GRU序列基线。为了评估稳定性和可解释性，研究包括种子鲁棒性分析、区域级消融和排列重要性。最佳结果使用归一化速度描述符和随机森林分类器获得，在保留测试集上达到平衡准确率0.826和AUROC 0.855。在10个随机种子下，该表示保持稳定，平衡准确率为0.810 ± 0.018，AUROC为0.855 ± 0.005。总体而言，结果表明归一化的面部区域运动是YouTubePD视频分类的一种轻量级且可解释的表示。该研究作为基准级分析，不声称临床严重程度评估或MDS-UPDRS面部表情评分。

英文摘要

Reduced facial expressivity is a common motor manifestation of Parkinson's disease (PD), often described as hypomimia or facial bradykinesia. This paper examines whether temporal motion descriptors extracted from facial-region keypoints can support in-the-wild PD-related video classification on the YouTubePD benchmark. Each video is represented using geometric descriptors from 14 predefined facial regions. Static geometry, normalized geometry, velocity-based descriptors, relative-velocity descriptors, and a GRU sequence baseline are compared under the same binary classification protocol. To assess stability and interpretability, the study includes seed-robustness analysis, region-level ablation, and permutation importance. The best result is obtained with normalized velocity descriptors and a Random Forest classifier, reaching a balanced accuracy of 0.826 and an AUROC of 0.855 on the held-out test split. Across 10 random seeds, this representation remains stable, with balanced accuracy of 0.810 +/- 0.018 and AUROC of 0.855 +/- 0.005. Overall, the results suggest that normalized facial-region motion is a lightweight and interpretable representation for YouTubePD video classification. The study is framed as a benchmark-level analysis and does not claim clinical severity assessment or MDS-UPDRS facial-expression scoring.

URL PDF HTML ☆

赞 0 踩 0

2606.10115 2026-06-10 cs.CV 新提交

Improving PET/CT-Based Whole-Body Lesion Segmentation Using Prediction Uncertainty-Augmented Models

利用预测不确定性增强模型改进PET/CT全身病灶分割

Bashirul Azam Biswas, Biratal Raj Wagle, Zhihan Yang, Marc A. Seltzer, Matthew E. Maeder, James B. Yu, Indrani Bhattacharya

AI总结提出一种不确定性感知框架，结合贝叶斯集成、体素不确定性量化与不确定性增强训练，提升PET/CT全身病灶分割的鲁棒性和病灶检测能力，在AutoPET-III和Deep-PSMA数据集上验证。

Comments 32 pages, 10 figures, 5 tables

详情

AI中文摘要

准确的全身正电子发射断层扫描（PET）/计算机断层扫描（CT）病灶分割对于癌症分期和治疗计划至关重要。PET提供不同放射性示踪剂的功能代谢信息，而CT提供解剖定位。由于细微的影像特征、混杂因素和读者间差异，从PET/CT影像中勾画病灶在临床上具有挑战性。现有的深度学习方法存在训练随机性、预测不一致、高肿瘤负荷病例中病灶遗漏以及缺乏不确定性量化等问题，限制了其临床可靠性。以nnU-Net为基线，我们提出了一种用于全身PET/CT病灶分割的不确定性感知框架，该框架整合了（1）贝叶斯集成以减少训练随机性，（2）具有认知和偶然分解的体素级不确定性量化，以及（3）认知不确定性增强训练以提高病灶检测。使用两个公开数据集AutoPET-III（1,611次扫描）和Deep-PSMA（200次扫描），包含多种癌症类型的FDG和PSMA研究，进行训练和评估。在未见过的AutoPET-III测试集上，贝叶斯集成相比确定性nnU-Net模型提高了鲁棒性和性能。不确定性图突出了模型不一致的区域，并与错误分类（尤其是假阳性）相关。不确定性增强训练以增加假阳性体积为代价提高了病灶恢复，反映了精确率-召回率的权衡。一种病例自适应路由策略通过在基础模型和增强模型之间进行选择，进一步提高了Dice系数。据我们所知，这是第一项在多示踪剂、泛癌种PET/CT分割中系统研究不确定性量化，并将贝叶斯集成与不确定性感知建模相结合的工作。

英文摘要

Accurate lesion segmentation from whole-body Positron Emission Tomography (PET)/Computed Tomography (CT) scans is essential for cancer staging and treatment planning. PET provides functional metabolic information with different radiotracers, while CT offers anatomical localization. Lesion delineation from PET/CT imaging is clinically challenging due to subtle imaging features, confounders, and inter-reader variability. Existing deep learning approaches suffer from training-related stochasticity, inconsistent predictions, missed lesions in high tumor-burden cases, and lack uncertainty quantification, limiting their clinical reliability. Using nnU-Net as a baseline, we propose an uncertainty-aware framework for whole-body PET/CT lesion segmentation that integrates (1) Bayesian ensembling to reduce training stochasticity, (2) voxel-wise uncertainty quantification with epistemic and aleatoric decomposition, and (3) epistemic uncertainty-augmented training to improve lesion detection. Two public datasets, AutoPET-III (1,611 scans) and Deep-PSMA (200 scans), comprising FDG and PSMA studies across multiple cancer types, are used for training and evaluation. Bayesian ensembling improves robustness and performance over deterministic nnU-Net models on the unseen AutoPET-III test set. Uncertainty maps highlight regions of model disagreement and correlate with misclassifications, particularly false positives. Uncertainty-augmented training improves lesion recovery at the cost of increased FPVol, reflecting a precision-recall trade-off. A case-adaptive routing strategy further improves Dice by selecting between the base and augmented models. To our knowledge, this is the first study to systematically investigate uncertainty quantification in multi-tracer, pan-cancer PET/CT segmentation and to combine Bayesian ensembling with uncertainty-aware modeling for this task.

URL PDF HTML ☆

赞 0 踩 0

2606.10372 2026-06-10 cs.CV 新提交

ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment

ClinReadNet: 一种受临床阅读启发的低剂量腹部CT图像质量评估网络

Xianye Xiao, Yulong Zou, Yujie Luo, Taihui Yu, Cun-Jing Zheng, Yuan-ming Geng, Shuihua Wang, Yudong Zhang, Jin Hong

发表机构 * School of Mathematics and Computer Sciences, Nanchang University（南昌大学数学与计算机科学学院）； School of Information Engineering, Nanchang University（南昌大学信息工程学院）； Department of Radiology, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University（中山纪念医院放射科，中山大学）； Department of Stomatology, Zhujiang Hospital, Southern Medical University（南方医科大学珠江医院口腔科）； Department of Biological Sciences, School of Science, Xi'an Jiaotong Liverpool University（西安交通大学利物浦大学科学学院生物科学系）； School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）

AI总结提出ClinReadNet框架，通过模拟放射科医生阅读习惯，结合Sobel序数质量网络和窗口多尺度温度多头自注意力模块，并设计分层排序概率分数损失函数，在LDCTIQAG2023数据集上实现SOTA性能。

详情

AI中文摘要

在腹部CT成像中，开发一种模拟医生阅读习惯的低剂量无参考图像质量评估（No-reference IQA）模型具有重要的实际价值。本文提出了一种新颖的基于深度学习的框架ClinReadNet，其设计与放射科医生的临床阅读逻辑一致：首先，引入Sobel序数质量网络（SOQN）模块，该模块能同时关注与图像质量高度相关的边缘细节和整个图像的质量分布模式，准确匹配“兼顾局部细节与整体上下文”的临床阅片判断习惯；其次，该框架集成了（移位）窗口多尺度温度多头自注意力（(S)W-MTMSA）模块，进一步复制了放射科医生从整体扫描到局部聚焦的阅片过程，并通过多锐度注意力精确锁定感兴趣区域；第三，设计了分层排序概率分数（HRPS）损失函数，该函数结合了粗分类和细分类的双重逻辑，同时关注分级标签之间的距离信息，有效提升了图像质量评估的性能。在LDCTIQAG2023数据集上进行的实验表明，所提方法达到了当前最先进（SOTA）性能：皮尔逊线性相关系数（PLCC）、斯皮尔曼秩相关系数（SROCC）和肯德尔秩相关系数（KROCC）的值分别达到0.9507、0.9554和0.8629，其绝对值之和（Score）为2.7690，优于现有方法。

英文摘要

In abdominal CT imaging, developing a low-dose, no-reference image quality assessment (No-reference IQA) model that mimics doctors' reading habits for evaluating CT image quality has significant practical value. This paper proposes a novel deep learning-based framework, ClinReadNet, whose design aligns with the clinical reading logic of radiologists: first, it introduces the Sobel ordinal quality network (SOQN) module, which can simultaneously focus on edge details highly relevant to image quality and the quality distribution pattern of the entire image, accurately matching the clinical image-reading judgment habit of "considering both local details and overall context"; second, the framework integrates the (shifted) window multi-scale temperature multi-head self-attention ((S)W-MTMSA) module, which further replicates the radiologists' image-reading process of shifting from overall scanning to local focusing, and accurately locks in regions of interest through multi-sharpness attention; third, it designs the hierarchical ranked probability score (HRPS) loss function, which combines the dual logics of coarse classification and fine classification, while paying attention to the distance information between grading labels, effectively improving the performance of image quality assessment. Experiments conducted on the LDCTIQAG2023 dataset show that the proposed method achieves the current state-of-the-art (SOTA) performance: the values of Pearson's linear correlation coefficient (PLCC), Spearman's rank-order correlation coefficient (SROCC), and Kendall's rank-order correlation coefficient (KROCC) reach 0.9507, 0.9554, and 0.8629 respectively, with the sum of their absolute values (Score) being 2.7690, outperforming existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10378 2026-06-10 cs.CV 新提交

FSS-Net: Frequency-Spatial Synergy Network with Wavelet Attention for Carotid Artery Ultrasound Segmentation

FSS-Net：用于颈动脉超声分割的频率-空间协同网络与小波注意力

Jiawei Liu, Zhijiang Wan, Junhua Hu, Rongli Zhang, Zhongbiao Xu, Yankun Cao, Yuan Chen, Jin Hong

发表机构 * Ji luan Academy, Nanchang University（井然学院，南昌大学）； School of Information Engineering, Nanchang University（信息工程学院，南昌大学）； State Key Laboratory of Water Cycle and Water Security, China Institute of Water Resources and Hydropower Research（水循环与水安全国家重点实验室，中国水利水电科学研究院）； Department of Diagnostic Radiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong（诊断放射科，李嘉诚医学部，香港大学）； Department of Radiotherapy, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Southern Medical University（放疗科，广东省人民医院，广东省医学科学院，南方医科大学）； Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University（SDU-NTU人工智能研究联合中心（C-FAIR），山东大学）； Department of Pediatrics, Shandong Provincial Hospital Affiliated to Shandong First Medical University（儿科，山东省立医院（附属山东第一医科大学））

AI总结提出频率-空间协同网络（FSS-Net），集成小波变换、多域注意力和边缘增强，在颈动脉超声数据集上实现96.46%的Dice分数，有效分割颈动脉并识别斑块。

详情

AI中文摘要

超声成像中颈动脉的精确分割对于中风风险评估至关重要。然而，散斑噪声、低对比度和模糊边界仍然是主要挑战。在本文中，我们提出了一种频率-空间协同网络（FSS-Net），以实现噪声鲁棒且高精度的颈动脉分割。该网络将小波变换、多域注意力和边缘增强集成到一个统一的编码器-解码器架构中。具体来说，设计了一个通道-空间-小波注意力（CSWA）模块，以抑制频率域中的噪声并净化语义特征。引入了一个小波增强瓶颈（WEB）模块，以高效捕获长距离全局依赖关系。此外，一个拉普拉斯引导的自适应边缘融合（LAEF）模块补偿高频细节并保持边界连续性。在颈动脉超声数据集上的大量实验表明，FSS-Net在低信噪比条件下达到了96.46%的Dice分数（DSC）和强鲁棒性，优于几种最先进的方法。该方法实现了超声成像中颈动脉的精确分割，有效识别颈动脉粥样硬化斑块，并通过其他任务（如乳腺癌分割）验证，表明其在超声图像中识别异常组织肿块具有良好的临床应用潜力。

英文摘要

Accurate segmentation of carotid arteries in ultrasound imaging is critical for stroke risk assessment. However, speckle noise, low contrast, and blurred boundaries remain major challenges. In this paper, we propose a Frequency-Spatial Synergy Network (FSS-Net) to achieve noise-robust and high-precision carotid artery segmentation. The network integrates wavelet transform, multi-domain attention, and edge enhancement into a unified encoder-decoder architecture. Specifically, a Channel-Spatial-Wavelet Attention (CSWA) module is designed to suppress noise and purify semantic features in the frequency domain. A Wavelet-Enhanced Bottleneck (WEB) module is introduced to capture long-range global dependencies efficiently. Furthermore, a Laplacian-Guided Adaptive Edge Fusion (LAEF) module compensates high-frequency details and maintains boundary continuity. Extensive experiments on carotid ultrasound datasets show that FSS-Net achieves a Dice score (DSC) of 96.46% and strong robustness under low SNR conditions, outperforming several state-of-the-art methods. This method realizes accurate segmentation of carotid artery in ultrasonic imaging, effectively identifies carotid atherosclerotic plaque, and is verified by other task (such as segmentation of breast cancer), suggesting that it has good clinical application potential in identifying abnormal tissue masses in ultrasonic images.

URL PDF HTML ☆

赞 0 踩 0

2606.10735 2026-06-10 cs.CV physics.med-ph 新提交

Patient-Level Diagnosis of Acute Myeloid Leukemia via Deep Learning Analysis of Bone Marrow Smear

基于深度学习分析骨髓涂片的急性髓系白血病患者级诊断

Yuqi Ma, Tianyi Wang, Weihua Meng, Hongru Chen, Fajin Tao, Qunxian Lu, Lin An, Xiaodong Mo, Gen Yang

发表机构 * State Key Laboratory of Nuclear Physics and Technology, School of Physics, Peking University（北京大学核物理与天体物理国家重点实验室，物理学院）； Peking University People’s Hospital, Peking University Institute of Hematology, National Clinical Research Center for Hematologic Disease, Beijing Key Laboratory of Hematopoietic Stem Cell Transplantation（北京大学人民医院，北京大学血液病研究所，国家血液病临床医学研究中心，北京造血干细胞移植重点实验室）； Shanghai Dishuo Beiken Biotechnology Co., Ltd.（上海迪朔生物科技有限公司）

AI总结提出从细胞到患者的深度学习流程，通过YOLO检测细胞、EfficientNet-B0分类复合母细胞样细胞（CBLC），聚合细胞级预测为患者级CBLC比率，实现AML辅助诊断，在外部验证集上F1达0.91。

Comments 4 figures

详情

AI中文摘要

骨髓涂片检查对于急性髓系白血病（AML）评估仍然重要，但手动单细胞解释劳动强度大，患者级诊断需要聚合大量细胞观察结果。我们提出了一种从细胞到患者的深度学习流程，用于从骨髓涂片图像进行AML辅助诊断。该研究包括来自六个匿名中心的258名患者，其中主要队列来自中心1-3的169名患者，外部验证队列来自中心4-6的89名患者。使用16类细胞注释词汇描述全局细胞组成，包括粒细胞、单核细胞、红系、淋巴、嗜酸性粒细胞和其他细胞。该模型不识别严格的AML母细胞或白血病母细胞，而是针对专家定义的复合类别——复合母细胞样细胞（CBLC），根据项目范围内的形态学标准，包括N、N1、M、M1、R、R1、J和J1。基于YOLO的固定分割模块检测细胞，预测轮廓通过轮廓IoU与专家多边形注释匹配，并生成标准化的单细胞裁剪。通过两阶段GT-to-YOLO和YOLO-to-YOLO策略训练EfficientNet-B0分类器，包括类别不平衡校正、中心-边界正则化和形态学辅助监督。将细胞级预测聚合为患者级CBLC比率，用于AML导向的诊断支持。该流程实现了稳定的内部验证并保持了外部泛化能力，在中心4、5和6上的集成加权F1分数分别为0.9076、0.8696和0.9124。

英文摘要

Bone marrow smear review remains important for acute myeloid leukemia (AML) assessment, but manual single-cell interpretation is labor-intensive and patient-level diagnosis requires aggregation of many cellular observations. We present a cell-to-patient deep learning pipeline for AML-assisted diagnosis from bone marrow smear images. The study included 258 patients from six anonymized centers, including a main cohort of 169 patients from Centers 1-3 and an external validation cohort of 89 patients from Centers 4-6. A 16-category cell annotation vocabulary was used to describe the global cellular composition, including granulocytic, monocytic, erythroid, lymphoid, eosinophilic, and other cells. Rather than identifying strict AML blasts or leukemic blasts, the model targets an expert-defined composite category termed Composite Blast-like Cells (CBLC), comprising N, N1, M, M1, R, R1, J, and J1 according to the project-wide morphological standard. A fixed YOLO-based segmentation module detected cells, predicted contours were matched to expert polygon annotations by contour IoU, and standardized single-cell crops were generated. An EfficientNet-B0 classifier was trained through a two-stage GT-to-YOLO and YOLO-to-YOLO strategy with class-imbalance correction, center-border regularization, and morphology-assisted supervision. Cell-level predictions were aggregated into patient-level CBLC ratios for AML-oriented diagnostic support. The pipeline achieved stable internal validation and maintained external generalization, with ensemble weighted F1-scores of 0.9076, 0.8696, and 0.9124 on Centers 4, 5, and 6, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.10756 2026-06-10 cs.CV physics.med-ph 新提交

DD-INR: Dynamics-Driven Implicit Neural Representation for Accelerated Whole-Brain Functional MRI Reconstruction

DD-INR: 用于加速全脑功能磁共振成像重建的动力学驱动隐式神经表示

Qiaoxin Li, Caini Pan, Pierre-Antoine Comby, Chaithya Giliyar, Philippe Ciuciu

发表机构 * MIND, Inria, Palaiseau, France（MIND、Inria、法国帕莱赛欧）； Neurospin, CEA Paris Saclay, France（Neurospin、CEA巴黎萨克雷、法国）； CEA NeuroSpin, Paris-Saclay University, CNRS BAOBAB, Gif-sur-Yvette, France（CEA NeuroSpin、巴黎萨克雷大学、CNRS BAOBAB、法国吉夫-sur-伊夫特）

AI总结提出DD-INR框架，通过将fMRI数据分解为静态背景和动态成分并用隐式神经表示建模动态，实现加速fMRI重建，提升图像质量和激活模式恢复。

详情

Journal ref: MICCAI 2026 - 29th International Conference on Medical Image Computing and Computer Assisted Intervention, Sep 2026, Strasbourg, France

AI中文摘要

fMRI的加速采集能够增强对大脑神经血管（BOLD）活动的检测，但高k空间欠采样使图像重建变得具有挑战性：任务诱发的BOLD信号幅度较小，传统的解剖MRI重建方法倾向于空间准确性而非时间保真度，因此无法恢复这些信号。我们提出了DD-INR，一个专为加速fMRI设计的动力学驱动隐式神经表示框架，它利用非相干时变采样和定制的时空先验，在模拟和体内采集中均优于传统方法，无论是在图像质量还是激活模式恢复方面。DD-INR通过将fMRI数据分解为静态背景和时变动态成分，仅用专门的INR表示动态部分，从而将模型能力集中在与激活相关的变化上，同时保持紧凑。总的来说，DD-INR为加速fMRI重建提供了一个有前景的框架，有潜力在实际扫描时间限制内提高fMRI研究的灵敏度和鲁棒性。源代码可在该网址获取。

英文摘要

Accelerated acquisition of fMRI enables enhanced detection of neurovascular (BOLD) activity in the brain, but image reconstruction becomes challenging with high k-space undersampling: Task-evoked BOLD signals are small in magnitude, which traditional anatomical MRI reconstruction methods fail to recover, as they favor spatial accuracy over temporal fidelity. We present DD-INR, a Dynamics-Driven Implicit Neural Representation framework tailored for accelerated fMRI that benefits from incoherent time-varying sampling and a tailored spatiotemporal prior, outperforming traditional methods, demonstrated in simulation and in-vivo acquisition, both in terms of image quality and retrieval of activation patterns. DD-INR achieves this by splitting the fMRI data into a static background and a temporally varying dynamic component, representing only the dynamics with a dedicated INR, thereby focusing the model's capacity on activation-relevant changes while remaining compact. In general, DD-INR provides a promising framework for accelerated fMRI reconstruction, with the potential to improve the sensitivity and robustness of fMRI studies within practical scan time limits. The source code is available at https://github.com/JoosenLi/DD-INR.

URL PDF HTML ☆

赞 0 踩 0

2606.10778 2026-06-10 cs.CV 新提交

From Patches to Patients: A study of the tile-to-slide performance transferability in Digital Pathology

从斑块到患者：数字病理学中斑块到全切片性能可迁移性的研究

Sofiène Boutaj, Leo Fillioux, Maria Vakalopoulou, Stergios Christodoulidis, Pierre Marza

发表机构 * Université Paris-Saclay, CentraleSupélec, Gustave Roussy, INSERM, IHU PRISM, Cancer Data Science Unit（巴黎-萨克雷大学、中央理工-高等电力学院、古斯塔夫·鲁西研究所、法国国家健康与医学研究院、IHU PRISM、癌症数据科学单元）； Université Paris-Saclay, CentraleSupélec, MICS Laboratory（巴黎-萨克雷大学、中央理工-高等电力学院、MICS实验室）

AI总结研究斑块级线性探测能否作为全切片级性能的可靠代理，通过19个基础模型在42个切片级和16个斑块级任务上的基准测试，发现斑块与切片性能高度相关，斑块级基准测试可有效筛选候选模型。

Comments Accepted to MICCAI 2026

详情

AI中文摘要

基础模型最近通过为全切片图像分析提供稳健表示，重新定义了组织病理学中的最先进技术。然而，为特定临床队列选择最优基础模型目前需要多个预处理步骤，随后对每个模型进行计算昂贵的特征提取和训练多实例学习聚合器。在这项工作中，我们研究高效的斑块级线性探测能否作为切片级性能的可靠代理，从而减少对每个候选编码器运行完整切片级管道的需求。我们在42个切片级和16个斑块级任务上对19个最先进的基础模型进行基准测试，使用ABMIL和均值池化聚合器比较斑块探测指标与切片级结果。我们观察到在不同任务难度下，斑块与切片性能之间存在高度相关性，表明编码器表示质量是WSI成功的主要决定因素。敏感性分析显示，可迁移性在不同模型间稳定，且受队列规模和每张切片斑块数量的影响大于平均任务难度。我们还测量了斑块级和切片级任务中最佳表现模型的一致性，表明斑块基准测试可靠地筛选出强候选模型。总体而言，我们的研究表明，斑块级基准测试为缩小候选模型范围提供了高效且实用的第一步，而切片级评估对于临床任务的最终验证仍然必不可少。

英文摘要

Foundation Models (FMs) have recently redefined the state-of-the-art in histopathology by providing robust representations for whole-slide image (WSI) analysis. However, selecting the optimal foundation model (FM) for a specific clinical cohort currently requires multiple preprocessing steps, followed by computationally expensive feature extraction and the training of a Multiple Instance Learning (MIL) aggregator for every model. In this work, we investigate whether efficient tile-level linear probing can serve as a reliable proxy for slide-level performance, reducing the need to run full slide-level pipelines for every candidate encoder. We benchmark 19 state-of-the-art FMs on 42 slide-level and 16 tile-level tasks, comparing tile probing metrics against slide-level outcomes using ABMIL and Mean Pooling aggregations. We observe a high correlation between tile and slide performance across varying task difficulties, indicating that encoder representation quality is the primary determinant of WSI success. Sensitivity analyses show that transferability is stable across models and is more influenced by cohort sizes and numbers of tiles per slide than by average task difficulty. We also measure the agreement in best performing models between tile and slide-level tasks, showing tile benchmarks reliably shortlist strong candidates. Overall, our study indicates that tile-level benchmarking provides an efficient and practical first step for narrowing down candidate models, while slide-level evaluation remains essential for final validation on clinical tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.11001 2026-06-10 cs.CV 新提交

IPSM-Bench: A New Intermediate Phase Segmentation Benchmark in Microstructure Images of Zinc-Based Absorbable Biomaterials

IPSM-Bench：锌基可吸收生物材料显微图像中的新中间相分割基准

Jinglin Xu, Shangyan Zhao, Jiabo Wang, Xinghong Mu, Yulong Lei, Jiacheng Zhang, Hongbo Sun, Yageng Li

发表机构 * School of Artificial Intelligence, University of Science and Technology Beijing（北京科技大学人工智能学院）； School of Advanced Materials Innovation, University of Science and Technology Beijing（北京科技大学前沿材料创新学院）； China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd（中国电信人工智能技术（北京）有限公司）； School of Materials Science and Engineering, University of Science and Technology Beijing（北京科技大学材料科学与工程学院）； Institute of Materials Intelligent Technology, Liaoning Academy of Materials（辽宁材料实验室材料智能技术研究所）； School of Big Data and Software Engineering, Chongqing University（重庆大学大数据与软件工程学院）

AI总结针对锌合金中间相分割面临的数据稀缺、低对比度等挑战，构建最大高质量数据集IPSM-Bench，并提出空间上下文先验引导的SAM方法SCoP-SAM，实现最优分割性能。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

锌基合金是不可或缺的新兴可吸收金属生物材料，其宏观性能受微观结构特征控制。中间相——关键的微观结构成分——在调节机械和功能性能中起关键作用。然而，锌合金显微组织中的中间相分割面临严峻挑战：标注数据集稀缺、对比度低、小目标检测困难以及形态异质性。为此，我们构建了IPSM-Bench，这是用于锌合金中间相分割的最大高质量数据集。此外，我们提出了SCoP-SAM，一种新的空间上下文先验引导的SAM方法，利用中间相的梯度结构和灰度属性捕获空间上下文先验，并将其融入整个SAM编码-解码过程，从而提升分割性能。基于提出的IPSM-Bench，我们建立了中间相分割的新基准，以系统评估最先进方法并推动锌合金微观结构分析研究。在IPSM-Bench和额外的公共合金基准上的大量实验表明，我们的SCoP-SAM不仅在锌合金中间相分割上实现了最先进性能，而且对其他合金场景也具有显著的泛化能力。

英文摘要

Zinc-based alloys are indispensable emerging absorbable metallic biomaterials, and their macroscopic performance is governed by microstructural characteristics. Intermediate phases-key microstructural constituents-are pivotal in regulating mechanical and functional properties. However, intermediate phase segmentation in zinc alloy microstructures faces formidable challenges: scarce annotated datasets, low contrast, difficulty detecting small targets, and heterogeneous morphologies. To this end, we construct IPSM-Bench, the largest high-quality dataset for zinc-alloy intermediate phase segmentation. Furthermore, we propose SCoP-SAM, a new Spatial Context Prior-guided SAM method that leverages the gradient structure and grayscale properties of intermediate phases to capture spatial context priors and incorporates them into the entire SAM encoding-decoding process, improving segmentation performance. Based on the proposed IPSM-Bench, we establish a new benchmark for intermediate phase segmentation to systematically evaluate state-of-the-art (SOTA) methods and advance research on zinc alloy microstructure analysis. Extensive experiments on IPSM-Bench and additional public alloy benchmarks demonstrate that our SCoP-SAM not only achieves SOTA performance for zinc-alloy intermediate phase segmentation but also generalizes remarkably well to other alloy scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.11012 2026-06-10 cs.CV 新提交

An Uncertainty Estimation Framework for Dose Accumulation in Adaptive Radiotherapy: Application to CBCT-Guided Radiotherapy for Cervical Cancer

自适应放疗中剂量累积的不确定性估计框架：应用于宫颈癌CBCT引导放疗

Cedric Hemon, Delphine Lebret, Jean-Claude Nunes, Valentin Boussot, Karine Peignaux, Nathalie Mesgouez-Nebout, Chantal Hanzen, Antoine Simon, Anaïs Barateau, Renaud de Crevoisier, Caroline Lafond

发表机构 * Univ. Rennes, CLCC Eugène Marquis, INSERM, LTSI - UMR 1099（雷恩大学，尤金·马奎斯中心，法国国家健康与医学研究院，LTSI - UMR 1099）； Department of Radiation Oncology, Centre Georges-Francois Leclerc（乔治-弗朗索瓦·勒克莱尔中心放射肿瘤科）； Institut de Cancérologie de l’Ouest–Site Paul Papin（西部癌症研究所-保罗·帕潘院区）； CLCC Henri Becquerel（亨利·贝克勒尔中心）

AI总结提出IMPACT-DoseAcc框架，通过贝叶斯分割引导和集成分割模型两种策略量化DIR不确定性，并传播至累积剂量指标，应用于宫颈癌CBCT引导oART，验证了不确定性校准和几何误差相关性。

Comments Under revision

详情

AI中文摘要

背景与目的：oART能够每日根据分次间解剖变化调整计划，但累积剂量估计仍受限于DIR、分割和解剖不确定性。我们在IMPACT中引入IMPACT-DoseAcc，一个不确定性感知的剂量累积框架，用于语义特征驱动的图像分析。该框架具有模态和疾病无关性，并应用于宫颈癌（LACC）的CBCT引导oART。材料与方法：回顾性分析9例LACC患者，使用每日CBCT衍生的虚拟CT进行剂量重新计算。IMPACT-DoseAcc关注DIR引起的不确定性，不建模vCT生成的不确定性。在IMPACT-Reg中测试了两种DIR不确定性策略：一种贝叶斯分割引导方法，使用一个概率模型量化解剖不确定性；以及一个针对结构的分割模型集成，以捕获认知变异性。体素级不确定性图通过剂量变形和累积传播，生成概率剂量体积直方图。集成不确定性通过变形场上的体素级标准差量化，几何误差通过变形轮廓与验证轮廓之间的表面距离评估。解剖变异性加权优化了聚合。结果：集成DIR不确定性与几何误差相关，CTVt和膀胱的Pearson系数分别为0.63和0.66。对于CTVt，pDVH达到96.3±3.9%的覆盖率，显示传播不确定性的校准。加权稳定了各分次和器官的估计。结论：IMPACT-DoseAcc将配准驱动的不确定性传播至累积剂量指标，改进了解剖变化下累积剂量的解释。其3DSlicer集成支持可重复、不确定性知情的ART工作流程。

英文摘要

Background and purpose: oART enables daily plan adaptation to interfraction anatomical variations, but cumulative dose estimation remains limited by DIR, segmentation, and anatomical uncertainties. We introduce IMPACT-DoseAcc, an uncertainty-aware dose accumulation framework, within IMPACT for semantic feature-driven image analysis. The framework is modality- and disease-agnostic and is applied to CBCT-guided oART for cervical cancer (LACC). Material and Methods: Nine LACC patients were retrospectively analyzed using daily CBCT-derived virtual CTs for dose recalculation. IMPACT-DoseAcc focuses on uncertainty from DIR, without modeling vCT-generation uncertainty. Two DIR uncertainty strategies were tested within IMPACT-Reg: a Bayesian segmentation-guided approach using one probabilistic model to quantify anatomical uncertainty, and an ensemble of segmentation models targeting structures to capture epistemic variability. Voxel-wise uncertainty maps were propagated through dose warping and accumulation to generate probabilistic dose-volume histograms. Ensemble uncertainty was quantified from voxel-wise standard deviation across deformation fields, and geometric error was assessed using surface distance between warped and validated contours. Anatomical-variability weighting refined aggregation. Results: Ensemble DIR uncertainty correlated with geometric error, with Pearson coefficients of 0.63 for CTVt and 0.66 for bladder. For CTVt, pDVHs achieved 96.3 +/- 3.9% coverage, showing calibration of propagated uncertainty. Weighting stabilized estimates across fractions and organs. Conclusions: IMPACT-DoseAcc propagates registration-driven uncertainty to cumulative dose metrics, improving interpretation of accumulated dose under anatomical variations. Its 3DSlicer integration supports reproducible, uncertainty-informed ART workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.11106 2026-06-10 cs.CV cs.AI 新提交

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

FADA: 可访问的胎儿超声解读与标注——基于选择性蒸馏的统一视觉-语言模型

Mahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes, Nader Mohammed, Abdullatif Magram, Khalid Alyafei, Mowafa Househ, Marco Agus

发表机构 * Hamad Bin Khalifa University（哈马德·本·哈利法大学）； HMC（哈马德医疗公司）； Advanced AlRazi Diagnostic Center（高级阿尔拉齐诊断中心）； Sidra Medicine（锡德拉医学）

AI总结提出统一视觉-语言模型FADA，通过选择性蒸馏从四个领域基础模型提取知识，实现胎儿超声的解读、分类、检测和分割，在单个消费级GPU上训练，无需外部标签，可在智能手机上离线运行。

详情

AI中文摘要

全球范围内受过训练的超声技师短缺限制了低收入和中等收入国家的产前超声筛查，这些国家超过一半的孕妇未接受专业超声检查。当前的深度学习方法分别处理检测、分割或分类，每个任务都需要单独的模型和推理时的专家指定标签。我们提出FADA，一个基于Qwen3.5-VL构建的统一视觉-语言模型，通过单一解读优先的流程执行临床解读、分类、检测和分割，无需外部标签。FADA通过离线预计算特征缓存，从四个领域基础模型（FetalCLIP、UltraSAM、USF-MAE、UltraFedFM）中蒸馏知识。选择性蒸馏仅对标注任务应用特征对齐，而解读任务依赖标准微调，在大多数评估指标上持续优于完全蒸馏。推荐变体FADA-SKD在分割上达到0.8820平均Dice，检测上达到0.7671 mAP@0.50，结构化解读合规性达到100%。专家超声技师对237张图像的验证确认了在自主和人机协同模式下输出临床可接受，其中73.5%的解读在临床医生指导下获得完美评分。该系统可在单个消费级GPU上训练，无需云连接即可部署。我们通过在商用智能手机（高通骁龙7 Gen 1，12 GB RAM）上使用GGUF量化的this http URL运行压缩的0.8B模型，验证了边缘部署，完全离线完成全部5阶段流程约需60秒。这为将AI辅助胎儿评估与便携式超声设备集成提供了实用途径，直接解决了资源受限环境中的诊断可及性差距。代码、模型和数据可在https://this https URL获取。

英文摘要

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

URL PDF HTML ☆

赞 0 踩 0

2606.10713 2026-06-10 eess.IV cs.AI cs.CV cs.LG 交叉投稿

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

++nnU-Net: 基于前缀数据增强的nnU-Net扩展

Ana Sofia Santos, André Ferreira, Gijs Luijten, Naida Solak, Lisle Faray de Paiva, Behrus Hinrichs-Puladi, Jens Kleesiek, Jan Egger, Victor Alves

发表机构 * Center Algoritmi / LASI, University of Minho, Braga, Portugal（阿尔戈里米中心/拉斯伊大学，明霍大学，布拉加，葡萄牙）； Institute for Artificial Intelligence in Medicine, University Medicine Essen, Essen, Germany（医学人工智能研究所，埃森医学院，埃森，德国）； Institute of Medical Informatics / Dept. of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Germany（医学信息学研究所/口腔和颅面外科部，亚琛大学医院，德国）； Faculty of Computer Science, University of Duisburg-Essen, Essen, Germany（计算机科学学院，杜伊斯堡-埃森大学，埃森，德国）

AI总结提出++nnU-Net，通过图像配准进行数据增强，在预处理和训练前生成变形图像，在5个2D数据集上提升Dice系数最高约22%。

Comments 7 pages, 1 figure, 2 tables

详情

AI中文摘要

nnU-Net在医学分割任务中持续展现出成功，这严重依赖于标注生物医学数据的可用性和多样性。然而，由于隐私法规和标注成本等因素，收集医学影像队列仍然具有挑战性。因此，数据增强在增加数据可用性的同时保持解剖学可行性方面起着关键作用。为此，我们提出了++nnU-Net，一种基于图像配准的新型数据增强模块，在预处理和训练之前运行。我们的框架在五个不同的2D数据集上进行了评估。在该工作流中，图像数据经过两阶段配准过程，生成新的变形图像。然后将变换应用于相应的分割。此外，该管道计算可用磁盘空间，生成补充的二进制合成掩码并生成检查点。我们证明++nnU-Net优于nnU-Net基线，在Dice相似系数得分上有所提升。在最显著的情况下，我们观察到性能提升约22%。这些发现强调了基于配准的数据增强的有效性，特别是对于2D医学影像数据集，并表明++nnU-Net为在数据有限的情况下提高分割性能提供了一种实用且可扩展的方法。++nnU-Net的源代码可在以下网址获取：this https URL

英文摘要

The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git

URL PDF HTML ☆

赞 0 踩 0

2505.23341 2026-06-10 cs.CV 版本更新

Dual-stream attention-guided learning for weakly supervised whole slide image classification

双流注意力引导学习用于弱监督全切片图像分类

Daoxi Cao, Hangbei Cheng, Yijin Li, Ruolin Zhou, Xuehan Zhang, Xinyi Li, Binwei Li, Xuancheng Gu, Jianan Zhang, Xueyu Liu, Yongfei Wu

发表机构 * College of Computer Science and Technology, College of Data Science, Taiyuan University of Technology（太原科技大学计算机科学与技术学院、数据科学学院）； College of Humanities, Law and Foreign Languages, Taiyuan University of Technology（太原科技大学人文学院、法律与外语学院）； College of Artificial Intelligence, Taiyuan University of Technology（太原科技大学人工智能学院）； School of Cyberspace Security, Beijing University of Posts and Telecommunications（北京邮电大学网络安全学院）； School of Mathematics, Taiyuan University of Technology（太原科技大学数学学院）

AI总结提出双流注意力引导学习框架，通过师生双流架构和注意力引导伪标签，解决弱监督下全切片图像中关键区域识别和实例关系建模问题，在合成和真实病理数据集上优于现有方法。

详情

AI中文摘要

全切片图像（WSIs）因其超高分辨率和丰富的形态学信息在癌症诊断中发挥关键作用，多实例学习（MIL）已成为解决WSIs巨大尺寸和实例细粒度标注稀缺的主流范式。然而，现有大多数MIL方法难以仅使用切片级标签准确识别诊断关键局部区域（实例），并且在高效建模实例间关系方面存在不足。为解决这些问题，我们提出了一种双流注意力引导学习（DSAGL）框架。DSAGL通过师生双流架构桥接切片级监督和实例级学习，并通过生成注意力引导伪标签缓解实例歧义。该框架采用共享轻量级编码器高效建模长距离依赖，并利用基于注意力的融合机制增强对稀疏信息区域的敏感性。在合成基准和真实病理WSI数据集上的大量实验表明，DSAGL在弱监督下始终优于最先进的MIL方法，实现了卓越的判别性能和鲁棒性。

英文摘要

Whole slide images (WSIs) play a crucial role in cancer diagnosis due to their ultra-high resolution and rich morphological information, and multiple instance learning (MIL) has become a prevalent paradigm to solve the massive size of WSIs and the scarcity of fine-grained annotations of instance. However, most existing MIL methods struggle to accurately identify diagnostically critical local regions (instance) using only slide-level labels, and suffer from modelling the relationship of instances efficiently. To address these defects, we propose a Dual-Stream Attention-Guided Learning (DSAGL) framework. DSAGL bridges slide-level supervision and instance-level learning through a teacher-student dual-stream architecture, and mitigates instance ambiguity by generating attention-guided pseudo labels. The framework employs a shared lightweight encoder to efficiently model long-range dependencies and an attention-based fusion mechanism to enhance sensitivity to sparse, informative regions. Extensive experiments on synthetic benchmarks and real-world pathological WSI datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL methods, achieving superior discriminative performance and robustness under weak supervision.

URL PDF HTML ☆

赞 0 踩 0

2509.05913 2026-06-10 cs.CV 版本更新

A fine-grained attention and geometric correspondence model for musculoskeletal risk classification in athletes using multimodal visual and skeletal features

基于多模态视觉和骨骼特征的运动员肌肉骨骼风险分类的细粒度注意力与几何对应模型

Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Tamanna Shermin, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam

发表机构 * Department of Computer Science and Engineering, United International University（计算机科学与工程系，国际联合大学）； Department of Data Science and Artificial Intelligence, Monash University（数据科学与人工智能系，墨尔本大学）； Faculty of Science and Technology, Charles Darwin University（科学与技术学院，查尔斯达尔文大学）； Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory, Dhaka（应用人工智能与智能系统实验室，达卡）

AI总结提出ViSK-GAT多模态框架，融合图像和骨骼坐标特征，通过细粒度注意力模块和几何对应模块实现运动员肌肉骨骼风险八级分类，关键指标超93%。

Comments Published in Computers and Electrical Engineering

详情

DOI: 10.1016/j.compeleceng.2026.111281
Journal ref: Computers and Electrical Engineering, Vol. 138, 111281, 2026

AI中文摘要

肌肉骨骼疾病对运动员构成重大风险，早期风险评估对于预防至关重要。然而，现有方法大多针对受控环境设计，由于依赖单一数据类型，无法在复杂环境中可靠地评估风险。本研究引入了ViSK-GAT（视觉-骨骼几何注意力变换器），一种新颖的多模态深度学习框架，利用视觉和基于骨骼坐标的特征对肌肉骨骼风险进行分类。通过结合图像和骨骼坐标创建了自定义多模态数据集（MusDis-Sports），每个样本根据快速全身评估（REBA）系统标记为八个风险类别。ViSK-GAT集成了两个创新模块：细粒度注意力模块（FGAM），在融合前通过自注意力细化模态内特征；以及多模态几何对应模块（MGCM），增强图像特征与坐标之间的跨模态对齐。该模型取得了稳健的性能，所有关键指标均超过93%。概率分布误差指标也显示出较低的均方根误差（RMSE）为0.1205和平均绝对误差（MAE）为0.0156。ViSK-GAT持续优于最先进的深度学习骨干网络，展示了其在推动人工智能驱动的肌肉骨骼风险评估和实现运动领域及时干预方面的潜力。

英文摘要

Musculoskeletal disorders pose significant risks to athletes, and early risk assessment is essential for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research introduces ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework that classifies musculoskeletal risk using both visual and skeletal coordinate-based features. A custom multimodal dataset (MusDis-Sports) was created by combining images and skeletal coordinates, with each sample labeled into eight risk categories based on the Rapid Entire Body Assessment (REBA) system. ViSK-GAT integrates two innovative modules: the Fine-Grained Attention Module (FGAM), which refines intra-modal features through self-attention before fusion, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal alignment between image features and coordinates. The model achieved robust performance, with all key metrics exceeding 93%. Probability distribution error metrics also showed a low Root Mean Squared Error (RMSE) of 0.1205 and a Mean Absolute Error (MAE) of 0.0156. ViSK-GAT consistently outperformed state-of-the-art (SOTA) deep learning backbones and showed its potential to advance artificial intelligence-driven musculoskeletal risk assessment and enable timely interventions in sports.

URL PDF HTML ☆

赞 0 踩 0

2602.01951 2026-06-10 cs.CV 版本更新

Enabling Progressive Whole-slide Image Analysis with Multi-scale Pyramidal Network

利用多尺度金字塔网络实现渐进式全切片图像分析

Shuyang Wu, Yifu Qiu, Ines P Nearchou, Sandrine Prost, Jonathan A Fallowfield, Hakan Bilen, Timothy J Kendall

发表机构 * Institute for Regeneration and Repair, University of Edinburgh（再生与修复研究所，爱丁堡大学）； School of Informatics, University of Edinburgh（信息学院，爱丁堡大学）； Indica Labs, 8700 Education Pl NW, Bldg. B Albuquerque, US（Indica实验室，美国阿尔伯克基8700教育大道西北区B座）； Medical School, University of St Andrews（医学学校，圣安德鲁大学）

AI总结提出多尺度金字塔网络（MSPN），一种即插即用模块，仅使用单一高倍输入实现渐进式多尺度全切片图像分析，通过网格重映射和粗引导网络学习粗粒度上下文，在多个任务和框架上一致提升MIL性能。

详情

AI中文摘要

多实例学习（MIL）常用于计算病理学（CPath），其中多尺度特征对于捕捉精细细胞细节和广泛组织结构至关重要。然而，现有的多尺度MIL方法通常依赖于不灵活的多倍率输入或计算成本高昂的架构。随着预训练基础模型（FMs）成为特征提取的趋势并推动轻量级模型的发展，我们重新思考并探索更高效的多尺度MIL方法。在本文中，我们提出了多尺度金字塔网络（MSPN），一种用于基于注意力的MIL的即插即用模块。MSPN仅使用单一高倍输入实现渐进式多尺度全切片图像分析。它由（1）基于网格的重映射组成，该重映射聚合高倍特征以导出空间感知的粗粒度特征图，以及（2）粗引导网络（CGN），该网络学习粗粒度上下文。我们将MSPN作为附加模块在4个基于注意力的框架上，针对5个临床相关任务，使用2个基础模型和一个预训练的MIL框架进行基准测试。我们的结果表明，MSPN在比较的配置和任务上一致地提高了MIL性能，同时保持轻量且易于使用。

英文摘要

Multiple-instance Learning (MIL) is commonly used for computational pathology (CPath), where multi-scale features are essential for capturing both fine cellular details and broad tissue architecture. However, existing multi-scale MIL approaches typically rely on the inflexible multi-magnification inputs or the computationally expensive architectures. As pre-trained foundation models (FMs) become the trend for feature extraction and boost lightweight models, we rethink and explore a more efficient multi-scale MIL method. In this paper, we propose the Multi-scale Pyramidal Network (MSPN), a plug-and-play module for attention-based MIL. MSPN introduces progressive multi-scale whole-slide image analysis using only a single high-magnification input. It consists of (1) grid-based remapping that aggregates high-magnification features to derive spatially-aware coarse feature maps, and (2) the Coarse Guidance Network (CGN) that learns coarse contexts. We benchmark MSPN as an add-on module to 4 attention-based frameworks on 5 clinically relevant tasks with 2 foundation models, and a pre-trained MIL framework. Our results demonstrate that MSPN consistently improves MIL across the compared configurations and tasks, while being lightweight and easy-to-use.

URL PDF HTML ☆

赞 0 踩 0

2604.28095 2026-06-10 cs.CV 版本更新

UHR-Net: An Uncertainty-Aware Hypergraph Refinement Network for Medical Image Segmentation

UHR-Net：一种用于医学图像分割的不确定性感知超图精炼网络

Shuokun Cheng, Jinghao Shi, Kun Sun

发表机构 * School of Computer Sciences, China University of Geosciences (Wuhan)（中国地质大学（武汉）计算机科学学院）

AI总结针对病灶边界模糊和小病灶分割困难，提出UHR-Net，采用不确定性导向实例对比预训练和不确定性引导超图精炼模块，在五个公开数据集上取得一致提升。

Comments 12 pages, 4 figures, 4 tables

详情

AI中文摘要

准确的病灶分割对于临床诊断和治疗规划至关重要。然而，病灶通常与周围组织相似且边界不清，导致边界/过渡区域的预测不稳定。此外，小病灶的线索可能被多尺度特征提取稀释，导致欠分割或过分割。为了解决这些挑战，我们提出了一种不确定性感知超图精炼网络（UHR-Net）。首先，我们引入了一种不确定性导向实例对比（UO-IC）预训练策略，该策略将几何感知的复制-粘贴增强与病灶样背景区域的难负样本挖掘相结合，以提高对小型和视觉模糊病灶的实例级判别能力。其次，我们设计了一个不确定性引导超图精炼（UGHR）模块，该模块从粗概率图中导出基于熵的不确定性图，以指导超图精炼。通过将超边原型分为前景和背景组，UGHR解耦了高阶交互并改善了模糊区域的精炼。在五个公开基准上的实验表明，与强基线相比取得了持续改进。代码可在以下网址获取：this https URL。

英文摘要

Accurate lesion segmentation is crucial for clinical diagnosis and treatment planning. However, lesions often resemble surrounding tissues and exhibit ill-defined boundaries, leading to unstable predictions in boundary/transition regions. Moreover, small-lesion cues can be diluted by multi-scale feature extraction, causing under- or over-segmentation. To address these challenges, we propose an Uncertainty-Aware Hypergraph Refinement Network (UHR-Net). First, we introduce an Uncertainty-Oriented Instance Contrastive (UO-IC) pretraining strategy that couples geometry-aware copy-paste augmentation with hard-negative mining of lesion-like background regions to improve instance-level discrimination for small and visually ambiguous lesions. Second, we design an Uncertainty-Guided Hypergraph Refinement (UGHR) block, which derives an entropy-based uncertainty map from a coarse probability map to guide hypergraph refinement. By splitting hyperedge prototypes into foreground and background groups, UGHR decouples higher-order interactions and improves refinement in ambiguous regions. Experiments on five public benchmarks demonstrate consistent gains over strong baselines. Code is available at: https://github.com/CUGfreshman/UHR-Net.

URL PDF HTML ☆

赞 0 踩 0

2606.09681 2026-06-10 cs.CV 版本更新

GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker Development

GenEyePose：用于数字神经生理学生物标志物开发的无患者、基于知识的扫视眼动建模

Tianyu Lin, Jooyoung Ryu, Puvada Sreevarsha, Rahul Srinivasaragavan, Riya Satavlekar, Susan Kim, Nidhi Soley, Yujie Yan, Ishan Vatsaraj, Carl Harris, Aimon Rahman, Vishal Patel, Joseph Greenstein, Casey Taylor, Kemar E. Green

发表机构 * Whiting School of Engineering, Johns Hopkins University（约翰霍普金斯大学惠廷工程学院）； Department of Neurology, Johns Hopkins Medicine（约翰霍普金斯医学院神经内科）

AI总结提出首个全合成、无患者的多模态眼动生成流水线，用于泛化扫视分析；基于合成数据训练的深度学习分类器在真实临床数据上区分正常与异常扫视精度，AUROC达0.76。

详情

AI中文摘要

眼动（包括扫视）被广泛认为是神经生理状态的高度敏感和客观生物标志物。检测神经系统疾病中的扫视特征提供了一种快速、便携的脑成像替代方案，避免了获取和成本障碍。目前，由于隐私问题和数据集稀缺，缺乏稳健的AI视频眼动图解决方案（例如数字生物标志物）用于筛查、分诊或定位脑异常。在这项工作中，我们提出了第一个完全合成、无患者的多模态眼动生成流水线，用于泛化扫视分析。使用该合成数据集，我们训练了一个深度学习分类器，以区分正常和异常（低度量和高度量）扫视精度，并在真实临床数据上评估其性能。该模型实现了0.76的AUROC和0.71的灵敏度，表明合成数据在临床应用中具有强大的泛化潜力，包括作为家庭和急诊室环境中的筛查工具或精确神经解剖定位工具。

英文摘要

Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoiding access and cost barriers. Currently, there are no robust AI-enabled video-oculographic solutions (e.g., digital biomarkers) for screening, triaging, or localizing brain abnormalities due to privacy issues and scarce datasets. In this work, we propose the first fully synthetic, patient-free, multimodal eye movement generation pipeline for generalizable saccade analysis. Using this synthetic dataset, we trained a deep learning classifier to distinguish between normal and abnormal (hypometria and hypermetria) saccadic accuracies and evaluated its performance on real-world clinical data. The model achieved an AUROC of 0.76 and a sensitivity of 0.71, showing that the synthetic data has strong potential to generalize for clinical applications, including as a screening tool in at-home and emergency room settings or a tool for precise neuroanatomic localization.

URL PDF HTML ☆

赞 0 踩 0

2412.16758 2026-06-10 physics.med-ph cs.CV 版本更新

Training Set Augmentation and Biology-Aware Harmonization Improve Radiomic Models for Lung Cancer Prediction in Indeterminate Nodules

训练集增强与生物学感知的谐波化改善不确定肺结节中肺癌预测的影像组学模型

Claire Huchthausen, Menglin Shi, Gabriel L. A. de Sousa, James Larner, Einsley Janowski, Jonathan Colen, Krishni Wijesooriya

发表机构 * Department of Radiation Oncology, University of Virginia School of Medicine（弗吉尼亚大学医学院放射肿瘤学系）； Department of Physics, University of Virginia（弗吉尼亚大学物理系）； Department of Physics, Massachusetts Institute of Technology（麻省理工学院物理系）； Department of Biomedical Engineering, Northwestern University（西北大学生物医学工程系）； Department of Radiation Oncology, University of Virginia（弗吉尼亚大学放射肿瘤学系）； Old Dominion University（旧 Dominion 大学）

AI总结针对早期肺结节恶性率低和图像采集差异问题，通过加入后期结节扩充训练集，并采用生物学感知的谐波化方法校正采集效应，显著提升了影像组学模型的预测性能（ROC-AUC 0.74）。

Comments 22 pages, 5 figures, plus supplemental material; updated with the accepted version of the manuscript

详情

AI中文摘要

基于CT影像组学的机器学习有潜力比标准方法更早预测肺结节（PNs）中的肺癌。早期发育PNs的低恶性率和可变的图像采集方式阻碍了用于诊断这些PNs的影像组学模型的开发。为应对这些挑战，我们利用后期发育的PNs扩充训练集，并对采集效应进行谐波化处理。我们研究了低于标准诊断灵敏度的早期发育良性及恶性PNs（n=106）。当仅使用早期发育PNs的ComBat谐波化影像组学特征训练时，分类器预测恶性程度的表现接近随机。随后，我们用后期发育的良性及恶性PNs（n=225）扩充训练集。我们评估了谐波化是否必须纳入影响新增训练数据中采集效应的生物学因素。为校正来自四种采集协议的变异性，我们比较了：1）生物学无感知谐波化，2）使用区分早期发育、后期发育良性、后期发育恶性数据集的协变量进行谐波化，3）分别对每个数据集进行谐波化。使用扩充训练集但采用生物学无感知谐波化的模型未能持续改进。使用协变量谐波化（ROC-AUC 0.74 [0.69-0.79]）或分别谐波化（ROC-AUC 0.71 [0.66-0.77]）的扩充训练数据获得了更高的测试ROC-AUC（Delong检验，p<=0.05）和PR-AUC（Wilcoxon检验，p<=0.05）。在一项原理验证方法学研究中，我们通过一个小型单中心数据集证明，结合来自后期发育良性及恶性PNs的影像组学特征需要生物学感知的谐波化。

英文摘要

CT radiomics-based machine learning has potential to predict lung cancer in pulmonary nodules (PNs) earlier than standard-of-care methods. Low malignancy rates in early-development PNs and variable image acquisition hinder development of radiomic models for diagnosing these PNs. To address these challenges, we augmented training using later-development PNs and harmonized for acquisition effects. We examine early-development benign and malignant PNs (n=106) below the sensitivity of standard-of-care diagnosis. Classifiers predicting malignancy performed near chance when trained on ComBat-harmonized radiomic features from only early-development PNs. We then augmented training with later-development benign and malignant PNs (n=225). We evaluated whether harmonization must incorporate biology that impacts acquisition effects in added training data. To correct variability from four acquisition protocols, we compared: 1) biology-unaware harmonization, 2) harmonizing with a covariate distinguishing early-development, later-development benign, later-development malignant datasets, 3) harmonizing each dataset separately. Models trained using augmentation, but biology-unaware harmonization, failed to improve consistently. Augmented training data harmonized with a covariate (ROC-AUC 0.74 [0.69-0.79]) or separately (ROC-AUC 0.71 [0.66-0.77]) yielded higher test ROC-AUC (Delong, p<=0.05) and PR-AUC (Wilcoxon, p<=0.05). In a proof-of-principle methodological study, we demonstrate with a small single-center dataset that combining radiomic features from later-development benign and malignant PNs requires biology-aware harmonization.

URL PDF HTML ☆

赞 0 踩 0

2507.22017 2026-06-10 eess.IV cs.CV 版本更新

Cyst-X: A Multi-Center MRI Benchmark and Federated Learning Framework for Malignancy-Risk Stratification of Pancreatic Cystic Neoplasm

Cyst-X：用于胰腺囊性肿瘤恶性风险分层的多中心MRI基准与联邦学习框架

Hongyi Pan, Gorkem Durak, Elif Keles, Ziliang Hong, Deniz Seyithanoglu, Zheyuan Zhang, Alpay Medetalibeyoglu, Halil Ertugrul Aktas, Andrea Mia Bejar, Yavuz Taktak, Gulbiz Dagoglu Kartal, Mehmet Sukru Erturk, Timurhan Cebeci, Yury Velichko, Lili Zhao, Emil Agarunov, Federica Proietto Salanitri, Concetto Spampinato, Pallavi Tiwari, Ziyue Xu, Sachin Jambawalikar, Ivo G. Schoots, Marco J. Bruno, Chenchan Huang, Candice W. Bolan, Tamas Gonda, Frank H. Miller, Rajesh N. Keswani, Michael B. Wallace, Ulas Bagci

发表机构 * Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University（机器与混合智能实验室，放射科，西北大学）； Istanbul Faculty of Medicine, Istanbul University（伊斯坦布尔大学医学学院）； Department of Biomedical Engineering and Radiology, University of Wisconsin-Madison（生物医学工程与放射科，威斯康星大学麦迪逊分校）； Department of Preventive Medicine, Northwestern University（预防医学系，西北大学）； Division of Gastroenterology and Hepatology, New York University（消化内科与肝病科，纽约大学）； Department of Electrical, Electronic and Computer Engineering, University of Catania（电气、电子和计算机工程系，卡塔尼亚大学）； NVIDIA ； Department of Radiology, Columbia University（放射科，哥伦比亚大学）； Department of Radiology and Nuclear Medicine, Erasmus Medical Center（放射科与核医学科，埃因霍温医学院）； Department of Gastroenterology and Hepatology, Erasmus Medical Center（消化内科与肝病科，埃因霍温医学院）； Department of Radiology, New York University（放射科，纽约大学）； Division of Gastroenterology and Hepatology, Mayo Clinic Florida（消化内科与肝病科，迈阿密诊所佛罗里达分部）； Department of Gastroenterology and Hepatology, Northwestern University（消化内科与肝病科，西北大学）

AI总结提出Cyst-X，一个多中心MRI基准和联邦学习框架，用于IPMN恶性风险分层，结合PanSegNet分割器和3D DenseNet-121分类器，在内部交叉验证中达到0.85的AUC，性能与放射科医生相当。

详情

AI中文摘要

预计到2030年，胰腺癌将成为第二大致命癌症，因此早期检测至关重要。导管内乳头状黏液性肿瘤（IPMN）是关键的癌前病变，目前指南在恶性风险分层方面存在困难，导致不必要的手术或漏诊。在此，我们介绍Cyst-X，一个用于IPMN恶性风险分层的多中心MRI基准和联邦学习框架。该数据集包含来自七个国际中心764名患者的1,461次腹部MRI扫描，具有基于组织病理学或三年影像随访的三级恶性标签和专家胰腺分割。该流程将PanSegNet胰腺分割器与3D DenseNet-121分类器以及并行放射组学预测器相结合。在内部交叉验证中，深度学习分类器在T2加权MRI上对高风险与低风险或无风险鉴别达到了平均受试者工作特征曲线下面积（AUC）0.85（95%置信区间0.84-0.86），平均精确度从患病率基线0.23提高到0.64。当训练分布在多个机构之间且不交换原始患者图像时，该性能得以保持（AUC 0.85，FedProx）。在仅基于影像条件下评估的629例读者子集上，与三位盲法放射科医生相比，该分类器在特异性相当的情况下达到或超过了敏感性。为了加速早期胰腺癌检测研究，我们公开发布Cyst-X数据集、分割掩膜和训练模型，作为首个用于胰腺囊性肿瘤分析的大规模多中心MRI资源。

英文摘要

Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we introduce Cyst-X, a multi-center MRI benchmark and a federated learning framework for IPMN malignancy-risk stratification. The dataset comprises 1,461 abdominal MRI scans from 764 patients at seven international centers, with three-tier malignancy labels anchored in histopathology or three-year imaging follow-up and expert pancreas segmentations. The pipeline couples the PanSegNet pancreas segmenter with a 3D DenseNet-121 classifier and a parallel radiomics predictor. On internal cross-validation, the deep learning classifier reached a mean area under the receiver operating characteristic curve (AUC) of 0.85 (95% confidence interval 0.84-0.86) on T2-weighted MRI for high-risk versus low- or no-risk discrimination, with the average precision rising from a prevalence baseline of 0.23 to 0.64. This performance was preserved (AUC 0.85, FedProx) when training was distributed across institutions without exchange of raw patient images. Benchmarked against three blinded radiologists on a 629-case reader subset evaluated under imaging-only conditions, the classifier matched or exceeded sensitivity at comparable specificity. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset, segmentation masks, and trained models as the first large-scale, multi-centre MRI resource for pancreatic cystic neoplasm analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.10640 2026-06-10 cs.CV 新提交

ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

ChartLens：用于图表数据校正和事实性摘要精炼的双分支框架

Hao Liu, Ruping Cao, Kun Wang, Zhiran Li, Fan Liu, Yupeng Hu, Liqiang Nie

发表机构 * Shandong University（山东大学）； Southeast University（东南大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结提出ChartLens双分支框架，通过结构感知CSV验证校正和文本保留引导的摘要精炼，提升图表数据恢复与摘要事实性，在DataMFM挑战赛Track 2中获第一。

详情

AI中文摘要

在本报告中，我们展示了针对DataMFM挑战赛Track 2：图表理解（Chart Understanding）的冠军解决方案。该赛道要求模型从图表图像中恢复结构化图表数据并生成忠实于事实的自然语言摘要。为了满足准确数据提取和事实性叙述的互补需求，我们提出了ChartLens，一个用于图表数据校正和摘要精炼的双分支框架。ChartLens由两个关键模块组成：结构感知CSV验证与校正（SAVC）和文本保留引导的摘要精炼（TRSR）。SAVC通过验证和校正提高结构化数据提取的可靠性，而TRSR通过保留图表中的关键文本和数值证据来增强摘要生成。通过结合模型自适应、基于校正的生成和OCR辅助的证据依据，ChartLens改善了结构化数据恢复和摘要事实性。在测试集上，我们的最终系统获得了69.10的总分，并在Track 2中排名第一，证明了其在准确图表理解方面的有效性。我们的代码将在以下网址发布：this https URL。

英文摘要

In this report, we present our champion solution for the DataMFM Challenge Track 2: Chart Understanding. This track requires models to recover structured chart data and generate faithful natural-language summaries from chart images. To address the complementary requirements of accurate data extraction and factual narration, we propose ChartLens, a dual-branch framework for chart data correction and summary refinement. ChartLens consists of two key modules: Structure-Aware CSV Verification and Correction (SAVC) and Text-Retention-Guided Summary Refinement (TRSR). SAVC improves the reliability of structured data extraction through verification and correction, while TRSR enhances summary generation by preserving critical textual and numerical evidence from charts. By combining model adaptation, correction-based generation, and OCR-assisted evidence grounding, ChartLens improves both structured data recovery and summary factuality. On the test set, our final system achieves an overall score of 69.10 and ranks first in Track 2, demonstrating its effectiveness for accurate chart understanding. Our code will be released at: https://github.com/iLearn-Lab/CVPRW26-ChartLens.

URL PDF HTML ☆

赞 0 踩 0

2606.10701 2026-06-10 cs.CV 新提交

ChartREG++：面向多样化指代线索和多目标指代的图表指代表达式定位基准与改进

Tianhao Niu, Ziyu Han, Xuan Dong, Qingfu Zhu, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics（社会计算与交互机器人研究中心）

AI总结针对现有图表指代表达式定位基准的局限，提出支持多种定位形式、多目标指代、多样化线索和图表类型的基准，并利用代码驱动合成流水线生成像素级实例掩码，训练实例分割模型集成到多模态定位框架，显著提升性能。

详情

AI中文摘要

指代表达式定位是视觉定位的核心问题，广泛用于视觉与语言模型的空间定位与推理诊断，但以往工作多聚焦于自然图像。相比之下，现有的图表指代表达式定位基准存在局限：(1) 大多采用边界框，限制了精细图表元素的定位精度；(2) 大多假设单个或两个指代目标实例，无法处理多实例目标指代；(3) 语言表达过度依赖文本线索或数据排名线索；(4) 仅覆盖狭窄的图表类型范围。为解决这些问题，我们引入了一个图表指代表达式定位基准，系统性地支持多种定位形式、多个指代目标、多样化定位线索和多种图表类型。在代表性多模态大模型上的结果揭示了显著的性能差距。我们进一步引入了一个代码驱动的合成流水线，利用绘图程序与渲染图表基元之间的固有对齐，跨图表元素类型和粒度生成像素级精确的实例掩码。我们使用合成掩码训练了一个实例分割模型，并将其集成到一个通用的多模态定位框架中。最终系统在我们的基准上持续优于基线，并很好地泛化到从ChartQA导出的真实图表定位基准。

英文摘要

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.10275 2026-06-10 cs.CV 新提交

FoA-SR: Faithful or Aesthetic? Profile-Aware Preference Optimization for Real-World Image Super-Resolution

FoA-SR: 忠实还是美观？面向真实世界图像超分辨率的轮廓感知偏好优化

Amjad Mahdi Alqarni, Peizhong Ju

发表机构 * Department of Computer Science（计算机科学系）； University of Kentucky（肯塔基大学）

AI总结提出FoA-SR，基于偏好优化实现真实世界图像超分辨率，通过忠实和美观两种轮廓分别优化适配器，在RealSR和DIV2K上验证了可分离的恢复策略。

Comments 17 pages, 6 figures, 9 tables. Preprint

详情

AI中文摘要

真实世界图像超分辨率（SR）通常设计为单一恢复目标，尽管当前生成模型能够为同一输入产生多个高质量重建。本文认为，最佳恢复策略取决于特定的恢复轮廓：忠实恢复优先考虑参考一致性、结构保持和幻觉抑制，而美观恢复优先考虑视觉愉悦和自然细节。我们提出FoA-SR，一种基于轮廓的新型真实世界SR偏好优化方法。为实现此目标，FoA-SR从我们的监督式FLUX.2-based SR适配器（Flux2SR）开始，该适配器通过LR潜在条件、流匹配和图像空间重建损失进行配对LR到HR图像超分辨率训练。在开发共享监督式超分辨率适配器后，FoA-SR为每个输入图像生成共享随机候选池，并使用轮廓特定的忠实和美观奖励对相同候选进行排序，以挖掘胜者-败者对。这些对用于微调单独的LoRA适配器，同时保持基础模型冻结。在RealSR和DIV2K上的实验表明，FoA-SR可以将同一SR适配器导向不同的恢复目标：忠实适配器改善参考一致性指标，而美观适配器提升无参考感知质量指标。我们的候选池分析显示，忠实和美观奖励经常选择不同的胜者，而Hybrid-LoRA消融表明，将两个轮廓合并为一个奖励会产生隐式折衷，而非显式轮廓控制。

英文摘要

Real-world image super-resolution (SR) is often designed with a single restoration objective, despite the current capacity of generative models to produce multiple high-quality reconstructions for the same input. In this paper, we argue that the best restoration strategy is subject to the specific restoration profile: a Faithful restoration prioritizes reference consistency, structure preservation, and hallucination suppression, whereas an Aesthetic restoration prioritizes visually pleasing and natural-looking details. We propose FoA-SR, a novel preference optimization approach to real-world SR based on profiles. To achieve this goal, FoA-SR starts with our supervised FLUX.2-based SR adapter (Flux2SR) trained with LR latent conditioning, flow matching, and image-space reconstruction losses for paired LR-to-HR image super-resolution. Following the development of the shared supervised super-resolution adapter, FoA-SR generates a shared stochastic candidate pool for each input image and ranks the same candidates using profile-specific Faithful and Aesthetic rewards to mine winner-loser pairs. These pairs are used to fine-tune separate LoRA adapters while keeping the base model frozen. Experiments on RealSR and DIV2K show that FoA-SR can steer the same SR adapter towards distinct restoration objectives: a Faithful adapter improves reference-consistent metrics while an Aesthetic adapter boosts metrics that measure perceptual quality without reference. Our candidate-pool analysis shows that Faithful and Aesthetic rewards frequently select different winners, and a Hybrid-LoRA ablation shows that collapsing both profiles into one reward yields an implicit compromise rather than explicit profile control.

URL PDF HTML ☆

赞 0 踩 0

2606.10350 2026-06-10 cs.CV 新提交

Multi-Angular Reflectance Anisotropy Observed from UAV Multispectral Imagery

无人机多光谱影像观测的多角度反射率各向异性

Zhenqiang Qin, Chenguang Dai, Min Wang, Xian Li

发表机构 * University of Information Engineering（信息工程大学）

AI总结提出一种几何感知的多角度观测提取流程，从BRDF角度量化观测几何效应，通过SFM精化相机参数并重投影同质区域，联合提取多波段反射率和观测几何参数，发现红边和近红外波段反射率变化达119-137%。

详情

AI中文摘要

由于低空飞行和宽视场成像，无人机多光谱影像自然包含多角度观测，这可能引入几何驱动的辐射变异性。本研究提出一种几何感知的多角度观测提取流程，从BRDF角度量化观测几何效应。具体地，通过运动恢复结构（SFM）精化相机内参和外参，并将正射影像上标注的同质区域重投影到从不同视角获取的多个原始子图像上。这使得能够在不同观测方向下联合提取同一地面目标的多波段反射率和观测几何参数。进一步利用（VZA，RAA）域中的波段极坐标可视化分析提取的观测值。草地目标的结果显示，十个波段均存在明显的反射率各向异性，其中红边和近红外波段的最大与最小反射率变化达119-137%，表明观测几何效应对辐射一致性有不可忽视的影响。

英文摘要

UAV multispectral imagery naturally contains multi-angular observations due to low flight altitude and wide field-of-view imaging, which may introduce geometry-driven radiometric variability. This study proposes a geometry-aware multi-angular observation extraction workflow to quantify observation-geometry effects from a BRDF perspective. Specifically, camera intrinsics and extrinsics are refined via structure-from-motion (SFM), and homogeneous regions annotated on an orthomosaic are reprojected onto multiple raw sub-images acquired from different viewpoints. This enables joint extraction of multi-band reflectance and observation geometry parameters for the same ground targets under varying viewing directions. The extracted observations are further analyzed using band-wise polar visualization in the (VZA, RAA) domain. Results on a grassland target show clear reflectance anisotropy across ten bands, with red-edge and nearinfrared bands exhibiting 119-137% variability between maximum and minimum reflectance, indicating non-negligible observation-geometry effects on radiometric consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.10373 2026-06-10 cs.CV 新提交

PF-Trans: Physics-Embedded Frequency-Aware Transformer for Spectral Reconstruction

PF-Trans：物理嵌入的频率感知Transformer用于光谱重建

Yuzhe Gui, Tianzhu Liu, Yanfeng Gu, Xian Li

发表机构 * National Natural Science Foundation of China（国家自然科学基金委员会）

AI总结针对快照宽带滤光片阵列成像中的光谱混叠问题，提出物理嵌入的频率感知Transformer（PF-Trans），通过掩膜注入和灰度一致性损失保证物理保真度，并引入双域块并行FFT分支抑制频域伪影，在GF-5上海数据集上PSNR达48.50 dB。

详情

AI中文摘要

快照宽带滤光片阵列（BFA）成像为光谱重建提供了高光通量，但由于复杂调制引入了严重的光谱混叠。当前的深度学习方法局限于空间去噪，往往无法解决由掩膜结构引起的全局频率特定退化。为了解决这个问题，我们提出了一种物理嵌入的频率感知Transformer（PF-Trans），用于高保真遥感光谱重建。我们的方法通过掩膜注入和灰度一致性损失显式集成物理传感模型，以确保物理保真度。此外，我们引入了一个带有并行快速傅里叶变换（FFT）分支的双域块，使网络能够感知并抑制频域中的混叠伪影。在多个数据集上的大量实验表明，PF-Trans实现了最先进的性能，在GF-5上海数据集上峰值信噪比（PSNR）高达48.50 dB，显著优于对比方法。

英文摘要

Snapshot Broadband Filter Array (BFA) imaging provides high light throughput for spectral reconstruction but introduces severe spectral aliasing due to complex modulation. Current deep learning approaches, limited to spatial denoising, often fail to address the global frequency-specific degradations caused by the mask structure. To address this, we propose a Physics-embedded Frequency-aware Transformer (PF-Trans) for high-fidelity remote sensing spectral reconstruction. Our method explicitly integrates the physical sensing model through mask injection and a gray-scale consistency loss to ensure physical fidelity. Furthermore, we introduce a Dual-domain Block with a parallel Fast Fourier Transform (FFT) branch, enabling the network to perceive and suppress aliasing artifacts in the frequency domain. Extensive experiments on multiple datasets demonstrate that PF-Trans achieves state-of-the-art performance, achieving a Peak Signal-to-Noise Ratio (PSNR) of up to 48.50 dB on the GF-5 Shanghai dataset, significantly outperforming comparison methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10628 2026-06-10 cs.CV 新提交

Leveraging Metric Depth for Relative Depth Prediction

利用度量深度进行相对深度预测

Xiaoyang Bi, Shuaikun Liu, Zhaohong Liu, Yuxin Yang, Zhe Zhao, Mengshi Qi, Liang Liu, Huadong Ma

发表机构 * Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia（智能电信软件与多媒体北京市重点实验室）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结针对足球场景中相对深度预测训练样本少的问题，提出利用预训练模型的零样本能力学习度量深度，在挑战集上取得2.68×10^{-3}的分数。

2606.11032 2026-06-10 cs.CV 新提交

U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training

U-TTT：通过测试时训练实现可泛化的PET图像去噪

Zhiwen Yang, Jiayin Li, Hao Lu, Hui Zhang, Zihua Wang, Bingzheng Wei, Yan Xu

发表机构 * School of Biological Science and Medical Engineering, Beihang University（北京航空航天大学生物科学与医学工程学院）； Department of Biomedical Engineering, Tsinghua University（清华大学生物医学工程系）； School of Aerospace Engineering, Tsinghua University（清华大学航天航空学院）； ByteDance Inc.（字节跳动有限公司）

AI总结针对PET图像去噪模型在分布偏移下性能退化的问题，提出U-TTT模型，集成测试时训练（TTT）层，通过自监督动态调整参数，并设计双域自适应机制（空间和频率TTT层），在未见剂量水平和扫描仪下实现最优去噪和泛化。

详情

AI中文摘要

现有的用于正电子发射断层扫描（PET）图像去噪的深度学习模型在分布偏移下常常遭受严重的性能退化，从根本上限制了其稳健的临床部署。这种泛化能力的缺乏源于固定参数模型的传统范式，该范式在训练后无法适应测试数据的变化（例如，剂量水平或扫描仪类型）。为了克服这一限制并实现稳健的泛化，我们引入了U-TTT，一种新颖的U形模型，它集成了测试时训练（TTT）层，通过自监督在推理过程中动态调整模型参数，从而适应每个测试实例的特定特征。此外，为了全面捕捉3D PET数据的复杂退化，U-TTT具有双域自适应机制，包括空间测试时训练（S-TTT）层和频率测试时训练（F-TTT）层。S-TTT层捕捉并校正空间结构退化，而F-TTT层抑制全局噪声谱并恢复精细的高频细节。大量实验表明，U-TTT在PET去噪性能上达到了最先进水平，并在具有挑战性的分布偏移下（包括未见剂量水平和未见扫描仪）展现出优越的泛化能力。我们的代码将在此https URL提供。

英文摘要

Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of generalization stems from the conventional paradigm of fixed-parameter models that cannot adapt to variations in test data (e.g., dose levels or scanner types) after training. To overcome this limitation and achieve robust generalization, we introduce U-TTT, a novel U-shaped model that integrates Test-Time Training (TTT) layers to dynamically adjust model parameters during inference through self-supervision, thereby adapting to the specific characteristics of each test instance. Furthermore, to comprehensively capture the complex degradations of 3D PET data, U-TTT features a dual-domain adaptation mechanism comprising a Spatial Test-Time Training (S-TTT) layer and a Frequency Test-Time Training (F-TTT) layer. The S-TTT layer captures and corrects spatial structural degradations, while the F-TTT layer suppresses global noise spectra and restores delicate high-frequency details. Extensive experiments demonstrate that U-TTT achieves state-of-the-art PET denoising performance and exhibits superior generalization under challenging distribution shifts, including both unseen dose levels and unseen scanners. Our code will be available at https://github.com/Yaziwel/U-TTT.

URL PDF HTML ☆

赞 0 踩 0

2606.11131 2026-06-10 cs.CV 新提交

UniPET: a universal network for high-quality PET image denoising across varied dose reduction factors

UniPET：一种适用于不同剂量减少因子的高质量PET图像去噪通用网络

Zhiwen Yang, Yang Zhou, Haowei Chen, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu

AI总结针对现有PET去噪方法在剂量减少因子变化时性能下降的问题，提出UniPET网络，通过风格对齐网络和区域感知学习策略实现跨DRF的高质量去噪，性能达到最先进水平。

详情

AI中文摘要

大多数现有的基于深度学习的PET图像去噪方法假设低剂量PET图像具有固定且已知的剂量减少因子（DRF）。然而，当DRF在实际应用中超出假设范围时，这些方法会遇到显著的性能下降。为了应对不同DRF带来的挑战，一些初步研究聚焦于通用PET图像去噪任务，旨在训练一个覆盖不同DRF低剂量数据的通用模型。尽管如此，这些通用模型常常难以处理不同DRF数据中存在的风格不匹配问题，导致出现显著的过度平滑效应，即\textit{风格消除问题}。为了解决这个问题，我们创新性地将域泛化引入PET图像去噪，并提出了一种通用PET图像去噪网络（UniPET），以实现跨不同DRF的高质量PET图像去噪。UniPET包含两个主要创新：风格对齐网络（SAN）和区域感知学习策略（RALS）。具体而言，SAN利用源自域泛化的风格对齐技术来对齐和恢复不同DRF下的风格，确保模型在各种DRF下的泛化能力，同时有效保留风格。此外，为了增强风格恢复，RALS区分平坦区域和风格化区域，仅在后者上进行对抗学习，从而更有效地引导模型关注学习风格化区域。实验证明，我们提出的UniPET能够自适应地恢复不同DRF风格，并实现跨DRF的高质量PET图像去噪。全面的实验表明，UniPET在特定DRF下表现出与专用DRF模型相当的性能，并在定量、感知和临床评估中实现了通用PET图像去噪的最先进性能。

英文摘要

Most existing deep learning-based PET image denoising methods assume a fixed and known dose reduction factor (DRF) for low-dose PET images. However, these methods encounter significant performance degradation when the DRF varies beyond the assumed one in practical applications. To address the challenge posed by varied DRFs, several preliminary studies focus on the task of universal PET image denoising, aiming to train a universal model over low-dose data across DRFs. Nonetheless, these vanilla universal models often struggle with misaligned styles present in different DRF data, leading to the \textit{style elimination issue} with a significant over-smoothing effect. To deal with this issue, we innovatively introduce domain generalization to PET image denoising and propose a universal PET image denoising network (UniPET) to achieve high-quality PET image denoising across diverse DRFs. UniPET comprises two primary innovations: a style alignment network (SAN) and a region-aware learning strategy (RALS). Specifically, SAN utilizes style alignment techniques derived from domain generalization to align and recover styles across different DRFs, ensuring the model's generalizability across various DRFs while effectively preserving styles. Furthermore, to enhance style recovery, RALS distinguishes between flat and stylized regions, exclusively conducting adversarial learning on the latter, thereby more effectively guiding the model's focus towards learning stylized regions. It is demonstrated that our proposed UniPET can adaptively recover different DRF styles and achieve high-quality PET image denoising across DRFs. Comprehensive experiments show that UniPET exhibits comparable performance to individual DRF-specific models at specific DRFs and realizes state-of-the-art performance in universal PET image denoising quantitatively, perceptually, and clinically.

URL PDF HTML ☆

赞 0 踩 0

2606.11186 2026-06-10 cs.CV 新提交

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

AnyMod-LLVE: 模态无关推理的低光照视频增强

Hangfeng Liang, Yutao Hu, Yanhan Hu, Xiaohan Wu, Wenqi Shao, Ying Fu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出AMNet统一多模态框架，通过空间-频谱双门控转换器学习辅助模态与RGB输入的对应关系，支持推理时任意模态组合，解决低光照视频增强中辅助模态缺失问题。

Comments Accepted at ICML 2026; Project page and code: https://lhfgghc.github.io/LLVE-AMNet

详情

AI中文摘要

低光照视频增强（LLVE）由于低照度条件下严重的信息退化仍然是一项具有挑战性的任务。最近的多模态方法通过引入辅助模态（如事件流和红外图像）显著提升了增强性能。然而，这些方法通常假设推理时这些模态可用，这在现实场景中往往不可行。为了解决这个问题，在本工作中，我们提出了AMNet，一个统一的LLVE多模态框架，以支持灵活的模态无关推理，其中辅助模态可能不可用。为了解决模态缺失问题，我们引入了一个空间-频谱双门控转换器，学习辅助模态与RGB输入之间的对应关系，生成隐式辅助表示以支持鲁棒增强。此外，为了充分促进跨模态对应学习，我们基于仅RGB数据集和合成辅助模态进行了大规模多模态预训练。大量实验表明，AMNet能够处理任意推理时的模态组合，并在模态缺失条件下展现出优越的LLVE性能。代码和模型可在项目页面上获取。

英文摘要

Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.

URL PDF HTML ☆

赞 0 踩 0

2606.10280 2026-06-10 eess.IV cs.CV 交叉投稿

Overlapped Wavelet Diffusion for Low-Light Image Enhancement

重叠小波扩散用于低光照图像增强

Fen Peng, Taizo Suzuki, Seisuke Kyochi

AI总结提出重叠小波扩散框架OWDiff，通过重叠小波变换消除块伪影，并引入低频引导的高频增强模块恢复细节，在LOLv1和LOLv2-real数据集上优于现有方法。

Comments Advance published in IEICE Transactions on Information and Systems. DOI: 10.1587/transinf.2026PCP0006. Code: https://github.com/FinnPeg/Overlapped-Wavelet-Diffusion

详情

DOI: 10.1587/transinf.2026PCP0006
Journal ref: IEICE Transactions on Information and Systems, Advance online publication, 2026

AI中文摘要

在这项研究中，我们提出了一种用于低光照图像增强（LLIE）的重叠小波扩散框架，该框架包含两个互补组件，以实现无块伪影和细节保持的增强。尽管与传统方法相比，最近基于扩散的LLIE方法表现出显著性能，但DiffLL仍然遭受由Haar小波变换（WT）引起的块伪影以及由于其高频恢复模块（HFRM）的限制导致的边缘模糊或纹理过度平滑。为了克服这些问题，我们引入了重叠小波变换（OWT），它融合了相邻区域的相关性，从而在结构上防止块伪影。此外，我们集成了一个低频引导的高频增强模块（HFEBlock）来加强细节恢复，产生更清晰的边缘和更可靠的纹理。在LOLv1和LOLv2-real数据集上的大量实验表明，我们的框架（称为OWDiff）在定性和定量上均持续优于现有的LLIE方法，在保持计算效率的同时实现了卓越的视觉质量。OWDiff有效解决了Haar WT和HFRM的结构限制，与DiffLL相比，在LOLv1和LOLv2-real数据集上平均PSNR增益为0.58 dB，SSIM相对提高1.64%，LPIPS相对降低5.9%。

英文摘要

In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.

URL PDF HTML ☆

赞 0 踩 0

2503.13358 2026-06-10 cs.CV 版本更新

One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

一步残差移位扩散用于图像超分辨率通过蒸馏

Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin

发表机构 * Kandinsky Lab（坎迪斯基实验室）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； Luzin Research Center（卢津研究所）； Moscow Independent Research Institute of Artificial Intelligence（莫斯科独立人工智能研究 institute）； Applied AI Institute（应用人工智能研究所）

AI总结提出RSD蒸馏方法，通过训练学生网络使基于其生成图像的虚拟ResShift模型与教师一致，实现单步超分辨率，在感知指标上超越教师和SinSR，且参数和计算成本更低。

Comments ICML-2026

详情

AI中文摘要

用于超分辨率（SR）的扩散模型产生高质量的视觉结果，但需要昂贵的计算成本。尽管已经开发了几种加速基于扩散的SR模型的方法，但有些（例如SinSR）无法产生真实的感知细节，而其他（例如OSEDiff）可能会产生不存在的结构。为了克服这些问题，我们提出了RSD，一种新的ResShift蒸馏方法。我们的方法基于训练学生网络生成图像，使得基于这些图像训练的新假ResShift模型与教师模型一致。RSD实现单步恢复，并在各种感知指标（LPIPS、CLIPIQA、MUSIQ）上明显优于教师。我们表明，我们的蒸馏方法可以超越SinSR（另一种基于ResShift的蒸馏方法），使其在感知质量方面与最先进的扩散SR蒸馏方法相当，且计算成本有限。与基于预训练文本到图像模型的SR方法相比，RSD产生具有竞争力的感知质量，并需要更少的参数、GPU内存和训练成本。我们在各种真实世界和合成数据集上提供了实验结果，包括RealSR、RealSet65、DRealSR、ImageNet和DIV2K。我们在以下网址提供代码：此https URL。

英文摘要

Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift. Our method is based on training the student network to produce images such that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a noticeable margin in various perceptual metrics (LPIPS, CLIPIQA, MUSIQ). We show that our distillation method can surpass SinSR, the other distillation-based method for ResShift, making it on par with state-of-the-art diffusion SR distillation methods with limited computational costs in terms of perceptual quality. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality and requires fewer parameters, GPU memory, and training cost. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K. We provide the code at https://github.com/Daniil-Selikhanovych/RSD.

URL PDF HTML ☆

赞 0 踩 0

2501.01481 2026-06-10 eess.IV cs.CV 版本更新

迈向校准、公平且准确的深度伪造检测

Ryan Brown, Chris Russell

发表机构 * University of Oxford（牛津大学）

AI总结提出Face-Fairness框架，通过Face-Feature Tuning实现无需人口统计标签的深度伪造检测公平性，同时保持或提升整体准确率。

详情

AI中文摘要

深度伪造检测器在不同人口群体间表现出较大的性能差距。现有的公平性方法需要人口统计标签、重新训练或牺牲准确性。我们引入了Face-Fairness (FF)，一个即插即用的偏差缓解框架。我们的主要贡献是Face-Feature Tuning (FFT)，这是首个在深度伪造检测中展示的无人口统计标签的公平性方法：一个轻量级校准器，基于冻结的人脸嵌入进行logit重映射。我们通过两种变体补充FFT：FF-Max，在人口统计标签可用时最大化最差组准确率；以及FF-Discover，通过嵌入发现的组实现相同目标。在域内和跨数据集测试设置中，FF一致地减少了FPR/TPR差距，提高了最小组准确率，同时保持（通常提升）整体准确率。该方法与检测器无关，增加了可忽略的运行时开销，并且不需要访问身份属性。

英文摘要

Deepfake detectors show large performance gaps across demographic groups. Existing fairness approaches require demographic labels, retraining, or sacrifice accuracy. We introduce Face-Fairness (FF), a plug-and-play framework for bias mitigation. Our primary contribution, Face-Feature Tuning (FFT), is the first demographic label-free fairness method demonstrated for deepfake detection: a lightweight calibrator that performs a logit remapping conditioned on frozen face embeddings. We complement FFT with two variants: FF-Max, which maximizes worst-group accuracy when demographics are available, and FF-Discover, which does the same with embedding-discovered groups. Across in-domain and cross-dataset test settings, FF consistently reduces FPR/TPR gaps and improves minimum group accuracy while maintaining (often improving) overall accuracy. The approach is detector-agnostic, adds negligible runtime overhead, and requires no access to identity attributes.

URL PDF HTML ☆

赞 0 踩 0

2606.09909 2026-06-10 cs.CR cs.AI cs.CV 交叉投稿

Bypassing Copyright Protection in Diffusion-based Customization via Two-Stage Latent Feature Optimization

通过两阶段潜在特征优化绕过基于扩散的定制中的版权保护

Ziang Xu, Wenbo Yu, Hongyao Yu, Hao Fang, Jiawei Kong, Bin Chen, Hao Wu, Shu-Tao Xia, Zhiyong Wu

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）

AI总结提出两阶段潜在特征优化（TS-LFO）攻击方法，通过潜在去噪和重建阶段恢复被防御破坏的映射，有效绕过扩散模型定制中的版权保护。

Comments accepted by KDD 2026

详情

AI中文摘要

随着基于扩散的定制中版权侵权问题的日益关注，对抗性攻击已成为一种突出的防御策略，以防止个性化图像生成中的恶意内容伪造。然而，当前的防御通常会在潜在扩散模型（LDM）的潜在空间中引入持久扰动，这些扰动仍然容易被对手自适应绕过。在本文中，我们引入了两阶段潜在特征优化（TS-LFO），一种针对受保护的基于扩散的定制的高效且有效的版权窃取攻击。我们首先观察到现有防御主要破坏输入图像与其潜在表示之间的映射，从而降低模型生成个性化输出的能力。为了应对这一点，TS-LFO通过两阶段优化过程恢复被破坏的映射。在潜在去噪阶段，我们通过联合最小化潜在-图像对齐损失和具有时间步长依赖权重的潜在扩散损失来增强潜在代码与输入图像之间的语义一致性，有效抑制防御引入的高频噪声。在潜在重建阶段，我们使用像素级约束恢复低频语义信息以细化潜在特征。大量实验表明，TS-LFO持续绕过最先进的（SOTA）版权防御，并在各种设置下优于SOTA版权攻击，如DiffPure、GrIDPure和IMPRESS。

英文摘要

With the growing concerns over copyright infringement in diffusion-based customization, adversarial attacks have emerged as a prominent defense strategy to prevent malicious content forgery in personalized image generation. However, current defenses typically introduce persistent perturbations in the latent space of Latent Diffusion Models (LDMs), which remain susceptible to adaptive bypasses by adversaries. In this paper, we introduce Two-Stage Latent Feature Optimization (TS-LFO), an efficient and effective copyright-stealing attack against protected diffusion-based customization. We begin by observing that existing defenses primarily disrupt the mapping between input images and their latent representations, thereby degrading the model's ability to produce personalized outputs. To counteract this, TS-LFO restores the broken mapping through a two-stage optimization process. In the Latent Denoising Stage, we enhance semantic consistency between latent codes and input images by jointly minimizing a Latent-Image Alignment Loss and a Latent Diffusion Loss with timestep-dependent weights, effectively suppressing the high-frequency noise introduced by defenses. In the Latent Reconstruction Stage, we recover low-frequency semantic information using pixel-level constraints to refine the latent features. Extensive experiments show that TS-LFO consistently bypasses state-of-the-art (SOTA) copyright defenses and outperforms SOTA copyright attacks such as DiffPure, GrIDPure and IMPRESS across diverse settings.

URL PDF HTML ☆

赞 0 踩 0

2606.10877 2026-06-10 cs.LG cs.CV 交叉投稿

XtrAIn: Training-Guided Occlusion for Feature Attribution

XtrAIn：训练引导的遮挡特征归因

Thodoris Lymperopoulos, Ioannis Kakogeorgiou, Denia Kanellopoulou

发表机构 * NCSR Demokritos（希腊国家科学研究中心德谟克利特）

AI总结提出XtrAIn方法，将遮挡操作从输入空间转移到参数空间，通过跟踪模型训练轨迹测量特征相关参数更新对输出logits的影响，解决传统遮挡归因中的偏差和不稳定性问题。

Comments 12 pages, 7 figures, 1 table

详情

AI中文摘要

基于遮挡的归因方法通过扰动输入特征并测量模型输出的变化来估计特征重要性，提供了一种直观的方式。然而，其可靠性受到特征移除实现方式的强烈影响：外部选择的基线可能引入偏差、分布外样本和不稳定的解释，而在非线性模型中，遮挡一组特征也可能改变非遮挡特征的贡献。我们将这种效应称为归因偏移，因为非遮挡特征的归因分数偏离其初始值。为了解决这些导致解释不稳定的主要问题，我们引入了XtrAIn，一种训练引导的归因方法，将遮挡操作从输入空间转移到参数空间。XtrAIn不用于工基线替换输入值，而是遵循模型的训练轨迹，测量特征相关参数更新如何影响输出logits。我们进一步引入了Xstep，一种轻量级近似方法以降低计算成本，以及XtrAIn+，一种目标聚焦变体，强调与目标类别一致的更新。在受控图像数据集和PAM50乳腺癌亚型分类上的实验表明，所提出的方法比标准归因基线产生更清晰、更可解释的归因模式。总体而言，XtrAIn提供了对特征归因的训练感知视角，并为研究训练过程中特征级证据的形成提供了有用的诊断工具。

英文摘要

Occlusion-based attribution methods provide an intuitive way to estimate feature importance by perturbing input features and measuring the resulting change in model output. However, their reliability is strongly affected by how feature removal is implemented: externally selected baselines can introduce bias, out-of-distribution samples, and unstable explanations, while in nonlinear models the occlusion of a set of features can also alter the contribution of non-occluded features. We refer to this effect as attribution shift, as the attribution scores of the non-occluded features drift from their initial values. To challenge these major issues that render explanations unstable, we introduce XtrAIn, a training-guided attribution method that transfers the occlusion operation from the input space to the parameter space. Instead of replacing input values with hand-crafted baselines, XtrAIn follows the model's training trajectory and measures how feature-associated parameter updates affect the output logits. We further introduce Xstep, a lightweight approximation for reducing computational cost, and XtrAIn+, a target-focused variant that emphasizes updates aligned with the target class. Experiments on controlled image datasets and PAM50 breast-cancer subtype classification show that the proposed methods produce cleaner and more interpretable attribution patterns than standard attribution baselines. Overall, XtrAIn provides a training-aware perspective on feature attribution and offers a useful diagnostic tool for studying how feature-level evidence is formed during training.

URL PDF HTML ☆

赞 0 踩 0

2411.05698 2026-06-10 cs.CV cs.AI cs.LG 版本更新

Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification

Visual-TCAV：用于图像分类事后可解释性的基于概念的归因和显著性图

Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla

发表机构 * Politecnico di Milano（米兰理工大学）

AI总结提出Visual-TCAV框架，结合概念激活向量和积分梯度，生成类无关显著性图并估计概念归因，在受控实验中比TCAV更忠实于真实解释。

Comments Accepted in TMLR

详情

AI中文摘要

卷积神经网络在图像分类中表现出色，但由于模型规模和复杂性，解释其预测具有挑战性。最先进的显著性方法生成局部解释，突出输入图像中识别类别的区域，但无法解释感兴趣的概念如何贡献于预测。另一方面，基于概念的方法（如TCAV）提供了网络对人类定义概念敏感性的见解，但无法计算其在特定预测中的归因，也无法显示其在输入图像中的位置。我们引入了Visual-TCAV，一种新颖的可解释性框架，旨在通过提供局部和全局解释来弥合这些方法之间的差距。Visual-TCAV使用概念激活向量（CAV）生成类无关的显著性图，显示网络识别特定概念的位置。此外，它可以使用积分梯度的推广来估计这些概念对任何类别输出的归因。我们通过一个已知解释真实情况的受控实验评估了该方法的忠实性，显示出比TCAV更好的真实情况对齐。我们的代码可在https://this URL获取。

英文摘要

Convolutional Neural Networks (CNNs) have shown remarkable performance in image classification. However, interpreting their predictions is challenging due to the size and complexity of these models. State-of-the-art saliency methods generate local explanations highlighting the area in the input image where a class is identified but cannot explain how a concept of interest contributes to the prediction. On the other hand, concept-based methods, such as TCAV, provide insights into how sensitive the network is to a human-defined concept but cannot compute its attribution in a specific prediction nor show its location within the input image. We introduce Visual-TCAV, a novel explainability framework aiming to bridge the gap between these methods by providing both local and global explanations. Visual-TCAV uses Concept Activation Vectors (CAVs) to generate class-agnostic saliency maps that show where the network recognizes a certain concept. Moreover, it can estimate the attribution of these concepts to the output of any class using a generalization of Integrated Gradients. We evaluate the method's faithfulness via a controlled experiment where the ground truth for explanations is known, showing better ground truth alignment than TCAV. Our code is available at https://github.com/DataSciencePolimi/Visual-TCAV.

URL PDF HTML ☆

赞 0 踩 0

2601.19210 2026-06-10 cs.CV 版本更新

Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP

对比谱校正：面向CLIP零样本对抗鲁棒性的测试时防御

Sen Nie, Jie Zhang, Zhuo Wang, Shiguang Shan, Xilin Chen

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结提出对比谱校正（CSR）方法，利用对抗样本在频率衰减下的特征不一致性，通过谱引导对比目标优化校正扰动，在16个分类基准上平均提升18.1%的强攻击鲁棒性，且推理开销低。

Comments Accepted by ICML 2026

详情

AI中文摘要

视觉语言模型（如CLIP）展现出显著的零样本泛化能力，但仍极易受到对抗样本的攻击。尽管测试时防御方法前景广阔，现有方法无法对强攻击提供足够的鲁棒性，且常受限于高推理延迟和任务特定适用性。为解决这些限制，我们首先研究了对抗样本的内在特性，发现对抗样本在渐进频率衰减下表现出严重的特征不一致性。我们进一步将其归因于模型固有的谱偏差。利用这一洞察，我们提出了一种高效的测试时防御方法，名为对比谱校正（CSR）。CSR优化一个校正扰动，在谱引导的对比目标下将输入重新对齐到自然流形，并以输入自适应方式应用。在16个分类基准上的大量实验表明，CSR在强APGD攻击下平均优于现有技术18.1%，且推理开销适中。此外，CSR在多种视觉任务中展现出广泛的适用性。代码见https://this URL。

英文摘要

Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model's inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong APGD with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://github.com/Summu77/CSR.

URL PDF HTML ☆

赞 0 踩 0

2604.06893 2026-06-10 cs.CV cs.LG 版本更新

Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

能量正则化的空间遮蔽：一种增强视觉模型鲁棒性和可解释性的新方法

Tom Devynck, Bilal Faye, Djamel Bouchaffra, Nadjib Lazaar, Hanane Azzag, Mustapha Lebbah

发表机构 * DAVID Lab, UVSQ, Paris-Saclay University（DAVID实验室，UVSQ，巴黎-萨克雷大学）； LIPN, UMR CNRS 7030, Sorbonne Paris Nord University（LIPN，UMR CNRS 7030，索邦巴黎北大学）； LISN, Paris-Saclay University（LISN，巴黎-萨克雷大学）

AI总结本文提出能量正则化空间遮蔽框架，通过可微能量最小化问题重新定义特征选择，实现更鲁棒和可解释的视觉模型。

Comments 8 pages

详情

AI中文摘要

深度卷积神经网络通过密集空间特征图的彻底处理取得了显著性能，但这种暴力策略引入了显著的计算冗余并鼓励依赖于虚假背景相关性。为此，我们提出能量正则化空间遮蔽（ERSM），一种新的框架，将特征选择重新公式化为可微能量最小化问题。通过在标准卷积骨干中嵌入轻量级能量-遮蔽层，每个视觉标记被分配一个由两个竞争力组成的标量能量：内在的Unary重要性成本和Pairwise空间一致性惩罚。不同于以往的剪枝方法，ERSM允许网络自主发现针对每个输入的最佳信息密度平衡。我们验证了ERSM在卷积架构上的有效性，证明其产生新兴稀疏性、改进对结构遮挡的鲁棒性，并产生高度可解释的空间遮蔽，同时保持分类准确性。此外，我们表明所学的能量排名在删除基于鲁棒性测试中显著优于基于幅度的剪枝，揭示ERSM作为一种内在去噪机制，能够在无像素级监督的情况下隔离语义物体区域。

英文摘要

Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.02224 2026-06-10 cs.CV 版本更新

Chroma Clues: Leveraging Color Statistics to Detect Synthetic Images

颜色线索：利用颜色统计检测合成图像

Lea Uhlenbrock, Davide Cozzolino, Christian Riess

发表机构 * Deutsche Forschungsgemeinschaft（德国研究基金会）

AI总结利用生成模型在颜色统计上的弱点，通过手工设计的颜色变换和学习优化的颜色变换，提出像素级或块级颜色敏感特征，实现高泛化准确率和鲁棒性的合成图像检测。

详情

AI中文摘要

AI合成图像的演变和传播正以前所未有的速度进行。图像生成器在完美模仿自然图像的目标上取得了快速进展，这也挑战了图像取证。在这项工作中，我们利用了当前生成模型中一个未被充分探索的线索，即它们在模仿自然图像的颜色统计方面的弱点。我们首先展示了用于训练图像生成器的LPIPS损失对色度的敏感性低于亮度，这可能导致合成图像颜色的统计差异。基于这一观察，我们随后引入了六种手工设计的颜色变换和一种学习任务优化颜色变换的方法，以统计上暴露生成的图像。这些变换可以以多种方式使用。首先，我们在像素级或块级定义了颜色敏感特征。一个简单、可解释的分类器使用这些特征实现了平均泛化准确率93.27%，并对六种后处理具有强鲁棒性。其次，我们证明了这些变换在自然和合成图像区域中表现出特征性的视觉噪声模式，从而实现直观的视觉图像评估。第三，我们证明了这些变换可以增强生成图像中的颜色模式，以改进多类归因。

英文摘要

The evolution and dissemination of AI-synthesized images is occurring at an unprecedented rate. Image generators are making rapid progress in their goal of perfectly imitating natural images, which also challenges image forensics. In this work, we exploit an underexplored cue in current generative models, namely their weakness to imitate color statistics of natural images. We first show that the LPIPS loss used for training image generators is less sensitive to chrominance than to luminance, which may lead to statistical discrepancies in the colors of synthetic images. Building on this observation, we then introduce six hand-crafted color transformations and a method to learn a task-optimized color transform to statistically expose generated images. These transformations can be used in various ways. First, we define color-sensitive features at pixel-level or patch-level. A simple, interpretable classifier achieves with these features an average generalization accuracy of 93.27% and strong robustness against six types of post-processing. Second, we demonstrate that the transformations exhibit characteristic visual noise patterns in natural and synthetic image areas, which enables an intuitive visual image evaluation. Third, we demonstrate that the transforms can enhance color patterns in generated images for improved multiclass attribution.

URL PDF HTML ☆

赞 0 踩 0

2604.13776 2026-06-10 cs.CY cs.CL cs.CR cs.CV 版本更新

Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking

谁被标记？AI内容水印中的多元评估差距

Alexander Nemecek, Osama Zafar, Yuqiao Xu, Wenbiao Li, Erman Ayday

发表机构 * Case Western Reserve University（凯斯西储大学）

AI总结本文揭示AI内容水印在不同语言、文化和群体间存在系统性偏差，提出跨语言检测一致性、文化多样性覆盖和检测指标人口统计分解三个评估维度，主张水印部署前必须进行公平性审计。

Comments 7 pages. Accepted at the Multimodal Alignment for a Pluralistic Society (MAPS) Workshop, CVPR 2026

详情

AI中文摘要

水印正成为AI内容认证的默认机制，治理政策和框架将其引用为内容溯源的基础设施。然而，在文本、图像和音频模态中，水印信号强度、可检测性和鲁棒性取决于内容本身的统计特性，而这些特性在不同语言、文化视觉传统和人口统计群体间存在系统性差异。我们研究了这种内容依赖性如何产生特定模态的偏差路径。通过回顾各模态的主要水印基准，我们发现除一个例外，没有基准报告跨语言、文化内容类型或人群组的性能。为解决此问题，我们提出了多元水印基准测试的三个具体评估维度：跨语言检测一致性、文化多样性内容覆盖以及检测指标的人口统计分解。我们认为水印是多元对齐管道的一部分，应遵循相同的评估标准。我们将此与当前强制部署水印但未要求公平性评估的治理框架联系起来。我们的立场是评估必须先于部署，并且应用于AI模型的相同偏差审计要求应扩展到验证层。

英文摘要

Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-specific pathways to bias. Reviewing the major watermarking benchmarks across modalities, we find that, with one exception, none report performance across languages, cultural content types, or population groups. To address this, we propose three concrete evaluation dimensions for pluralistic watermark benchmarking: cross-lingual detection parity, culturally diverse content coverage, and demographic disaggregation of detection metrics. We argue that watermarking is part of the pluralistic alignment pipeline and should be held to the same evaluation standards. We connect this to governance frameworks currently mandating watermarking deployment without requiring fairness evaluation. Our position is that evaluation must precede deployment, and that the same bias auditing requirements applied to AI models should extend to the verification layer.

URL PDF HTML ☆

赞 0 踩 0

2606.09882 2026-06-10 cs.CV cs.LG 新提交

WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

WHU-Infra3D：面向3D路边基础设施清单的全栈多模态数据集与基准

Chong Liu, Luxuan Fu, Xuyu Feng, Zhen Dong, Bisheng Yang

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS)（信息工程测绘遥感国家重点实验室）； Wuhan University（武汉大学）

AI总结提出WHU-Infra3D多模态基准数据集，覆盖三城市53.8公里，融合全景图像与LiDAR点云，提供2D-3D实例关联和跨帧跟踪，支持基础设施状态诊断与属性识别，填补自动化维护数据集空白。

详情

AI中文摘要

数字孪生城市的范式正从粗略的视觉映射转向更精确、可操作的城市资产数字化。然而，现有数据集主要关注粗略的视觉感知，缺乏自动化基础设施维护所需的严格多模态对齐和属性及状态诊断。为弥合这一差距，我们引入了WHU-Infra3D，一个大规模、多模态的基准数据集，专门用于路边基础设施清单。覆盖三个城市53.8公里，WHU-Infra3D独特地集成了全景图像和LiDAR点云，并具有严格的2D-3D实例关联和跨帧跟踪。该数据集包含超过17.5万个多视图2D边界框以及数千个3D基础设施实例，提供了超过18.1万个详细的属性和状态注释（例如，锈蚀、遮挡），以支持运行健康评估。我们在五个核心任务上建立了全面的基线：2D检测、2D跨视图匹配、3D地理识别、3D点云分割和属性识别。广泛的评估暴露了当前模型在长尾缺陷状态上的显著跨城市领域差距和固有脆弱性，使WHU-Infra3D成为推进可扩展、AI驱动的城市基础设施清单和生命周期管理的重要试验场。WHU-Infra3D数据集可在以下网址获取：https://xxx。

英文摘要

The paradigm of digital twin cities is shifting from coarse visual mapping toward more precise and actionable digitization of urban assets. However, existing datasets predominantly focus on coarse visual perception, lacking the strict multi-modal alignment and attribute and status diagnosis required for automated infrastructure maintenance. To bridge this gap, we introduce WHU-Infra3D, a large-scale, multi-modal benchmark dataset dedicated to roadside infrastructure inventory. Covering 53.8 km across three cities, WHU-Infra3D uniquely integrates panoramic imagery and LiDAR point clouds with rigorous 2D-3D instance association and cross-frame tracking. Comprising over 175k multi-view 2D bounding boxes alongside thousands of 3D infrastructure instances, the dataset provides over 181k detailed attribute and status annotations (e.g., rust, occlusion) to empower operational health assessment. We establish comprehensive baselines across five core tasks: 2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, and attribute recognition. Extensive evaluations expose significant cross-city domain gaps and inherent vulnerabilities of current models on long-tailed defective statuses, establishing WHU-Infra3D as an essential testbed for advancing scalable, AI-driven urban infrastructure inventory and lifecycle management. The WHU-Infra3D dataset is available at https://github.com/WHU-USI3DV/WHU-Infra3D.

URL PDF HTML ☆

赞 0 踩 0

2606.10066 2026-06-10 cs.CV cs.AI cs.LG 新提交

A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

公共医学视觉语言基准中预训练污染的受控审计

Bruce Changlong Xu, Lan Wu, Alexander Ryu

AI总结审计发现公共医学VLM基准存在图像源重叠和文本规范顺序交换性信号，但确认的像素级重复罕见，且现有成员推理检测器在小规模医学VLM队列中不可靠。

Comments 30 pages, 7 figures, 9 tables. Preprint

详情

AI中文摘要

医学视觉语言模型（VLM）在公共基准上进行评估，这些基准的图像和问答对多年来一直可自由下载，但报告准确度假设这些示例在预训练中不存在。我们对SLAKE-En、PathVQA、VQA-RAD以及一个辅助的公共OmniMedVQA镜像上的开放VLM进行了审计，使用了四种检测器系列：图像侧近邻重叠（针对PMC-OA-beta）、规范顺序可交换性、队列相对Min-K%++尾部富集以及跨模型Top-K重叠。我们发现SLAKE-En上存在可测量的图像侧源重叠：SigLIP-B-16标记了19.8%的图像，SigLIP-SO400M标记了4.2%，而域外对照产生0/2000个标记。人工裁定显示，相同模态、相同投影的匹配对应不同患者，而非经过验证的像素级重复，因此我们将其解释为源或分布重叠，而非确认的每图像记忆。在文本侧，Qwen2.5-VL在SLAKE-En上显示出规范顺序可交换性信号，该信号在顺序消融和外部非医学基线中仍然存在。在OmniMedVQA镜像上，五个医学和通用VLM触发了可交换性，而BLIP-2保持干净。相比之下，队列相对Min-K%++尾部富集和跨模型Top-K重叠在外部预域基线中崩溃：BLIP-2重现了明显的正信号，尽管缺乏合理的医学VQA暴露。我们得出结论，这些队列相对检测器作为小规模医学VLM队列上的独立成员推理信号是不可靠的。

英文摘要

Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadable for years, yet reported accuracy assumes these examples were absent from pretraining. We audit open VLMs on SLAKE-En, PathVQA, VQA-RAD, and an auxiliary public OmniMedVQA mirror using four detector families: image-side near-neighbour overlap against PMC-OA-beta, canonical-order exchangeability, cohort-relative Min-K%++ tail enrichment, and cross-model top-K overlap. We find measurable image-side source overlap on SLAKE-En: 19.8% of images are flagged under SigLIP-B-16 and 4.2% under SigLIP-SO400M, while out-of-domain controls produce 0/2000 flags. Manual adjudication shows same-modality, same-projection matches to different patients rather than verified pixel-level duplicates, so we interpret this as source or distributional overlap rather than confirmed per-image memorization. On the text side, Qwen2.5-VL on SLAKE-En shows a canonical-order exchangeability signal that survives ordering ablation and external non-medical baselines. On the OmniMedVQA mirror, exchangeability fires for five medical and general VLMs while BLIP-2 remains clean. In contrast, cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap collapse under an external pre-domain baseline: BLIP-2 reproduces the apparent positive signals despite lacking plausible medical-VQA exposure. We conclude that these cohort-relative detectors are unreliable as standalone membership-inference signals on small medical-VLM cohorts.

URL PDF HTML ☆

赞 0 踩 0

2606.10107 2026-06-10 cs.CV q-bio.QM 新提交

用于鲁棒野火检测与分类的大规模开源图像和视频数据集

Emadeldeen Hamdan, Yingyi Luo, B. Ugur Toreyin, Erdem Koyuncu, Adam J. Watts, Ugur Gudukbay, Ahmet Enis Cetin

AI总结提出大规模开源野火图像视频数据集GWFP，结合多种卷积与Transformer架构及HTE-ResNet方法，实现跨域鲁棒检测。

详情

AI中文摘要

野火检测与监测对于减缓火势蔓延和减少环境及基础设施损害至关重要。本文介绍了GWFP（全球野火预防数据集），这是一个大规模、开源的野火图像和视频数据集，旨在支持早期火灾和烟雾检测研究。GWFP包含地理多样化的野火场景，包括火焰、烟雾、水雾/雾环境条件、近红外（NIR）图像、余烬以及从全球真实场景中收集的具有挑战性的负样本。为了评估数据集的鲁棒性和跨域泛化能力，我们在域内和跨数据集设置下对多种卷积和基于Transformer的架构进行了基准测试。此外，我们探索了使用Hadamard增强残差连接（HTE-ResNet）的轻量级频率-空间特征交互，以分析域偏移条件下的表示鲁棒性。实验结果表明，该方法在真实世界野火监测应用中具有强大的跨数据集泛化能力和实用价值。数据集和源代码将在接收后公开发布。

英文摘要

Wildfire detection and monitoring are critical for mitigating fire spread and reducing environmental and infrastructural damage. In this work, we introduce GWFP (Global Wildfire Prevention Dataset), a large-scale, open-source dataset of wildfire images and videos designed to support early fire and smoke detection research. GWFP contains geographically diverse wildfire scenes, including flames, smoke, Waterdog/Fog environmental conditions, Near Infrared (NIR) imagery, Ember, and challenging negative samples collected from real-world scenarios worldwide. To evaluate dataset robustness and cross-domain generalization, we benchmark multiple convolutional and transformer-based architectures across both in-domain and cross-dataset settings. Additionally, we explore lightweight frequency--spatial feature interaction using Hadamard-enhanced residual connections (HTE-ResNet) to analyze representation robustness under domain-shift conditions. Experimental results demonstrate strong cross-dataset generalization and practical utility for real-world wildfire monitoring applications. The dataset and source code will be publicly released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.10196 2026-06-10 cs.CV cs.AI 新提交

分析目标检测数据集的无训练腐败检测

Christian Sieberichs, Simon Geerkens, Thomas Waschulzik, Viswanathan Ramesh, Alexander Braun

发表机构 * University of Applied Sciences Düsseldorf（杜塞尔多夫应用科学大学）； Siemens Mobility GmbH（西门子交通有限公司）； Goethe University Frankfurt（法兰克福大学）

AI总结本文研究无训练特征空间方法在目标检测数据集中检测标注错误的应用，发现该方法能可靠暴露语义错误，但位置错误难以检测。

Comments Accepted at DataCV Workshop, Conference on Computer Vision and Pattern Recognition (CVPR) 2026

详情

AI中文摘要

注释错误在计算机视觉数据集中普遍存在，并且会显著降低在其上训练的系统的性能，特别是在目标检测等复杂任务中。存在多种识别注释错误的方法，包括无训练的特征空间方法，这些方法提供了一种快速且可解释的方式来分析注释。然而，对于包含语义和空间信息的目标检测注释，其行为在很大程度上仍未探索。在这项工作中，我们分析了基于特征空间的方法在检测目标检测数据集中的注释错误时的适用性。通过调整现有的特征空间方法，我们表明此类方法可靠地暴露语义错误，而位置错误仍然难以检测。我们使用VOC2012和KITTI，在多个预训练嵌入模型、合成噪声类型（对称、非对称和位置）以及真实世界注释错误上评估了这种行为。所有代码和真实世界腐败数据均可在以下仓库公开获取：https://github.com/ChristianSieberichs/BoundingBox_corruption_detection

英文摘要

Annotation errors are widespread in computer vision datasets and can significantly degrade the performance of systems trained on them, particularly in complex tasks such as object detection. Several approaches exist to identify annotation errors, including training-free feature-space methods which provide a fast and interpretable way to analyze annotations. However, the behavior on object detection annotations, which include semantic and spatial information, remains largely unexplored. In this work we analyze the applicability of feature-space-based approaches for detecting annotation errors in object detection datasets. By adapting an existing feature-space method, we show that such approaches reliably expose semantic mislabel, while positional errors remain difficult to detect. We evaluate this behavior across multiple pretrained embedding models, synthetic noise types (symmetric, asymmetric, and positional), and real-world annotation errors using VOC2012 and KITTI. All code and real-world corruptions are publicly available at the following repository: https://github.com/ ChristianSieberichs/BoundingBox\_corruption\_detection

URL PDF HTML ☆

赞 0 踩 0

2606.10790 2026-06-10 cs.CV 新提交

A Multimodal RGB and Events Dataset for Hand Detection in First-Person View

第一人称视角下用于手部检测的多模态RGB和事件数据集

Bharghav Kota, Yulia Sandamirskaya

发表机构 * Zurich University of Applied Sciences（苏黎世应用科技大学）

AI总结针对移动机器人系统中传统相机在暗光下运动模糊的问题，提出利用事件相机与RGB相机结合的多模态手部检测方法，并通过合成事件数据集实现与现有方法相当的性能。

详情

AI中文摘要

现有的手部检测算法基于图像工作，检测率受限于相机的帧率。在移动机器人系统的手部检测应用中，传统相机会导致运动模糊，尤其是在较暗的光照条件下。我们可以利用事件相机，它具有高动态范围、高时间分辨率和低功耗的特点。最近的研究表明，使用事件相机和帧相机的立体设置可以提高检测精度和带宽-延迟权衡。在目标检测和识别任务中使用事件相机的主要瓶颈是训练数据量相对较少。在这项工作中，我们提出了一种方法以及一个从自我中心、第一人称视角合成的示例性事件手部数据集。数据使用v2e工具箱从现有的RGB Egohands数据集合成。通过改变v2e工具箱的参数，提供不同光照条件和尺度的数据集版本。使用微调后的YOLOv8模型生成地面真值检测，该模型应用于Egohands数据集中的RGB图像，并在高时间分辨率事件上进行插值。我们使用多模态数据集，利用现有的使用事件和RGB相机多模态设置的目标检测算法进行手部检测，并展示了与最先进方法相当的性能。

英文摘要

Existing hand detection algorithms work on images and the detection rate is restricted by the frame rate of the camera. In hand detection applications for moving robotic systems, conventional cameras cause motion blur, especially in darker lighting conditions. We can leverage the use of event-based cameras which possess a high dynamic range, high temporal resolution, and low power consumption. Recent work has shown that using a stereo setup of an event-based and a frame-based camera improves detection accuracy and the bandwidth-latency tradeoff. The main bottleneck in using event-based cameras in object detection and recognition tasks is a relatively low amount of training data. In this work, we propose a methodology and an exemplary synthetic event-based hand dataset from an egocentric, first-person view perspective. The data is synthesized from the existing RGB Egohands dataset with the v2e toolbox. Parameters of the v2e toolbox are varied to provide versions of the dataset with different lighting conditions and scales. Ground truth detections are generated with a fine-tuned YOLOv8 model which is applied to the RGB images in the Egohands dataset and interpolated on the high-temporal resolution events. We use the multi-modal dataset to perform hand detection with existing object detection algorithms which use a multi-modal setup of event and RGB cameras and demonstrate performance comparable to the state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2606.10894 2026-06-10 cs.CV 新提交

The 1st PortraitCraft Challenge: A CVPR 2026 Workshop Competition on Portrait Composition Understanding and Generation

首届PortraitCraft挑战赛：CVPR 2026研讨会肖像构图理解与生成竞赛

Zijie Lou, Youyun Tang, Xiaochao Qu, Haoxiang Li, Ting Liu, Luoqi Liu, Xun Zhu, Zheng Zhang, Xi Chen, Miao Li, Ji Wu, Dizhe Zhang, Xian Ge, Sujia Wang, Ruiyang Zhang, Jiaming Wang, Xianshun Wang, Lu Qi, Boao Kang, Wei Zhou, Jinghui Sun, Zhenyu Yan, Jiliang Zhao, Rui Yang, Yipo Huang, Boyuan Liu, Shanglin Li, Zifan Xie, Yichen Zhang, Anlan Wang, Wenfeng Lin, Mingyu Guo, Dong Li, Xinghao Wang, Yanting Li, Shanzhao Tong, Shuai He, Qiu Zhou, Yongqi Yang, Taoyang Mu, Dianqiao Lei, Anlong Ming, Huadong Ma

发表机构 * CVPR 2026

AI总结提出PortraitCraft挑战赛，包含构图理解与生成两个赛道，并发布约5万张肖像数据集，推动肖像美学与可控图像生成研究。

详情

AI中文摘要

本文介绍了首届PortraitCraft挑战赛的概况，该挑战赛是CVPR 2026的官方竞赛之一。挑战赛聚焦于肖像构图理解与生成，旨在推动AI在肖像美学分析和可控图像合成方面的研究。与主要关注全局美学评分的现有数据集和任务不同，PortraitCraft引入了一个统一的评估框架，包含两个互补赛道。赛道1要求模型进行结构化肖像构图理解，赛道2要求模型在显式构图约束下从结构化构图描述生成肖像图像。为支持该挑战赛，我们构建并公开发布了一个大规模肖像构图数据集，包含约50,000张精心策划的真实肖像图像，提供多级监督。本报告描述了挑战赛设置、评估协议、数据集组成和最终结果，并分析了提交方案的技术特点。PortraitCraft挑战赛为肖像构图理解与生成研究提供了一个标准化和可复现的平台，有望推动肖像美学和可控图像生成领域的进一步发展。

英文摘要

This paper presents an overview of the inaugural PortraitCraft Challenge, held as one of the official competitions at CVPR 2026. The challenge focuses on portrait composition understanding and generation, aiming to advance AI research in portrait aesthetics analysis and controllable image synthesis. Unlike existing datasets and tasks that primarily focus on global aesthetic scoring, PortraitCraft introduces a unified evaluation framework comprising two complementary tracks. Track 1 requires models to perform structured portrait composition understanding, and Track 2 requires models to generate portrait images from structured composition descriptions under explicit compositional constraints. To support the challenge, we constructed and publicly released a large-scale portrait composition dataset consisting of approximately 50,000 curated real portrait images, providing multi-level supervision. This report describes the challenge setup, evaluation protocols, dataset composition, and final results, along with an analysis of the technical characteristics of the submitted solutions. The PortraitCraft Challenge provides a standardized and reproducible platform for research on portrait composition understanding and generation, and is expected to foster further progress in the fields of portrait aesthetics and controllable image generation.

URL PDF HTML ☆

赞 0 踩 0

2606.10905 2026-06-10 cs.CV 新提交

Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

超越模型规模：通过训练小模型探究视觉上下文学习中的差距

Sunil Khatri, Steven Landgraf, Markus Ulrich, Simon Reiß

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结通过训练仅1百万参数的小模型，挑战大规模视觉上下文学习模型，揭示任务编码、预训练任务和评估指标方面的适应性能力测量差距。

详情

AI中文摘要

视觉上下文学习（VICL）旨在推动自适应视觉模型的发展，使其能够基于少量示例在测试时适应新任务。受自然语言处理研究中上下文学习历史的影响，当前VICL方法通常采用大规模模型和数据扩展作为关键要素。然而，这些要素是否是视觉模型形成上下文学习能力的关键尚不清楚。为了对这类大模型进行压力测试，我们用一个极端的反例挑战它们：我们训练了一个仅含1百万参数和7万张图像的微小视觉上下文模型。我们将这个容量严重受限的小模型与7000倍大的VICL模型在不同自适应设置下进行比较：（1）具有小分布偏移的图像数据，（2）未见过的任务编码，以及（3）全新的任务，即VICL所设想的场景。由于小模型和大模型之间训练资源的巨大差距，我们的实验展示了在任务编码方式、预训练中使用的任务以及评估指标选择方面，自适应能力测量存在的不足。当前VICL基准测试中的这些差距凸显了在自适应能力评估方面进行创新的必要性。

英文摘要

Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.10967 2026-06-10 cs.CV 新提交

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

视觉上下文学习何去何从？跨领域与任务的统一基准

Pradnya Halady, Jiale Wei, Zdravko Marinov, Alexander Jaus, Simon Reiß

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结针对视觉上下文学习评估局限于预训练镜像任务的问题，构建跨领域和任务的统一基准VIBE，在14个数据集和12个任务上测试6个模型，揭示其适应能力、局限性及失败模式。

详情

AI中文摘要

视觉上下文学习被提出作为动态模型的一种途径，这些模型可以根据提供的上下文生成预测，从而在测试时适应新的视觉任务。然而，对这些模型适应能力的评估一直局限于狭窄的设置，主要反映预训练中的任务或图像领域，而实际适应并不需要。我们通过构建一个广泛的视觉上下文基准（VIBE），重点关注多样化的成像领域和广泛的任务，来解决这一差距。借此，我们能够更清晰地了解视觉上下文模型在面对新的图像和任务分布时的适应能力。我们在14个数据集和12个任务上对六个模型进行了压力测试（总共探索了106个数据集-任务组合），并在统一的、可重复的评估协议下，以一次学习设置进行比较。我们的评估揭示了视觉上下文学习现状的关键见解，包括局限性、系统性失败模式和有前景的方向。为了促进更广泛的评估，我们将公开发布我们的VIBE工具包。

英文摘要

Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

URL PDF HTML ☆

赞 0 踩 0

2606.11129 2026-06-10 cs.CV 新提交

WorldOlympiad: Can Your World Model Survive a Triathlon?

WorldOlympiad：你的世界模型能经受铁人三项考验吗？

Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang, Dakai An, Akide Liu, Yinghao Yu, Jiasheng Tang, Fan Wang, Wei Wang, Bohan Zhuang

发表机构 * Zhejiang University（浙江大学）； DAMO Academy, Alibaba Group（阿里巴巴达摩院）； The Hong Kong University of Science and Technology（香港科技大学）； Monash University（莫纳什大学）； TRE, Alibaba Group（阿里巴巴TRE）

AI总结提出WorldOlympiad基准，从物理忠实性、几何一致性和交互保真度三个维度诊断视频世界模型，揭示现有模型在物理推理、3D一致性和长程交互方面的显著不足。

Comments Project Page: https://alibaba-damo-academy.github.io/WorldOlympiad/, Code: https://github.com/alibaba-damo-academy/WorldOlympiad

详情

AI中文摘要

我们介绍WorldOlympiad，一个用于诊断基于视频的世界模型在物理忠实性、几何一致性和交互保真度方面的基准。现有基准通常关注视觉质量、语义对齐或短期时间一致性，但很少能洞察生成视频是否遵循物理规则、保持连贯的3D结构以及支持长程可控交互。为弥补这一空白，WorldOlympiad将世界模型评估分解为三个互补维度。物理轨迹使用对象分割和MLLM作为评判者，评估生成视频是否遵循力学、热现象和材料属性中的可解释规则。几何轨迹通过高斯泼溅重建生成视频，评估结构一致性、跨视角连贯性和相机轨迹对齐。交互轨迹评估生成序列是否遵循复杂动作提示并在连续视频块间保持平滑连贯的过渡。WorldOlympiad进一步涵盖三个主要下游场景，包括游戏、机器人和通用真实世界视频，捕捉从交互控制、具身操作到开放域运动和相机动态的多样化挑战。这些轨迹和场景共同构成了一个可扩展且可解释的评估套件，揭示了超越通用视频质量的失败模式。对最先进模型的实验揭示了物理推理、3D一致性和长程交互方面的显著差距，强调了为生成式世界模型制定更结构化评估协议的必要性。

英文摘要

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

URL PDF HTML ☆

赞 0 踩 0

2606.10255 2026-06-10 eess.IV cs.CV cs.DL cs.LG physics.bio-ph 交叉投稿

异常检测是否需要任务特定训练？

Xingwu Zhang, Guanxuan Li, Paul Henderson, Gerardo Aragon-Camarasa, Zijun Long

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出基于检索的异常检测框架RAD，无需任务特定训练，通过多级检索匹配记忆库中的无异常特征，在多个基准上达到最优性能，挑战了任务特定训练的必要性。

详情

AI中文摘要

当前最先进的多类无监督异常检测（MUAD）方法依赖于训练编码器-解码器模型来重建无异常特征。然而，我们认为这种任务特定训练在分布偏移下成本高昂，并且基于重建的残差评分进一步面临保真度-稳定性困境。现有的免训练替代方案在MUAD中仍然容易受到跨类别和跨区域不匹配的影响。受这些限制的启发，我们提出了基于检索的异常检测（RAD），一种无需任务特定训练的框架，它将无异常特征存储在记忆库中，并通过多级检索检测异常，将测试补丁与记忆库进行匹配。实验表明，RAD在四个既定基准（MVTec-AD、VisA、Real-IAD、3D-ADAM）的标准和少样本设置下均达到了最先进的性能。在MVTec-AD上，RAD仅使用单个无异常图像即可达到96.7%的像素AUROC，而RAD的全数据性能为98.5%。这些发现共同推翻了MUAD需要任务特定训练的假设，表明最先进的异常检测可以通过免训练的基于记忆的检索实现。我们的代码可在此https URL获取。

英文摘要

Current state-of-the-art multi-class unsupervised anomaly detection (MUAD) methods rely on training encoder--decoder models to reconstruct anomaly-free features. However, we argue that such task-specific training is costly under distribution shifts, and that reconstruction-based residual scoring further faces a fidelity--stability dilemma. Existing training-free alternatives, in turn, remain prone to cross-category and cross-region mismatches in MUAD. Motivated by these limitations, we propose Retrieval-based Anomaly Detection (RAD), a task-specific training-free framework that stores anomaly-free features in a memory and detects anomalies through multi-level retrieval, matching test patches against the memory. Experiments demonstrate that RAD achieves state-of-the-art performance across four established benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under both standard and few-shot settings. On MVTec-AD, RAD reaches 96.7% Pixel AUROC with just a single anomaly-free image compared to 98.5% of RAD's full-data performance. Collectively, these findings overturn the assumption that MUAD requires task-specific training, showing that state-of-the-art anomaly detection is feasible with training-free memory-based retrieval. Our code is available at https://github.com/longkukuhi/RAD.

URL PDF HTML ☆

赞 0 踩 0

2602.09809 2026-06-10 cs.CV 版本更新

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

SciFlow-Bench：通过逆解析评估结构感知的科学图表生成

Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang

发表机构 * Peking University（北京大学）； Shanghai Jiao Tong University（上海交通大学）； Huawei Cloud BU（华为云业务部）； Zhongguancun Academy（中关村学院）； Beijing Key Laboratory of Data Intelligence and Security (Peking University)（北京数据智能与安全重点实验室（北京大学））

AI总结提出SciFlow-Bench基准，通过逆解析将生成的图表图像转换为结构化图进行比较，以结构可恢复性而非视觉相似性评估科学图表生成。

详情

AI中文摘要

科学图表传达显式的结构信息，然而现代文本到图像模型通常生成视觉上合理但结构错误的结果。现有基准要么依赖图像中心或主观指标，对结构不敏感，要么评估中间符号表示而非最终渲染图像，导致基于像素的图表生成研究不足。我们引入SciFlow-Bench，一个结构优先的基准，用于直接从像素级输出评估科学图表生成。基于真实科学PDF构建，SciFlow-Bench将每个源框架图与规范真值图配对，并在闭环往返协议下将模型作为黑盒图像生成器进行评估，该协议将生成的图表图像逆解析回结构化图以进行比较。该设计通过结构可恢复性而非仅视觉相似性进行强制评估，并由一个协调规划、感知和结构推理的分层多智能体系统实现。实验表明，保持结构正确性仍然是一个基本挑战，特别是对于具有复杂拓扑的图表，强调了结构感知评估的必要性。

英文摘要

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.

URL PDF HTML ☆

赞 0 踩 0

2409.02426 2026-06-10 cs.LG cs.CV 版本更新

Breaking the Curse of Dimensionality: Diffusion Models Efficiently Learn Low-Dimensional Distributions

打破维度诅咒：扩散模型高效学习低维分布

Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出新数学框架，证明扩散模型通过等价于子空间聚类，能以线性于内在维度的样本复杂度学习低维分布，避免维度诅咒。

Comments 37 pages, 8 figures, 2 tables

详情

AI中文摘要

尽管扩散模型在广泛的生成任务中取得了经验上的成功，但其学习数据分布能力的基本原理仍不清楚。在这项工作中，我们开发了一个新的数学框架，解释了扩散模型如何能够从有限数量的训练样本中有效学习低维分布，而不受维度诅咒的影响。具体来说，受图像数据内在低维结构的启发，我们在理论上分析了一个数据分布被建模为低秩高斯混合的场景。在合适的网络参数化下，我们表明优化扩散模型的训练目标等价于在训练样本上解决经典子空间聚类问题，其中每个子空间基对应于一个高斯分量的低秩协方差。这种等价性使我们能够证明，学习底层分布的样本复杂度与数据的内在维度呈线性关系，而不是与环境维度呈指数关系。我们的理论发现得到了经验证据的进一步支持，这些证据展示了在合成和真实世界图像数据集上的泛化相变现象。此外，我们建立了学习到的子空间基与图像数据语义属性之间的对应关系，为可控图像生成提供了原则性基础。

英文摘要

Despite their empirical success across a wide range of generative tasks, the fundamental principles underlying the ability of diffusion models to learn data distributions are poorly understood. In this work, we develop a new mathematical framework that explains how diffusion models can effectively learn low-dimensional distributions from a finite number of training samples without suffering from the curse of dimensionality. Specifically, motivated by the intrinsic low-dimensional structure of image data, we theoretically analyze a setting in which the data distribution is modeled as a mixture of low-rank Gaussians. Under suitable network parameterization, we show that optimizing the training objective of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples, where each subspace basis corresponds to the low-rank covariance of a Gaussian component. This equivalence allows us to show that the sample complexity for learning the underlying distribution scales linearly with the intrinsic dimension of the data, rather than exponentially with the ambient dimension. Our theoretical findings are further supported by empirical evidence that demonstrates phase transition phenomena in generalization on both synthetic and real-world image datasets. Moreover, we establish a correspondence between the learned subspace bases and semantic attributes of image data, providing a principled foundation for controllable image generation.

URL PDF HTML ☆

赞 0 踩 0

2411.02817 2026-06-10 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

深度学习用于回声测深仪数据

Ketil Malde

发表机构 * Ketil Malde

AI总结本文探讨深度学习在声学数据（如回声图）中的应用，指出由于声学数据特性，需开发专用方法而非简单复用图像处理模型，并强调缺乏标准数据集和格式是主要障碍。

连续神经重参数化作为鲁棒固定图表UV修复的深度几何先验

Mohammad Sadegh Salehi

发表机构 * Zero One Creative London, UK（伦敦零一创意公司）

AI总结提出将固定图表UV展开视为连续神经重参数化，使用未训练的SIREN网络优化几何目标，结合谱初始化、Tutte残差预热等策略，实现零翻转的鲁棒图表求解。

详情

AI中文摘要

传统的UV展开依赖于几何畸变能量的直接优化，可能因无效初始化、局部最小值或拓扑翻转而失败。我们将固定图表UV展开重新定义为连续神经重参数化：一个未训练的SIREN将每个顶点的网格特征映射到UV坐标，其权重针对几何目标进行优化。实际贡献是一个鲁棒的图表求解器配方，结合了Laplace-Beltrami谱输入、Tutte残差预热、$C^2$行列式扩展、单射性屏障以及有效性检查的重试/回退路由，而非声称任何单一组件能保证有效性或应取代重切割方法。NTK-LBO诊断表明，谱条件改变更新几何，尤其在初始化和中秩子空间，但本身不能预测图表成功。在紧凑预切割图表和47图表分层Thingi10K/xatlas切割基准上，神经求解器在所有紧凑图表上产生零翻转，并在42/47个分层求解中有效零翻转。与BFF和OptCuts的比较明确了范围：允许时重切割可以更快且畸变更低，而神经求解器针对提供图表的有效性和验证优先的图集构建。在Amara Spatial生成的网格上，完整的图集构建路径在25个资产集上提供打包图集覆盖，并在大规模Rust图集运行中通过回退路由实现1000/1000严格局部有效且零UV翻转的图集。

英文摘要

Traditional UV unwrapping relies on direct optimization of geometric distortion energies and can fail through invalid initialization, local minima, or topological foldovers. We recast fixed-chart UV unwrapping as continuous neural reparameterization: an untrained SIREN maps per-vertex mesh features to UV coordinates, and its weights are optimized for a geometric objective. The practical contribution is a robust chart-solver recipe, combining Laplace--Beltrami spectral inputs, Tutte residual warm-up, a $C^2$ determinant extension, an injectivity barrier, and validity-checked retry/fallback routing, rather than a claim that any single component guarantees validity or that recutting methods should be replaced. NTK--LBO diagnostics show that spectral conditioning changes update geometry, especially at initialization and mid-rank subspaces, but does not by itself predict chart success. On compact pre-cut charts and a 47-chart stratified Thingi10K/xatlas-cut benchmark, the neural solver produces zero flips on all compact charts and 42/47 valid zero-flip stratified solves. BFF and OptCuts comparisons sharpen the scope: recutting can be faster and lower-distortion when allowed, while the neural solver targets supplied-chart validity and validation-first atlas construction. On Amara Spatial generated meshes, the full atlas construction path gives packed-atlas coverage on a 25-asset set and 1000/1000 strict locally valid atlases with zero UV flips in a large-scale Rust atlas run after fallback routing.

URL PDF HTML ☆

赞 0 踩 0

2606.10223 2026-06-10 cs.SD cs.AI cs.CV 交叉投稿

Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

双分支门控融合用于开放集音频深度伪造源追踪

Awais Khan, Kutub Uddin, Khalid Malik

AI总结针对开放集音频深度伪造源追踪问题，提出双分支门控融合框架，结合XLSR-53和CORES描述符，通过输入条件门控自适应加权，实现域内高精度和域外鲁棒泛化。

详情

AI中文摘要

将合成语音归因于其原始系统仍然是一个开放挑战：闭集模型无法拒绝未见过的合成器并产生过度自信的预测。为了解决这个问题，我们提出了一个双分支门控融合框架，将XLSR-53与CORES配对，CORES是一个66维描述符，与之前仅使用线性滤波器组（LFB）的工作不同，它跨越倒谱、振荡、节奏、能量和频谱维度，以捕获互补的合成伪影。我们的分析表明，XLSR-53在域内（ID）保持判别性，而CORES在分布偏移（OOD）下稳定泛化，但由于SSL表示不平衡，它们的简单拼接失败。为了解决这个问题，一个输入条件门控在联合训练下自适应地加权每个分支，使用交叉熵、用于ID/OOD分离的能量边际损失和门控多样性项。在MLAAD基准上，我们的系统实现了97.6%的ID准确率、4.9%的EERc，并且相对于Interspeech 2025基线，FPR95相对降低了83.5%。

英文摘要

Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.10407 2026-06-10 cs.SD cs.CV q-bio.QM 交叉投稿

深度树张量网络

Chang Nie

发表机构 * Nanjing University of Science and Technology（南京理工大学）

AI总结提出深度树张量网络（DTTN），通过多线性运算捕获指数阶特征交互，在多个基准上超越现有方法。

详情

AI中文摘要

源自量子物理的张量网络（TNs）已被广泛用作指数机器和参数分解器用于识别任务。典型的TN模型，如矩阵乘积态（MPS），在自然图像识别中尚未取得成功应用。当它们被使用时，主要是在现有网络中压缩参数，从而失去了捕获指数阶特征交互的独特能力。本文提出了一种名为\textit{\textbf{深度树张量网络}}（DTTN）的新架构，它通过多线性运算捕获跨特征的$2^L$阶乘法交互，同时本质上展开为具有参数共享属性的\textit{树}状TN拓扑。DTTN由多个反对称交互模块（AIMs）堆叠而成，这种设计便于高效实现。此外，我们的理论分析证明了量子启发的TN模型与多项式/多线性网络在特定条件下的等价性。我们认为DTTN可以促进该领域内更具可解释性的研究。所提出的模型在多个基准和领域上进行了评估，显示出优于同行方法和最先进架构的性能。我们的代码在此https URL公开提供。

英文摘要

Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parametric decomposers for recognition tasks. Typical TN models, such as Matrix Product States (MPS), have not yet achieved successful application in natural image recognition. When employed, they primarily serve to compress parameters within pre-existing networks, thereby losing their distinctive capability to capture exponential-order feature interactions. This paper introduces a novel architecture named \textit{\textbf{D}eep \textbf{T}ree \textbf{T}ensor \textbf{N}etwork} (DTTN), which captures $2^L$-order multiplicative interactions across features through multilinear operations, while essentially unfolding into a \emph{tree}-like TN topology with the parameter-sharing property. DTTN is stacked with multiple antisymmetric interaction modules (AIMs), and this design facilitates efficient implementation. Furthermore, our theoretical analysis demonstrates the equivalence between quantum-inspired TN models and polynomial/multilinear networks under specific conditions. We posit that the DTTN could catalyze more interpretable research within this field. The proposed model is evaluated across multiple benchmarks and domains, demonstrating superior performance compared to both peer methods and state-of-the-art architectures. Our code is publicly available at https://github.com/NieCha/deep_tree_tensor_network.

URL PDF HTML ☆

赞 0 踩 0

2509.25017 2026-06-10 cs.LG cs.CV 版本更新

Uncertainty-Aware Deep Learning for Wildfire Danger Forecasting

不确定性感知的深度学习用于野火危险预测

Spyros Kondylatos, Nikolas Papadopoulos, Gustau Camps-Valls, Ioannis Papoutsis

发表机构 * Aix-Marseille University（艾克斯-马赛大学）； University of Cambridge（剑桥大学）； University of Malaga（马拉加大学）； University of Crete（希腊克里特大学）

AI总结提出不确定性感知深度学习框架，联合捕获认知不确定性和偶然不确定性，提升短期野火危险预测的准确性和可靠性，F1分数提高2.3%，预期校准误差降低2.1%。

详情

AI中文摘要

野火是最严重的自然灾害之一，对人类和自然生态系统构成重大威胁。日益增长的野火风险增加了对不仅准确而且可靠的预测模型的需求。深度学习在预测野火危险方面显示出潜力；然而，其采用受到对其预测可靠性的担忧的阻碍，部分源于缺乏不确定性量化。为应对这一挑战，我们提出了一个不确定性感知的深度学习框架，该框架联合捕获认知（模型）和偶然（数据）不确定性，以增强短期野火危险预测。在次日预测中，与确定性基线相比，我们表现最佳的模型将F1分数提高了2.3%，并将预期校准误差降低了2.1%，从而提升了预测技能和校准能力。我们的实验证实了不确定性估计的可靠性，并展示了它们在决策支持中的实际效用，包括识别拒绝低置信度预测的不确定性阈值，以及生成伴随不确定性层的良好校准的野火危险图。将预测范围延长至十天，我们观察到偶然不确定性随时间增加，表明环境条件的更大变异性，而认知不确定性保持稳定。最后，我们表明，尽管两种不确定性类型在低不确定性情况下可能是冗余的，但在更具挑战性的条件下它们提供互补的见解，强调了联合建模对稳健野火危险预测的价值。总之，我们的方法显著提高了野火危险预测的准确性和可靠性，推动了可信赖的野火深度学习系统的发展。

英文摘要

Wildfires are among the most severe natural hazards, posing a significant threat to both humans and natural ecosystems. The growing risk of wildfires increases the demand for forecasting models that are not only accurate but also reliable. Deep Learning (DL) has shown promise in predicting wildfire danger; however, its adoption is hindered by concerns over the reliability of its predictions, some of which stem from the lack of uncertainty quantification. To address this challenge, we present an uncertainty-aware DL framework that jointly captures epistemic (model) and aleatoric (data) uncertainty to enhance short-term wildfire danger forecasting. In the next-day forecasting, our best-performing model improves the F1 Score by 2.3% and reduces the Expected Calibration Error by 2.1% compared to a deterministic baseline, enhancing both predictive skill and calibration. Our experiments confirm the reliability of the uncertainty estimates and illustrate their practical utility for decision support, including the identification of uncertainty thresholds for rejecting low-confidence predictions and the generation of well-calibrated wildfire danger maps with accompanying uncertainty layers. Extending the forecast horizon up to ten days, we observe that aleatoric uncertainty increases with time, showing greater variability in environmental conditions, while epistemic uncertainty remains stable. Finally, we show that although the two uncertainty types may be redundant in low-uncertainty cases, they provide complementary insights under more challenging conditions, underscoring the value of their joint modeling for robust wildfire danger prediction. In summary, our approach significantly improves the accuracy and reliability of wildfire danger forecasting, advancing the development of trustworthy wildfire DL systems.

URL PDF HTML ☆

赞 0 踩 0

2603.04852 2026-06-10 cs.AI cs.CV 版本更新

Non-Parametric Structural Priors for Geometry Theorem Prediction

几何定理预测的非参数结构先验

Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang

发表机构 * School of Artificial Intelligence, Beijing Normal University, Beijing, China（北京师范大学人工智能学院）； Engineering Research Center of Intelligent Technology（智能技术与教育应用工程研究中心）； Beijing Key Laboratory of Artificial Intelligence for Education, Beijing, China（北京人工智能教育重点实验室）； Baidu, Beijing, China（百度）

AI总结针对几何定理预测中参数模型泛化性差的问题，提出定理前驱图作为非参数结构先验，通过上下文学习实现无训练定理预测，在FormalGeo7k上达到89.29%准确率。

详情

AI中文摘要

多步定理预测是几何问题求解中的核心挑战。现有的神经符号方法严重依赖有监督参数模型，这些模型对不断发展的定理库泛化能力有限。在这项工作中，我们通过上下文学习（ICL）的视角探索无训练定理预测。我们识别出一个关键的可扩展性瓶颈，称为结构漂移：随着推理深度的增加，普通ICL的性能急剧下降，通常降至接近零。我们将这种失败归因于LLM无法恢复潜在拓扑依赖关系，导致无结构探索。为解决此问题，我们提出定理前驱图，将历史解轨迹中的时间依赖关系编码为有向图，并施加显式拓扑约束，从而在推理过程中有效剪枝搜索空间。结合检索增强的图构建和逐步符号执行器，我们的方法使LLM能够在没有任何基于梯度的优化的情况下充当结构化规划器。在FormalGeo7k基准上的实验表明，我们的方法达到了89.29%的准确率，显著优于ICL基线，并与最先进的有监督模型相匹配。这些结果表明，显式结构先验为扩展基于LLM的符号推理提供了一个有前景的方向。

英文摘要

Multi-step theorem prediction is a central challenge in geometry problem solving. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.30370 2026-06-10 cs.NE cs.AI cs.CV cs.LG 版本更新

Updating the standard neuron model in artificial neural networks

更新人工神经网络中的标准神经元模型

Raul Mohedano, Thomas Batard, Erik Velasco-Salido, Ramsses De Los Santos Mendoza, Jorge H. Martínez, Stacey Levine, Marcelo Bertalmío

发表机构 * Spanish National Research Council (CSIC)（西班牙国家研究理事会（CSIC））； Center for Research in Mathematics (CIMAT)（数学研究中心（CIMAT））； Universidad Autónoma de Madrid (UAM)（马德里自治大学（UAM））； National Science Foundation (NSF)（国家科学基金会（NSF））

AI总结本文用更真实的皮层细胞模型替代标准点神经元模型，在不增加参数的情况下，提升了人工神经网络的表达能力、鲁棒性和学习速度，并减少了记忆化和所需训练数据量。

Comments Acknowledgments included in the manuscript

2602.16898 2026-06-10 cs.RO cs.AI cs.CV cs.LG 版本更新

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI：一种多智能体框架用于集成通用机器人操作

Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

发表机构 * Department of Electrical Engineering, Sharif University of Technology（电气工程系，谢里夫大学）

AI总结 MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作，提升泛化能力和零样本任务成功率。

Comments Some fundemental change in text and codebase

2603.04056 2026-06-10 cs.CV cs.RO 版本更新

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

长期动态底栖环境中的视觉定位：一个数据集、基于足迹的地面真实信息以及视觉地点识别基准

Martin Kvisvik Larsen, Oscar Pizarro

发表机构 * Department of Marine Technology（海洋技术系）； Norwegian University of Science and Technology（挪威科学技术大学）； Trondheim, Norway（特罗姆瑟，挪威）

AI总结本文提出一个用于长期底栖环境视觉定位的 curated 数据集和基于足迹的地面真实方法，评估了八种最先进的视觉地点识别方法，发现其在该数据集上的 Recall@K 显著低于传统基准。

详情

DOI: 10.3389/frobt.2026.1821019
Journal ref: Frontiers in Robotics and AI Volume 13 (2026) 1821019

AI中文摘要

长期视觉定位有潜力降低光学底栖监测中自主水下机器人（AUV）的成本并提高制图质量。尽管有这种潜力，底栖环境中长期视觉定位仍被低估，主要由于缺乏用于基准测试的curated数据集。此外，有限的地理参考精度和图像足迹需要精确的几何信息以实现准确的地面真实。在本文中，我们通过提出一个用于长期视觉定位的底栖环境curated数据集和一种新的方法来为近垂直水下影像的视觉定位结果进行地面真实，解决了这些差距。我们的数据集包括来自五个底栖参考站点的地理参考AUV影像，这些站点在长达六年的期间内被重新访问，包括原始和颜色校正的立体影像、相机校准和亚分米注册的相机姿态。据我们所知，这是首个涵盖多个站点和光层栖息地的长期视觉定位水下数据集。我们的地面真实方法估计3D海底图像足迹，并将具有重叠足迹的相机视图联系起来，确保地面真实链接反映共享的视觉内容。基于此数据集和地面真实，我们基准测试了八种最先进的视觉地点识别（VPR）方法，并发现Recall@K在我们的数据集上显著低于传统陆地和水下基准。最后，我们比较了基于足迹的地面真实与传统位置基于的地面真实，并表明距离阈值地面真实在地形崎岖和海拔变化的站点上会高估VPR Recall@K。共同，curated数据集、地面真实方法和VPR基准为在动态底栖环境中推进长期视觉定位提供了基础。

英文摘要

Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.

URL PDF HTML ☆

赞 0 踩 0

2510.15470 2026-06-10 cs.CV cs.IR 版本更新

MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval

MSAM：多语义自适应挖掘用于跨模态无人机视频-文本检索

Jinghao Huang, Yaxiong Chen, Ganchao Liu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； School of Computer Science and Artificial Intelligence, Wuhan University of Technology（武汉理工大学计算机科学与人工智能学院）； School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University（西北工业大学人工智能、光学与电子学院（iOPEN））

AI总结本文提出MSAM方法，通过多语义自适应学习机制提升无人机视频-文本跨模态检索性能，采用细粒度交互和自适应语义构建模块增强特征表示鲁棒性。

详情

DOI: 10.1109/TCSVT.2026.3701979

AI中文摘要

随着无人机技术的发展，视频数据量迅速增加，亟需高效的语义检索方法。本文首次系统提出并研究无人机视频-文本检索（DVTR）任务。无人机视频具有俯视视角、强结构同质性和目标组合的多义性，挑战了现有针对地面视角设计的跨模态方法。为此，我们提出名为多语义自适应挖掘（MSAM）的新方法。MSAM引入多语义自适应学习机制，整合帧间动态变化并从特定场景区域提取丰富的语义信息，从而增强对无人机视频内容的深度理解和推理。该方法依赖于词与无人机视频帧之间的细粒度交互，整合自适应语义构建模块、分布驱动的语义学习项和多样性语义项，加深文本与无人机视频模态的交互并提升特征表示的鲁棒性。为减少无人机视频复杂背景的干扰，我们引入了跨模态交互特征融合池化机制，专注于目标区域的特征提取和匹配，以最小化噪声影响。在两个自建的无人机视频-文本数据集上进行的广泛实验表明，MSAM在无人机视频-文本检索任务中优于其他现有方法。源代码和数据集将公开发布。

英文摘要

With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2508.05769 2026-06-10 cs.CV 版本更新

Improving Masked Style Transfer using Blended Partial Convolution

通过混合部分卷积改进遮蔽风格迁移

Seyed Hadi Seyed, Ayberk Cansever, David Hart

发表机构 * East Carolina University（东卡罗来纳大学）

AI总结本文提出基于部分卷积的风格迁移网络，精准应用于目标区域，并通过内部混合技术弥补区域选择的不完美，提升视觉和量化效果。

详情

DOI: 10.1109/ACCESS.2026.3687089
Journal ref: IEEE ACCESS Vol. 14 2026

AI中文摘要

艺术风格迁移长期以来依赖于卷积和变压器神经网络的发展。大多数算法将艺术风格应用于整个图像，但个别用户可能只需要将风格应用于图像中的特定区域。标准做法是在风格化后简单地对图像进行遮蔽。本文表明这种做法倾向于不恰当地捕捉目标区域的风格特征。我们提出了一种基于部分卷积的风格迁移网络，能够准确地将风格特征仅应用于目标区域。此外，我们还提出了网络内部的混合技术，以弥补区域选择的不完美。我们通过SA-1B数据集中的示例展示了这种改进在视觉和量化上的提升。代码可在https://github.com/davidmhart/StyleTransferMasked公开获取。

英文摘要

Artistic style transfer has long been possible with the advancements of convolution- and transformer-based neural networks. Most algorithms apply the artistic style transfer to the whole image, but individual users may only need to apply a style transfer to a specific region in the image. The standard practice is to simply mask the image after the stylization. This work shows that this approach tends to improperly capture the style features in the region of interest. We propose a partial-convolution-based style transfer network that accurately applies the style features exclusively to the region of interest. Additionally, we present network-internal blending techniques that account for imperfections in the region selection. We show that this visually and quantitatively improves stylization using examples from the SA-1B dataset. Code is publicly available at https://github.com/davidmhart/StyleTransferMasked.

URL PDF HTML ☆

赞 0 踩 0

2407.09510 2026-06-10 cs.CV 版本更新

3DGS.zip: A survey on 3D Gaussian Splatting Compression Methods

3DGS.zip：3D高斯散射压缩方法综述

Milena T. Bagdasarian, Paul Knoll, Yi-Hsin Li, Florian Barthel, Anna Hilsmann, Peter Eisert, Wieland Morgenstern

发表机构 * Fraunhofer HHI（弗劳恩霍夫研究所汉诺威研究所）； Humboldt-Universität zu Berlin（柏林洪堡大学）； Technische Universität Berlin（柏林技术大学）

AI总结本文综述了3DGS压缩方法，探讨了压缩与紧缩技术，旨在提高3DGS的效率和实用性，通过减少文件大小和高斯数量来优化质量和性能。

Comments 3D Gaussian Splatting compression survey; 3DGS compression; updated discussion; new approaches added; new illustrations

详情

DOI: 10.1111/cgf.70078
Journal ref: Computer Graphics Forum, Volume 44, Issue 2 (2025)

AI中文摘要

3D高斯散射（3DGS）作为一种实时辐射场渲染技术，因其质量和速度的先进性能而崭露头角。3DGS将场景建模为三维高斯集合，并通过优化额外属性以符合场景的几何和视觉特性。尽管其在渲染速度和图像保真度方面具有优势，但其显著的存储和内存需求限制了其在移动设备或头显中的应用。为解决这些挑战并推动3DGS的实用性，本文提供了对压缩和紧缩技术的全面详细分析。我们将现有方法分为压缩（减少文件大小）和紧缩（减少高斯数量）两类。两种方法均旨在维持或提升质量，分别通过最小化其各自属性：压缩通过最小化文件大小，紧缩通过最小化高斯数量。我们介绍了所分析方法的基本数学概念，以及关键的实现细节和设计选择。本文详尽讨论了方法之间的相似性和差异性，以及各自的优势和劣势。我们建立了基于关键性能指标和数据集的统一框架，以比较这些方法。由于这些方法在短时间内并行发展，目前尚无全面的比较。本文首次提出一个统一的框架来评估3DGS压缩技术。我们维护一个网站，定期更新新兴方法：https://w-m.github.io/3dgs-compression-survey/。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a cutting-edge technique for real-time radiance field rendering, offering state-of-the-art performance in terms of both quality and speed. 3DGS models a scene as a collection of three-dimensional Gaussians, with additional attributes optimized to conform to the scene's geometric and visual properties. Despite its advantages in rendering speed and image fidelity, 3DGS is limited by its significant storage and memory demands. These high demands make 3DGS impractical for mobile devices or headsets, reducing its applicability in important areas of computer graphics. To address these challenges and advance the practicality of 3DGS, this survey provides a comprehensive and detailed examination of compression and compaction techniques developed to make 3DGS more efficient. We classify existing methods into two categories: compression, which focuses on reducing file size, and compaction, which aims to minimize the number of Gaussians. Both methods aim to maintain or improve quality, each by minimizing its respective attribute: file size for compression and Gaussian count for compaction. We introduce the basic mathematical concepts underlying the analyzed methods, as well as key implementation details and design choices. Our report thoroughly discusses similarities and differences among the methods, as well as their respective advantages and disadvantages. We establish a consistent framework for comparing the surveyed methods based on key performance metrics and datasets. Specifically, since these methods have been developed in parallel and over a short period of time, currently, no comprehensive comparison exists. This survey, for the first time, presents a unified framework to evaluate 3DGS compression techniques. We maintain a website that will be regularly updated with emerging methods: https://w-m.github.io/3dgs-compression-survey/ .

URL PDF HTML ☆

赞 0 踩 0

2408.07922 2026-06-10 cs.CV cs.LG 版本更新

A Deep Features-Based Approach Using Modified ResNet50 and Gradient Boosting for Visual Sentiments Classification

基于改进ResNet50和梯度提升的深度特征方法用于视觉情感分类

Arslan Bisharat, Muhammad Mubeen, Arslan Akram, Saadullah Farooq Abbasi, Muhammad Salman Ali, Muhammad Usman Tariq

发表机构 * Department of Computer Science（计算机科学系）； Loyola University Chicago（芝加哥洛伊拉大学）； University Of the People（人民大学）； The Superior University Lahore（拉合尔超级大学）； University of Birmingham（伯明翰大学）

AI总结本文提出一种结合改进ResNet50提取深度特征和梯度提升算法的情感分类方法，通过两个基准数据集验证，优于现有深度学习和机器学习模型。

Comments 4 pages, 4 figures, 3 tables, IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR) 2024

详情

AI中文摘要

视觉情感分析（VSA）的多功能性是其日益受到关注的原因之一。由于以往研究主要集中在单一模态的情感分析上，如文本，因此难以高效管理包含视觉信息的社会媒体数据。此外，大多数视觉情感研究需要充分分类情感，因为它们主要关注简单合并模态属性而未深入研究其复杂关系。为此，提出了一种融合深度学习和机器学习算法的方法。本研究使用深度特征方法进行多类分类，从改进的ResNet50中提取深度特征，并使用梯度提升算法对包含情感内容的照片进行分类。该方法在两个基准数据集CrowdFlower和GAPED上进行了彻底评估。最后，使用最先进的深度学习和机器学习模型来比较所提出的方法。与现有最先进的方法相比，所提出的方法在所呈现的数据集上表现出色。

英文摘要

The versatile nature of Visual Sentiment Analysis (VSA) is one reason for its rising profile. It isn't easy to efficiently manage social media data with visual information since previous research has concentrated on Sentiment Analysis (SA) of single modalities, like textual. In addition, most visual sentiment studies need to adequately classify sentiment because they are mainly focused on simply merging modal attributes without investigating their intricate relationships. This prompted the suggestion of developing a fusion of deep learning and machine learning algorithms. In this research, a deep feature-based method for multiclass classification has been used to extract deep features from modified ResNet50. Furthermore, gradient boosting algorithm has been used to classify photos containing emotional content. The approach is thoroughly evaluated on two benchmarked datasets, CrowdFlower and GAPED. Finally, cutting-edge deep learning and machine learning models were used to compare the proposed strategy. When compared to state-of-the-art approaches, the proposed method demonstrates exceptional performance on the datasets presented.

URL PDF HTML ☆

赞 0 踩 0

2305.19369 2026-06-10 eess.IV cs.CV physics.med-ph 版本更新

The Brain Tumor Segmentation (BraTS) Challenge 2023: Glioma Segmentation in Sub-Saharan Africa Patient Population (BraTS-Africa)

2023年脑肿瘤分割（BraTS）挑战：撒哈拉以南非洲患者群体的胶质瘤分割（BraTS-Africa）

Maruf Adewole, Jeffrey D. Rudie, Anu Gbadamosi, Oluyemisi Toyobo, Confidence Raymond, Dong Zhang, Olubukola Omidiji, Rachel Akinola, Mohammad Abba Suwaid, Adaobi Emegoakor, Nancy Ojo, Kenneth Aguh, Chinasa Kalaiwo, Gabriel Babatunde, Afolabi Ogunleye, Yewande Gbadamosi, Kator Iorpagher, Evan Calabrese, Mariam Aboian, Marius Linguraru, Jake Albrecht, Benedikt Wiestler, Florian Kofler, Anastasia Janas, Dominic LaBella, Anahita Fathi Kzerooni, Hongwei Bran Li, Juan Eugenio Iglesias, Keyvan Farahani, James Eddy, Timothy Bergquist, Verena Chung, Russell Takeshi Shinohara, Walter Wiggins, Zachary Reitman, Chunhao Wang, Xinyang Liu, Zhifan Jiang, Ariana Familiar, Koen Van Leemput, Christina Bukas, Maire Piraud, Gian-Marco Conte, Elaine Johansson, Zeke Meier, Bjoern H Menze, Ujjwal Baid, Spyridon Bakas, Farouk Dako, Abiodun Fatade, Udunna C Anazodo

发表机构 * Medical Artificial Intelligence Laboratory (MAI Lab)（医学人工智能实验室（MAI实验室））； Department of Radiation Biology, Radiotherapy and Radiodiagnosis, University of Lagos（拉各斯大学放射生物学、放射治疗与放射诊断系）； Department of Radiology, University of California, San Diego（加州大学圣地亚哥分校放射科）； Crestview Radiology Limited（Crestview放射科有限公司）； Lagos University Teaching Hospital（拉各斯大学教学医院）； Lagos State University Teaching Hospital, Ikeja, Lagos, Nigeria（拉各斯州大学教学医院，伊凯贾，拉各斯，尼日利亚）； NSIA-Kano Diagnostic Center, Kano Nigeria（NSIA-卡诺诊断中心，卡诺，尼日利亚）； Nnamdi Azikiwe University Teaching Hospital, Nnewi, Anambra State, Nigeria（恩内迪·阿齐基韦大学教学医院，恩韦伊，安纳博拉州，尼日利亚）； Federal Medical Centre, Abeokuta, Ogun State, Nigeria（阿博库塔联邦医疗中心，奥贡州，尼日利亚）； Federal Medical Centre, Umahia, Abia State, Nigeria（乌马希亚联邦医疗中心，阿比亚州，尼日利亚）； National Hospital Abuja, FCT, Nigeria（阿布贾国家医院，联邦首都区，尼日利亚）； Benue State University Teaching Hospital, Markurdi, Benue State, Nigeria（贝努埃州大学教学医院，马库尔迪，贝努埃州，尼日利亚）； Duke University Medical Center, Department of Radiology, USA（达特茅斯大学医学中心，放射科，美国）； University of California San Francisco, CA, USA（加州大学旧金山分校，CA，美国）； Yale University, New Haven, CT, USA（耶鲁大学，新 Haven，CT，美国）； Children's National Hospital, Washington DC, USA（儿童医院华盛顿特区，华盛顿特区，美国）； George Washington University, Washington DC, USA（乔治·华盛顿大学，华盛顿特区，美国）； Sage Bionetworks, USA（Sage生物网络，美国）； Department of Neuroradiology, Technical University of Munich, Munich, Germany（慕尼黑技术大学神经放射科系，慕尼黑，德国）； Helmholtz Research Center, Munich, Germany（海德堡研究中心，慕尼黑，德国）； Duke University Medical Center, Department of Radiation Oncology, USA（达特茅斯大学医学中心，放射肿瘤科，美国）； Children’s Hospital of Philadelphia, University of Pennsylvania, Philadelphia, PA, USA（费城儿童医院，宾夕法尼亚大学，费城，PA，美国）； Center for AI and Data Science for Integrated Diagnostics (AI2D) & Center for Biomedical Image Computing and Analytics (CBICA), University of Pennsylvania, Philadelphia, PA, USA（人工智能与数据科学整合诊断中心（AI2D）及生物医学影像计算与分析中心（CBICA），宾夕法尼亚大学，费城，PA，美国）； Athinoula A Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Boston, MA, USA（Athinoula A Martinos生物医学影像中心，马萨诸塞总医院，波士顿，MA，美国）； University of Zurich, Switzerland（苏黎世大学，瑞士）； Cancer Imaging Program, National Cancer Institute, National Institutes of Health, Bethesda, MD 20814, USA（癌症成像计划，国家癌症研究所，国家卫生研究院，贝塞斯达，MD 20814，美国）； Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania, Philadelphia, USA（临床流行病学与生物统计学中心，宾夕法尼亚大学，费城，美国）； Department of Applied Mathematics and Computer Science, Technical University of Denmark, Denmark（应用数学与计算机科学系，丹麦技术大学，丹麦）； Mayo Clinic, MN, USA（梅奥诊所，MN，美国）； Precision FDA, U.S. Food and Drug Administration, Silver Spring, MD, USA（Precision FDA，美国食品药品监督管理局，Silver Spring，MD，美国）； Booz Allen Hamilton, McLean, VA, USA（Booz Allen Hamilton，麦肯，VA，美国）； Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA（放射科，佩尔曼医学院，宾夕法尼亚大学，费城，PA，美国）； Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA（病理学与实验室医学系，佩尔曼医学院，宾夕法尼亚大学，费城，PA，美国）； Center for Global Health, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA（全球健康中心，佩尔曼医学院，宾夕法尼亚大学，费城，宾夕法尼亚，美国）； Montreal Neurological Institute, McGill University, Montreal, Canada（蒙特利尔神经科学研究所，麦吉尔大学，蒙特利尔，加拿大）； Department of Medicine, University of Cape Town, South Africa（医学系，开普敦大学，南非）； Department of Radiation Medicine, University of Cape Town, South Africa（放射医学系，开普敦大学，南非）

AI总结研究探讨了在资源有限的撒哈拉以南非洲地区，利用先进机器学习方法进行胶质瘤分割的可行性，旨在改进该地区胶质瘤的诊断和治疗。

Comments arXiv admin note: text overlap with arXiv:2107.02314

详情

DOI: 10.1093/noajnl/vdag082

AI中文摘要

胶质瘤是最常见的原发性脑肿瘤。尽管胶质瘤相对罕见，但它们是致命性最高的癌症之一，诊断后生存率低于2年。胶质瘤诊断困难、治疗困难且对传统疗法具有内在耐药性。多年来，大量研究改善了胶质瘤的诊断和治疗，降低了全球北方的死亡率，但低收入和中等收入国家（LMICs）患者生存机会未变，且在撒哈拉以南非洲（SSA）人群中的生存率更差。长期生存与识别适当的脑MRI病理特征及通过组织病理学确认有关。自2012年以来，脑肿瘤分割（BraTS）挑战已评估了最先进的机器学习方法以检测、表征和分类胶质瘤。然而，不清楚这些最先进的方法是否能在SSA广泛应用，因为广泛使用低质量MRI技术，产生较差的图像对比度和分辨率，更重要的是，疾病晚期出现的倾向以及SSA中胶质瘤的特殊特征（即疑似更高的脑膜瘤发病率）。因此，BraTS-Africa挑战为通过BraTS挑战将SSA的脑MRI胶质瘤病例纳入全球努力提供了独特机会，以开发和评估计算机辅助诊断（CAD）方法，用于资源有限环境中的胶质瘤检测和表征。

英文摘要

Gliomas are the most common type of primary brain tumors. Although gliomas are relatively rare, they are among the deadliest types of cancer, with a survival rate of less than 2 years after diagnosis. Gliomas are challenging to diagnose, hard to treat and inherently resistant to conventional therapy. Years of extensive research to improve diagnosis and treatment of gliomas have decreased mortality rates across the Global North, while chances of survival among individuals in low- and middle-income countries (LMICs) remain unchanged and are significantly worse in Sub-Saharan Africa (SSA) populations. Long-term survival with glioma is associated with the identification of appropriate pathological features on brain MRI and confirmation by histopathology. Since 2012, the Brain Tumor Segmentation (BraTS) Challenge have evaluated state-of-the-art machine learning methods to detect, characterize, and classify gliomas. However, it is unclear if the state-of-the-art methods can be widely implemented in SSA given the extensive use of lower-quality MRI technology, which produces poor image contrast and resolution and more importantly, the propensity for late presentation of disease at advanced stages as well as the unique characteristics of gliomas in SSA (i.e., suspected higher rates of gliomatosis cerebri). Thus, the BraTS-Africa Challenge provides a unique opportunity to include brain MRI glioma cases from SSA in global efforts through the BraTS Challenge to develop and evaluate computer-aided-diagnostic (CAD) methods for the detection and characterization of glioma in resource-limited settings, where the potential for CAD tools to transform healthcare are more likely.

URL PDF HTML ☆

赞 0 踩 0

1. 多模态与视觉语言模型 15 篇

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

Geometric Coastline Localization using Vision-Language Models

GUI-AC: Enhancing Continual Learning in GUI Agents

Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning

Kwai Keye-VL-2.0 Technical Report

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Listen, Look, and Learn: Learning Without Forgetting through SAM-Audio

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

A History-Aware Visually Grounded Critic for Computer Use Agents

Let ViT Speak: Generative Language-Image Pre-training

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

2. 具身智能、机器人与自动驾驶 12 篇

FlexPath: Learned Semantic Path Priors for Image-Based Planning

LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting

Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving

SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

3. 图像识别、检索与分类 4 篇

Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization

Advancing Wood Identification in the Philippines: Utilizing the Xylorix Platform for Efficient AI Model Development and Deployment for Five Key Species

CapStARE: Capsule-based Sequential Architecture for Robust and Efficient Gaze Estimation

Selective Disk Bispectrum: A Complete and Rotation Invariant Image Descriptor

4. 目标检测、分割与定位 8 篇

Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

Don't waste SAM

Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line

ZODS-RS -- Zero-training Oriented Detection & Segmentation for Remote Sensing

Democratising Camera Trap AI: An Open-Source Model for Detecting UK Mammals

Automatic Labelling for Low-Light Pedestrian Detection

PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

5. 生成式视觉与世界模型 26 篇

ABot-Earth 0.5: Generative 3D Earth Model

Making Time Editable in Video Diffusion Transformers

Few-step Generative Models as Lossy Compression

PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models

STEDiff: Strengthening Text Embedding for Text-to-Image Alignment in Diffusion Model

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding

Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

Mean Flow Distillation: Robust and Stable Distillation for Flow Matching Models

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

On the Controllability-Fidelity Frontier in Diffusion Editing

Cost-Aware Routing for Efficient Text-To-Image Generation

FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

GeoLoom: High-quality Geometric Diagram Generation from Textual Input

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

The Emergence of Reproducibility and Generalizability in Diffusion Models

MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance

6. 3D视觉、点云与空间智能 13 篇

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

Benchmarking stereo reconstruction for 3D printable Martian terrain models

Efficient RWKV-based Representation Learning for 3D Point Clouds

3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

GRAR: Glass-induced Reflection Artifact Removal in LiDAR Point Clouds

Segment and Select: Vision-Language Segmentation in 3D Scenarios

Globally Localizing Lunar Rover in Pixels via Graph Alignment

AnimaSpark: A Feed-Forward Method for Animating Arbitrary 3D Objects

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning