Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities
基于潜在空间引导的多模态分割中缺失模态的场景采样
Irem Ulku, Ö. Özgür Tanrıöver, Erdem Akagündüz
AI总结 本文提出了一种新的训练策略,通过直接从预训练的潜在空间学习场景采样分布,以指导多模态分割在缺失模态下的微调,从而提高性能。
Comments 14 pages, 4 figures, 9 tables
详情
多模态语义分割通过结合不同传感器模态的互补信息,为遥感分析带来了好处。在现实中的遥感应用中,由于传感器故障、恶劣大气条件或数据采集问题,一个或多个模态可能不可用。即使有预训练的多模态表示和现有的微调或适应策略,性能仍可能受限,因为训练时通常将所有模态可用性场景视为等信息。在本文中,我们提出了一种新的训练策略,直接从预训练的潜在空间学习场景采样分布。与依赖于均匀随机模态丢弃不同,所提出的方法将微调引导到更具信息量的模态可用性场景。更具体地说,我们独立量化每个场景的影响,基于其在共享潜在表示中引起的变化。然后,我们使用径向基函数内核捕捉场景关系,并通过正则化内核平滑推导出细化的场景评分。这些评分随后在场景采样过程中转换为概率分布,用于微调。我们在三个遥感图像集(DSTL、Potsdam和Hunan)上评估了该策略,使用CBC-SLP、CBC和CMX主干网络。不同图像集和主干网络的实验结果表明,我们的方法优于标准微调和LoRA基于的适应。这些发现表明,预训练的潜在表示可以作为缺失模态微调期间采样的有效基础。代码可在https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling获取。
Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling