SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining
SpectralEarth-FM: 将高光谱图像引入多模态地球观测预训练
Nassim Ait Ali Braham, Aaron Banze, Conrad M. Albrecht, Julien Mairal, Jocelyn Chanussot, Xiao Xiang Zhu
AI总结 本文提出SpectralEarth-FM,一种用于多传感器地球观测输入的分层变压器,旨在联合处理高光谱图像与低通道观测。通过构建SpectralEarth-MM数据集,采用JEPA风格的目标进行预训练,实现了在高光谱下游任务和标准EO基准上的最佳性能。
详情
地球观测(EO)基础模型(FMs)越来越多地使用多传感器数据进行训练,涵盖多谱段图像(MSI)、合成孔径雷达(SAR)和衍生的地理空间层,但高光谱图像(HSI)仍被低估。相反,现有的高光谱FM仅在HSI上训练,未探索HSI与共定位EO传感器的联合预训练和融合。我们引入SpectralEarth-FM,一种用于多传感器EO输入的分层变压器,具有异构光谱维度。该架构结合了高光谱输入的光谱标记化、传感器特定编码器、跨传感器融合模块和共享分层编码器,能够联合处理HSI和低通道观测。为了预训练SpectralEarth-FM,我们构建了SpectralEarth-MM数据集,该数据集将EnMAP、EMIT、DESI三颗空间载荷的HSI与Sentinel-2、Landsat-8/9光学图像、Landsat地表温度(LST)和Sentinel-1 SAR在共同地理足迹上进行共定位。该数据集包含约2000万个全球分布的地点,25000万个地理参考碎片,以及超过40TB的数据。预训练使用一种联合嵌入预测架构(JEPA)风格的目标,匹配全球视图和同一地点单传感器局部视图之间的表示。我们评估了SpectralEarth-FM在高光谱下游任务和标准EO基准上的性能,遵循PANGAEA协议,实现了在两种评估设置中的最佳性能。
Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.