OpenRFM: Dissecting Relational In-Context Learning
OpenRFM:剖析关系型上下文学习
Zhikai Chen, Junyu Yin, Jialiang Gu, Siheng Xiong, Xiaoze Liu, Ruowang Zhang, Keren Zhou, Kai Guo
AI总结 本文通过分析关系型Transformer的模型和数据两方面问题,提出双阶段上下文学习架构和同质性感知预训练混合策略,构建OpenRFM模型,在关系型基础模型上平均任务性能提升约30%。
详情
- Comments
- 25 pages, including appendix
关系型基础模型(RFM)承诺一个单一的预训练预测器,给定任何关系数据库,通过关系型上下文学习(ICL)在一次前向传播中返回预测。然而,开放RFM与其商业对应物之间存在显著差距,且这一差距的根源尚未被系统理解。我们从两个角度剖析了一个代表性框架——关系型Transformer(RT)。模型方面:我们表明RT执行关系级ICL,而核回归视图显示,当稀疏标签单元覆盖导致欠定回归时,它会失败。数据方面:我们消融了RT的预训练来源,发现仅合成预训练和分布内预训练将相同架构驱动到不同机制(惰性与特征学习)。探究这一差距揭示,缺失的成分是标签生成过程中可识别支持的关系型潜在变量。这两个诊断转化为:(1)一种双阶段ICL架构,将关系型骨干与从预训练表格基础模型提升的批级ICL层相结合,以克服关系级标签稀缺;(2)一种同质性感知的合成加持续真实数据预训练混合,辅以基于原型的正则化。这些选择定义了OpenRFM,一个简单而有效的RFM,在RT骨干上平均任务性能提升约30%,并在大量评估任务上超越了商业模型KumoRFMv1。
Relational Foundation Models (RFMs) promise a single pre-trained predictor that, given any relational database, returns predictions in one forward pass via relational in-context learning (ICL). Yet a substantial gap separates open RFMs from their commercial counterparts, and the origin of this gap has not been systematically understood. We dissect a representative framework, the Relational Transformer (RT), from two perspectives. Model side: we show that RT performs relation-level ICL, and a kernel regression view shows it fails when sparse label-cell coverage yields an underdetermined regression. Data side: we ablate RT's pre-training source and find that existing synthetic-only pre-training and in-distribution pre-training drive the same architecture into different regimes, lazy vs. feature-learning. Probing this gap reveals that the missing ingredient is a support-identifiable relational latent in the label-generation process. These two diagnoses translate into (1) a dual-stage ICL architecture that combines the relational backbone with a batch-level ICL layer lifted from a pre-trained tabular foundation model to overcome relation-level label scarcity, and (2) a homophily-aware synthetic plus continual real-data pre-training mixture, augmented with a prototype-based regularization. These choices define OpenRFM, a simple yet effective RFM that improves average task performance by approximately 30% over the RT backbone and surpasses the commercial model KumoRFMv1 on a large set of evaluation tasks.