m6A-FORM: A Foundation Model for Decoding N6-methyladenosine Biology
m6A-FORM:解码N6-甲基腺苷生物学的基础模型
Tinghe Zhang, Sumin Jo, Shou-Jiang Gao, Yufei Huang
AI总结 提出基于Transformer的基础模型m6A-FORM,利用MeRIP-seq峰作为先验,预训练后微调实现m6A位点预测,性能优于现有方法,并支持调控因子结合位点预测和组织保守位点分析。
详情
N6-甲基腺苷(m6A)是真核生物mRNA中最丰富的内部修饰。然而,现有大多数预测器采用以腺苷为中心的公式,计算效率低且易产生假阳性。本文提出m6A-FORM,一种基于Transformer的RNA甲基化基础模型,使用MeRIP-seq峰作为甲基化富集先验,并在来自143个人类MeRIP-seq研究的约2200万个峰衍生序列上预训练。使用来自m6A-Atlas v2.0和GLORI的高置信度单核苷酸m6A注释微调后,m6A-FORM-sites实现了最先进的m6A位点预测性能,PR-AUC为0.635,ROC-AUC为0.988,PR-AUC比现有方法至少提高0.14,同时推理速度显著加快。任务特定适配进一步支持19个m6A相关调控因子的结合位点预测,以及识别与mRNA降解相关的YTHDF2结合m6A位点。将m6A-FORM应用于来自24个人类组织的67个数据集,识别出19,631个组织保守位点,这些位点具有独特的定位、聚类、甲基化、表达、RBP相互作用和衰变相关特征。
N6-methyladenosine (m6A) is the most abundant internal modification in eukaryotic mRNA. However, most existing predictors use adenosine-centered formulations that are computationally inefficient and prone to false positives. Here we present m6A-FORM, a transformer-based foundation model for RNA methylation that uses MeRIP-seq peaks as methylation-enriched priors and is pretrained on approximately 22 million peak-derived sequences from 143 human MeRIP-seq studies. After fine-tuning with high-confidence single-nucleotide m6A annotations from m6A-Atlas v2.0 and GLORI, m6A-FORM-sites achieves state-of-the-art m6A site prediction performance, with a PR-AUC of 0.635 and ROC-AUC of 0.988, improving PR-AUC by at least 0.14 over existing methods while enabling substantially faster inference. Task-specific adaptation further supports prediction of binding sites for 19 m6A-associated regulators and identification of YTHDF2-bound m6A sites associated with mRNA degradation. Applying m6A-FORM across 67 datasets from 24 human tissues identifies 19,631 tissue-conserved sites with distinct localization, clustering, methylation, expression, RBP-interaction, and decay-associated signatures.