Interpretable enzyme function prediction via sparse autoencoder features of ESMC across the microbial protein universe
通过ESMC稀疏自编码器特征实现可解释的酶功能预测:跨越微生物蛋白质宇宙
Yue Hu, Wanyu Cheng, Junqing Wang, Yingchao Liu
AI总结 利用ESMC-6B蛋白质语言模型的稀疏自编码器特征,无需任务特定训练即可准确预测酶功能,在微生物酶基准上实现78.9% top-1准确率,并发现16.9万个暗酶候选。
详情
- Comments
- 17 pages, 5 figures, 3 tables
微生物基因组和宏基因组包含数百万功能未知的酶,即酶暗物质。虽然深度学习改进了蛋白质功能预测,但大多数方法是基于序列或结构相似性的黑箱,限制了新型催化活性的发现。ESMC-6B蛋白质语言模型及其稀疏自编码器(具有16,384维可解释生物学概念码本,每个概念由GPT-5注释)创造了新的机会:直接将这些特征用作酶功能的语义签名。在这里,我们展示了ESMC-SAE特征能够实现准确且可解释的酶委员会(EC)编号预测,无需任务特定训练或GPU密集型计算。在包含4,868个微生物SwissProt酶(涵盖161个EC3子类)的平衡基准上,ESMC-SAE二元特征达到78.9%的top-1和88.5%的top-5准确率,比3-mer基线(57.3%)高37.6%。在模拟发现新型酶类的留一EC3子类评估中,SAE特征在47.7%的情况下恢复了EC1超类(随机为14.3%,3.3倍),而序列方法为26.6%。判别性特征对应于机制上可解释的概念:水解酶的催化三联体几何结构、氧化还原酶的NAD(P)H结合Rossmann折叠、转移酶的磷酸结合P环。我们还调查了包含770万个簇的ESM Atlas,并在所有主要微生物门中识别出169,859个暗酶样候选。我们的结果为微生物暗物质中的酶功能发现建立了一个范式:设计上可解释,无需GPU集群即可扩展,适用于ESM Atlas中的数十亿蛋白质。
Microbial genomes and metagenomes contain millions of proteins whose enzymatic functions remain unknown, the enzyme dark matter. While deep learning has improved protein function prediction, most methods are black boxes relying on sequence or structural similarity, limiting discovery of novel catalytic activities. The ESMC-6B protein language model and its sparse autoencoder with a 16,384-dimensional codebook of interpretable biological concepts, each annotated by GPT-5, creates a new opportunity: using these features directly as semantic signatures for enzyme function. Here, we show that ESMC-SAE features enable accurate and interpretable enzyme commission (EC) number prediction without task-specific training or GPU-intensive computation. On a balanced benchmark of 4,868 microbial SwissProt enzymes across 161 EC3 subclasses, ESMC-SAE binary features achieve 78.9% top-1 and 88.5% top-5 accuracy, 37.6% higher than 3-mer baselines (57.3%). In leave-one-EC3-class-out evaluation simulating discovery of novel enzyme classes, SAE features recover the EC1 superclass in 47.7% of cases (3.3x random, 14.3%), versus 26.6% for sequence methods. Discriminative features correspond to mechanistically interpretable concepts: catalytic triad geometry for hydrolases, NAD(P)H-binding Rossmann folds for oxidoreductases, phosphate-binding P-loops for transferases. We also survey the ESM Atlas of 7.7 million clusters and identify 169,859 dark enzyme-like candidates across all major microbial phyla. Our results establish a paradigm for enzyme function discovery in microbial dark matter: interpretable by design, scalable without GPU clusters, and applicable to the billions of proteins in the ESM Atlas.