arXivDaily arXiv每日学术速递 周一至周五更新
重置
q-bio.BM生物分子2
2606.11382 2026-06-11 cs.LG q-bio.BM 新提交

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

GLACIER:用于分子性质预测的多模态师生基础模型

Emily Nguyen, Yongchan Hong, Harsh Toshniwal, Yan Liu, Andreas Luttens

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Quantitative and Computational Biology, University of Southern California(南加州大学定量与计算生物学系) Amazon(亚马逊) Department of Medical Biochemistry and Biophysics, Science for Life Laboratory, Karolinska Institutet(卡罗林斯卡学院医学生物化学与生物物理系,生命科学实验室)

AI总结 提出GLACIER师生框架,通过融合分子图、SMILES和物理化学描述符三种模态,并利用大模型蒸馏,实现高效准确的分子性质预测。

详情
AI中文摘要

深度学习模型有助于在数十亿候选化合物中发现具有定制性质的分子。然而,开发和部署最先进模型的计算负担不断增加,限制了其可扩展性。大多数大规模模型本质上是单模态的,忽视了利用互补分子数据模态的潜力。为了解决这些缺点,本文介绍了用于化学推理和探索的图-语言对齐表示(GLACIER)模型,这是一个师生框架,集成了分子图、SMILES字符串和物理化学描述符,以学习丰富的分子嵌入。我们的框架包括三个阶段:(1)我们在100,000个药物样分子上预训练三个学生编码器:用于分子图的消息传递神经网络、用于SMILES字符串的基于Transformer的编码器以及用于物理化学描述符的多层感知器;(2)我们使用新颖的Finsler几何感知模块融合这些学生模态;(3)通过对比学习,将来自大型教师模型(包括MiniMol和MolFormer)的互补知识蒸馏到一个轻量级模型中。我们证明GLACIER是一个稳健的框架,在复杂的分子性质预测任务中提供高预测性能和计算效率。我们的代码在此https URL公开可用。

英文摘要

Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at this https URL.

2604.25701 2026-06-11 physics.bio-ph physics.data-an q-bio.BM q-bio.MN q-bio.PE 版本更新

Bayesian Rate Inference for Sequence Motif Dynamics in Systems of Reactive Nucleic Acids

反应性核酸系统中序列基序动力学的贝叶斯速率推断

Johannes Harth-Kitzerow, Ulrich Gerland, Torsten A. Enßlin

AI总结 提出贝叶斯推断框架,从链反应器模拟的连接计数数据中推断基序速率方程参数,为匹配简化模型与复杂模拟提供方法,并迈向从实验数据直接推断反应速率常数。

详情
Comments
18 pages, 8 figures, pre-submission
AI中文摘要

RNA世界假说提出了生命在早期地球上出现的一条途径。它假设生命始于基于RNA的系统,能够存储、传递和复制信息,设想单体和短RNA寡聚体相互作用形成更长的链,最终成为具有催化活性的核酶。RNA池中的关键反应是杂交、去杂交、模板化连接和切割。这些反应依赖于许多环境参数以及相互作用链之间广泛可能的构型。为了扫描如此高维的参数空间,需要高效的描述。基序速率方程将复杂的链反应器动力学投影到序列基序空间。这里我们提出了一个贝叶斯推断框架,从链反应器模拟产生的连接计数数据中推断其参数。这提供了一个将更简单的基序速率方程与更复杂的模拟相匹配的框架。此外,这是朝着直接从实验数据推断反应速率常数(包括严格的 uncertainty 估计)迈出的一步。这可能是连接理论与实验、加深我们对生命出现所必需的基本特征理解的关键步骤。

英文摘要

The RNA world hypothesis suggests a pathway of how life emerged on early earth. It assumes that life started with RNA based systems, capable of storing, transmitting and replicating information, envisioning that monomers and short RNA oligomers interact to form longer strands, eventually becoming catalytically active ribozymes. Key reactions in RNA pools are hybridization, dehybridization, templated ligation, and cleavage. Those reactions depend on many environmental parameters and the wide range of possible configurations among interacting strands. In order to scan such high dimensional parameter spaces, efficient descriptions are needed. Motif rate equations project complex strand reactor dynamics onto sequence motif space. Here we present a Bayesian inference framework to infer their parameters from ligation count data produced by strand reactor simulations. This provides a framework to match the simpler motif rate equations to more complex simulations. Additionally, it is a step towards inferring reaction rate constants directly from experimental data, including rigorous uncertainty estimation. This could be an essential procedure to connect theory and experiment, and deepen our understanding of the essential features necessary for life to emerge.