arXivDaily arXiv每日学术速递 周一至周五更新
重置
2604.25750 2026-04-29 q-bio.PE

From lab to outbreak: experimental mosquito extrinsic incubation period distributions shape dengue epidemic dynamics

Léa Loisel, Sandie Arnoux, Gaël Beaunée, Pauline Ezanno

详情
英文摘要

Dengue virus transmission models commonly assume an exponential distribution for the mosquito extrinsic incubation period (EIP), potentially oversimplifying biological variability. We developed a stochastic mechanistic dengue transmission model comparing epidemic dynamics under commonly assumed exponential (EXP) versus experimentally derived (ED) EIP distributions. Our results show that using an experimentally derived EIP distribution delays and flattens epidemic peaks, resulting in lower but more prolonged peaks, slightly prolongs crisis durations, and reduces peak intensity compared to the exponential assumption, while outbreak probability remains largely unaffected. These differences are modulated by mosquito mortality and human recovery principally. Incorporating experimentally informed EIP distributions enhances the biological realism of models and may improve predictions of dengue epidemic dynamics, informing more effective vector control strategies and public health responses.

2604.25592 2026-04-29 q-bio.NC eess.SP

A geometry aware framework enhances noninvasive mapping of whole human brain dynamics

Song Wang, Kexin Lou, Chen Wei, Zhiyuan Sheng, Jiahao Tang, Kaining Peng, Xinke Shen, Shuhao Mei, Liang Chen, Dongfeng Gu, Quanying Liu

详情
英文摘要

Non-invasive electrophysiology lacks methods that accurately reconstruct whole-brain spatiotemporal dynamics while incorporating individual cortical geometry, leaving current electroencephalography and magnetoencephalography source imaging limited by simplistic or biologically implausible priors. Here, we show that embedding participant-specific Geometric Basis Functions (GBFs), eigenmodes derived from each individual's cortical surface, provides a powerful anatomic constraint that resolves the inverse problem and improves reconstruction fidelity. The method reconstructs neural sources as linear combinations of geometric basis functions, thereby aligning source estimates with the geometric organization of neural dynamics. We validate GBF across the Meta-Source Benchmark, task-evoked data, resting-state networks, intracranial stimulation, and epilepsy data. The results demonstrate that GBF yields high localization accuracy and captures fast spatiotemporal dynamics consistent with anatomical pathways. These findings suggest that both spontaneous and evoked whole-brain activity can be described by hundreds of geometric modes, providing a compact yet accurate representation of neural sources. By linking cortical geometry to electrophysiological dynamics, GBF offers a versatile source imaging tool for both scientific and clinical applications.

2604.24796 2026-04-29 q-bio.OT cs.LG

A multi-stage soft computing framework for complex disease modelling and decision support: A liver cirrhosis case study

Xueyuan Huang, Yuheng Wang, Yuanzhi He, Siqi Gou, Lu Bai, Wenqian Wu, Peifeng Liu, Aijia Wang, Tianhui Fan, Ze Zhou, Jiayu Xu

Comments 20 pages, 8 figures

详情
英文摘要

Liver cirrhosis is a major global health problem causing millions of deaths annually, and timely detection with aggressive treatment can significantly improve patients' quality of life. Modelling complex diseases from biomedical data is computationally challenging due to high dimensionality, strong feature correlations, noise, and limited labelled samples. Conventional Machine Learning (ML) pipelines often struggle with robustness, interpretability, and generalisation under such conditions. In this study, we propose an ML-driven multi-stage decision framework for complex disease modelling and therapeutic exploration. The framework integrates single-cell transcriptomic profiling, high-dimensional network-based feature stabilisation, multi-model learning, deep representation construction, and post-hoc decision support. Specifically, single-cell sequencing data were analysed to identify key cellular subpopulations, followed by high-dimensional weighted gene co-expression network analysis (hdWGCNA) to stabilise gene modules under sparsity and noise. To enhance non-linear feature interaction modelling, tabular molecular features were restructured into two-dimensional disease maps and analysed using a CNN. Finally, molecular docking was incorporated as a decision-support module to evaluate candidate therapeutic compounds. Using liver cirrhosis as a representative case, the framework identified a disease-associated endothelial subpopulation and extracted seven robust signature genes (HSPB1, GADD45A, CLDN5, ATP1B3, C1QBP, ENPP2, and PARL). The CNN-based representation learning module outperformed conventional pipelines in classification. The framework is disease-agnostic and readily extends to other omics-driven biomedical applications involving uncertainty, heterogeneity, and limited samples.

2509.14118 2026-04-29 math.OC q-bio.NC

Multi-Source Neural Activity Indices for EEG/MEG Localization: A Two-Stage Spatial Filtering Framework and Extension to MNE-Python

Julia Jurkowska, Joanna Dreszer, Monika Lewandowska, Krzysztof Tołpa, Tomasz Piotrowski

详情
英文摘要

Accurate electroencephalography (EEG) and magnetoencephalography (MEG) source localization and reconstruction are essential for understanding brain function, yet remain challenging because the underlying EEG/MEG inverse problem is inherently ill-posed. Spatial filtering (beamforming) approaches, such as linearly constrained minimum variance (LCMV) spatial filters, are widely used and well supported by existing analysis software. In this work, we extend this framework by deriving a novel family of unbiased multi-source neural activity indices that form the localization stage of a two-stage spatial-filtering-based localization-reconstruction framework for the EEG/MEG inverse problem. In contrast to existing formulations, the proposed indices do not require knowledge of the target source covariance matrix, making them directly applicable in practical experimental settings. Their compact algebraic forms enable straightforward and numerically efficient implementation. The framework is validated on simulated EEG data and its applicability is illustrated through an example involving experimental EEG data from an oddball paradigm. To facilitate adoption, we provide a full open-source implementation extending MNE-Python, accompanied by a practical tutorial.

2505.07638 2026-04-29 math.PR q-bio.MN

Identifiability of SDEs for reaction networks

Louis Faul, Linard Hoessly, Panqiu Xia

Comments 19 pages

详情
英文摘要

Biochemical reaction networks are widely applied across scientific disciplines to model complex dynamic systems. We investigate the diffusion approximation of reaction networks with mass-action kinetics, focusing on the identifiability of the stochastic differential equations associated to the reaction network. We derive conditions under which the law of the diffusion approximation is identifiable and provide theorems for verifying identifiability in practice. Notably, our results show that some reaction networks have non-identifiable reaction rates, even when the law of the corresponding stochastic process is completely known. Moreover, we show that reaction networks with distinct graphical structures can generate the same diffusion law under specific choices of reaction rates. Finally, we compare our framework with identifiability results in the deterministic ODE setting and the discrete continuous-time Markov chain models for reaction networks.

2411.14721 2026-04-29 cs.CL cs.LG q-bio.QM

MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts

Jiatong Li, Yunqing Liu, Wei Liu, Jingdi Le, Di Zhang, Wenqi Fan, Dongzhan Zhou, Yuqiang Li, Qing Li

Comments Accepted by TKDE, To appear. Codes are available at: https://github.com/phenixace/MolReFlect

详情
英文摘要

Molecule discovery is a pivotal research field, impacting everything from medicine to materials. Recently, Large Language Models (LLMs) have been widely adopted in molecular understanding and generation, serving as a bridge between the molecular space and the natural language space, yet the alignment between molecules and their corresponding captions remains a significant challenge. Previous endeavors typically treat molecules as monolithic inputs, lacking an intermediate reasoning process and sacrificing explainability. In this work, we define fine-grained alignments as the precise correspondence between a molecule's sub-structures and the textual phrases that explain their properties. These alignments are crucial for LLMs to understand molecules in a more accurate and explainable manner. Normally, such fine-grained alignments require expert annotation, which is both costly and time-consuming. To allow LLMs to automatically label and learn the fine-grained alignments, we propose MolReFlect, a novel teacher-student framework, where a teacher LLM first generates and refines mappings between caption phrases and SMILES substructures and then explicitly teaches these detailed alignments to a student LLM. Experimental results demonstrate that MolReFlect enables LLMs to significantly outperform previous baselines, achieving the state-of-the-art performance in the molecule-caption translation task. Our codes are available via: https://github.com/phenixace/MolReFlect.

2604.25415 2026-04-29 q-bio.NC cs.AI cs.HC

One-shot emergency psychiatric triage across 15 frontier AI chatbots

Veith Weilnhammer, Lennart Luettgau, Christopher Summerfield, Viknesh Sounderajah, Elise Wilkinson, Virginia Corno, Matthew M Nour

详情
英文摘要

AI chatbots are increasingly used for health advice, but their performance in psychiatric triage remains undercharacterized. Psychiatric triage is particularly challenging because urgency must often be inferred from thoughts, behavior, and context rather than from objective findings. We evaluated the performance of 15 frontier AI chatbots on psychiatric triage from realistic single-message disclosures using 112 clinical vignettes, each paired with 1 of 4 original benchmark triage labels: A, routine; B, assessment within 1 week; C, assessment within 24 to 48 hours; and D, emergency care now. Vignettes covered 9 psychiatric presentation clusters and 9 focal risk dimensions, organized into 28 presentation-by-risk groups. Each group contributed 4 distinct vignettes, with 1 vignette at each triage level. Each vignette was rendered as a realistic human-authored conversational query, and the AI chatbots were tasked with assigning a triage label from that disclosure. Emergency under-triage occurred in 23 of 410 level D trials (5.6%), and all under-triaged emergencies were reassigned to level C urgency. Across target models, average accuracy ranged from 42.0% to 71.8%. Accuracy was highest for level D vignettes (94.3%) and lowest for level B vignettes (19.7%). Mean signed ordinal error was positive (+0.47 triage levels), indicating net over-triage. Dispersion was highest around the middle triage levels. All results were confirmed relative to clinician consensus labels from 50 medical doctors. When presented with user messages containing sufficient clinical information, frontier AI chatbots thus recognized psychiatric emergencies as requiring urgent medical assessment with near-zero error rates, yet showed marked over-triage for low and intermediate risk presentations.

2604.25244 2026-04-29 q-bio.BM cs.LG

Learning Structure, Energy, and Dynamics: A Survey of Artificial Intelligence for Protein Dynamics

Haocheng Tang, Liang Shi, Ya-Shi Zhang, Xixian Liu, Jian Tang, Jiarui Lu

详情
英文摘要

Protein dynamics underlie many biological functions, yet remain difficult to characterize due to the high computational cost of molecular dynamics simulations and the scarcity of dynamic structural data. This survey reviews recent advances in artificial intelligence for protein dynamics from three perspectives: learning from structural ensembles and trajectories, learning from physical energy signals, and learning to accelerate molecular simulations. We summarize representative methods for conformation ensemble generation, trajectory generation, Boltzmann generators, physics-aware adaptation, machine learning potentials, coarse-grained modeling, and collective variable discovery. We further discuss available datasets and key open challenges, such as scalability, thermodynamic consistency, kinetic fidelity, and integration with experimental constraints.

2604.25233 2026-04-29 math.OC q-bio.GN

A Combinatorial Optimisation Approach to Multi-factorial Gap-filling in Genome-scale Metabolic Models (GEMs)

Philip Kilby, Sevvandi Kandanaarachchi, Matthew J. Morgan, Amy M. Paten, Mariana Velasque, Andrew C. Warden, Juan P. Molina Ortiz

详情
英文摘要

Genome-Scale Metabolic Models (GEMs) describe the interactions between genes, proteins, and the biochemical reactions that underpin an organism's metabolism aiming to computationally simulate functions at the cellular level. While many metabolic reactions can be inferred from genome analysis, constructing GEMs often involves incorporating reactions unsupported by genomic data to improve prediction accuracy. This is known as gap-filling, a process that can be performed manually (a time-consuming task) or computationally. Traditional computational gap-filling approaches aim to correct GEM predictions for a single environmental condition (medium) by solving a large Integer Linear Programming problem. Sequential application across multiple media can produce a more robust model, but often introduces unrealistic predictions in other media. They are also slow to run. In this paper, we study multi-factorial gap filling, which aims to gap-fill GEMs across typically 10 or more input media simultaneously, while improving their overall predictive accuracy and minimising unrealistic behaviour. We view the selection of the set of reactions as a combinatorial optimisation problem, and describe a method based on classic metaheuristic approaches which requires the solution of continuous Linear Programming problems only. This paper provides an introduction of this problem to an audience whose speciality lies outside biology, and suggests a practical first-cut solution method. We demonstrate the method gap-filling GEMs for three bacteria strains, selecting 3000 to 4000 reactions from a database of more than 11000 reactions, while attempting to match the empirically measured performance on 9 to 28 separate media conditions. We show that our method outperforms conventional approaches on multiple metrics, including Kendal Tau and RMS Error by an average of 7.3% and 13.3%, respectively.

2604.25062 2026-04-29 q-bio.MN cs.LG physics.bio-ph

Learning biophysical models of gene regulation with probability flow matching

Suryanarayana Maddu, Victor Chardès, Michael J. Shelley

详情
英文摘要

Cellular differentiation is governed by gene regulatory networks, the high-dimensional stochastic biochemical systems that determine the transcriptional landscape and mediate cellular responses to signals and perturbations. Although single-cell RNA sequencing provides quantitative snapshots of the transcriptome, current methods for inferring gene-regulatory dynamics often lack mechanistic interpretability and fail to generalize to unseen conditions. Here we introduce Probability Flow Matching (PFM), a scalable framework for learning biophysically consistent stochastic processes directly from time-resolved single-cell measurements. Applying PFM to three hematopoiesis datasets, we show that models with similar interpolation accuracy can encode fundamentally different dynamics, with only biophysically consistent formulations accurately capturing mechanisms of lineage transitions, fate specification, and gene perturbation responses. We further demonstrate that PFM accommodates unbalanced populations, enabling simultaneous inference of cellular proliferation and death dynamics. Together, these results establish PFM as a flexible, scalable framework for integrating mechanistic modeling with single-cell omics.

2604.25038 2026-04-29 q-bio.PE

Equation Learning for multiscale models of infectious diseases

James W. G. Doran, Cameron A. Smith, Christian A. Yates, Ruth Bowness

详情
英文摘要

Tuberculosis (TB) is an airborne disease caused by the pathogen Mycobacterium tuberculosis. In 2023, according to the World Health Organization, it ''probably'' replaced COVID-19 as the leading cause of death from an infectious agent globally; in the nineteenth century, one in seven of all humans deaths were as a result of tuberculosis. More than 10 million people are diagnosed with TB every year. The majority of cases in adults occur in males (62.5% of all global adult cases in 2023, compared to 37.5% in females). The main reasons for males suffering from a higher burden of global TB cases, compared to females, is likely to be a combination of within-host factors, such as differences in immune response, and population-scale factors, such as likelihood of completing treatment. To investigate the impact different scales have in determining this higher TB burden in males, we have developed a gender/sex-stratified multiscale framework. We have learnt ordinary differential equations (ODEs) to capture the average output of an agent-based within-host model, and used the resulting equations to describe the within-host scales of the multiscale framework. We evolve the population demographics at the between-host scale using ODEs, and link the scales with stochastic coupling functions. We have considered counterfactual scenarios to elucidate the impact of sex and gender on the infectious disease dynamics of TB. This paper is intended to provide a proof-of-concept for the development and implementation of the presented multiscale framework.

2604.24913 2026-04-29 cs.LG q-bio.PE

Generative diffusion models for spatiotemporal influenza forecasting

Joseph Lemaitre, Justin Lessler

详情
英文摘要

Forecasting infectious disease incidence can provide important information to guide public health planning, yet is difficult because epidemic dynamics are complex. Current mechanistic and statistical approaches often struggle to capture multimodal uncertainty or emergent trends. Influpaint adapts denoising diffusion probabilistic models to epidemic forecasting. By encoding influenza seasons as spatiotemporal images in which pixel intensity represents incidence, Influpaint learns a rich distribution of disease dynamics from a hybrid dataset of surveillance and simulated trajectories. Forecasting is formulated as a conditional generation (inpainting) task from partial observations. We show that Influpaint generates realistic, diverse epidemic trajectories and achieves forecast accuracy that is competitive with leading ensemble methods in retrospective evaluation. In real-time evaluation during the 2023--2025 U.S. CDC FluSight challenges, performance improved substantially across seasons, with highly accurate but somewhat overconfident projections in 2024--2025. The best performance was achieved with a training dataset containing 30% surveillance and 70% simulated trajectories. These results show that diffusion models can capture important spatiotemporal structure in influenza dynamics and provide a flexible framework for probabilistic infectious disease forecasting.

2604.24773 2026-04-29 q-bio.BM quant-ph

Simultaneous Fragment Docking for Geometrically Linkable Pose Pairs

Jiyun Lee, You Kyoung Chung, Joonsuk Huh

Comments 27 pages, 6 figures, 5 supplementary figures, 3 supplementary tables

详情
英文摘要

Computational molecular design requires binding arrangements that are not only energetically favorable but also chemically realizable. However, computational methods remain limited in directly recovering fragment pose pairs that can later be connected into a single molecule. To address this problem, we formulated the simultaneous placement of two fragments as a quadratic unconstrained binary optimization problem, Q-SFD, and introduced an explicit inter-fragment distance term to favor reconstruction-feasible arrangements. Relative to the formulation without this term, Q-SFD approximately doubled top-1 recovery of reconstruction-feasible pairs, and the top-5 solutions contained at least one feasible pair for more than 90% of benchmark cases without loss of fragment-level pose accuracy.

2604.24772 2026-04-29 q-bio.NC q-bio.SC

Neuronal electricality founded in murburn-thermodynamic principles: 1. Background and basic theoretical formulation

Kelath Murali Manoj, Nagamani Sukumar

Comments 34 pages and 1 Figure

详情
英文摘要

Trans-membrane gradients and fluxes of cations (H+, Na+, K+, etc.) were deemed to be the rationale of electrical activities of aerobic cells/organelles, as per classical perceptions. Murburn concept (an umbrella of theorization based in stochastic redox processes) has afforded novel models for various metabolic, bioenergetic and electrophysiological outcomes. Herein, the foundational mechanistic formalisms for the electrical activities of neurons that lead signal relay along the axonal length are provided. Electron Holding potential (EHP), a dimensionless field/state variable (related logarithmically to electron chemical potential) is used to explain neuronal activity. By combining local redox relaxation dynamics with spatial transport driven by thermodynamic gradients, we derive a unified reaction-transport-relaxation equation that captures resting potential, excitability, waveform generation, and signal propagation within a single framework. Nonlinear local redox kinetics naturally give rise to threshold behavior, all-or-none responses, and stable spike waveforms. The framework accommodates known physiological variability and provides a direct bridge between metabolic/redox state and electrophysiological behavior. This work establishes a chemically grounded, non-circular alternative to ion-centric models and offers testable predictions for neuronal dynamics across biological systems. In the second part of this work, we compare the new theory with existing systems, provide further evidence, simulations and describe elaborate agendas for falsification and validation.

2507.14245 2026-04-29 cs.LG cond-mat.mtrl-sci cs.AI cs.CE q-bio.BM

Curriculum-guided multimodal representation learning enables generalizable prediction of nanomaterial-protein interactions

Hengjie Yu, Kenneth A. Dawson, Haiyun Yang, Shuya Liu, Yan Yan, Yaochu Jin

Comments 36 pages, 6 figures

详情
英文摘要

Nanomaterial-protein interactions (NPI) are pivotal to realizing the therapeutic and diagnostic potential of nanomaterials. Although AI promises to accelerate mechanistic understanding and enable rational nanomaterial design, robust generalization to unseen nanomaterials or proteins remains unresolved. Here, we present CuMMI (curriculum-guided multimodal interaction model), a generalizable, explainable, and transferable model designed to infer NPI across complex biological settings. CuMMI leverages a self-constructed million-scale NPI dataset and adopts a multi-stage curriculum centered on human plasma, with progressively broader biofluid exposure to enhance data coverage and generalizability. By integrating protein sequence, structure, and a text-encoded experimental context of 37 features, CuMMI captures complementary material-specific, biochemical, and environmental information. Sample-level quality weights are assigned to ensure full utilization of available data while mitigating low-confidence and sparsely recorded entries. Ablation studies highlight the most influential tabular features, clarifying their contribution to the prediction. Through rigorous external validation across independence-preserving temporal, nanomaterial-held-out, and protein-held-out evaluations, our framework consistently achieves good performance (mean of five classification metrics exceeding 0.75), highlighting its robustness and generalizability to unseen data. Furthermore, fine-tuning on independent gold-nanoparticle data and a held-out protein subset further delivers better performance than training from scratch with substantially fewer samples. Together, our approach enables generalizable and transferable NPI prediction and may accelerate in vitro research and applications of nanomaterials.

2506.22842 2026-04-29 cond-mat.soft cond-mat.mes-hall cond-mat.stat-mech physics.bio-ph q-bio.BM

Actively induced supercoiling can slow down plasmid solutions by trapping the threading entanglements

Roman Staňo, Renáta Rusková, Dušan Račko, Jan Smrek

详情
英文摘要

Harnessing the topology of ring polymers as a design motif in functional nanomaterials is becoming a promising direction in the field of soft matter. For example, the ring topology of DNA plasmids prevents the relaxation of excess twist introduced to the polymer, instead resulting in helical supercoiled structures. In equilibrium semi-dilute solutions, tightly supercoiled rings relax faster than their torsionally relaxed counterparts, since the looser conformations of the latter allow for rings to thread through each other and entrain through entanglements. Here we use molecular simulations to explore a non-equilibrium scenario, in which a supercoiling agent, akin to gyrase enzymes, rapidly induces supercoiling in the suspensions of relaxed plasmids. The activity of the agent not only alters the conformational topology from open to branched, but also locks-in threaded rings into supramolecular clusters, which relax very slowly. Ultimately, our work shows how the polymer topology under non-equilibrium conditions can be leveraged to tune dynamic behavior of macromolecular systems, suggesting a method to create a class of driven materials glassified by activity.

2506.11272 2026-04-29 physics.bio-ph cond-mat.soft nlin.AO q-bio.CB q-bio.QM

Maximum-Entropy Model of Colored Noise in Superdiffusive Axonal Growth

Julian Sutaria, Cristian Staii

Comments 20 pages, 5 figures

详情
英文摘要

We develop a coarse-grained stochastic theory for axonal growth on micropatterned substrates using the Shannon--Jaynes maximum entropy principle. Starting from a Langevin description of growth cone motion, we infer the effective distribution of traction force relaxation rates from experimentally motivated constraints rather than postulating the colored noise directly. The resulting relaxation rate distribution generates a stationary colored acceleration process with power-law temporal correlations and yields analytical predictions for the axonal mean squared displacement and velocity autocorrelation. The long-time behavior is controlled by the slow-relaxation part of the inferred distribution, corresponding physically to broadly distributed clutch or adhesion engagement times. For biologically relevant parameters, the model predicts a negative correlation exponent $α=-1/2$. This prediction is in close quantitative agreement with measurements on cortical neurons cultured on micropatterned poly-D-lysine-coated PDMS substrates, which are well described by $α\simeq -0.6$ and exhibit superdiffusive mean squared displacement scaling with exponent $1.4$. The same framework accounts for the crossover from early diffusive behavior to long-time anomalous growth and for the corresponding power law decay of the velocity autocorrelation. These results show how entropy-constrained active fluctuations can connect microscopic force generation processes to emergent growth laws in neuronal systems and, more broadly, in active matter.

2411.17692 2026-04-29 q-bio.NC cs.IT math.IT physics.bio-ph

Quantifying information stored in synaptic connections rather than in firing activities of neural networks

Xinhao Fan, Shreesh P Mysore

Comments This version corresponds to the accepted manuscript for publication in Neural Computation. The accepted manuscript and full supplementary material are provided here

详情
英文摘要

A cornerstone of our understanding of both biological and artificial neural networks is that they store information in the strengths of synaptic connections among the neurons. However, in contrast to the well-established theory for quantifying information encoded by the firing activity of neural networks, there does not exist a framework for quantifying information stored in the network's connection distribution itself. Here, we develop a theoretical framework for synaptic information by using densely connected Hebbian networks performing autoassociative memory tasks and by modeling data patterns to be stored as log-normal distributions. Specifically, we derive analytical approximations for Shannon mutual information between the data and singletons, pairs, and arbitrary n-tuples of synaptic connections within the network. Our framework corroborates well-established insights regarding pattern storage capacity, supports the principle of distributed coding in neural firing activities, and formalizes the heterogeneity inherent in information encoding across synapses in a network. Notably, it discovers synergistic interactions among synapses, revealing that the information encoded jointly by all the synapses exceeds the 'sum of its parts'. Taken together, this study introduces a powerful, interpretable framework for quantitatively understanding information storage in the synapses of neural networks, one that illustrates the duality of synaptic connectivity and neural population activity in learning and memory.

2410.02082 2026-04-29 cs.LG q-bio.QM

FARM: Enhancing Molecular Representations with Functional Group Awareness

Thao Nguyen, Kuan-Hao Huang, Ge Liu, Martin D. Burke, Ying Diao, Heng Ji

Comments Preprint. The code is available at: https://github.com/thaonguyen217/farm_molecular_representation

详情
英文摘要

We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key idea behind FARM is the incorporation of functional group (FG) annotations at the atomic level, enabling both FG-enhanced SMILES and FG graphs. In this representation, SMILES strings are enriched with functional group information that identifies the group membership of each atom, while the FG graph captures molecular structure by representing how functional groups are connected. This tokenization injects chemical knowledge into SMILES and expands the effective molecular vocabulary, making the representation more suitable for Transformer-based models and more aligned with natural language structure. FARM learns molecular representations from two complementary perspectives to jointly encode functional and structural information. Masked language modeling on FG-enhanced SMILES captures atom-level features enriched with functional context, while graph neural networks model higher-level molecular topology through functional group connectivity. Contrastive learning is then used to align these two views into a unified embedding space, ensuring that both atom-level detail and functional group structure are jointly represented. We evaluate FARM on the MoleculeNet benchmark and achieve state-of-the-art performance on 8 out of 13 tasks. We further validate its generalization ability on a photostability dataset for quantum mechanical properties. These results demonstrate that FARM improves molecular representation learning, supports strong transfer learning across drug discovery and materials science, and enables broad applications in pharmaceutical research and functional material design.

2207.09264 2026-04-29 q-bio.QM

Flow Rate Independent Multiscale Liquid Biopsy for Precision Oncology

Jing Yan, Jie Wang, Robert Dallmann, Renquan Lu, Jérôme Charmet

Comments 19 pages, 5 figures (+ supplementary materials: 16 pages, 10 figures)

详情
英文摘要

Immunoaffinity-based liquid biopsies of circulating tumor cells (CTCs) hold great promise for cancer management, but typically suffer from low throughput, relative complexity and post-processing limitations. Here we address these issues simultaneously by decoupling and independently optimizing the nano-, micro- and macro-scales of an enrichment device that is simple to fabricate and operate. Unlike other affinity-based devices, our scalable mesh approach enables optimum capture conditions at any flow rate, as demonstrated with constant capture efficiencies, above 75% between 50-200 uL/min. The device achieved 96% sensitivity and 100% specificity when used to detect CTCs in the blood of 79 cancer patients and 20 healthy controls. We demonstrate its post processing capacity with the identification of potential responders to immune checkpoint inhibition therapy and the detection of HER2 positive breast cancer. The results compare well with other assays, including clinical standards. This suggests that our approach, which overcomes major limitations associated with affinity-based liquid biopsies, could help improve cancer management.