arXivDaily arXiv每日学术速递 周一至周五更新
2602.02425 2026-02-03 cs.LG q-bio.QM

Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Amaru Caceres Arroyo, Lea Bogensperger, Ahmed Allam, Michael Krauthammer, Konrad Schindler, Dominik Narnhofer

详情
英文摘要

Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.

2602.02374 2026-02-03 q-bio.MN math.DS

Recurrent neural chemical reaction networks trained to switch dynamical behaviours through learned bifurcations

Alexander Dack, Tomislav Plesa, Thomas E. Ouldridge

详情
英文摘要

Both natural and synthetic chemical systems not only exhibit a range of non-trivial dynamics, but also transition between qualitatively different dynamical behaviours as environmental parameters change. Such transitions are called bifurcations. Here, we show that recurrent neural chemical reaction networks (RNCRNs), a class of chemical reaction networks based on recurrent artificial neural networks that can be trained to reproduce a given dynamical behaviour, can also be trained to exhibit bifurcations. First, we show that RNCRNs can inherit some bifurcations defined by smooth ordinary differential equations (ODEs). Second, we demonstrate that the RNCRN can be trained to infer bifurcations that allow it to approximate different target behaviours within different regions of parameter space, without explicitly providing the bifurcation itself in the training. These behaviours can be specified using target ODEs that are discontinuous with respect to the parameters, or even simply by specifying certain desired dynamical features in certain regions of the parameter space. To achieve the latter, we introduce an ODE-free algorithm for training the RNCRN to display designer oscillations, such as a heart-shaped limit cycle or two coexisting limit cycles.

2602.01845 2026-02-03 cs.LG q-bio.QM

No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

Furkan Eris

详情
英文摘要

Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbf{Proust}, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman $ρ= 0.390$ on ProteinGym substitutions, competitive with MLMs requiring 50--200$\times$ the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20$\times$ larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust in a sweet spot as it also retains native generative capabilities that MLMs lack by design. Interpretability analysis reveals that per-position entropy variance predicts, to an extent, when retrieval augmentation helps and hurts. Such insights can grow in both quantity and quality at scale and inform capabilities such as test-time scaling. Code and weights are available at https://github.com/Furkan9015/proust-inference

2602.01772 2026-02-03 cs.LG cs.AI q-bio.QM

DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics

Yucheng Liao, Han Wen, Weinan E, Weijie Zhang

Comments 21 pages, 5 figures

详情
英文摘要

Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semi-supervised training to universal cross-modal representation learning. By integrating dual-encoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.

2602.01492 2026-02-03 q-bio.PE cs.SY eess.SY

From Discrete to Continuous Mixed Populations of Conformists, Nonconformists, and Imitators

Azadeh Aghaeeyan, Pouria Ramazi

详情
英文摘要

In two-strategy decision-making problems, individuals often imitate the highest earners or choose either the common or rare strategy. Individuals who benefit from the common strategy are conformists, whereas those who profit by choosing the less common one are called nonconformists. The population proportions of the two strategies may undergo perpetual fluctuations in finite, discrete, heterogeneous populations of imitators, conformists, and nonconformists. How these fluctuations evolve as population size increases was left as an open question and is addressed in this paper. We show that the family of Markov chains describing the discrete population dynamics forms a generalized stochastic approximation process for a differential inclusion--the continuous-time dynamics. Furthermore, we prove that the continuous-time dynamics always equilibrate. Then, by leveraging results from the stochastic approximation theory, we show that the amplitudes of fluctuations in the proportions of the two strategies in the population approach zero with probability one when the population size grows to infinity. Our results suggest that large-scale perpetual fluctuations are unlikely in large, well-mixed populations consisting of these three types, particularly when imitators follow the highest earners.

2602.01482 2026-02-03 q-bio.NC cs.AI cs.CV

Community-Level Modeling of Gyral Folding Patterns for Robust and Anatomically Informed Individualized Brain Mapping

Minheng Chen, Tong Chen, Yan Zhuang, Chao Cao, Jing Zhang, Tianming Liu, Lu Zhang, Dajiang Zhu

详情
英文摘要

Cortical folding exhibits substantial inter-individual variability while preserving stable anatomical landmarks that enable fine-scale characterization of cortical organization. Among these, the three-hinge gyrus (3HG) serves as a key folding primitive, showing consistent topology yet meaningful variations in morphology, connectivity, and function. Existing landmark-based methods typically model each 3HG independently, ignoring that 3HGs form higher-order folding communities that capture mesoscale structure. This simplification weakens anatomical representation and makes one-to-one matching sensitive to positional variability and noise. We propose a spectral graph representation learning framework that models community-level folding units rather than isolated landmarks. Each 3HG is encoded using a dual-profile representation combining surface topology and structural connectivity. Subject-specific spectral clustering identifies coherent folding communities, followed by topological refinement to preserve anatomical continuity. For cross-subject correspondence, we introduce Joint Morphological-Geometric Matching, jointly optimizing geometric and morphometric similarity. Across over 1000 Human Connectome Project subjects, the resulting communities show reduced morphometric variance, stronger modular organization, improved hemispheric consistency, and superior alignment compared with atlas-based and landmark-based or embedding-based baselines. These findings demonstrate that community-level modeling provides a robust and anatomically grounded framework for individualized cortical characterization and reliable cross-subject correspondence.

2510.02568 2026-02-03 cs.SI cs.NE q-bio.PE

Identifying Asymptomatic Nodes in Network Epidemics using Graph Neural Networks

Conrado Catarcione Pinto, Amanda Camacho Novaes de Oliveira, Rodrigo Sapienza Luna, Daniel Ratton Figueiredo

Comments Paper presented in the 35th Brazilian Conference on Intelligent Systems (BRACIS)

详情
英文摘要

Infected individuals in some epidemics can remain asymptomatic while still carrying and transmitting the infection. These individuals contribute to the spread of the epidemic and pose a significant challenge to public health policies. Identifying asymptomatic individuals is critical for measuring and controlling an epidemic, but periodic and widespread testing of healthy individuals is often too costly. This work tackles the problem of identifying asymptomatic individuals considering a classic SI (Susceptible-Infected) network epidemic model where a fraction of the infected nodes are not observed as infected (i.e., their observed state is identical to susceptible nodes). In order to classify healthy nodes as asymptomatic or susceptible, a Graph Neural Network (GNN) model with supervised learning is adopted where a set of node features are built from the network with observed infected nodes. The approach is evaluated across different network models, network sizes, and fraction of observed infections. Results indicate that the proposed methodology is robust across different scenarios, accurately identifying asymptomatic nodes while also generalizing to different network sizes and fraction of observed infections.

2509.15748 2026-02-03 cs.CV q-bio.NC

Hybrid Lie semi-group and cascade structures for the generalized Gaussian derivative model for visual receptive fields

Tony Lindeberg

Comments 27 pages, 9 figures

详情
英文摘要

Because of the variabilities of real-world image structures under the natural image transformations that arise when observing similar objects or spatio-temporal events under different viewing conditions, the receptive field responses computed in the earliest layers of the visual hierarchy may be strongly influenced by such geometric image transformations. One way of handling this variability is by basing the vision system on covariant receptive field families, which expand the receptive field shapes over the degrees of freedom in the image transformations. This paper addresses the problem of deriving relationships between spatial and spatio-temporal receptive field responses obtained for different values of the shape parameters in the resulting multi-parameter families of receptive fields. For this purpose, we derive both (i) infinitesimal relationships, roughly corresponding to a combination of notions from semi-groups and Lie groups, as well as (ii) macroscopic cascade smoothing properties, which describe how receptive field responses at coarser spatial and temporal scales can be computed by applying smaller support incremental filters to the output from corresponding receptive fields at finer spatial and temporal scales, structurally related to the notion of Lie algebras, although with directional preferences. The presented results provide (i) a deeper understanding of the relationships between spatial and spatio-temporal receptive field responses for different values of the filter parameters, which can be used for both (ii) designing more efficient schemes for computing receptive field responses over populations of multi-parameter families of receptive fields, as well as (iii)~formulating idealized theoretical models of the computations of simple cells in biological vision.

2409.13669 2026-02-03 q-bio.NC cs.NE

A Spatiotemporal Perspective on Dynamical Computation in Neural Information Processing Systems

T. Anderson Keller, Lyle Muller, Terrence J. Sejnowski, Max Welling

详情
英文摘要

Spatiotemporal flows of neural activity, such as traveling waves, have been observed throughout the brain since the earliest recordings; yet there is still little consensus on their functional role. Recent experiments and models have linked traveling waves to visual and physical motion, but these observations have been difficult to reconcile with standard accounts of topographically organized selectivity and feedforward receptive fields. Here, we introduce a theoretical framework that formalizes and generalizes the connection between 'motion' and flowing neural dynamics in the language of equivariant neural network theory. We consider 'motion' not only in physical or visual spaces, but also in more abstract representational spaces, and we argue that recurrent traveling-wave-like dynamics are not just useful but necessary for accurate and stable processing of any signal undergoing such motion. Formally, we show that for any non-trivial recurrent neural network to process a sequence undergoing a flow transformation (such as visual motion) in a structured equivariant manner, its hidden state dynamics must actively realize a homomorphic representation of the same flow through recurrent connectivity. In this ''spatiotemporal perspective on dynamical computation'', traveling waves and related flows are best understood as faithful dynamic representations of stimulus flows; and consequently the natural inclination of biological systems towards such dynamics may be viewed as an innate inductive bias towards efficiency and generalization in the spatiotemporally-structured dynamical world they inhabit.

2407.11498 2026-02-03 cond-mat.stat-mech q-bio.MN

Thermodynamic Space of Chemical Reaction Networks

Shiling Liang, Paolo De Los Rios, Daniel Maria Busiello

详情
英文摘要

Living systems operate out of equilibrium, continuously consuming energy to sustain organised, functional states. Their emergent behaviour usually relies on a set of interconnected chemical reaction networks (CRNs) driven by external fluxes that keep some species at fixed concentrations. Hence, uncovering the principles governing the functioning of these CRNs is crucial to understand how living systems generate and regulate complexity. While kinetics plays a key role in shaping detailed dynamical phenomena, the range of operations of a CRN is fundamentally constrained by thermodynamics. Here, we introduce and analytically derive the "thermodynamic space" of a CRN, i.e., the range of accessible stationary concentrations that can be realized under a given energetic budget. We establish analogous bounds for reaction affinities, shedding light on how global thermodynamic properties, such as the total non-equilibrium driving, can limit local non-equilibrium quantities. We illustrate our results in various paradigmatic examples, demonstrating how the onset of complex behaviors is intimately tangled with the presence of non-equilibrium conditions. By providing a general tool for analysing CRNs, the presented framework constitutes a stepping stone to deepen our ability to predict complex out-of-equilibrium phenomena and design artificial chemical systems, starting from the sole knowledge of the underlying thermodynamic properties.

2302.13268 2026-02-03 q-bio.GN cs.AI cs.LG q-bio.QM

Revolutionizing Genomics with Reinforcement Learning Techniques

Mohsen Karami, Khadijeh, Jahanian, Roohallah Alizadehsani, Iman Dehzangi, Juan M Gorriz, Yudong Zhang, Jia Wang, Farshid Hajati, Min Yang, Thantrira Porntaveetus, Hamid Alinejad-Rokny

详情
英文摘要

In recent years, Reinforcement Learning (RL) has emerged as a powerful tool for solving a wide range of problems, including decision-making and genomics. The exponential growth of raw genomic data over the past two decades has exceeded the capacity of manual analysis, leading to a growing interest in automatic data analysis and processing. RL algorithms are capable of learning from experience with minimal human supervision, making them well-suited for genomic data analysis and interpretation. One of the key benefits of using RL is the reduced cost associated with collecting labeled training data, which is required for supervised learning. While there have been numerous studies examining the applications of Machine Learning (ML) in genomics, this survey focuses exclusively on the use of RL in various genomics research fields, including gene regulatory networks (GRNs), genome assembly, and sequence alignment. We present a comprehensive technical overview of existing studies on the application of RL in genomics, highlighting the strengths and limitations of these approaches. We then discuss potential research directions that are worthy of future exploration, including the development of more sophisticated reward functions as RL heavily depends on the accuracy of the reward function, the integration of RL with other machine learning techniques, and the application of RL to new and emerging areas in genomics research. Finally, we present our findings and conclude by summarizing the current state of the field and the future outlook for RL in genomics.

2602.01230 2026-02-03 q-bio.GN

Toward Interpretable and Generalizable AI in Regulatory Genomics

Masayuki Nagai, Alan E. Murphy, Kaeli Rizzo, Peter K. Koo

Comments 5 figures, 1 table

详情
英文摘要

Deciphering how DNA sequence encodes gene regulation remains a central challenge in biology. Advances in machine learning and functional genomics have enabled sequence-to-function (seq2func) models that predict molecular regulatory readouts directly from DNA sequence. These models are now widely used for variant effect prediction, mechanistic interpretation, and regulatory sequence design. Despite strong performance on held-out genomic regions, their ability to generalize across genetic variation and cellular contexts remains inconsistent. Here we examine how architectural choices, training data, and prediction tasks shape the behavior of seq2func models. We synthesize how interpretability methods and evaluation practices have probed learned cis-regulatory organization and highlighted systematic failure modes, clarifying why strong predictive accuracy can fail to translate into robust regulatory understanding. We argue that progress will require reframing seq2func models as continually refined systems, in which targeted perturbation experiments, systematic evaluation, and iterative model updates are tightly coupled through AI-experiment feedback loops. Under this framework, seq2func models become self-improving tools that progressively deepen their mechanistic grounding and more reliably support biological discovery.

2602.01088 2026-02-03 q-bio.QM

INDIGENA: inductive prediction of disease-gene associations using phenotype ontologies

Fernando Zhapa-Camacho, Robert Hoehndorf

详情
英文摘要

Motivation: Predicting gene-disease associations (GDAs) is the problem to determine which gene is associated with a disease. GDA prediction can be framed as a ranking problem where genes are ranked for a query disease, based on features such as phenotypic similarity. By describing phenotypes using phenotype ontologies, ontology-based semantic similarity measures can be used. However, traditional semantic similarity measures use only the ontology taxonomy. Recent methods based on ontology embeddings compare phenotypes in latent space; these methods can use all ontology axioms as well as a supervised signal, but are inherently transductive, i.e., query entities must already be known at the time of learning embeddings, and therefore these methods do not generalize to novel diseases (sets of phenotypes) at inference time. Results: We developed INDIGENA, an inductive disease-gene association method for ranking genes based on a set of phenotypes. Our method first uses a graph projection to map axioms from phenotype ontologies to a graph structure, and then uses graph embeddings to create latent representations of phenotypes. We use an explicit aggregation strategy to combine phenotype embeddings into representations of genes or diseases, allowing us to generalize to novel sets of phenotypes. We also develop a method to make the phenotype embeddings and the similarity measure task-specific by including a supervised signal from known gene-disease associations. We apply our method to mouse models of human disease and demonstrate that we can significantly improve over the inductive semantic similarity baseline measures, and reach a performance similar to transductive methods for predicting gene-disease associations while being more general. Availability and Implementation: https://github.com/bio-ontology-research-group/indigena

2602.01019 2026-02-03 q-bio.NC cs.AI cs.HC

Inter- and Intra-Subject Variability in EEG: A Systematic Survey

Xuan-The Tran, Thien-Nhan Vo, Son-Tung Vu, Thoa-Thi Tran, Manh-Dat Nguyen, Thomas Do, Chin-Teng Lin

详情
英文摘要

Electroencephalography (EEG) underpins neuroscience, clinical neurophysiology, and brain-computer interfaces (BCIs), yet pronounced inter- and intra-subject variability limits reliability, reproducibility, and translation. This systematic review studies that quantified or modeled EEG variability across resting-state, event-related potentials (ERPs), and task-related/BCI paradigms (including motor imagery and SSVEP) in healthy and clinical cohorts. Across paradigms, inter-subject differences are typically larger than within-subject fluctuations, but both affect inference and model generalization. Stability is feature-dependent: alpha-band measures and individual alpha peak frequency are often relatively reliable, whereas higher-frequency and many connectivity-derived metrics show more heterogeneous reliability; ERP reliability varies by component, with P300 measures frequently showing moderate-to-good stability. We summarize major sources of variability (biological, state-related, technical, and analytical), review common quantification and modeling approaches (e.g., ICC, CV, SNR, generalizability theory, and multivariate/learning-based methods), and provide recommendations for study design, reporting, and harmonization. Overall, EEG variability should be treated as both a practical constraint to manage and a meaningful signal to leverage for precision neuroscience and robust neurotechnology.

2602.00978 2026-02-03 cs.NE q-bio.PE

Organismal Agency and Rapid Adaptation: The Phenopoiesis Algorithm for Phenotype-First Evolution

Nam H. Le

Comments 22 pages, 2 figures,

详情
英文摘要

Evolutionary success depends on the capacity to adapt: organisms must respond to environmental challenges through both genetic innovation and lifetime learning. The gene-centric paradigm attributes evolutionary causality exclusively to genes, while Denis Noble's phenotype-first framework argues that organisms are active agents capable of interpreting genetic resources, learning from experience, and shaping their own development. However, this framework has remained philosophically intuitive but algorithmically opaque. We show for the first time that organismal agency can be implemented as a concrete computational process through heritable phenotypic patterns. We introduce the Phenopoiesis Algorithm, where organisms inherit not just genes but also successful phenotypic patterns discovered during lifetime learning. Through experiments in changing environments, these pattern-inheriting organisms achieve 3.4 times faster adaptation compared to gene-centric models. Critically, these gains require cross-generational inheritance of learned patterns rather than within-lifetime learning alone. We conclude that organismal agency is not a philosophical abstraction but an algorithmic mechanism with measurable adaptive value. The mechanism works through compositional reuse: organisms discover how to compose primitive elements into solutions, encode those compositional recipes, and transmit them to offspring. Evolution operates across multiple timescales -- fast, reversible phenotypic inheritance and slow, permanent genetic inheritance -- providing adaptive flexibility that single-channel mechanisms cannot achieve.

2602.00832 2026-02-03 physics.bio-ph q-bio.PE

Size and shape of terrestrial animals

Neelima Sharma, Madhusudhan Venkadesan

Comments 19 pages, 3 main figures, 7 supplementary figures, 2 extended data tables (ancillary file)

详情
英文摘要

Natural selection for terrestrial locomotion has yielded unifying patterns in the body shape of legged animals, often manifesting as scaling laws. One such pattern appears in the frontal aspect ratio. Smaller animals like insects typically adopt a landscape frontal aspect ratio, with a wider side-to-side base of support than center of mass height. Larger animals like elephants, however, are taller than wide with a portrait aspect ratio. Known explanations for postural scaling are restricted to animal groups with similar anatomical and behavioural motifs, but the trend in frontal aspect ratio transcends such commonalities. Here we show that vertebrates and invertebrates with diverse body plans, ranging in mass from 28 mg to 22000 kg, exhibit size-dependent scaling of the frontal aspect ratio driven by the need for lateral stability on uneven natural terrain. Because natural terrain exhibit scale-dependent unevenness, and the frontal aspect ratio is important for lateral stability during locomotion, smaller animals need a wider aspect ratio for stability. This prediction is based on the fractal property of natural terrain unevenness, requires no anatomical or behavioural parameters, and agrees with the measured scaling despite vast anatomical and behavioural differences. Furthermore, a statistical phylogenetic comparative analysis found that shared ancestry and random trait evolution cannot explain the measured scaling. Thus, our findings reveal that terrain roughness, acting through natural selection for stability, likely drove the macroevolution of frontal shape in terrestrial animals.

2602.00782 2026-02-03 q-bio.BM cs.AI

Controlling Repetition in Protein Language Models

Jiahao Zhang, Zeqing Zhang, Di Wang, Lijie Hu

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Protein language models (PLMs) have enabled advances in structure prediction and de novo protein design, yet they frequently collapse into pathological repetition during generation. Unlike in text, where repetition merely reduces readability, in proteins it undermines structural confidence and functional viability. To unify this problem, we present the first systematic study of repetition in PLMs. We first propose quantitative metrics to characterize motif-level and homopolymer repetition and then demonstrate their negative impact on folding reliability. To address this challenge, we propose UCCS (Utility-Controlled Contrastive Steering), which steers protein generation with a constrained dataset. Instead of naively contrasting high- vs. low-repetition sequences, we construct contrastive sets that maximize differences in repetition while tightly controlling for structural utility. This disentanglement yields steering vectors that specifically target repetition without degrading foldability. Injected at inference, these vectors consistently reduce repetition without retraining or heuristic decoding. Experiments with ESM-3 and ProtGPT2 in CATH, UniRef50, and SCOP show that our method outperforms decoding penalties and other baselines, substantially lowering repetition while preserving AlphaFold confidence scores. Our results establish repetition control as a central challenge for PLMs and highlight dataset-guided steering as a principled approach for reliable protein generation.

2602.00660 2026-02-03 q-bio.BM physics.data-an

Phase Transitions in Unsupervised Feature Selection

Jonathan Fiorentino, Michele Monti, Dimitrios Miltiadis-Vrachnos, Vittorio Del Tatto, Alessandro Laio, Gian Gaetano Tartaglia

Comments 15 pages, 4 figures in main text, 7 figures in supplemental material

详情
英文摘要

Identifying minimal and informative feature sets is a central challenge in data analysis, particularly when few data points are available. Here we present a theoretical analysis of an unsupervised feature selection pipeline based on the Differentiable Information Imbalance (DII). We consider the specific case of structural and physico-chemical features describing a set of proteins. We show that if one considers the features as coordinates of a (hypothetical) statistical physics model, this model undergoes a phase transition as a function of the number of retained features. For physico-chemical descriptors, this transition is between a glass-like phase when the features are few and a liquid-like phase. The glass-like phase exhibits bimodal order-parameter distributions and Binder cumulant minima. In contrast, for structural descriptors the transition is less sharp. Remarkably, for physico-chemical descriptors the critical number of features identified from the DII coincides with the saturation of downstream binary classification performance. These results provide a principled, unsupervised criterion for minimal feature sets in protein classification and reveal distinct mechanisms of criticality across different feature types.

2601.19149 2026-02-03 cs.LG q-bio.QM

GPCR-Filter: a deep learning framework for efficient and precise GPCR modulator discovery

Jingjie Ning, Xiangzhen Shen, Li Hou, Shiyi Shen, Jiahao Yang, Junrui Li, Hong Shan, Sanan Wu, Sihan Gao, H. Eric Xu, Xinheng He

详情
英文摘要

G protein-coupled receptors (GPCRs) govern diverse physiological processes and are central to modern pharmacology. Yet discovering GPCR modulators remains challenging because receptor activation often arises from complex allosteric effects rather than direct binding affinity, and conventional assays are slow, costly, and not optimized for capturing these dynamics. Here we present GPCR-Filter, a deep learning framework specifically developed for GPCR modulator discovery. We assembled a high-quality dataset of over 90,000 experimentally validated GPCR-ligand pairs, providing a robust foundation for training and evaluation. GPCR-Filter integrates the ESM-3 protein language model for high-fidelity GPCR sequence representations with graph neural networks that encode ligand structures, coupled through an attention-based fusion mechanism that learns receptor-ligand functional relationships. Across multiple evaluation settings, GPCR-Filter consistently outperforms state-of-the-art compound-protein interaction models and exhibits strong generalization to unseen receptors and ligands. Notably, the model successfully identified micromolar-level agonists of the 5-HT\textsubscript{1A} receptor with distinct chemical frameworks. These results establish GPCR-Filter as a scalable and effective computational approach for GPCR modulator discovery, advancing AI-assisted drug development for complex signaling systems.

2511.06356 2026-02-03 cs.LG cs.AI q-bio.BM

Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets

Runhan Shi, Letian Chen, Gufeng Yu, Yang Yang

详情
英文摘要

Chemical reaction prediction remains a fundamental challenge in organic chemistry, where existing machine learning models face two critical limitations: sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity. These shortcomings lead to inconsistent predictions and poor generalization to real-world scenarios. To address these challenges, we propose ReaDISH, a novel reaction prediction model that learns permutation-invariant representations while incorporating interaction-aware features. It introduces two innovations: (1) symmetric difference shingle encoding, which extends the differential reaction fingerprint (DRFP) by representing shingles as continuous high-dimensional embeddings, capturing structural changes while eliminating order sensitivity; and (2) geometry-structure interaction attention, a mechanism that models intra- and inter-molecular interactions at the shingle level. Extensive experiments demonstrate that ReaDISH improves reaction prediction performance across diverse benchmarks. It shows enhanced robustness with an average improvement of 8.76% on R$^2$ under permutation perturbations.

2506.12277 2026-02-03 q-bio.QM

Evaluation of machine-learning models to measure individualized treatment effects from randomized clinical trial data with time-to-event outcomes

Elvire Roblin, Paul-Henry Cournède, Stefan Michiels

Comments 20 pages, 8 figures

详情
英文摘要

Objective: In randomized clinical trials, prediction models can be used to explore the relationships between patients' variables (e.g., clinical, pathological, or lifestyle variables, and also biomarker or genomic data) and treatment effect magnitude. Our aim was to evaluate flexible machine learning models capable of incorporating interactions and nonlinear effects from high-dimensional data to estimate individualized treatment recommendations in trials with time-to-event outcomes. Methods: We compared survival models based on neural networks (CoxCC and CoxTime) and random survival forests (Interaction Forests) against a Cox proportional hazards model with an adaptive LASSO (ALASSO) penalty as a benchmark. For individualized treatment recommendations in the survival setting, we adapted metrics originally designed for binary outcomes to accommodate time-to-event data with censoring. These adapted metrics included the C-for-Benefit, the E50-for-Benefit, and the root mean squared error for treatment benefit. An extensive simulation study was conducted using two different data generation processes incorporating nonlinearity and interactions. The models were applied to gene expression and clinical data from three cancer clinical trial data sets. Results: In the first data generation process, neural networks outperformed ALASSO in terms of calibration while the Interaction Forests showed superior C-for-benefit performance. In the second data generation process, both machine learning methods outperformed the benchmark linear ALASSO method across discrimination, calibration, and RMSE metrics. In the cancer trial data sets, the machine learning methods often performed better than ALASSO, particularly IF in terms of C-for-benefit, and either a neural network or IF for calibration measures addressing treatment benefit.

2506.04056 2026-02-03 q-bio.PE

Generalized Lotka-Volterra systems with quenched random interactions and saturating nonlinear response

Marco Zenari, Francesco Ferraro, Sandro Azaele, Amos Maritan, Samir Suweis

Journal ref Physical Review E 00, 004200 (2026)

详情
英文摘要

The generalized Lotka-Volterra (GLV) equations with quenched random interactions have been extensively used to investigate the stability and dynamics of complex ecosystems. However, the standard linear interaction model suffers from pathological unbounded growth, especially under strong cooperation or heterogeneity. This work addresses that limitation by introducing a Monod-type saturating nonlinear response into the GLV framework. Using Dynamical Mean Field Theory, we derive analytical expressions for the species abundance distribution in the Unique Fixed Point phase and show the suppression of unbounded dynamics. Numerical simulations reveal a rich dynamical structure in the Multiple Attractor phase, including a transition between high-dimensional chaotic and low-volatility regimes, governed by interaction symmetry. These findings offer a more ecologically realistic foundation for disordered ecosystem models and highlight the role of nonlinearity and symmetry in shaping the diversity and resilience of large ecological communities.

2503.00008 2026-02-03 q-bio.MN

Determining the Equivalence of Small Zero-one Reaction Networks

Yue Jiao, Xiaoxian Tang

详情
英文摘要

Zero-one reaction networks are pivotal to cellular signaling, and establishing the equivalence of such networks represents a foundational computational challenge in the realm of chemical reaction network research. Herein, we propose a high-efficiency approach for identifying the equivalence of zero-one networks. Its efficiency stems from a set of criteria tailored to judge the equivalence of steady-state ideals derived from zero-one networks, which effectively reduces the computational cost associated with Gröbner basis calculations. Experimental results demonstrate that our proposed method can successfully categorize more than three million networks by their equivalence within a feasible timeframe. Also, our computational results for two important classes of quadratic zero-one networks (3-dimensional with 3 species, 6 reactions; 4-dimensional with 4 species, 5 reactions) show that they have no positive steady states for a generic choice of rate constants, implying these small networks generically exhibit neither multistability nor periodic orbits.

2502.06914 2026-02-03 q-bio.QM cs.AI cs.LG

UniZyme: A Unified Protein Cleavage Site Predictor Enhanced with Enzyme Active-Site Knowledge

Chenao Li, Shuo Yan, Enyan Dai

Comments 22 pages,9 figures

详情
英文摘要

Enzyme-catalyzed protein cleavage is essential for many biological functions. Accurate prediction of cleavage sites can facilitate various applications such as drug development, enzyme design, and a deeper understanding of biological mechanisms. However, most existing models are restricted to an individual enzyme, which neglects shared knowledge of enzymes and fails to generalize to novel enzymes. Thus, we introduce a unified protein cleavage site predictor named UniZyme, which can generalize across diverse enzymes. To enhance the enzyme encoding for the protein cleavage site prediction, UniZyme employs a novel biochemically-informed model architecture along with active-site knowledge of proteolytic enzymes. Extensive experiments demonstrate that UniZyme achieves high accuracy in predicting cleavage sites across a range of proteolytic enzymes, including unseen enzymes. The code is available in https://github.com/Ao-LiChen/UniZyme

2602.00464 2026-02-03 q-bio.QM cs.CV

A 30-item Test for Assessing Chinese Character Amnesia in Child Handwriters

Zebo Xu, Steven Langsford, Zhuang Qiu, Zhenguang Cai

详情
英文摘要

Handwriting literacy is an important skill for learning and communication in school-age children. In the digital age, handwriting has been largely replaced by typing, leading to a decline in handwriting proficiency, particularly in non-alphabetic writing systems. Among children learning Chinese, a growing number have reported experiencing character amnesia: difficulty in correctly handwriting a character despite being able to recognize it. Given that there is currently no standardized diagnostic tool for assessing character amnesia in children, we developed an assessment to measure Chinese character amnesia in Mandarin-speaking school-age population. We utilised a large-scale handwriting dataset in which 40 children handwrote 800 characters from dictation prompts. Character amnesia and correct handwriting responses were analysed using a two-parameter Item Response Theory model. Four item-selection schemes were compared: random baseline, maximum discrimination, diverse difficulty, and an upper-and-lower-thirds discrimination score. Candidate item subsets were evaluated using out-of-sample prediction. Among these selection schemes, the upper-and-lower-thirds discrimination procedure yields a compact 30-item test that preserves individual-difference structure and generalizes to unseen test-takers (cross-validated mean r =.74 with full 800-item-test; within-sample r =.93). This short-form test provides a reliable and efficient tool of assessing Chinese character amnesia in children and can be used to identify early handwriting and orthographic learning difficulties, contributing to the early detection of developmental dysgraphia and related literacy challenges.

2602.00259 2026-02-03 cs.HC cs.AI q-bio.OT

Intelligent Reasoning Cues: A Framework and Case Study of the Roles of AI Information in Complex Decisions

Venkatesh Sivaraman, Eric P. Mason, Mengfan Ellen Li, Jessica Tong, Andrew J. King, Jeremy M. Kahn, Adam Perer

Comments Accepted at CHI 2026

详情
英文摘要

Artificial intelligence (AI)-based decision support systems can be highly accurate yet still fail to support users or improve decisions. Existing theories of AI-assisted decision-making focus on calibrating reliance on AI advice, leaving it unclear how different system designs might influence the reasoning processes underneath. We address this gap by reconsidering AI interfaces as collections of intelligent reasoning cues: discrete pieces of AI information that can individually influence decision-making. We then explore the roles of eight types of reasoning cues in a high-stakes clinical decision (treating patients with sepsis in intensive care). Through contextual inquiries with six teams and a think-aloud study with 25 physicians, we find that reasoning cues have distinct patterns of influence that can directly inform design. Our results also suggest that reasoning cues should prioritize tasks with high variability and discretion, adapt to ensure compatibility with evolving decision needs, and provide complementary, rigorous insights on complex cases.

2602.00143 2026-02-03 q-bio.QM cs.LG stat.ML

Early warning prediction: Onsager-Machlup vs Schrödinger

Xiaoai Xu, Yixuan Zhou, Xiang Zhou, Jingqiao Duan, Ting Gao

Comments 20 pages

详情
英文摘要

Predicting critical transitions in complex systems, such as epileptic seizures in the brain, represents a major challenge in scientific research. The high-dimensional characteristics and hidden critical signals further complicate early-warning tasks. This study proposes a novel early-warning framework that integrates manifold learning with stochastic dynamical system modeling. Through systematic comparison, six methods including diffusion maps (DM) are selected to construct low-dimensional representations. Based on these, a data-driven stochastic differential equation model is established to robustly estimate the probability evolution scoring function of the system. Building on this, a new Score Function (SF) indicator is defined by incorporating Schrödinger bridge theory to quantify the likelihood of significant state transitions in the system. Experiments demonstrate that this indicator exhibits higher sensitivity and robustness in epilepsy prediction, enables earlier identification of critical points, and clearly captures dynamic features across various stages before and after seizure onset. This work provides a systematic theoretical framework and practical methodology for extracting early-warning signals from high-dimensional data.

2602.00057 2026-02-03 q-bio.NC cs.AI cs.LG

Explore Brain-Inspired Machine Intelligence for Connecting Dots on Graphs Through Holographic Blueprint of Oscillatory Synchronization

Tingting Dan, Jiaqi Ding, Guorong Wu

Comments Published in Nature Communications

Journal ref Nature Communications 16, 9425 (2025)

详情
英文摘要

Neural coupling in both neuroscience and artificial intelligence emerges as dynamic oscillatory patterns that encode abstract concepts. To this end, we hypothesize that a deeper understanding of the neural mechanisms governing brain rhythms can inspire next-generation design principles for machine learning algorithms, leading to improved efficiency and robustness. Building on this idea, we first model evolving brain rhythms through the interference of spontaneously synchronized neural oscillations, termed HoloBrain. The success of modeling brain rhythms using an artificial dynamical system of coupled oscillations motivates a "first principle" for brain-inspired machine intelligence based on a shared synchronization mechanism, termed HoloGraph. This principle enables graph neural networks to move beyond conventional heat diffusion paradigms toward modeling oscillatory synchronization. Our HoloGraph framework not only effectively mitigates the over-smoothing problem in graph neural networks but also demonstrates strong potential for reasoning and solving challenging problems on graphs.

2602.00019 2026-02-03 q-bio.BM cs.AI

AutoBinder Agent: An MCP-Based Agent for End-to-End Protein Binder Design

Fukang Ge, Jiarui Zhu, Linjie Zhang, Haowen Xiao, Xiangcheng Bao, Fangnan Xie, Danyang Chen, Yanrui Lu, Yuting Wang, Ziqian Guan, Lin Gu, Jinhao Bi, Yingying Zhu

Comments 4 pages, 3 figures

详情
英文摘要

Modern AI technologies for drug discovery are distributed across heterogeneous platforms-including web applications, desktop environments, and code libraries-leading to fragmented workflows, inconsistent interfaces, and high integration overhead. We present an agentic end-to-end drug design framework that leverages a Large Language Model (LLM) in conjunction with the Model Context Protocol (MCP) to dynamically coordinate access to biochemical databases, modular toolchains, and task-specific AI models. The system integrates four state-of-the-art components: MaSIF (MaSIF-site and MaSIF-seed-search) for geometric deep learning-based identification of protein-protein interaction (PPI) sites, Rosetta for grafting protein fragments onto protein backbones to form mini proteins, ProteinMPNN for amino acid sequences redesign, and AlphaFold3 for near-experimental accuracy in complex structure prediction. Starting from a target structure, the framework supports de novo binder generation via surface analysis, scaffold grafting and pose construction, sequence optimization, and structure prediction. Additionally, by replacing rigid, script-based workflows with a protocol-driven, LLM-coordinated architecture, the framework improves reproducibility, reduces manual overhead, and ensures extensibility, portability, and auditability across the entire drug design process.

2502.07272 2026-02-03 cs.CL q-bio.GN

GENERator: A Long-Context Generative Genomic Foundation Model

Wei Wu, Qiuyi Li, Yuanyuan Zhang, Zhihao Zhan, Ruipu Chen, Mingyang Li, Kun Fu, Junyan Qi, Yongzhou Bao, Chao Wang, Yiheng Zhu, Zhiyun Zhang, Jian Tang, Fuli Feng, Jieping Ye, Yuwen Liu, Hui Xiong, Zheng Wang

详情
英文摘要

The rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but existing approaches are often limited by restricted training scope, constrained generative capability, or prohibitive computational cost. We introduce GENErator, a generative genomic foundation model for long-context DNA modeling, with a context length of 98k nucleotides, pre-trained on 386 billion nucleotides of eukaryotic DNA. Without task-specific fine-tuning, GENERator exhibits strong intrinsic capabilities: unsupervised embedding analyses reveal phylogenetically coherent structure, and sequence recovery benchmarks demonstrate generative accuracy comparable to or exceeding state-of-the-art models with substantially improved computational efficiency. In a zero-shot setting, GENERator achieves competitive variant effect prediction performance relative to alignment-based methods, while remaining fully alignment-free and broadly applicable across species. With task-specific fine-tuning, the model attains leading performance on established genomic benchmarks. We further demonstrate practical generative applications. GENERator can generate protein-coding DNA sequences that translate into structurally plausible proteins and, through a prompt-guided design framework, design cis-regulatory elements with targeted activity profiles, including synthetic super-enhancers validated by high-throughput UMI-STARR-seq assays. Together, these results establish GENERator as an efficient and biologically grounded framework for genomic interpretation and programmable sequence design. Code and supplementary resources are available at https://github.com/GenerTeam/GENERator.