arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.02165 2026-03-03 physics.ao-ph q-bio.QM

Investigating the short-term effects of particulate matter (PM) chemical components on mortality and the potential modifying effect of extreme temperature: A time-series analysis in London

Xiaolu Zhang, Anna Font, Anja Tremper, Max Priestman, Shawn Y. Lee, David C. Green, Dimitris Evangelopoulos, Gang I. Chen

详情
英文摘要

Particulate matter (PM) is linked to adverse health outcomes, yet the roles of specific PM components and their modification by extreme temperature remain unclear. We examined short-term associations between ten PM chemical components and daily mortality in Greater London (2015-2018). PM components include inorganic aerosols (black carbon from wood burning (BCwb) and traffic exhaust (BCtr), SO4, NO3, and NH4) and organic aerosols (hydrocarbon-like organic aerosol (HOA), biomass burning OA (BBOA), cooking-like OA (COA), more and less oxidized oxygenated OA (MO-OOA and LO-OOA)). We applied quasi-Poisson generalized additive models and weighted quantile sum (WQS) regression to estimate single-pollutant, multi-pollutant, and mixture effects, respectively, and included interaction terms to test effect modification by heat waves and cold spells. All ten components showed positive associations with all-cause mortality in single-pollutant models with stronger estimated risks for respiratory mortality, particularly for NH4, NO3, SO4. In mixture analyses, the WQS index was significantly associated with all-cause mortality (RR = 1.015, 95% CI: 1.006-1.024 per 25th-percentile increase) and showed a marginally significance with respiratory mortality (RR = 1.018, 95% CI: 0.994-1.042). MO-OOA and COA contributed most to all-cause mortality, while BBOA and BC Wood dominated respiratory effects. Heat waves consistently amplified respiratory risks in both single-pollutant and mixture models with little evidence for cardiovascular mortality. Overall, MO-OOA demonstrated harmful associations across outcomes, suggesting potential toxicity link to secondary atmospheric oxidation processes. These findings support source-specific control strategies and highlight the importance of accounting for extreme temperature in air pollution mitigation policies.

2603.02036 2026-03-03 q-bio.PE math.DS

Lag-Induced Critical Transitions to Extinction in Replicating Systems

Edward A. Turner, Francisco Crespo, Joan Gimeno, Ernest Fontich, Santiago F. Elena, Josep Sardanyés

Comments 6 pages, 5 figures

详情
英文摘要

Replicating systems sustained by error-prone enzymatic amplification can undergo critical transitions between persistence and extinction. In RNA viruses, such transitions are classically governed by mutation rates and fitness landscapes, giving rise to error thresholds and lethal mutagenesis. Motivated by experimental evidence that polymerase-targeting antivirals constrain replication, we analyze replicating systems with explicit delays in replication-enzyme availability. We identify a lag-induced (dynamical) critical transition driven by the loss of temporal coordination between genome translation and replication. At a fixed mutation rate and replicative fitness landscape, populations cross an extinction threshold solely due to time delays. Within the quasispecies framework, replication-translation timing emerges as an independent control parameter, defining a distinct dynamical route to extinction and suggesting new antiviral strategies based on modulating replicase availability. More generally, we propose that the pathway to collapse described in this article can be understood as lag-time-induced tipping (τ-tipping).

2603.01965 2026-03-03 cs.LG q-bio.QM

CoVAE: correlated multimodal generative modeling

Federico Caretti, Guido Sanguinetti

详情
英文摘要

Multimodal Variational Autoencoders have emerged as a popular tool to extract effective representations from rich multimodal data. However, such models rely on fusion strategies in latent space that destroy the joint statistical structure of the multimodal data, with profound implications for generation and uncertainty quantification. In this work, we introduce Correlated Variational Autoencoders (CoVAE), a new generative architecture that captures the correlations between modalities. We test CoVAE on a number of real and synthetic data sets demonstrating both accurate cross-modal reconstruction and effective quantification of the associated uncertainties.

2603.01873 2026-03-03 q-bio.BM

Bi-TEAM: A Unified Cross-Scale Representation Learning Framework for Chemically Modified Biomolecules

Chunbin Gu, Zijun Gao, Mutian He, Jingjie Zhang, Haipeng Wen, Zihao Luo, Xiaorui Wang, Hanqun Cao, Jiajun Bu, Chang-Yu Hsieh, Pheng Ann Heng

Comments 57 pages, 16 figures

详情
英文摘要

Representation learning for protein biochemical space faces a difficult trade-off: protein language models excel at capturing long-range biological semantics but often miss fine-grained chemical details. Conversely, chemical language models encode atomic information but lack broader sequence context. To address this, we introduce Bi-TEAM (Bi-gated Residual Space Modification), a general framework that injects localized chemical variation into global protein contexts. By ensuring robustness against perturbations such as non-canonical amino acids, post-translational modifications (PTMs), and topological constraints, Bi-TEAM uncovers functional chemical dependencies often missed by evolutionary baselines. Mechanistically, Bi-TEAM maps non-canonical residues to their natural counterparts and injects atomic-level data via a bi-gated residual fusion mechanism. Crucially, this process uses modification-aware prompts to ensure that local structural changes influence global functional representations without requiring alphabet expansion. We evaluated Bi-TEAM on ten datasets spanning chemically modified peptides, PTMs, and natural proteins. The model consistently outperformed state-of-the-art baselines, achieving up to a 66 percent improvement in Matthews correlation coefficient (MCC) on scaffold-similarity splits and a 350 percent increase in hemolysis prediction accuracy. Furthermore, when deployed as an oracle for generative modeling, Bi-TEAM nearly quadrupled the success rate for designing cell-penetrating cyclic peptides. By unifying biological semantics with chemical precision, Bi-TEAM provides a versatile foundation for machine learning driven exploration of peptide and protein biochemical space.

2603.01849 2026-03-03 q-bio.QM

Characterization of the novel transposon Tn7722 harboring bla NDM-1 : Insights into the evolutionary dynamics of resistance in Klebsiella pneumoniae

Tram Vo, Aïcha Hamieh, Marc Levy, Pierre Pontarotti, Jean-Marc Rolain, Vicky Merhej

详情
英文摘要

Background: Klebsiella pneumoniae is a major opportunistic pathogen responsible for various invasive infections. The rise of carbapenem-resistant K. pneumoniae, primarily due to acquisition of bla NDM genes, presents a serious global health threat. In French Polynesia, where international travel is frequent, sporadic cases of NDM-producing Enterobacteriales have emerged. This study aims to characterize the genomic features of NDM-producing K. pneumoniae isolates collected in French Polynesia and evaluate the roles of clonal expansion and horizontal gene transfer mediated by mobile genetic elements in bla NDM spread. Materials and Methods: Between July 2006 and September 2021, 17 carbapenemase-producing K. pneumoniae isolates were identified from 715 clinical samples in Tahiti. Whole-genome sequencing using Illumina MiSeq and Oxford Nanopore technologies was performed. Results: Seven NDM-producing K. pneumoniae strains were identified, five bla NDM-1 and two bla NDM-9 variants. All isolates were resistant to ertapenem (MICs 1 to >32 mg/L), with three resistants to imipenem (MICs 8 to >32 mg/L) and six to meropenem (MICs 2 to >8 mg/L). A novel IS26mediated composite transposon, Tn7722 (16,246 bp), was detected in four isolates on IncF and IncR plasmids. This transposon also carried qnrS1 and aph(3')-VI genes, conferring resistance to fluoroquinolones and aminoglycosides. Tn7722-like elements were found in diverse bacterial genomes worldwide, suggesting it facilitates bla NDM transmission across multiple species and regions. Conclusion: NDM-producing K. pneumoniae in French Polynesia remain sporadic but genetically diverse, without evidence of local outbreak. It underscores the role of plasmid and Tn7722-driven evolution and adaptation. Ongoing genomic surveillance is vital to track the evolution of highrisk clones and MGEs guiding effective containment.

2602.23797 2026-03-03 physics.soc-ph q-bio.PE

Co-spreading dynamics of smoking behavior and awareness on social contact networks

Saicharan Ritwik Chinni, Anupama Sharma

详情
英文摘要

Smoking behavior and awareness co-spread through social interactions, giving rise to coupled contagion processes on social contact networks. In addition to initiation and cessation, awareness of the harmful effects of smoking plays an important role in shaping individual behavior and population-level outcomes. In this work, we develop a mathematical model to study the coupled dynamics of smoking behavior, quitting, and awareness in a population. A deterministic framework based on ordinary differential equations is first formulated to capture the interplay between social influence and awareness-driven behavioral change. Analysis of the model reveals the existence of smoking-free and smoking-endemic steady states, and identifies conditions under which awareness can reduce or suppress the persistence of smoking. Since social interactions are often localized rather than well mixed, the mean-field description is then extended to a network-based model that incorporates structured contact patterns. Numerical simulations performed on empirical social networks indicate that contact heterogeneity and localized awareness spreading can influence the effectiveness of interventions. Our findings underscore the importance of population structure when devising awareness-based intervention strategies for smoking cessation.

2512.18114 2026-03-03 q-bio.QM

Greater than the Sum of Its Parts: Building Substructure into Protein Encoding Models

Robert Calef, Arthur Liang, Manolis Kellis, Marinka Zitnik

详情
英文摘要

Protein representation learning has advanced rapidly with the scale-up of sequence and structure supervision, but most models still encode proteins either as per-residue token sequences or as single global embeddings. This overlooks a defining property of protein organization: proteins are built from recurrent, evolutionarily conserved substructures that concentrate biochemical activity and mediate core molecular functions. Although substructures such as domains and functional sites are systematically cataloged, they are rarely used as training signals or representation units in protein models. We introduce Magneton, an environment for developing substructure-aware protein models. Magneton provides (1) a dataset of 530,601 proteins annotated with over 1.7 million substructures spanning 13,075 types, (2) a training framework for incorporating substructures into existing protein models, and (3) a benchmark suite of 13 tasks probing representations at the residue, substructural, and protein levels. Using Magneton, we develop substructure-tuning, a supervised fine-tuning method that distills substructural knowledge into pretrained protein models. Across state-of-the-art sequence- and structure-based models, substructure-tuning improves function prediction, yields more consistent representations of substructure types never observed during tuning, and shows that substructural supervision provides information that is complementary to global structure inputs. The Magneton environment, datasets, and substructure-tuned models are all openly available (https://github.com/rcalef/magneton/).

2506.09007 2026-03-03 cs.LG q-bio.QM

Branched Schrödinger Bridge Matching

Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee

Comments Published at ICLR 2026. (Proceedings of the 14th International Conference on Learning Representations, Rio de Janeiro, Brazil)

详情
英文摘要

Predicting the intermediate trajectories between an initial and target distribution is a central problem in generative modeling. Existing approaches, such as flow matching and Schrödinger bridge matching, effectively learn mappings between two distributions by modeling a single stochastic path. However, these methods are inherently limited to unimodal transitions and cannot capture branched or divergent evolution from a common origin to multiple distinct modes. To address this, we introduce Branched Schrödinger Bridge Matching (BranchSBM), a novel framework that learns branched Schrödinger bridges. BranchSBM parameterizes multiple time-dependent velocity fields and growth processes, enabling the representation of population-level divergence into multiple terminal distributions. We show that BranchSBM is not only more expressive but also essential for tasks involving multi-path surface navigation, modeling cell fate bifurcations from homogeneous progenitor states, and simulating diverging cellular responses to perturbations.

2501.06762 2026-03-03 q-bio.NC cs.LG cs.NE

Improving the adaptive and continuous learning capabilities of artificial neural networks: Lessons from multi-neuromodulatory dynamics

Jie Mei, Alejandro Rodriguez-Garcia, Daigo Takeuchi, Gabriel Wainstein, Nina Hubig, Yalda Mohsenzadeh, Srikanth Ramaswamy

详情
英文摘要

Continuous, adaptive learning, the ability to adapt to the environment and keep improving performance, is a hallmark of natural intelligence. Biological organisms excel in acquiring, transferring, and retaining knowledge while adapting to volatile environments, making them a source of inspiration for artificial neural networks (ANNs). This study explores how neuromodulation, a building block of learning in biological systems, can help address catastrophic forgetting and enhance the robustness of ANNs in continual learning. Driven by neuromodulators including dopamine (DA), acetylcholine (ACh), serotonin (5-HT) and noradrenaline (NA), neuromodulatory processes in the brain operate at multiple scales, facilitating dynamic responses to environmental changes through mechanisms ranging from local synaptic plasticity to global network-wide adaptability. Importantly, the relationship between neuromodulators and their interplay in modulating sensory and cognitive processes is more complex than previously expected, demonstrating a "many-to-one" neuromodulator-to-task mapping. To inspire neuromodulation-aware learning rules, we highlight (i) how multi-neuromodulatory interactions enrich single-neuromodulator-driven learning, (ii) the impact of neuromodulators across multiple spatio-temporal scales, and correspondingly, (iii) strategies for approximating and integrating neuromodulated learning processes in ANNs. To illustrate these principles, we present a conceptual study to showcase how neuromodulation-inspired mechanisms, such as DA-driven reward processing and NA-based cognitive flexibility, can enhance ANN performance in a Go/No-Go task. Though multi-scale neuromodulation, we aim to bridge the gap between biological and artificial learning, paving the way for ANNs with greater flexibility, robustness, and adaptability.

2307.14025 2026-03-03 cs.LG cs.CV eess.IV q-bio.QM stat.ML

Topological Inductive Bias fosters Multiple Instance Learning in Data-Scarce Scenarios

Salome Kazeminia, Carsten Marr, Bastian Rieck

详情
Journal ref
Transactions on Machine Learning Research, 2026
英文摘要

Multiple instance learning (MIL) is a framework for weakly supervised classification, where labels are assigned to sets of instances, i.e., bags, rather than to individual data points. This paradigm has proven effective in tasks where fine-grained annotations are unavailable or costly to obtain. However, the effectiveness of MIL drops sharply when training data are scarce, such as for rare disease classification. To address this challenge, we propose incorporating topological inductive biases into the data representation space within the MIL framework. This bias introduces a topology-preserving constraint that encourages the instance encoder to maintain the topological structure of the instance distribution within each bag when mapping them to MIL latent space. As a result, our Topology Guided MIL (TG-MIL) method enhances the performance and generalizability of MIL classifiers across different aggregation functions, especially under scarce-data regimes. Our evaluations show average performance improvements of 15.3% for synthetic MIL datasets, 2.8% for MIL benchmarks, and 5.5% for rare anemia classification compared to current state-of-the-art MIL models, where only 17-120 samples per class are available. We make our code publicly available.

2603.01780 2026-03-03 cs.LG q-bio.GN

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Zhao Yang, Hengchang Liu, Chuan Cao, Bing Su

Comments Accepted as a workshop paper at MLGenX 2026

详情
英文摘要

Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at https://huggingface.co/collections/Hengchang-Liu/d3lm.

2603.01774 2026-03-03 q-bio.PE math.PR

Approximate message passing for block-structured ecological systems

Maxime Clenet, Mohammed-Younes Gueddari

详情
英文摘要

Ecological interaction networks are rarely homogeneous: species naturally form communities with distinct interaction structures, resulting in block-structured variance and correlation profiles in the interaction matrix. We study the equilibrium properties of generalized Lotka-Volterra systems whose interaction matrices are random and non-symmetric with variance and correlation profiles. Based on recent advances in approximate message passing (AMP) for heterogeneous and correlated random matrices, we derive a set of self-consistent fixed-point equations that, in the large-$n$ limit, characterize the equilibrium abundance distribution. In particular, we show that this limiting distribution is an explicit mixture of truncated Gaussian, driven by the variance and correlation profiles. We then illustrate the ecological implications of this result through three applications involving two interacting communities. First, we show that local changes in the correlation profile within a single community induce system-wide responses in species persistence, revealing the non-local nature of persistence dynamics. Second, we find that communities dominated by mutualistic or competitive interactions are more robust to increasing inter-community coupling, whereas communities structured by predator-prey interactions are more prone to collapse. Third, we demonstrate that asymmetric interaction variance alone, in the complete absence of correlation, can generate feedback loop between communities.

2603.01701 2026-03-03 q-bio.PE cs.MA

A speciation simulation that partly passes open-endedness tests

Théo de Pinho, Lana Sinapayen

Comments 12 pages, 4 figures

详情
英文摘要

One of the main goals of artificial life research is to recreate in artificial systems the trends for ever more complex and novel entities, interactions and processes that we see in Earth's biosphere, that is, to create open-ended systems. In this paper, we test for Tokyo type 1 open-ended evolution (OEE) of the Tree of Life Simulation (ToLSim), an artificial life software created by Lana Sinapayen. To do so, we conducted an experiment to measure evolutionary activity statistics. These require us to define the notion of components. Here, we define components as the agent's genes. The results show that ToLSim is capable of exhibiting unbounded total cumulative evolutionary activity. However, total and median normalized cumulative evolutionary activity appear bounded and new evolutionary activity is persistently null, suggesting that ToLSim is not open-ended. Further studies on ToLSim could repeat this experiment with individuals or even species, rather than genes, to test whether the present results are valid.

2603.01568 2026-03-03 cs.LG cs.CV cs.IT math.IT q-bio.NC

Rate-Distortion Signatures of Generalization and Information Trade-offs

Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin

详情
英文摘要

Generalization to novel visual conditions remains a central challenge for both human and machine vision, yet standard robustness metrics offer limited insight into how systems trade accuracy for robustness. We introduce a rate-distortion-theoretic framework that treats stimulus-response behavior as an effective communication channel, derives rate-distortion (RD) frontiers from confusion matrices, and summarizes each system with two interpretable geometric signatures - slope ($β$) and curvature ($κ$) - which capture the marginal cost and abruptness of accuracy-robustness trade-offs. Applying this framework to human psychophysics and 18 deep vision models under controlled image perturbations, we compare generalization geometry across model architectures and training regimes. We find that both biological and artificial systems follow a common lossy-compression principle but occupy systematically different regions of RD space. In particular, humans exhibit smoother, more flexible trade-offs, whereas modern deep networks operate in steeper and more brittle regimes even at matched accuracy. Across training regimes, robustness training induces systematic but dissociable shifts in beta/kappa, revealing cases where improved robustness or accuracy does not translate into more human-like generalization geometry. These results demonstrate that RD geometry provides a compact, model-agnostic lens for comparing generalization behavior across systems beyond standard accuracy-based metrics.

2603.01537 2026-03-03 cs.AI q-bio.BM q-bio.QM

Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing?

Youssef Abo-Dahab, Ruby Hernandez, Ismael Caleb Arechiga Duran

Comments 34 pages, 5 figures. Under review at Discover Artificial Intelligence

详情
英文摘要

The contributions of model complexity, data volume, and feature modalities to knowledge graph-based drug repurposing remain poorly quantified under rigorous temporal validation. We constructed a pharmacology knowledge graph from ChEMBL 36 comprising 5,348 entities including 3,127 drugs, 1,156 proteins, and 1,065 indications. A strict temporal split was enforced with training data up to 2022 and testing data from 2023 to 2025, together with biologically verified hard negatives mined from failed assays and clinical trials. We benchmarked five knowledge graph embedding models and a standard graph neural network with 3.44 million parameters that incorporates drug chemical structure using a graph attention encoder and ESM-2 protein embeddings. Scaling experiments ranging from 0.78 to 9.75 million parameters and from 25 to 100 percent of the data, together with feature ablation studies, were used to isolate the contributions of model capacity, graph density, and node feature modalities. Removing the graph attention based drug structure encoder and retaining only topological embeddings combined with ESM-2 protein features improved drug protein PR-AUC from 0.5631 to 0.5785 while reducing VRAM usage from 5.30 GB to 353 MB. Replacing the drug encoder with Morgan fingerprints further degraded performance, indicating that explicit chemical structure representations can be detrimental for predicting pharmacological network interactions. Increasing model size beyond 2.44 million parameters yielded diminishing returns, whereas increasing training data consistently improved performance. External validation confirmed 6 of the top 14 novel predictions as established therapeutic indications. These results show that drug pharmacological behavior can be accurately predicted using target-centric information and drug network topology alone, without requiring explicit chemical structure representations.

2603.01184 2026-03-03 cs.LG cs.AI q-bio.NC stat.CO

Scaling of learning time for high dimensional inputs

Carlos Stein Brito

Comments 14 pages, 5 figures

详情
英文摘要

Representation learning from complex data typically involves models with a large number of parameters, which in turn require large amounts of data samples. In neural network models, model complexity grows with the number of inputs to each neuron, with a trade-off between model expressivity and learning time. A precise characterization of this trade-off would help explain the connectivity and learning times observed in artificial and biological networks. We present a theoretical analysis of how learning time depends on input dimensionality for a Hebbian learning model performing independent component analysis. Based on the geometry of high-dimensional spaces, we show that the learning dynamics reduce to a unidimensional problem, with learning times dependent only on initial conditions. For higher input dimensions, initial parameters have smaller learning gradients and larger learning times. We find that learning times have supralinear scaling, becoming quickly prohibitive for high input dimensions. These results reveal a fundamental limitation for learning in high dimensions and help elucidate how the optimal design of neural networks depends on data complexity. Our approach outlines a new framework for analyzing learning dynamics and model complexity in neural network models.

2603.01054 2026-03-03 q-bio.OT

Topological analysis of bladder filling

Arturo Tozzi

Comments 8 pages, 1 figure

详情
英文摘要

Bladder function is typically assessed through pressure volume relations, compliance indices and flow measurements, whereas structural evaluation relies largely on qualitative imaging findings. These approaches do not formally quantify how bladder geometry changes during filling. To distinguish structural reorganization from pure mechanical stiffness, we developed a simulation based topological analysis of bladder filling grounded in mechanical parameters derived from the literature. Progressive filling was modeled under quasi static conditions, generating multi volume geometries from which spatial descriptors were computed. Drawing on the Freudenthal suspension theorem, filling was interpreted as a dimensional expansion process and structural stability was evaluated by testing whether geometric invariants remain preserved across increasing volumes. Simulated smooth expansion and controlled structural perturbations were compared under identical loading conditions. Pressure trajectories and wall stress estimates were similar across configurations when compliance was matched, whereas geometric descriptors showed divergent volume indexed stability profiles in the presence of remodeling. Computable instability measures identified progressive spatial heterogeneity despite preserved global pressure behavior. By providing a quantitative measure of geometric continuity across successive filling states, our approach indicates that structural remodeling may become detectable before conventional functional impairment appears. Progressive surface irregularity can arise even when compliance, detrusor pressure and flow parameters remain within reference limits. Serial imaging over time may support identification of individuals at greater risk of diverticula formation, decompensation or structural complications despite stable pressure measurements.

2512.11582 2026-03-03 cs.LG cs.CV q-bio.NC

Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model

Sam Gijsen, Marc-Andre Schulz, Kerstin Ritter

Comments Accepted at ICLR 2026. Code and pretrained models available at https://github.com/SamGijsen/Brain-Semantoks

详情
英文摘要

The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.

2511.11758 2026-03-03 q-bio.QM cs.AI

Protein Structure Tokenization via Geometric Byte Pair Encoding

Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Marinka Zitnik

Comments ICLR 2026

详情
英文摘要

Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an $\mathrm{SE}(3)$ end-frame loss. GeoBPE offers compression ($>$10x reduction in bits-per-residue at similar distortion rate), data efficiency ($>$10x less training data), and generalization (maintains test/train distortion ratio of $1.0-1.1$). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across $12$ tasks and $24$ test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs. Code is available at https://github.com/shiningsunnyday/PT-BPE/.

2510.25976 2026-03-03 cs.CV cs.AI q-bio.NC

Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

Roman Beliy, Amit Zalcher, Jonathan Kogman, Navve Wasserman, Michal Irani

Comments Accepted at ICLR 2026

详情
英文摘要

Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.

2509.26560 2026-03-03 stat.ML cs.LG q-bio.NC

Estimating Dimensionality of Neural Representations from Finite Samples

Chanwoo Chun, Abdulkadir Canatar, SueYeon Chung, Daniel Lee

详情
英文摘要

The global dimensionality of a neural representation manifold provides rich insight into the computational process underlying both artificial and biological neural networks. However, all existing measures of global dimensionality are sensitive to the number of samples, i.e., the number of rows and columns of the sample matrix. We show that, in particular, the participation ratio of eigenvalues, a popular measure of global dimensionality, is highly biased with small sample sizes, and propose a bias-corrected estimator that is more accurate with finite samples and with noise. On synthetic data examples, we demonstrate that our estimator can recover the true known dimensionality. We apply our estimator to neural brain recordings, including calcium imaging, electrophysiological recordings, and fMRI data, and to the neural activations in a large language model and show our estimator is invariant to the sample size. Finally, our estimators can additionally be used to measure the local dimensionalities of curved neural manifolds by weighting the finite samples appropriately.

2509.20719 2026-03-03 cs.LG q-bio.QM

A Genetic Algorithm for Navigating Synthesizable Molecular Spaces

Alston Lo, Connor W. Coley, Wojciech Matusik

Comments ICLR 2026

详情
英文摘要

Inspired by the effectiveness of genetic algorithms and the importance of synthesizability in molecular design, we present SynGA, a simple genetic algorithm that operates directly over synthesis routes. Our method features custom crossover and mutation operators that explicitly constrain it to synthesizable molecular space. By modifying the fitness function, we demonstrate the effectiveness of SynGA on a variety of design tasks, including synthesizable analog search and sample-efficient property optimization, for both 2D and 3D objectives. Furthermore, by coupling SynGA with a machine learning-based filter that focuses the building block set, we boost SynGA to state-of-the-art performance. For property optimization, this manifests as a model-based variant SynGBO, which employs SynGA and block filtering in the inner loop of Bayesian optimization. Since SynGA is lightweight and enforces synthesizability by construction, our hope is that SynGA can not only serve as a strong standalone baseline but also as a versatile module that can be incorporated into larger synthesis-aware workflows in the future.

2508.14492 2026-03-03 q-bio.NC cs.AI nlin.AO

Synaptic bundle theory for spike-driven sensor-motor system: More than eight independent synaptic bundles collapse reward-STDP learning

Takeshi Kobayashi, Shogo Yonekura, Yasuo Kuniyoshi

Comments 5 pages, 4 figures

详情
英文摘要

Neuronal spikes directly drive muscles and endow animals with agile movements, but applying the spike-based control signals to actuators in artificial sensor-motor systems inevitably causes a collapse of learning. We developed a system that can vary \emph{the number of independent synaptic bundles} in sensor-to-motor connections. This paper demonstrates the following four findings: (i) Learning collapses once the number of motor neurons or the number of independent synaptic bundles exceeds a critical limit. (ii) The probability of learning failure is increased by a smaller number of motor neurons, while (iii) if learning succeeds, a smaller number of motor neurons leads to faster learning. (iv) The number of weight updates that move in the opposite direction of the optimal weight can quantitatively explain these results. The functions of spikes remain largely unknown. Identifying the parameter range in which learning systems using spikes can be constructed will make it possible to study the functions of spikes that were previously inaccessible due to the difficulty of learning.

2508.11674 2026-03-03 cs.NE cs.AI q-bio.NC

Learning Internal Biological Neuron Parameters and Complexity-Based Encoding for Improved Spiking Neural Networks Performance

Zofia Rudnicka, Janusz Szczepanski, Agnieszka Pregowska

详情
英文摘要

This study proposes a novel learning paradigm for spiking neural networks (SNNs) that replaces the perceptron-inspired abstraction with biologically grounded neuron models, jointly optimizing synaptic weights and intrinsic neuronal parameters. We evaluate two architectures, leaky integrate-and-fire (LIF) and a meta-neuron model, under fixed and learnable intrinsic dynamics. Additionally, we introduce a biologically inspired classification framework that combines SNN dynamics with Lempel-Ziv complexity (LZC), enabling efficient and interpretable classification of spatiotemporal spike data. Training is conducted using surrogate-gradient backpropagation, spike-timing-dependent plasticity (STDP), and the Tempotron rule on spike trains generated from Poisson processes, widely adopted in computational neuroscience as a standard stochastic model of neuronal spike generation due to their analytical tractability and empirical relevance. Learning intrinsic parameters improves classification accuracy by up to 13.50 percentage points for LIF networks and 8.50 for meta-neuron models compared to baselines tuning only network size and learning rate. The proposed SNN-LZC classifier achieves up to 99.50% accuracy with sub-millisecond inference latency and competitive energy consumption. We further provide theoretical justification by formalizing how optimizing intrinsic dynamics enlarges the hypothesis class and proving descent guarantees for intrinsic-parameter updates under standard smoothness assumptions, linking intrinsic optimization to provable improvements in the surrogate objective.

2508.10760 2026-03-03 q-bio.BM cs.AI

FROGENT: An End-to-End Full-process Drug Design Multi-Agent System

Qihua Pan, Dong Xu, Qianwei Yang, Jenna Xinyi Yao, Sisi Yuan, Zexuan Zhu, Jianqiang Li, Junkai Ji

Comments 37 pages, 20 figures

详情
英文摘要

Drug discovery is a complex, multi-step pipeline that remains heavily dependent on manual, experience-driven operations; meanwhile, existing customized artificial intelligence tools are fragmented across web applications, desktop software, and code libraries, resulting in incompatible interfaces and inefficient, burdensome workflows. To overcome these challenges, we propose FROGENT, a full-process drug design multi-agent system that leverages the planning, reasoning, and tool-use capabilities of large language models (LLMs) to unify drug discovery within a closed-loop and autonomous framework. FROGENT is a collaborative multi-agent system comprising a central Orchestrate Agent for strategic workflow coordination and three distributed agents, Retrieve, Forge, and Gauge, that employ dynamic biochemical databases, extensible tool libraries, and task-specific computational models via the Model Context Protocol. This architecture enables end-to-end execution of complex drug discovery pipelines, covering target identification, small-molecule generation, peptide optimization, and retrosynthetic planning. Across eight benchmarks spanning core drug discovery tasks, FROGENT consistently outperforms six increasingly advanced ReAct-style agents. Case studies further demonstrate its practicality and generalization across real-world small-molecule and peptide design scenarios. Overall, FROGENT not only achieves substantial gains in efficiency and accuracy, but also demonstrates the potential of LLM-based agentic systems to autonomously orchestrate drug development pipelines, reducing, or even replacing, reliance on manual, experience-driven human intervention.

2506.07459 2026-03-03 cs.LG q-bio.QM

ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning

Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, Ge Liu

详情
英文摘要

Protein generative models have shown remarkable promise in protein design, yet their success rates remain constrained by reliance on curated sequence-structure datasets and by misalignment between supervised objectives and real design goals. We present ProteinZero, an online reinforcement learning framework for inverse folding models that enables scalable, automated, and continuous self-improvement with computationally efficient feedback. ProteinZero employs a reward pipeline that combines structural guidance from ESMFold with a novel self-derived ddG predictor, providing stable multi-objective signals while avoiding the prohibitive cost of physics-based methods. To ensure robustness in online RL, we further introduce a novel embedding-level diversity regularizer that mitigates mode collapse and promotes functionally meaningful sequence variation. Within a general RL formulation balancing multi-reward optimization, KL-divergence from a reference model, and diversity regularization, ProteinZero achieves robust improvements across designability, stability, recovery, and diversity. On the CATH-4.3 benchmark, it consistently outperforms state-of-the-art baselines including ProteinMPNN, ESM-IF, and InstructPLM, reducing design failure rates by 36-48% and achieving success rates above 90% across diverse folds. Importantly, a complete RL run can be executed on a single 8 X GPU node within three days, including reward computation and data generation. These results indicate that efficient online RL fine-tuning can complement supervised pretraining by allowing protein generative models to evolve continuously from their own outputs and optimize multiple design objectives without labeled data, opening new possibilities for exploring the vast protein design space. Full source code and model checkpoints will be released upon publication.

2506.06750 2026-03-03 cs.AI cs.LG q-bio.NC

Accuracy-Efficiency Trade-Offs in Spiking Neural Networks: A Lempel-Ziv Complexity Perspective on Learning Rules

Zofia Rudnicka, Janusz Szczepanski, Agnieszka Pregowska

详情
英文摘要

Training spiking neural networks (SNNs) remains challenging due to temporal dynamics, non-differentiability of spike events, and sparse event-driven activations. This paper studies how the choice of learning paradigm (unsupervised, supervised, and hybrid) affects classification performance and computational cost in temporal pattern recognition. Building on our earlier study [Rudnicka et al., 2026], we use Lempel-Ziv complexity (LZC) as a compact, decision-relevant descriptor of spike-train temporal organization to quantify how different learning rules reshape class-conditional temporal structure. The pipeline combines a leaky integrate-and-fire (LIF) SNN with an LZC-based decision rule. We evaluate learning rules on synthetic sources with controlled temporal statistics (Bernoulli, two-state Markov, and Poisson spike processes) and on two-class subsets of MNIST and N-MNIST. Across datasets, gradient-based learning achieves the highest accuracy but at high computational cost, whereas bio-inspired rules (e.g., Tempotron and SpikeProp) offer favorable accuracy--efficiency trade-offs. These results highlight that selecting a learning rule should be guided by application constraints and the desired balance between separability and computational overhead.

2506.02052 2026-03-03 q-bio.BM cs.AI cs.LG q-bio.QM

General Protein Pretraining or Domain-Specific Designs? Benchmarking Protein Modeling on Realistic Applications

Shuo Yan, Yuliang Yan, Bin Ma, Chenao Li, Haochun Tang, Jiahua Lu, Minhua Lin, Yuyuan Feng, Enyan Dai

详情
英文摘要

Recently, extensive deep learning architectures and pretraining strategies have been explored to support downstream protein applications. Additionally, domain-specific models incorporating biological knowledge have been developed to enhance performance in specialized tasks. In this work, we introduce $\textbf{Protap}$, a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain-specific models across diverse and realistic downstream protein applications. Specifically, Protap covers five applications: three general tasks and two novel specialized tasks, i.e., enzyme-catalyzed protein cleavage site prediction and targeted protein degradation, which are industrially relevant yet missing from existing benchmarks. For each application, Protap compares various domain-specific models and general architectures under multiple pretraining settings. Our empirical studies imply that: (i) Though large-scale pretraining encoders achieve great results, they often underperform supervised encoders trained on small downstream training sets. (ii) Incorporating structural information during downstream fine-tuning can match or even outperform protein language models pretrained on large-scale sequence corpora. (iii) Domain-specific biological priors can enhance performance on specialized downstream tasks. Code and datasets are publicly available at https://github.com/Trust-App-AI-Lab/protap.

2505.12565 2026-03-03 cs.AI cs.CL cs.LG q-bio.QM

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymkuć, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz A. Grzybowski, Martin D. Burke, Heng Ji

Comments Accepted to ICLR 2026 (Oral). Code: https://github.com/blender-nlp/mCLM Data and Model: https://huggingface.co/collections/language-plus-molecules/mclm

详情
英文摘要

Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model that comprises a bilingual language model that understands both natural language descriptions of functions and molecular blocks. mCLM front-loads synthesizability considerations while improving the predicted functions of molecules in a principled manner. Experiments on FDA-approved drugs showed that mCLM is capable of significantly improving chemical functions. mCLM, with only 3B parameters, also achieves improvements in synthetic accessibility relative to 7 other leading generative AI methods including GPT-5. When tested on 122 out-of-distribution medicines using only building blocks/tokens that are compatible with automated modular synthesis, mCLM outperforms all baselines in property scores and synthetic accessibility. mCLM can also reason on multiple functions and iteratively self-improve to rescue drug candidates that failed late in clinical trials ("fallen angels").

2503.04490 2026-03-03 cs.CL q-bio.GN

Large Language Models in Bioinformatics: A Survey

Zhenyu Wang, Zikang Wang, Jiyue Jiang, Pengan Chen, Xiangyu Shi, Yu Li

Comments Accepted by ACL 2025

详情
英文摘要

Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross-omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.