arXivDaily arXiv每日学术速递 周一至周五更新
重置
2604.03028 2026-04-06 q-bio.PE q-bio.GN

Synonymous Codon Usage Bias Overrides Phylogeny to Reflect Convergent Frond Architecture in a Rapidly Radiating Fern Family Thelypteridaceae

Kerui Huang, Wenyan Zhao, Huan Li, Ningyun Zhang, Lixuan Xiang, Xuan Tang, Yulong Xiao, Yi Liu, Zui Yao, Jun Yan, Hanbin Yin, Rongjie Huang, Yulong Xiao, Peng Xie, Haoliang Hu, Jiangping Shu, Hui Shang, Yun Wang

Comments 23 pages, 8 figures, 4 tables

详情
英文摘要

Convergent evolution provides powerful evidence for natural selection, yet its molecular basis is typically sought in protein-coding amino acid substitutions. Whether adaptive pressures can drive the convergent evolution of synonymous codon usage bias (CUB) to override phylogenetic history remains a fundamental question. Here, we investigate this within the rapidly radiating fern family Thelypteridaceae by establishing a comparative framework that integrates chloroplast phylogenomics with dimensionality reduction of codon usage, morphological data, and divergence time estimation. Our results reveal that chloroplast CUB patterns are strikingly incongruent with the phylogeny of this family. Instead, they partition species into distinct clusters that strongly correlate with a convergently evolved morphological trait, lamina base architecture, a key adaptation whose radiation we date to the early Neogene. This convergent molecular signal is driven by a specific subset of photosynthesis-related genes (ndhJ, psaA, and psbD), which exhibit a high density of type-specific, third-position codon substitutions. These findings demonstrate that CUB can serve as a powerful, quantifiable indicator of adaptive history, revealing a cryptic layer of molecular convergence linked to the regulation of protein synthesis. Our work providing a new framework for uncovering adaptive histories obscured by complex evolutionary processes.

2604.03021 2026-04-06 q-bio.NC

Temporal structure of the language hierarchy within small cortical patches

Julien Gadonneix, Mingfang Zhang, Jérémy Rapin, Linnea Evanson, Pierre Bourdillon, Jean-Rémi King

详情
英文摘要

Speech production requires the rapid coordination of a complex hierarchy of linguistic units, transforming a semantic representation into a precise sequence of articulatory movements. To unravel the neural mechanisms underlying this feat, we leverage recordings from eight 3.2 x 3.2 mm 64-microelectrode arrays implanted in the motor cortex and inferior frontal gyrus of two patients tasked to produce twenty thousand sentences. We show that a hierarchy of linguistic features are robustly encoded in most of these small cortical patches. Contrary to our expectations, instead of a clear macroscopic organization between patches, we observe a multiplexing of phonetic, syllabic and lexical representations within each cortical patch. Critically, this coding scheme dynamically changes over time to allow successive phonemes, syllables and words to be simultaneously represented without interference. Overall, these results, reminiscent of position encoding in transformers, show how small cortical patches organize the unfolding of the speech hierarchy during language production.

2604.02886 2026-04-06 stat.ME q-bio.GN q-bio.QM stat.AP stat.ML

High-dimensional Many-to-many-to-many Mediation Analysis

Tien Dat Nguyen, Trung Khang Tran, Cong Khanh Truong, Duy-Cat Can, Binh T. Nguyen, Oliver Y. Chén

详情
英文摘要

We study high-dimensional mediation analysis in which exposures, mediators, and outcomes are all multivariate, and both exposures and mediators may be high-dimensional. We formalize this as a many (exposures)-to-many (mediators)-to-many (outcomes) (MMM) mediation analysis problem. Methodologically, MMM mediation analysis simultaneously performs variable selection for high-dimensional exposures and mediators, estimates the indirect effect matrix (i.e., the coefficient matrices linking exposure-to-mediator and mediator-to-outcome pathways), and enables prediction of multivariate outcomes. Theoretically, we show that the estimated indirect effect matrices are consistent and element-wise asymptotically normal, and we derive error bounds for the estimators. To evaluate the efficacy of the MMM mediation framework, we first investigate its finite-sample performance, including convergence properties, the behavior of the asymptotic approximations, and robustness to noise, via simulation studies. We then apply MMM mediation analysis to data from the Alzheimer's Disease Neuroimaging Initiative to study how cortical thickness of 202 brain regions may mediate the effects of 688 genome-wide significant single nucleotide polymorphisms (SNPs) (selected from approximately 1.5 million SNPs) on eleven cognitive-behavioral and diagnostic outcomes. The MMM mediation framework identifies biologically interpretable, many-to-many-to-many genetic-neural-cognitive pathways and improves downstream out-of-sample classification and prediction performance. Taken together, our results demonstrate the potential of MMM mediation analysis and highlight the value of statistical methodology for investigating complex, high-dimensional multi-layer pathways in science. The MMM package is available at https://github.com/THELabTop/MMM-Mediation.

2604.02842 2026-04-06 q-bio.BM

ViraHinter: a dual-modal artificial intelligence framework for predicting virus-host interactions

Weiqiang Bai, Fei Wang, Jialin Wang, Sheng Xu, Lifeng Qiao, Juan Li, Zhuyi Guo, Xiangyun Hou, Lei Bai, Bowen Zhou, Edward C. Holmes, Weifeng Shi, Siqi Sun

详情
英文摘要

Protein-protein interactions (PPIs) between a virus and its host govern infection, replication, and pathogenesis. While high-throughput mapping has identified thousands of virus-host associations, much of the virus-host interactome remains uncharacterized due to the labor-intensive nature of experimental screens, the inherent difficulty in capturing transient interactions, and the limited sequence homology across divergent viral families. Here, we introduce ViraHinter, a dual-modal deep learning framework for the precise prediction of virus-host interactions and large-scale inference of interaction landscapes. ViraHinter couples a structure-generation branch with a sequence-representation branch, integrating structure-informed pair representations with ESM-derived embeddings to learn generalizable interaction rules across unseen viruses. We benchmark ViraHinter on pathogenic coronaviruses and influenza A viruses and show that it consistently outperforms RoseTTAFold2-PPI, AlphaFold 3 and RoseTTAFold2-Lite in prioritizing high-confidence candidates even under severe class imbalance and across diverse interface regimes. Notably, it successfully identifies novel functionally relevant host factors and recapitulates the structural plasticity of the complex interfaces. By intersecting predictions across multiple influenza subtypes, ViraHinter reveals 33 shared host factors, offering a roadmap for broad-spectrum antiviral discovery. ViraHinter therefore serves as a robust computational approach for studying virus-host interactions, enabling systematic screening of host factors for all known human-infecting viruses, providing new insights into the shared mechanisms of viral pathogenesis, and accelerating the discovery of novel therapeutic targets and the development of broad-spectrum antivirals.

2604.02394 2026-04-06 q-bio.GN stat.ME

Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Muhammad Muneeb, David B. Ascher

详情
英文摘要

Objective: SNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. Methods: We benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, and SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. Results: Heritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) being negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h^2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2, with both being non-significant. Conclusion: SNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input.

2604.02380 2026-04-06 q-bio.GN math.MG stat.ME

VeloTree: Inferring single-cell trajectories from RNA velocity fields with varifold distances

Elodie Maignant, Tim Conrad, Christoph von Tycowicz

Comments arXiv admin note: text overlap with arXiv:2507.11313

详情
英文摘要

Trajectory inference is a critical problem in single-cell transcriptomics, which aims to reconstruct the dynamic process underlying a population of cells from sequencing data. Of particular interest is the reconstruction of differentiation trees. One way of doing this is by estimating the path distance between nodes -- labeled by cells -- based on cell similarities observed in the sequencing data. Recent sequencing techniques make it possible to measure two types of data: gene expression levels, and RNA velocity, a vector that quantifies variation in gene expression. The sequencing data then consist in a discrete vector field in dimension the number of genes of interest. In this article, we present a novel method for inferring differentiation trees from RNA velocity fields using a distance-based approach. In particular, we introduce a cell dissimilarity measure defined as the squared varifold distance between the integral curves of the RNA velocity field, which we show is a robust estimate of the path distance on the target differentiation tree. Upstream of the dissimilarity measure calculation, we also implement comprehensive routines for the preprocessing and integration of the RNA velocity field. Finally, we illustrate the ability of our method to recover differentiation trees with high accuracy on several simulated and real datasets, and compare these results with the state of the art.

2604.02346 2026-04-06 cs.LG cs.AI cs.SE q-bio.BM

DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

Tianyu Liu, Sihan Jiang, Fan Zhang, Kunyang Sun, Teresa Head-Gordon, Hongyu Zhao

Comments 29 pages, 6 figures

详情
英文摘要

Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.

2604.01949 2026-04-06 cs.LG q-bio.GN

annbatch unlocks terabyte-scale training of biological data in anndata

Ilan Gold, Felix Fischer, Lucas Arnoldt, F. Alexander Wolf, Fabian J. Theis

详情
英文摘要

The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch

2510.20847 2026-04-06 q-bio.NC cs.AI

Integrated representational signatures strengthen specificity in brains and models

Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla

详情
英文摘要

The extent to which different neural or artificial neural networks (models) rely on equivalent representations to support similar tasks remains a central question in neuroscience and machine learning. Prior work has typically compared systems using a single representational similarity metric, yet each captures only one facet of representational structure. To address this, we leverage a suite of representational similarity metrics-each capturing a distinct facet of representational correspondence, such as geometry, unit-level tuning, or linear decodability-and assess brain region or model separability using multiple complementary measures. Metrics that preserve geometric or tuning structure (e.g., RSA, Soft Matching) yield stronger region-based discrimination, whereas more flexible mappings such as Linear Predictivity show weaker separation. These findings suggest that geometry and tuning encode brain-region- or model-family-specific signatures, while linearly decodable information tends to be more globally shared across regions or models. To integrate these complementary representational facets, we adapt Similarity Network Fusion (SNF), a framework originally developed for multi-omics data integration. SNF produces substantially sharper regional and model family-level separation than any single metric and yields robust composite similarity profiles. Moreover, clustering cortical regions using SNF-derived similarity scores reveals a clearer hierarchical organization that aligns closely with established anatomical and functional hierarchies of the visual cortex-surpassing the correspondence achieved by individual metrics.

2510.08736 2026-04-06 q-bio.NC

Neural correlates of perceptual consciousness from within: a narrative review of human intracranial research

Francois Stockart, Alexis Robin, Hal Blumenfeld, Milan Brazdil, Philippe Kahane, Liad Mudrik, Jasmine Thum, Michael Pereira, Nathan Faivre

详情
英文摘要

Despite many years of research, the quest to identify neural correlates of perceptual consciousness (NCC) remains unresolved. One major obstacle lies in methodological limitations: most studies rely on non-invasive neural measures with limited spatial or temporal resolution making it difficult to disentangle proper NCCs from concurrent cognitive processes. Additionally, the relatively low sensitivity of non-invasive neural measures limits the interpretation of null findings in studies targeting proper NCCs. In this review, we discuss how human intracranial recordings can advance the search for NCCs, by offering high spatiotemporal resolution, improved signal sensitivity, and broad cortical and subcortical coverage. We review studies that have examined NCCs at the level of single neurons and populations of neurons, and evaluate their implications on the debates between cognitive and sensory theories of consciousness. Finally, we highlight the limits of current intracranial human recordings and propose future directions based on emerging technologies and novel experimental paradigms.

2508.14680 2026-04-06 cond-mat.stat-mech physics.data-an q-bio.PE

Size-structured populations with growth fluctuations: Feynman--Kac formula and decoupling

Ethan Levien, Yaïr Hein, Farshid Jafarpour

Comments 29 pages, 4 figures

详情
英文摘要

We study a size-structured population model in which individual cells grow at a rate determined by a fluctuating internal variable (e.g., gene expression levels). Many previous models of phenotypically heterogeneous populations can be viewed as special cases of this model, and it has previously been observed that the internal variable decouples from cell size under certain conditions. In this work, we generalize these results and connect them to the Feynman-Kac formula, which yields relationships between the lineage dynamics and population distribution in branching processes. To this end, we derive conditions for decoupling, both in the lineage and population ensemble. When decoupling occurs in both ensembles, the size dynamics can be transformed, via a random time change, into a growth-homogeneous process, and expectations can be evaluated through an exponential tilting procedure that follows from the Feynman-Kac formula. We further characterize weaker, ensemble-specific forms of decoupling that hold in either the lineage or the population ensemble, but not both. We provide a more general interpretation of tilted expectations in terms of the mass-weighted phenotype distribution

2505.08671 2026-04-06 q-bio.PE nlin.AO

How spatial patterns can lead to less resilient ecosystems

David Pinto-Ramos, Ricardo Martinez-Garcia

详情
英文摘要

Several theoretical models predict that spatial patterning increases ecosystem resilience. However, these predictions rely on simplifying assumptions, such as assuming isotropic and infinitely large ecosystems, and empirical evidence directly linking spatial patterning to enhanced resilience remains scarce. We introduce a unifying framework, encompassing existing models for vegetation pattern formation in water-stressed ecosystems, that relaxes these assumptions. This framework incorporates finite vegetated areas surrounded by desert and anisotropic environmental conditions that lead to non-reciprocal plant interactions. Under these more realistic conditions, we identify a novel desertification mechanism, known as nonlinear convective instability in physics but largely overlooked in ecology. These instabilities form when non-reciprocal interactions destabilize the vegetation-desert interface and can trigger desertification fronts even under stress levels where isotropic models predict stability. Importantly, ecosystems exhibiting periodic vegetation patterns are more susceptible to nonlinear convective instabilities than those with homogeneous vegetation, suggesting that spatial patterning may reduce, rather than enhance, resilience. These findings challenge the prevailing view that self-organized patterning enhances ecosystem resilience and provide a new framework for investigating how spatial dynamics shape the stability and resilience of ecological systems under changing environmental conditions.