arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.26544 2026-03-30 cs.CL q-bio.QM

Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model

Maria Kefala, Jeffery L. Painter, Syed Tauhid Bukhari, Maurizio Sessa

Comments 4 Figures and 2 Tables

详情
英文摘要

Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.

2603.26370 2026-03-30 q-bio.QM math.DS math.OC

Multi-scale Metabolic Modeling and Simulation

Peter E. Carstensen, Teddy Groves, Lars K. Nielsen, Ulrich Krühne, Krist V. Gernaey, John B. Jørgensen

Comments To be presented at ESCAPE36, 7 pages, 6 figures

详情
英文摘要

Biological systems are governed by coupled interactions between intracellular metabolism and bioreactor operation that span multiple time scales. Constraint-based metabolic models are widely used to describe intracellular metabolism, but repeatedly solving the optimization problem at each time step in dynamic models introduces numerical challenges related to infeasibility and computational efficiency. This work presents a multi-scale modeling framework that integrates genome-scale, constraint-based metabolic models with dynamic bioreactor simulations. Intracellular metabolism is described using positive flux variables in a parsimonious flux balance analysis, and the resulting embedded optimization problem is replaced by a neural network surrogate. The surrogate provides a smooth approximation of the embedded optimization mapping and eliminates repeated linear program solves during simulation. The approach is demonstrated for fed-batch fermentation of Escherichia coli, in which the surrogate model yields intracellular fluxes under substrate-limited conditions, whereas the underlying linear program would otherwise be infeasible. The framework provides a continuous representation of intracellular metabolism suitable for dynamic simulation of genome-scale models in bioreactor configurations.

2603.26267 2026-03-30 q-bio.NC

On the RAID dataset of perceptual responses: analysis and statistical causes

Paula Daudén-Oliver, David Agost-Beltran, Emilio Sansano-Sansano, Raul Montoliu, Valero Laparra, Jesús Malo, Marina Martínez-Garcia

详情
英文摘要

This work analyzes the RAID dataset to evaluate human responses to affine image distortions, including rotation, translation, scaling, and Gaussian noise. Using Mean Squared Error (MSE), the study establishes human detection thresholds for these distortions, enabling comparison across types. Statistical analysis with ANOVA and Tukey Kramer tests reveals that observers are significantly more sensitive to Gaussian noise, which consistently produced the lowest detection thresholds. Fourier analysis further shows that high-frequency components act as a visual mask for Gaussian noise, demonstrating a strong correlation between high frequency energy and detection thresholds. Additionally, spectral orientation influences the perception of rotation. Finally, the study employs the PixelCNN model to show that image probability significantly correlates with detection thresholds for most distortions, suggesting that statistical likelihood affects human visual tolerance.

2603.26110 2026-03-30 q-bio.QM

TurboESM: Ultra-Efficient 3-Bit KV Cache Quantization for Protein Language Models with Orthogonal Rotation and QJL Correction

Yue Hu, Junqing Wang, Yingchao Liu

Comments 16 pages, 7 tables

详情
英文摘要

The rapid scaling of Protein Language Models (PLMs) has unlocked unprecedented accuracy in protein structure prediction and design, but the quadratic memory growth of the Key-Value (KV) cache during inference remains a prohibitive barrier for single-GPU deployment and high-throughput generation. While 8-bit quantization is now standard, 3-bit quantization remains elusive due to severe numerical outliers in activations. This paper presents TurboESM, an adaptation of Google's TurboQuant to the PLM domain. We solve the fundamental incompatibility between Rotary Position Embeddings (RoPE) and orthogonal transformations by deriving a RoPE-first rotation pipeline. We introduce a head-wise SVD calibration method tailored to the amino acid activation manifold, a dual look-up table (LUT) strategy for asymmetric K/V distributions, and a 1-bit Quantized Johnson-Lindenstrauss (QJL) residual correction. All experiments are conducted on ESM-2 650M, where our implementation achieves a 7.1x memory reduction (330 MB to 47 MB) while maintaining cosine similarity > 0.96 in autoregressive decoding across diverse protein families, including short peptides, transmembrane helices, enzyme active site fragments, and intrinsically disordered regions. We further implement a Triton-based fused decode attention kernel that eliminates intermediate dequantization memory allocations, achieving a 1.96x speedup over the PyTorch two-step path for the KV fetch operation alone; however, TurboESM incurs a prefill overhead of 21-27 ms relative to the original model due to KV quantization and packing, making it most suitable for memory-bound scenarios rather than latency-critical short-sequence workloads. Analysis reveals that PLMs exhibit sharper outlier profiles than large language models (LLMs) due to amino acid vocabulary sparsity, and our method effectively addresses these distributions.

2509.24779 2026-03-30 cs.LG q-bio.BM

MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models

Kacper Kapuśniak, Cristian Gabellini, Michael Bronstein, Prudencio Tossou, Francesco Di Giovanni

详情
英文摘要

Molecular Dynamics (MD) is a powerful computational microscope for probing protein functions. However, the need for fine-grained integration and the long timescales of biomolecular events make MD computationally expensive. To address this, several generative models have been proposed to generate surrogate trajectories at lower cost. Yet, these models typically learn a fixed-lag transition density, causing the training signal to be dominated by frequent but uninformative transitions. We introduce a new class of generative models, MSM Emulators, which instead learn to sample transitions across discrete states defined by an underlying Markov State Model (MSM). We instantiate this class with Markov Space Flow Matching (MarS-FM), whose sampling offers more than two orders of magnitude speedup compared to implicit- or explicit-solvent MD simulations. We benchmark Mars-FM ability to reproduce MD statistics through structural observables such as RMSD, radius of gyration, and secondary structure content. Our evaluation spans protein domains (up to 500 residues) with significant chemical and structural diversity, including unfolding events, and enforces strict sequence dissimilarity between training and test sets to assess generalization. Across all metrics, MarS-FM outperforms existing methods, often by a substantial margin.

2507.03005 2026-03-30 cs.CL q-bio.PE

Beyond cognacy

Gerhard Jäger

Comments 9 pages, 2 figures

详情
英文摘要

Computational phylogenetics has become an established tool in historical linguistics, with many language families now analyzed using likelihood-based inference. However, standard approaches rely on expert-annotated cognate sets, which are sparse, labor-intensive to produce, and limited to individual language families. This paper explores alternatives by comparing the established method to two fully automated methods that extract phylogenetic signal directly from lexical data. One uses automatic cognate clustering with unigram/concept features; the other applies multiple sequence alignment (MSA) derived from a pair-hidden Markov model. Both are evaluated against expert classifications from Glottolog and typological data from Grambank. Also, the intrinsic strengths of the phylogenetic signal in the characters are compared. Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal, suggesting it as a promising, scalable alternative to traditional cognate-based methods. This opens new avenues for global-scale language phylogenies beyond expert annotation bottlenecks.

2410.03757 2026-03-30 math.OC math-ph math.CA math.MP q-bio.QM

Framing structural identifiability in terms of parameter symmetries

Johannes G Borgqvist, Alexander P Browning, Fredrik Ohlsson, Ruth E Baker

Comments 45 pages, 2 figures

详情
英文摘要

A key step in mechanistic modelling of dynamical systems is to conduct a structural identifiability analysis. This entails deducing which parameter combinations can be estimated from a given set of observed outputs. The standard differential algebra approach answers this question by re-writing the model as a higher-order system of ordinary differential equations that depends solely on the observed outputs. Over the last decades, alternative approaches for analysing structural identifiability based on Lie symmetries acting on independent and dependent variables as well as parameters, have been proposed. However, the link between the standard differential algebra approach and that using full symmetries remains elusive. In this work, we establish this link by introducing the notion of parameter symmetries, which are a special type of full symmetry that alter parameters while preserving the observed outputs. Our main result states that a parameter combination is locally structurally identifiable if and only if it is a differential invariant of all parameter symmetries of a given model. We show that the standard differential algebra approach is consistent with the concept of structural identifiability in terms of parameter symmetries. We present an alternative symmetry-based approach for analysing structural identifiability using parameter symmetries. Lastly, we demonstrate our approach on two well-known models in mathematical biology.

2405.16885 2026-03-30 stat.ME q-bio.PE

Hidden Markov modelling of spatio-temporal dynamics of measles in 1750-1850 Finland

Tiia-Maria Pasanen, Jouni Helske, Tarmo Ketola

详情
Journal ref
Journal of Applied Statistics, 1-25. (2026)
英文摘要

Real world spatio-temporal datasets, and phenomena related to them, are often challenging to visualise or gain a general overview of. In order to summarise information encompassed in such data, we combine two well known statistical modelling methods. To account for the spatial dimension, we use the intrinsic modification of the conditional autoregression, and incorporate it with the hidden Markov model, allowing the spatial patterns to vary over time. We apply our method to parish register data considering deaths caused by measles in Finland in 1750-1850, and gain novel insight of previously undiscovered infection dynamics. Five distinctive, reoccurring states, describing spatially and temporally differing infection burden and potential routes of spread, are identified. We also find that there is a change in the occurrences of the most typical spatial patterns circa 1812, possibly due to changes in communication networks after major administrative transformations in Finland.

2603.26007 2026-03-30 q-bio.NC cs.AI cs.CV

Longitudinal Boundary Sharpness Coefficient Slopes Predict Time to Alzheimer's Disease Conversion in Mild Cognitive Impairment: A Survival Analysis Using the ADNI Cohort

Ishaan Cherukuri

详情
英文摘要

Predicting whether someone with mild cognitive impairment (MCI) will progress to Alzheimer's disease (AD) is crucial in the early stages of neurodegeneration. This uncertainty limits enrollment in clinical trials and delays urgent treatment. The Boundary Sharpness Coefficient (BSC) measures how well-defined the gray-white matter boundary looks on structural MRI. This study measures how BSC changes over time, namely, how fast the boundary degrades each year works much better than looking at a single baseline scan for predicting MCI-to-AD conversion. This study analyzed 1,824 T1-weighted MRI scans from 450 ADNI subjects (95 converters, 355 stable; mean follow-up: 4.84 years). BSC voxel-wise maps were computed using tissue segmentation at the gray-white matter cortical ribbon. Previous studies have used CNN and RNN models that reached 96.0% accuracy for AD classification and 84.2% for MCI conversion, but those approaches disregard specific regions within the brain. This study focused specifically on the gray-white matter interface. The approach uses temporal slope features capturing boundary degradation rates, feeding them into Random Survival Forest, a non-parametric ensemble method for right-censored survival data. The Random Survival Forest trained on BSC slopes achieved a test C-index of 0.63, a 163% improvement over baseline parametric models (test C-index: 0.24). Structural MRI costs a fraction of PET imaging ($800--$1,500 vs. $5,000--$7,000) and does not require CSF collection. These temporal biomarkers could help with patient-centered safety screening as well as risk assessment.

2603.25991 2026-03-30 eess.SY cs.SY q-bio.NC

Passivity-Based Control of Electrographic Seizures in a Neural Mass Model of Epilepsy

Gagan Acharya, Erfan Nozari

详情
英文摘要

Recent advances in neurotechnologies and decades of scientific and clinical research have made closed-loop electrical neuromodulation one of the most promising avenues for the treatment of drug-resistant epilepsy (DRE), a condition that affects over 15 million individuals globally. Yet, with the existing clinical state of the art, only 18% of patients with DRE who undergo closed-loop neuromodulation become seizure-free. In a recent study, we demonstrated that a simple proportional feedback policy based on the framework of passivity-based control (PBC) can significantly outperform the clinical state of the art. However, this study was purely numerical and lacked rigorous mathematical analysis. The present study addresses this gap and provides the first rigorous analysis of PBC for the closed-loop control of epileptic seizures. Using the celebrated Epileptor neural mass model of epilepsy, we analytically demonstrate that (i) seizure dynamics are, in their standard form, neither passive nor passivatable, (ii) epileptic dynamics, despite their lack of passivity, can be stabilized by sufficiently strong passive feedback, and (iii) seizure dynamics can be passivated via proper output redesign. To our knowledge, our results provide the first rigorous passivity-based analysis of epileptic seizure dynamics, as well as a theoretically-grounded framework for sensor placement and feedback design for a new form of closed-loop neuromodulation with the potential to transform seizure management in DRE.

2603.25986 2026-03-30 q-bio.PE q-bio.QM

Evaluating Phylogenetic Comparative Methods under Reticulate Evolutionary Scenarios

Lydia Morley, Emma Lehmberg, Sungsik Kong

Comments 28 pages, 10 figures, 4 tables

详情
英文摘要

Phylogenetic comparative methods (PCMs) are widely used to study trait evolution. However, many evolutionary histories involve reticulate evolutionary scenarios, such as hybridization, that violate core assumptions of these methods. In this study, we evaluate how such violations affect the performance of PCMs. In particular, we focus on the ancestral character estimation, evolutionary rate estimation, and model selection. We simulate continuous trait evolution on various phylogenetic network topologies and assess the performance of PCMs that assume a bifurcating tree (i.e., major tree of the network) as the underlying model of evolution. We found that the performance of the tested PCMs was suboptimal. Using random forest, generalized linear models, and model-based clustering, we identified key factors contributing to these inaccuracies. Our results show that frequent and/or recent hybridization accompanied by one ore more transgressive events and rapidly evolving traits (i.e., high evolutionary rate) lead to significant estimation error, especially with respect to rate estimation and model choice. These factors substantially shift trait values away from tree-based model expectations, leading to overall increased error in parameter estimates. Our study demonstrates cases in which PCMs that rely on trees are likely to misinterpret biological histories and offers recommendations for researchers studying systems with complex evolutionary histories.

2603.25880 2026-03-30 q-bio.QM cs.AI cs.LG

Spectral Coherence Index: A Model-Free Metric for Protein Structural Ensemble Quality Assessment

Yuda Bi, Huaiwen Zhang, Jingnan Sun, Vince D Calhoun

详情
英文摘要

Protein structural ensembles from NMR spectroscopy capture biologically important conformational heterogeneity, but it remains difficult to determine whether observed variation reflects coordinated motion or noise-like artifacts. We evaluate the Spectral Coherence Index (SCI), a model-free, rotation-invariant summary derived from the participation-ratio effective rank of the inter-model pairwise distance-variance matrix. Under grouped primary analysis of a Main110 cohort of 110 NMR ensembles (30--403 residues; 10--30 models per entry), SCI separated experimental ensembles from matched synthetic incoherent controls with AUC-ROC $= 0.973$ and Cliff's $δ= -0.945$. Relative to an internal 27-protein pilot, discrimination softened modestly, showing that pilot-era thresholds do not transfer perfectly to a larger, more heterogeneous cohort: the primary operating point $τ= 0.811$ yielded 95.5\% sensitivity and 89.1\% specificity. PDB-level sensitivity remained nearly unchanged (AUC $= 0.972$), and an independent 11-protein holdout reached AUC $= 0.983$. Across 5-fold grouped stratified cross-validation and leave-one-function-class-out testing, SCI remained strong (AUC $= 0.968$ and $0.971$), although $σ_{R_g}$ was the stronger single-feature discriminator and a QC-augmented multifeature model generalized best (AUC $= 0.989$ and $0.990$). Residue-level validation linked SCI-derived contributions to experimental RMSF across 110 proteins and showed broad concordance with GNM-based flexibility patterns. Rescue analyses showed that Main110 softening arose mainly from size and ensemble normalization rather than from loss of spectral signal. Together, these results establish SCI as an interpretable, bounded coherence summary that is most useful when embedded in a multimetric QC workflow for heterogeneous protein ensembles.

2603.25762 2026-03-30 q-bio.GN quant-ph

QHap: Quantum-Inspired Haplotype Phasing

Rui Zhang, Xian-Zhe Tao, Yibo Chen, Jiawei Zhang, Lei He, Dongming Fang, Lin Yang, Yuhui Sun, Qinyuan Zheng, Xinmeng Shi, Yang Zhou, Wanyi Chen, Chentao Yang, Man-Hong Yung, Jun-Han Huang

Comments 19 pages, 7 figures

详情
英文摘要

Haplotype phasing, the process of resolving parental allele inheritance patterns in diploid genomes, is critical for precision medicine and population genetics, yet the underlying optimization is NP-hard, posing a scalability challenge. To address this, we introduce QHap, a haplotype phasing tool that leverages quantum-inspired optimization. By reformulating haplotype phasing as a Max-Cut problem and deploying a GPU-accelerated ballistic simulated bifurcation solver, QHap accelerates phasing while maintaining accuracy comparable to established phasing tools. On the highly polymorphic human major histocompatibility complex region, QHap demonstrates 4- to 20-fold acceleration with zero switch error across multiple long read sequencing platforms. The framework implements two strategies: a read-based method for regional phasing, and a single nucleotide polymorphism-based method that, through quality-weighted probabilistic edge construction, efficiently scales to chromosome-scale tasks. Integration of chromatin conformation capture data extends phase block contiguity by up to 15-fold, enabling near-chromosome-spanning haplotype reconstruction. QHap demonstrates that quantum-inspired algorithms operating on classical hardware offer a promising approach to addressing the growing computational demands of sequencing data, establishing a new paradigm for applying physics-inspired optimization to fundamental challenges in computational genomics.

2603.25755 2026-03-30 physics.chem-ph cs.LG q-bio.QM stat.ML

KANEL: Kolmogorov-Arnold Network Ensemble Learning Enables Early Hit Enrichment in High-Throughput Virtual Screening

Pavel Koptev, Nikita Krainov, Konstantin Malkov, Alexander Tropsha

Comments 8 Pages

详情
英文摘要

Machine learning models of chemical bioactivity are increasingly used for prioritizing a small number of compounds in virtual screening libraries for experimental follow-up. In these applications, assessing model accuracy by early hit enrichment such as Positive Predicted Value (PPV) calculated for top N hits (PPV@N) is more appropriate and actionable than traditional global metrics such as AUC. We present KANEL, an ensemble workflow that combines interpretable Kolmogorov-Arnold Networks (KANs) with XGBoost, random forest, and multilayer perceptron models trained on complementary molecular representations (LillyMol descriptors, RDKit-derived descriptors, and Morgan fingerprints).