arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.12404 2026-05-13 q-bio.NC

Empirical scaling laws in balanced networks with conductance-based synapses

Vicky Zhu, Gabriel Ocker, Robert Rosenbaum

AI总结 本文研究了在平衡网络中使用电导型突触模型对膜电位波动的影响。作者通过计算机模拟发现,尽管电导型突触模型单独使用时会导致膜电位波动过小,而电流型突触模型引入尖峰时间相关性时又会导致波动过大,但将两者结合使用可以产生更接近实际的中等波动水平。该研究揭示了在构建更真实的神经网络模型时,多个现实假设的协同作用至关重要。

详情
英文摘要

Strongly coupled, recurrent, balanced network models have been successful in describing and predicting many phenomena observed in cortical neural recordings. However, most balanced network models use current-based synapse models in place of more realistic, conductance-based models. Conductance-based synapse models predict unrealistically small membrane potential variability. On the other hand, introducing realistic levels of spike time correlations to models with current-based synapses predicts unrealistically large membrane potential variability. We use computer simulations to show that these two effects can cancel: Recurrent network models with conductance-based synapses and spike time correlations produce more realistic, moderate levels of membrane potential variability. Consistent with recent work on feedforward networks, our results show that including more realistic modeling assumptions produces more realistic dynamics, but only if when two modeling assumptions are included together.

2605.12286 2026-05-13 q-bio.GN cs.AI

Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

Younhun Kim, Georg K. Gerber, Travis E. Gibson

AI总结 该研究探讨了是否仅通过微生物群落成员的原始DNA序列即可预测其群落层面的丰度特征。研究提出了一种基于集合聚合基因组嵌入(SAGE)的方法,结合基因组语言模型(GLMs)的少样本学习能力,用于预测微生物群落的丰度分布。实验表明,该方法在新型基因组上的泛化能力优于传统生物信息学方法,并验证了群落层面潜在表示对性能提升的关键作用。

详情
Comments
11 pages, 7 figures
英文摘要

Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.

2605.10818 2026-05-13 cs.LG q-bio.NC

On periodic distributed representations using Fourier embeddings

Jakeb Chouinard

AI总结 本文研究了如何利用傅里叶嵌入构建周期性分布式表示,以更好地处理角度等周期性信号。作者提出使用高维实值周期嵌入,解决传统标量角度表示在处理接近角度时的困难,并通过点积相似性控制不同核函数的形状。研究重点在于利用空间语义指针这一神经可解释的表示方法,形式化定义狄利克雷核和周期高斯核,为周期性信号的建模提供了新的思路。

详情
英文摘要

Periodic signals are critical for representing physical and perceptual phenomena. Scalar, real angular measures, e.g., radians and degrees, result in difficulty processing and distinguishing nearby angles, especially when their absolute difference exceeds pi. We can avoid this problem by using real-valued, periodic embeddings in high-dimensional space. These representations also allow us to control the nature of their dot product similarities, allowing us to construct a variety of different kernel shapes. In this work, we aim of highlight how these representations can be constructed and focus on the formalization of Dirichlet and periodic Gaussian kernels using the neurally-plausible representation scheme of Spatial Semantic Pointers.

2604.16642 2026-05-13 q-bio.QM q-bio.CB q-bio.GN stat.AP

Geometric coherence of single-cell CRISPR perturbations reveals regulatory architecture and predicts cellular stress

Prashant C. Raju

AI总结 该研究提出了一种新的几何稳定性度量方法Shesha,用于评估单细胞CRISPR扰动响应的方向一致性,揭示了基因调控结构并预测细胞应激状态。通过分析多个CRISPR数据集,研究发现稳定性与扰动效应大小高度相关,但在某些情况下二者分离,揭示了不同调控因子的生物学特性。该方法为筛选实验中的靶点优先级排序、细胞制造中的表型质量控制以及计算扰动预测的评估提供了新视角。

详情
英文摘要

Genome engineering has achieved remarkable sequence-level precision, yet predicting the transcriptomic state that a cell will occupy after perturbation remains an open problem. Single-cell CRISPR screens measure how far cells move from their unperturbed state, but this effect magnitude ignores a fundamental question: do the cells move together? Two perturbations with identical magnitude can produce qualitatively different outcomes if one drives cells coherently along a shared trajectory while the other scatters them across expression space. We introduce a geometric stability metric, Shesha, that quantifies the directional coherence of single-cell perturbation responses as the mean cosine similarity between individual cell shift vectors and the mean perturbation direction. Across five CRISPR datasets (2,200+ perturbations spanning CRISPRa, CRISPRi, and pooled screens), stability correlates strongly with effect magnitude (Spearman $ρ=0.75-0.97$), with a calibrated cross-dataset correlation of 0.97. Crucially, discordant cases where the two metrics decouple expose regulatory architecture: pleiotropic master regulators such as CEBPA and GATA1 pay a "geometric tax," producing large but incoherent shifts, while lineage-specific factors such as KLF1 produce tightly coordinated responses. After controlling for magnitude, geometric instability is independently associated with elevated chaperone activation (HSPA5/BiP; $ρ_{partial}=-0.34$ and $-0.21$ across datasets), and the high-stability/high-stress quadrant is systematically depleted. The magnitude-stability relationship persists in scGPT foundation model embeddings, confirming it is a property of biological state space rather than linear projection. Perturbation stability provides a complementary axis for hit prioritization in screens, phenotypic quality control in cell manufacturing, and evaluation of in silico perturbation predictions.

2603.21919 2026-05-13 cond-mat.soft q-bio.SC

Mechanical stress induced by the polymerisation of an active gel near a surface

Kristiana Mihali, Dennis Wörthmüller, Pierre Sens

AI总结 该研究探讨了细胞膜附近活性凝胶聚合过程中产生的机械应力对膜形变的影响。通过建立可压缩活性凝胶的流体力学模型,研究了肌动蛋白流动、密度弛豫及与膜的摩擦如何在线性形变范围内诱导膜上的正交和切向应力。研究结合解析解与有限元方法,揭示了压缩性、界面摩擦及肌动蛋白周转率对膜稳定性的影响,并确定了导致膜线性不稳定的条件。

详情
Comments
12 pages, 9 figures
英文摘要

Actin flow in the cortical cytoskeleton underneath the cell membrane generates mechanical stresses that shape the cell surface. We study this mechanism using a hydrodynamic model of a compressible active gel polymerizing at the membrane and undergoing turnover. We determine how actin flow, density relaxation and friction of actin with the membrane generate stress on a corrugated membrane at the linear order in deformation. Analytical solutions in limiting regimes, combined with finite element methods in the general case, provide a map of normal and tangential stresses as functions of compressibility, interfacial friction and actin turnover, and determine the conditions under which actin polymerization can render the membrane linearly unstable. The non-linear regime is also briefly discussed.

2510.00733 2026-05-13 cs.LG cs.AI q-bio.QM

Neural Diffusion Processes for Physically Interpretable Survival Prediction

Alessio Cristofoletto, Cesare Rollo, Giovanni Birolo, Piero Fariselli

AI总结 本文提出了一种名为DeepFHT的生存分析框架,将深度神经网络与随机过程理论中的首次穿越时间(FHT)分布相结合,将事件发生时间建模为潜在扩散过程首次到达吸收边界的时间。该方法通过神经网络将输入变量映射到具有物理意义的参数,如初始条件、漂移和扩散系数,从而在无需假设比例风险的前提下,生成闭式生存和风险函数。实验表明,DeepFHT在预测性能上与现有先进方法相当,同时保持了物理可解释的参数化特性,有助于揭示输入特征与风险之间的关系。

详情
Comments
12 pages, 5 figures
英文摘要

We introduce DeepFHT, a survival-analysis framework that couples deep neural networks with first hitting time (FHT) distributions from stochastic process theory. Time to event is represented as the first passage of a latent diffusion process to an absorbing boundary. A neural network maps input variables to physically meaningful parameters including initial condition, drift, and diffusion, within a chosen FHT process such as Brownian motion, both with drift and driftless. This yields closed- form survival and hazard functions and captures time-varying risk without assuming proportional- hazards. We compare DeepFHT with Cox regression using synthetic and real-world datasets. The method achieves predictive accuracy on par with the state-of-the-art approach, while maintaining a physics- based interpretable parameterization that elucidates the relation between input features and risk. This combination of stochastic process theory and deep learning provides a principled avenue for modeling survival phenomena in complex systems

2507.16179 2026-05-13 cond-mat.soft q-bio.BM

Cooperation and competition of basepairing and electrostatic interactions in mixtures of DNA nanostars and polylysine

Gabrielle R. Abraham, Tianhao Li, Anna Nguyen, William M. Jacobs, Omar A. Saleh

AI总结 该研究探讨了DNA纳米星与聚赖氨酸混合体系中碱基配对与静电相互作用的协同与竞争效应。通过实验与理论结合,研究了温度、离子强度和组分比例对相分离行为的影响,发现两者在高盐和高温条件下协同作用,稳定共凝集相,并形成多相共存现象。研究还揭示了不同盐浓度下相分离的动力学路径及非平衡聚集行为,展示了多种相互作用模式对生物分子体系相行为复杂性的显著影响。

详情
Journal ref
J. Am. Chem. Soc. 2025, 147, 46, 42452-42461
Comments
Include supplementary information
英文摘要

Phase separation in biomolecular mixtures can result from multiple physical interactions, which may act either complementarily or antagonistically. In the case of protein-nucleic acid mixtures, charge plays a key role but can have contrasting effects on phase behavior. Attractive electrostatic interactions between oppositely charged macromolecules are screened by added salt, reducing the driving force for coacervation. By contrast, base pairing interactions between nucleic acids are diminished by charge repulsion and thus enhanced by added salt, promoting associative phase separation. To explore this interplay, we combine experiment and theory to map the complex phase behavior of a model solution of poly-L-lysine (PLL) and self-complementary DNA nanostars (NS) as a function of temperature, ionic strength, and macromolecular composition. Despite having opposite salt dependences, we find that electrostatics and base pairing cooperate to stabilize NS-PLL coacervation at high ionic strengths and temperatures, leading to two- or three-phase coexistence under various conditions. We further observe a variety of kinetic pathways to phase separation at different salt concentrations, resulting in the formation of nonequilibrium aggregates or droplets whose compositions evolve on long timescales. Finally, we show that the cooperativity between electrostatics and base pairing can be used to create immiscible coacervates that partition various NS species at intermediate salt concentrations. Our results illustrate how the interplay between distinct interaction modes can greatly increase the complexity of the phase behavior relative to systems with a single type of interaction.

2412.04172 2026-05-13 q-bio.NC math.DS

Activity-dependent neuromodulation and calcium homeostasis cooperate to produce robust and modulable neuronal function

Arthur Fyon, Guillaume Drion

AI总结 本研究探讨了活动依赖性神经调节与钙稳态如何协同作用,以维持神经元功能的稳定性和可调性。通过构建基于电导的计算模型,研究发现一种受生物机制启发的神经调节控制器能够与钙稳态机制协同工作,既保持神经元放电模式,又维持细胞内钙浓度。研究还表明,这种协同依赖于电导空间中的交集区域,并指出增强神经元退化性有助于实现更可靠的调控,该机制在神经网络层面也具有广泛适用性。

详情
Journal ref
PLOS Computational Biology 22(4): e1014177 (2026)
英文摘要

Neurons rely on two interdependent mechanisms, homeostasis and neuromodulation, to maintain robust and adaptable functionality. Calcium homeostasis stabilizes neuronal activity by adjusting ionic conductances, whereas neuromodulation dynamically modifies ionic properties in response to external signals carried by neuromodulators. Combining these mechanisms in conductance-based models often produces unreliable outcomes, particularly when sharp neuromodulation interferes with calcium-homeostatic tuning. This study explores how a biologically inspired neuromodulation controller can harmonize with calcium homeostasis to ensure reliable neuronal function. Using computational models of stomatogastric ganglion and dopaminergic neurons, we demonstrate that controlled neuromodulation preserves neuronal firing patterns while calcium homeostasis simultaneously maintains target intracellular calcium levels. Unlike sharp neuromodulation, the neuromodulation controller integrates activity-dependent feedback through mechanisms mimicking G-protein-coupled receptor cascades. The interaction between these controllers critically depends on the existence of an intersection in conductance space, representing a balance between target calcium levels and neuromodulated firing patterns. Maximizing neuronal degeneracy enhances the likelihood of such intersections, enabling robust modulation and compensation for channel blockades. We further show that this controller pairing extends to network-level activity, reliably modulating the rhythmic activity of central pattern generators. This study highlights the complementary roles of calcium homeostasis and neuromodulation, proposing a unified control framework for maintaining robust and adaptive neural activity under physiological and pathological conditions.

2410.00532 2026-05-13 q-bio.QM

smICA: Open-Source Software for Quantitative, Lifetime-Resolved Mapping of Absolute Fluorophore Concentrations in Living Cells

Tomasz Kalwarczyk, Grzegorz Bubak, Jarosław Michalski, Antoni Lis, Karina Kwapiszewska, Marta Pilz, Adam Mamot, Olga Perzanowska, Joanna Kowalska, Jacek Jemielity, Robert Hołyst

AI总结 该研究提出了一种名为smICA的开源软件工具,用于定量解析活细胞中荧光分子的绝对浓度及其寿命信息。该方法通过单分子成像数据实现高灵敏度的浓度映射,仅需少量光子即可完成细胞分割与信号过滤,显著提升了测量效率。研究通过体外和体内实验验证了方法的可靠性,并展示了其在监测活细胞内荧光标记mRNA浓度动态变化中的应用,为单细胞层面的定量生物学研究提供了有力工具。

详情
Comments
19 pages, 8 figures, 31 references
英文摘要

Advanced microscopy techniques are essential in biomedical research for visualising and tracking biomolecules within living cells and their compartments. Conventional fluorescence microscopy methods, however, often struggle with accurately measuring the absolute concentrations of fluorescent probes in living cells. To overcome these limitations, we introduce an open-source analysis tool, smICA (Single-Molecule Image to Concentration Analyser). The smICA method offers quantitative mapping of absolute fluorophore concentrations, lifetime-resolved filtering methods of the signal, intensity-based cell segmentation, and requires only a few photons per pixel. Our approach also reduces the time required to determine the mean concentration per cell compared to the standard FCS measurement performed in multiple posts. To highlight the robustness of the method, we validated it against standard fluorescence correlation spectroscopy (FCS) measurements by performing in vitro (polymers in aqueous solution) and in vivo (polymers and EGFP in living cells) experiments. Finally, we present exemplary studies on the time evolution of fluorescently labelled mRNA concentration in living cells. The presented methodology, along with the software, is a promising tool for quantitative single-cell studies, including, but not limited to, protein expression, biomolecule degradation (such as proteins and mRNA), and monitoring enzymatic reactions.

2405.02038 2026-05-13 q-bio.NC math-ph math.MP q-bio.CB

Dimensionality reduction of neuronal degeneracy reveals two interfering physiological mechanisms

Arthur Fyon, Alessio Franci, Pierre Sacré, Guillaume Drion

AI总结 该研究探讨了神经元在离子通道组成高度可变的情况下如何维持稳定功能的问题。通过降维分析,研究发现了通道电导空间中的两个主要维度,揭示了两个相互干扰的生理机制,这些机制可通过反馈调节机制解释。研究为理解离子通道组成与神经元电生理活动之间的关系提供了定量见解,并提出了一个无需依赖模型的可靠神经调控规则。

详情
Journal ref
PNAS Nexus, Volume 3, Issue 10, October 2024, pgae415
英文摘要

Neuronal systems maintain stable functions despite large variability in their physiological components. Ion channel expression, in particular, is highly variable in neurons exhibiting similar electrophysiological phenotypes, which poses questions regarding how specific ion channel subsets reliably shape neuron intrinsic properties. Here, we use detailed conductance-based modeling to explore the origin of stable neuronal function from variable channel composition. Using dimensionality reduction, we uncover two principal dimensions in the channel conductance space that capture most of the variance of the observed variability. Those two dimensions correspond to two physiologically relevant sources of variability that can be explained by feedback mechanisms underlying regulation of neuronal activity, providing quantitative insights into how channel composition links to neuronal electrophysiological activity. These insights allowed us to understand and design a model-independent, reliable neuromodulation rule for variable neuronal populations.

2605.11764 2026-05-13 cs.LG q-bio.BM

Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

Thor Klamt, Wolfgang Nejdl, Ming Tang

AI总结 该研究探讨了机器学习预测PROTAC(蛋白降解靶向嵌合体)生物活性时存在的泛化差距问题,指出在不同实验室间测量变异是导致这一差距的主要因素。通过分析多个模型在不同评估协议下的表现,研究揭示了跨实验室数据差异对预测性能的显著影响,并提出了分解该差距的框架。此外,研究还开发了PROTAC-Bench数据集及相关评估工具,为后续研究提供了重要资源。

详情
Comments
32 pages, 11 figures, 11 tables. Dataset: https://huggingface.co/datasets/ThorKl/protac-bench (CC-BY-4.0). Code: https://github.com/ThorKlm/PROTAC-Bench (MIT)
英文摘要

Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade bounding the inter-laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation-threshold choice. Across eight published architectures and ESM-2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES-level deduplication; a 21-dimensional 2000-trial hyperparameter optimisation cannot break this ceiling, and the rank-1 single-seed configuration regresses by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction (Bailey and Lopez de Prado, 2014). Few-shot k=5 stratified per-target retraining combined with ADMET features lifts 65-target LOTO AUROC from 0.668 to 0.7050, and post-hoc Platt scaling recovers raw output to within the 0.05 well-calibrated threshold. We release PROTAC-Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance-decomposition framework, the per-target calibration protocol, and the evaluation code.

2605.11718 2026-05-13 q-bio.NC cs.AI cs.NE

Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization

Zhaotian Gu, Molan Li, Jie Su, Chang Liu, Tianyi Qian, Dahui Wang

AI总结 本研究探讨了灵长类视觉皮层背侧流中方向选择性图(如MT区)的计算起源问题。通过引入一种时空拓扑深度神经网络(TDANN),结合自监督对比学习与生物启发的空间损失函数,模型在自然视频训练中自发生成了类似大脑的运动方向图和拓扑针轮结构。研究揭示了MT区的方向选择特性源于任务驱动的判别压力与空间正则化之间的优化权衡,其表征定量匹配了猕猴MT区的生理基线,为背侧与腹侧视觉流的计算机制统一提供了新见解。

详情
英文摘要

The spatial and functional organization of the primate visual cortex is a fundamental problem in neuroscience. While recent computational frameworks like the Topographic Deep Artificial Neural Network (TDANN) have successfully modeled spatial organization in the ventral stream, the computational origins of the dorsal stream's distinct topographies, such as direction-selective maps in the middle temporal (MT) area, remain largely unresolved. In this work, we present a spatiotemporal TDANN to investigate whether MT topography is governed by the same universal principles. By training a 3D ResNet on naturalistic videos via a Momentum Contrast (MoCo) self-supervised paradigm alongside a biologically inspired spatial loss, we demonstrate the spontaneous emergence of brain-like direction maps and topological pinwheel structures. Crucially, we reveal that MT tuning properties, characterized by strong direction selectivity paired with a residual axial component, arise from a strict optimization trade-off between task-driven discriminative pressure and spatial regularization. The model's representations quantitatively match in vivo macaque MT physiological baselines, including direction selectivity index, circular variance, and pinwheel density. These findings unify the computational origins of the ventral and dorsal streams, establishing a general mechanism for cortical self-organization.

2605.11675 2026-05-13 q-bio.QM q-bio.NC

Accounting for Missed Events in the Bayesian Modeling of IP3R Multimodal Gating

Schayma Ben Marzougui, Audrey Denizot, Hugues Berry

AI总结 该研究针对IP3R通道多模态门控行为的建模问题,提出了一种基于贝叶斯方法的改进模型,用于解决全细胞膜片钳技术因时间分辨率不足而遗漏短时事件所带来的偏差。通过引入分层马尔可夫链模型并直接在似然函数中整合遗漏事件的修正,该方法显著提升了参数估计和模型评估的准确性。研究发现,考虑遗漏事件后,IP3R通道的Park和Drive两种模式均基于相同的三态马尔可夫模型,但具有不同的动力学参数,且中等浓度钙离子显著抑制Drive到Park的转换,揭示了IP3R通道在不同钙浓度下的门控机制差异。

详情
英文摘要

The Inositol 1,4,5-trisphosphate receptor channel (IP 3 R) is an important calcium channel involved in calcium-induced calcium release, playing a prominent role in intracellular calcium signaling. However, accurately characterizing its gating behavior remains a challenge, particularly due to the temporal resolution of patch clamp techniques that is not large enough to detect all short-lived events. This limitation can significantly bias the inference of kinetic models describing the receptor activity. To address this issue, we focused on the quantitative analysis of IP 3 R gating behavior using patch clamp data, with particular attention to missed events. We modeled IP 3 R channel gating using Hierarchical Markov chains and used a Bayesian approach that integrates missed event correction directly into the likelihood function, enabling more accurate parameter inference and model evaluation. We show that accounting for missed events deeply clarifies the multi-modal model that emerges from model selection. In this new model, the Park and Drive modes both consist of the same 3-state Markov model, with mode-dependent kinetic parameters: the Drive mode stabilizes the closed state directly connected to the open one, whereas the Park mode stabilizes the other closed state, that is not connected to the open one. Intermediate Ca 2+ concentrations are found to strongly depress the Drive to Park transition rate, so that the IP 3 R channel undergoes frequent transitions to the Park mode only for __ 50 nM or micromolar Ca 2+ concentrations. Overall, our approach provides a refined perspective on IP 3 R channel modeling and highlights the critical importance of accounting for missed events upon model selection based on single-channel recordings.

2605.11648 2026-05-13 q-bio.QM

NORI: Fast probabilistic inference for ambiguous observation-entity mappings

Simon Van de Vyver, Tibo Vande Moortele, Ben-Björn Binke, Pieter Verschaffelt, Peter Dawyndt, Bart Mesuere

AI总结 NORI 是一种快速的概率推理方法,用于解决实验观测与生物实体之间模糊映射的问题,其速度比现有方法快几个数量级。该方法支持大规模数据分析和广泛的超参数优化,能够应用于蛋白质推断、组学领域的分类与功能分析等生物信息学任务,显著提升了相关研究的效率和适用范围。

详情
Comments
8 pages, 1 table
英文摘要

NORI performs probabilistic inference to resolve ambiguous mappings between experimental observations and biological entities orders of magnitude faster than state-of-the-art methods. This makes large-scale analysis and extensive hyperparameter optimization possible, and supports a broader range of bioinformatics applications, including protein inference, taxonomic and functional analysis in omics-fields.

2605.11598 2026-05-13 cs.LG cs.AI cs.DB q-bio.QM

EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

Madhurima Panja, Danny D'Agostino, Huitao Li, Tanujit Chakraborty, Nan Liu

AI总结 随着数据驱动方法在公共卫生决策中的广泛应用,传染病预测已成为重要研究领域。为解决现有研究缺乏高质量多变量预测基准的问题,本文提出了EpiCastBench,一个包含40个精心挑选的多变量传染病数据集的大型基准框架,涵盖多种传染病和地理区域,具有不同的时间粒度、序列长度和稀疏性。研究通过统一的评估设置对15种多变量预测模型进行了系统比较,所有数据和代码均已公开,有助于推动传染病预测方法的发展与验证。

详情
英文摘要

The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).

2605.11450 2026-05-13 q-bio.MN

Scalable vertex guided filtrations identify structurally relevant genes in cancer networks

Edmara Viana, Rodrigo Henrique Ramos, Flávia Raquel Gonçalves Carneiro, Cynthia de Oliveira Lage Ferreira

AI总结 该研究提出了一种基于顶点函数的过滤方法(VFB),用于分析癌症相关蛋白网络中的拓扑结构,以识别具有结构意义的基因。相比传统的维托里斯-里斯(VR)过滤方法,VFB在计算效率上更具优势,并能够有效捕捉二阶和三阶拓扑结构(Betti-2和Betti-3),从而发现新的驱动基因并验证其生物学意义。该方法为大规模网络分析提供了可扩展且具有生物解释性的新工具。

详情
Comments
13 pages and 3 figures
英文摘要

Topological data analysis (TDA) has established itself as a useful tool for capturing multiscale structures in complex networks, such as connected components, cycles, and cavities. Although Vietoris-Rips (VR) filtering is widely used in network analysis, it tends to be computationally expensive, especially for large networks. This work explores vertex function-based (VFB) filtering based on network measures, applying persistent homology to identify relevant topological structures in cancer-associated protein networks, and compares its effectiveness with the VR approach. The results show that VFB reproduces the second-order structures (Betti-2) identified by VR, recovering previously reported essential genes. In addition, VFB detected new driver genes, confirmed in databases such as IntOGen and NCG, and allowed analysis of third-order structures (Betti-3) that was not feasible with VR. Thus, VFB represents a scalable alternative to VR, preserving biological interpretability and complementing classical network metrics.

2605.11389 2026-05-13 math.DS q-bio.MN

Bistability, Absolute Concentration Robustness, and Hysteresis in Dual-Site Futile Cycles with Bifunctional Enzymes

Badal Joshi, Tung D. Nguyen, Matthew D. Johnston

AI总结 本文研究了由双功能酶催化的双位点无用循环系统,探讨了其在稳态数量、稳定性以及分岔结构等方面的动力学特性。通过数学分析,揭示了四类网络在边界稳态、双稳态和绝对浓度鲁棒性等方面的差异,并发现其中一类网络同时表现出双稳态和绝对浓度鲁棒性,系统可以在不同中间浓度下达到相同最终产物浓度的两个稳定状态。

详情
英文摘要

Bifunctional enzymes, which catalyze both the forward and reverse steps of a substrate modification reaction, arise naturally in bacterial two-component signaling systems and metabolic regulation. Beyond their well-known role in conferring absolute concentration robustness (ACR) on substrate species, bifunctional enzymes profoundly shape the dynamical landscape of the networks in which they appear. We study a class of dual-site futile cycles in which the reverse modification steps are carried out by bifunctional enzyme-substrate compounds, and provide a complete mathematical analysis of all four such networks, characterizing the existence, number, and stability of steady states, as well as the bifurcation structure as total substrate is varied. All four networks admit boundary steady states, in contrast to the non-bifunctional case. The networks differ in the number and stability of boundary steady states, in the maximum number of positive steady states (ranging from two to four), and in whether bistability is present. In two networks, a transcritical bifurcation connects the boundary and positive steady state branches; in one case this is a backward bifurcation, producing hysteresis. Perhaps the most striking phenomenon occurs in one of the four networks, which simultaneously exhibits bistability and ACR in the final modification state, where the system can settle into either of two stable steady states with different intermediate concentrations yet identical final product concentration.

2605.11368 2026-05-13 cs.LG cs.AI q-bio.GN

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Jeongchan Kim, Yunkyung Ko, Jong Chul Ye

AI总结 本文研究了如何利用Edit Flows在DNA序列生成过程中实现推理阶段的奖励控制。提出了一种名为LPDP的方法,它是一种无需训练、关注中间状态和动作的局部重解算操作符,能够在生成可变长度DNA序列时进行高效的编辑操作。LPDP通过在每一步推理中评估单步根编辑、保留最优根编辑集,并在局部范围内求解离散优化问题,从而提升生成序列的质量和生物合理性,适用于增强子优化和基因剪接边界修复等任务。

详情
Comments
22 pages, 5 figures
英文摘要

We study the application of recent Edit Flows for inference-time reward control for DNA sequence generation. Unlike most reward-guided DNA generation frameworks, which operate on fixed-length sequence spaces, Edit Flows have a potential to generate variable-length DNA through biologically plausible insertion, deletion, and substitution operations. In particular, we propose Local Perturbation Discrete Programming (LPDP), a training-free, intermediate-state and action-aware local re-solving operator for variable-length DNA edit-action generators at inference time. More specifically, at each guided rollout step, LPDP scores one-step root edits, retains a near-best root band, and re-ranks each retained root by solving a bounded local discrete program around its child sequence. This local program uses the typed geometry of edit actions to focus on coherent substitution, insertion, or deletion subgraphs, and aggregates local continuations with either a hard Max backup or a soft log-sum-exponential (LSE) backup. We instantiate LPDP in two regimes: front-loaded reward tilting for enhancer optimization, where early edits are critical for establishing global regulatory sequence structure, and back-loaded reward tilting for exon-intron-exon inpainting, where late edits fine-tune splice-boundary contexts.

2605.11258 2026-05-13 cs.AI cs.CL q-bio.QM

Unlocking LLM Creativity in Science through Analogical Reasoning

Andrew Shen, Shaul Druckmann, James Zou

AI总结 本文研究如何通过类比推理(Analogical Reasoning, AR)提升大型语言模型(LLM)在科学问题中的创造力,特别是在生物医学等复杂领域。作者发现现有LLM在开放性问题求解中容易陷入模式崩溃,生成多样性不足的解,为此提出AR方法,通过跨领域问题的类比结构生成新颖解决方案。实验表明,AR显著提升了生成解的多样性和新颖性,并在多个生物医学任务中取得了优于现有方法的性能,验证了其在实际应用中的有效性。

详情
英文摘要

Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ($ρ$=0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.

2605.11221 2026-05-13 q-bio.QM cs.LG

Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows

Yaochen Rao, Farzaneh Jalalypour, N. M. Anoop Krishnan, Rocío Mercado

AI总结 该研究旨在解决靶向蛋白降解(TPD)领域中实验数据缺乏结构化的问题,提出了一种结合专家反馈的大型语言模型(LLM)工作流,用于自动化从科学文献中提取关键实验信息。该方法通过少量专家标注的样本优化提示指令,并在分子胶和PROTAC两类TPD化合物的数据库中实现了高精度的数据提取与扩展,显著提升了数据库规模与实验信息的完整性。研究成果为TPD研究及更广泛的科学文献数据整理提供了可复用的工具和数据资源。

详情
英文摘要

Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline records, and expert-annotated ground truth. A lightweight cross-validated prompt-refinement module adapts extraction instructions from scarce expert annotations. With only seven annotated molecular glue publications, the workflow achieved record-level $F_1 = 0.98$ and transferred to PROTACs by terminology substitution alone, maintaining record-level $F_1 > 0.93$. Applied at scale, it expanded molecular glue and PROTAC databases by 81% and 92% records, respectively, with 92% and 82.5% of newly recovered records validated as correct upon expert review. The workflow also recovered kinetic and assay-context information essential for cross-study potency comparison and condition-aware degradation modeling. We release the workflow, prompts, evaluation code, and extracted datasets as resources for TPD data curation and AI-assisted scientific curation more broadly.

2605.11189 2026-05-13 cs.LG q-bio.BM

Deep Learning for Protein Complex Prediction and Design

Ziwei Xie

AI总结 本文研究如何利用深度学习准确建模和设计蛋白质复合物结构,这是计算结构生物学中的核心问题,对理解细胞功能和开发药物具有重要意义。研究提出了专门针对蛋白质结构层次特性的深度学习架构,并设计了高效的搜索算法,以在庞大的序列空间中寻找相互作用的同源蛋白,从而提升复合物结构预测和蛋白质序列设计的准确性。

详情
Comments
PhD thesis
英文摘要

Accurately modeling and designing protein complex structures is a central problem in computational structural biology, with broad implications for understanding cellular function and developing therapeutics. This thesis investigates two fundamental aspects of this problem using deep learning: domain-specific architectures that capture the hierarchical nature of protein structures, and search algorithms that efficiently navigate the vast sequence spaces of protein complexes to identify interacting homologs for improving complex structure prediction and to design protein sequences.

2605.11028 2026-05-13 q-bio.OT

Morpho-Physiological and Genetic Diversity of Crataegus Taxa (Rosaceae) in Selected Locations of Iraqi Kurdistan-Region

Karzan Ezzalddin Mohammed

AI总结 本文研究了伊拉克库尔德斯坦地区六十一份山楂(Crataegus spp.)种质资源的形态、生理生化及遗传多样性。通过形态学和分子标记分析,鉴定出七种山楂类群,包括五个物种和两个杂交种,并发现不同生态型在植株类型、生殖阶段及果实形态等方面存在显著差异。研究结果揭示了果实形态和理化特性在不同种质间具有高度变异,为山楂资源的保护和利用提供了重要依据。

详情
Comments
96 pages
英文摘要

One of the great phytogeography zones of semi-arid lands in the world is the Kurdistan region of Iraq which hosts many important fruit species due to its geographical location and ecology. Mountain Hawthorn (Crataegus spp.) is a vital wild edible deciduous fruit tree of the genus Crataegus for the region, which is highly beneficial for ornamental, economical, industrial and medicinal uses. In the present study, morphological, phytochemical and molecular marker systems were applied on sixty-one Hawthorn accessions from different locations in the Iraqi Kurdistan region during April 2022 to September 2023. Phenotypic markers have proven to be extremely useful in studies of genetic diversity in Hawthorn genotypes, the results of the present morphological study showed that there are seven taxa (five species, two hybrids) were observed including, Crataegus azarolus, Crataegus meyrei, Crataegus monogyna, Crataegus orientalists, Crataegus pentagyna, Crataegus azarolus x Crataegus meyrei and Crataegus azarolus x Crataegus pentagyna. There was significant variation among different ecotypes in terms of plant type, reproductive stage, and fruit morphology and production uses. Fruit Physio-morphological data revealed a high level of significant variability (P 0.01) among accessions based on the analysis of variance. The most important characteristics for explaining fruit morphological variability `were 11 varbales including fruit weight (FW), fruit length (FL), fruit width (FW), seed length (SL), seed width (SW), number of seeds per fruits (NSF), volume solution (VS), fruit fresh weight (WOF), seed weight (WS), Potentional of hydrogen (pH) and mositure content (MC). They all are significantly different for all the traits measured for the studied accessions.

2605.11022 2026-05-13 q-bio.GN cs.AI cs.ET cs.LG

SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification

Akarsh Gupta, Kenneth Rodrigues, Sagnik Chatterjee

AI总结 该研究提出了一种名为SCOPE的Siamese对比操作子对嵌入方法,用于功能序列的表示与分类。通过融合嵌入空间进行分类,该方法在操作子对识别任务中表现出色,其ROC-AUC达到0.71,与当前最先进的模型相当。研究发现,基于蛋白质语言模型的嵌入已能有效捕捉功能关系,为大规模微生物基因组的操作子识别提供了可行且可扩展的解决方案。

详情
英文摘要

Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.

2605.10994 2026-05-13 q-bio.NC q-bio.OT

Internally triggered retrospective learning in neural networks

Arturo Tozzi

AI总结 本文提出了一种神经网络内部触发的回顾学习方法,区别于传统依赖外部输入驱动的持续参数更新方式,该方法通过网络自身生成的事件触发参数更新。在网络运行过程中,突触相互作用被累积为编码近期共激活模式的潜在痕迹,同时内部预测机制持续计算预测状态与实际状态之间的差异,当差异超过自适应阈值时触发学习事件,从而实现对过去活动的有选择性整合。该方法能够减少不必要的参数漂移,适用于需要对稀有或重要输入进行选择性适应的多种应用场景。

详情
Comments
13 pagews, 2 figures
英文摘要

Learning in artificial neural networks usually relies on continuous, externally driven weight updates, in which parameters are modified at every step in response to incoming data, error signals or reward feedback. In this setting, routine and informative inputs contribute similarly to parameter adjustment. We introduce a learning approach in which parameter updates are governed by internally generated events arising from the network own representational dynamics. During ongoing activity, synaptic interactions are accumulated as latent traces encoding recent coactivation patterns, without immediately modifying the underlying parameters. In parallel, an internal predictive process estimates the evolving latent state, while a scalar measure of discrepancy between predicted and observed states is continuously computed. When discrepancy exceeds an adaptive threshold derived from recent error statistics, a learning event is triggered, inducing a retrospective update selectively integrating past activity into the current configuration. We performed simulations using a minimal neural network exposed to structured sequential inputs with transient perturbations. We found that learning occurs through sparse, temporally localized events associated with increases in prediction error, leading to stepwise changes in synaptic efficacy and discrete transitions in latent state organization. By selectively reorganizing parameters in response to internally detected discrepancies, our episodic updating may reduce unnecessary parameter drift while preserving informative patterns. Potential applications include systems requiring selective adaptation to rare or informative inputs such as physiological, industrial or environmental monitoring, edge computing under limited energy budgets, autonomous systems operating in dynamic conditions and sequential computational data processing.

2605.10985 2026-05-13 cs.LG cs.AI q-bio.BM

Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning

Siddhant Dutta, Edward Tan Beng Wai, Soumick Sarker, Pasan Gunawardane, Jagath C. Rajapakse

AI总结 该研究提出了一种可解释的蛋白质语言模型表示方法,通过可微分图划分技术将ESM-2的表示映射到蛋白质接触图,并利用SoftBlobGIN网络学习功能子结构,从而提升预测任务的性能与可解释性。该方法无需重新训练语言模型,仅增加少量参数,即可在酶分类、功能预测等任务中取得优异表现,并能自动识别生物意义的功能区域,如活性位点残基和催化接触模式。实验表明,该框架显著提升了结构解释的准确性与可审计性,为蛋白质语言模型提供了结构层面的透明性支持。

详情
Comments
19 Pages, 8 figures, 11 Tables, Submitted to NeurIPS 2026
英文摘要

Protein language models such as ESM-2 learn rich residue representations that achieve strong performance on protein function prediction, but their features remain difficult to interpret as structural $\&$ evolutionary signals are encoded in dense latent spaces. We propose a plug-$\&$-play framework that projects ESM-2 representations onto protein contact graphs $\&$ applies $\textbf{SoftBlobGIN}$, a lightweight Graph Isomorphism Network with differentiable Gumbel-softmax substructure pooling, to perform structure-aware message passing $\&$ learn coarse functional substructures for downstream prediction tasks. Across enzyme classification, SoftBlobGIN achieves 92.8\% accuracy $\&$ 0.898 macro-F1. Unlike post hoc analysis of protein language models alone, our method produces directly auditable structural explanations: GNNExplainer recovers biologically meaningful active-site residues, spatially localized functional clusters, $\&$ catalytic contact patterns. On binding-site detection, SoftBlobGIN improves residue AUROC from $0.885$ using an ESM-2 linear probe to $0.983$, indicating that these structural explanations are not recoverable from language-model features alone. Learned blob partitions provide an additional layer of interpretability by automatically grouping residues into functional substructures, with blobs containing annotated active-site residues showing $1.85\times$ higher importance than other blobs ($ρ{=}0.339$, $p{=}0.009$), without any active-site supervision. Our framework requires no retraining of the language model, adds only $\sim$1.1M parameters, $\&$ generalises across ProteinShake tasks, achieving $F_{\max}$ of $0.733$ on Gene Ontology prediction $\&$ AUROC of $0.969$ on binding-site detection. We position this as an interpretable structural companion to protein language models that makes their predictions more transparent $\&$ auditable.

2605.10979 2026-05-13 q-bio.OT

Statin Recommendations among US Adults with the 2026 Dyslipidemia Guidelines

James A. Diao, Thomas A. Buckley, Andrew Z. Zhou, Smaraki Dash, Rishi K. Wadhera, Arjun K. Manrai

AI总结 该研究分析了2026年美国血脂异常指南对中老年人群他汀类药物推荐的影响,发现相较于2018年指南,新指南在一级推荐标准下减少了约300万人的他汀推荐,而在引入30年风险评估的二级推荐标准下,推荐人数却增加了约2080万。研究指出,新指南对不同人群的影响存在显著差异,尤其对中青年人群的推荐大幅增加,突显了30年风险评估在扩大用药资格中的关键作用。

详情
英文摘要

Importance: The 2026 multisociety dyslipidemia guideline recommended the PREVENT equations in place of the PCE equations, introduced 30-year risk assessment as a new treatment pathway, and lowered risk-based treatment thresholds. The net population impact of these concurrent changes on statin recommendations is unknown. Objective: To estimate changes in statin recommendations under 2026 PREVENT-based dyslipidemia guidelines compared with 2018 PCE-based guidelines. Design and Participants: Cross-sectional analysis of pooled data from NHANES, spanning 2011-2023 and comprising 24,199 participants aged 30-79 years. Main Outcomes and Measures: Number and proportion of US adults receiving or recommended for statin therapy. Results: At the class 1 threshold, the number of US adults receiving or recommended for statin therapy decreased by an estimated 3.0 million (95% CI, 2.3 million to 3.6 million), with larger reductions among Black adults (-4.2 percentage points [pp]), men (-4.0pp), and adults aged 50-69 years (-5.6pp). At the class 2 threshold--which additionally recommends statins for adults aged 30-59 years based on 30-year risk--the number of adults recommended increased by an estimated 20.8 million (95% CI, 19.6 million to 22.0 million), or +11.6pp. The increase was largest among adults aged 50-59 years (+19.7pp) and 40-49 years (+14.8pp). Conclusions: The net population impact of the 2026 dyslipidemia guidelines depends critically on which recommendation class is applied. At the class 1 threshold, statin recommendations decreased modestly; at the class 2 threshold, inclusion of 30-year risk assessment substantially expanded recommendations, particularly among younger adults. These divergent effects underscore the importance of the 30-year risk criterion as a major driver of new eligibility and the need for outcomes and equity monitoring during guideline implementation.

2605.10840 2026-05-13 cs.LG cs.AI q-bio.QM

Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

Yixuan Yang, Mehak Arora, Ryan Zhang, Baraa Abed, Junseob Kim, Tilendra Choudhary, Md Hassanuzzaman, Kevin Zhu, Ayman Ali, Chengkun Yang, Alasdair Edward Gent, Victor Moas, Rishikesan Kamaleswaran

AI总结 本文提出 Clin-JEPA,一种用于电子健康记录(EHR)患者轨迹的多阶段协同训练框架,旨在通过联合嵌入预测预训练(JEPA)实现对患者轨迹的预测和多种下游风险预测任务的统一建模。该方法通过五阶段预训练课程,稳定地协同训练一个基于 Qwen3-8B 的编码器和一个高参数量的潜在轨迹预测器,解决了传统 JEPA 方法中预测器与编码器无法有效协同的问题。实验表明,Clin-JEPA 在 MIMIC-IV 数据集上显著优于现有方法,在多个风险预测任务中表现出优越的性能。

详情
Comments
17 pages, 4 figures, 8 tables. Code: https://github.com/YeungYathin/Clin-JEPA
英文摘要

We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data -- to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning -- remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA's five-phase pretraining curriculum -- predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization -- addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent $\ell_1$ rollout drift uniquely converges ($-$15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83$\times$ further than stable patients in latent space, vs $\leq$2.62$\times$ for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).

2605.09964 2026-05-13 cs.AI q-bio.QM

Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach

Ziqi Gao, Chenyi Zi, Zijing Liu, Ziqiao Meng, Yu Li, Jia Li

AI总结 蛋白质-蛋白质相互作用(PPIs)在细胞功能和疾病机制中起着关键作用。当前基于学习的PPI预测方法主要关注学习蛋白质的表示,却忽略了设计专门的分类头,通常依赖于缺乏生物学依据的通用聚合方法。本文提出了一种基于生物“L3规则”的模型无关PPI分类器L3-PPI,通过引入L3路径正则化的图提示学习方法,将蛋白质嵌入对的分类任务转化为图级别的分类任务,有效提升了预测性能。

详情
Comments
Accepted at ICML 2026
英文摘要

Protein-protein interactions (PPIs) are fundamental to cellular function and disease mechanisms. Current learning-based PPI predictors focus on learning powerful protein representations but neglect designing specialized classification heads. They mainly rely on generic aggregating methods like concatenation or dot products, which lack biological insight. Motivated by the biological "L3 rule", where multiple length-3 paths between a pair of proteins indicate their interaction likelihood, our study addresses this gap by designing a biologically informed PPI classifier. In this paper, we provide empirical evidence that popular PPI datasets strongly support the L3 rule. We propose an L3-path-regularized graph prompt learning method called L3-PPI, which can generate a prompt graph with virtual L3 paths based on protein representations and controls the number of paths. L3-PPI reformulates the classification of protein embedding pairs into a graph-level classification task over the generated prompt graph. This lightweight module seamlessly integrates with PPI predictors as a plug-and-play component, injecting the interaction prior of complementarity to enhance performance. Extensive experiments show that L3-PPI achieves superior performance enhancements over advanced competitors.

2602.17739 2026-05-13 q-bio.GN cs.AI cs.LG

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang

AI总结 GeneZip 是一种面向长上下文DNA建模的区域感知压缩框架,旨在解决现有方法在压缩预算分配和计算成本上的不足。该方法结合动态路由机制与区域感知比例(RAR)目标,利用基因结构注释指导压缩过程,从而在推理时无需注释即可对原始DNA序列进行高效压缩。GeneZip 在压缩效果、冗余识别和训练效率方面表现出色,显著提升了长序列DNA模型的性能与可扩展性。

详情
Comments
Preprint, work in progress
英文摘要

Long-context DNA models are limited by token-mixing cost and by how compression allocates representational budget across the genome. Existing approaches operate close to base-pair resolution, apply fixed downsampling, or learn content-dependent chunks without an explicit genomic budget, making long-context pretraining expensive and difficult to control. We introduce GeneZip, a region-aware DNA compression framework that combines H-Net-style dynamic routing with a Region-Aware Ratio (RAR) objective and bounded routing. GeneZip uses static gene-structure annotations during compression training to specify region-wise base-pairs-per-token (BPT) targets; at inference time, it compresses raw unseen DNA without annotations. GeneZip provides three main benefits. First, it is effective: GeneZip variants achieve the best validation PPL among encoder-based compressors, with GeneZip-70M operating at 137.6 BPT, and across four reproducible DNALongBench tasks--contact map prediction, eQTL prediction, enhancer-target gene prediction, and transcription-initiation signal prediction--GeneZip obtains the best average rank among compared sequence models. Second, it is redundancy-aware: a post-hoc RepeatMasker/TRF analysis shows that, without repeat supervision, GeneZip assigns higher local BPT to TE-derived interspersed repeats and tandem repeats, two major classes of repetitive DNA sequence redundancy. Third, it is efficient: by reducing the effective token-mixing length, GeneZip enables longer-context and larger-capacity pretraining, including 128K-context and 636M-parameter variants on a single A100 80GB GPU, and fine-tunes the eQTL task 50.4x faster than JanusDNA (50 vs. 2520 minutes). These results establish GeneZip as an effective, redundancy-aware, and efficient compression interface for long-context DNA modeling.

2602.15451 2026-05-13 q-bio.QM cs.AI cs.LG quant-ph

Molecular Design beyond Training Data with Novel Extended Objective Functionals of Generative AI Models Driven by Quantum Annealing Computer

Hayato Kunugi, Mohsen Rahmani, Yosuke Iyama, Yutaro Hirono, Akira Suma, Matthew Woolway, Vladimir Vargas-Calderón, William Kim, Kevin Chern, Mohammad Amin, Masaru Tateno

AI总结 该研究提出了一种结合量子退火计算机的深度生成模型优化框架,用于小分子药物设计,解决了传统生成模型生成药物类化合物频率较低的问题。研究中引入了神经哈希函数(NHF),同时作为正则化和二值化方案,用于经典与量子神经网络之间的信号转换及误差函数构建。实验表明,基于量子退火的生成模型在分子有效性和药物相似性方面优于传统模型,并且在无需额外约束条件下超越了训练数据的表现,展示了量子计算在药物设计中的潜在优势。

详情
Comments
28 pages, 4 figures
英文摘要

Deep generative modeling to stochastically design small molecules is an emerging technology for accelerating drug discovery and development. However, one major issue in molecular generative models is their lower frequency of drug-like compounds. To resolve this problem, we developed a novel framework for optimization of deep generative models integrated with a D-Wave quantum annealing computer, where our Neural Hash Function (NHF) presented herein is used both as the regularization and binarization schemes simultaneously, of which the latter is for transformation between continuous and discrete signals of the classical and quantum neural networks, respectively, in the error evaluation (i.e., objective) function. The compounds generated via the quantum-annealing generative models exhibited higher quality in both validity and drug-likeness than those generated via the fully-classical models, and was further indicated to exceed even the training data in terms of drug-likeness features, without any restraints and conditions to deliberately induce such an optimization. These results indicated an advantage of quantum annealing to aim at a stochastic generator integrated with our novel neural network architectures, for the extended performance of feature space sampling and extraction of characteristic features in drug design.