arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.08019 2026-05-11 cs.AI q-bio.NC

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

Botos Csaba, Sreejan Kumar, Austin Tudor David Andrews, Laurence Hunt, Chris Summerfield, Joshua B. Tenenbaum, Rui Ponte Costa, Marcelo G. Mattar, Momchil Tomov

AI总结 该研究探讨了现代人工智能系统是否能像人类一样在新环境中快速学习抽象知识并高效行动,通过分析人类在复杂游戏任务中的行为与脑活动数据,对比了前沿大推理模型(LRMs)与深度强化学习代理的表现。研究发现,LRMs在游戏学习行为和脑活动预测方面显著优于传统强化学习模型,尤其在皮层和皮下区域表现出数量级的预测优势。结果表明,LRMs能更有效地模拟人类在复杂自然环境中的学习与决策过程。

详情
英文摘要

Humans rapidly learn abstract knowledge when encountering novel environments and flexibly deploy this knowledge to guide efficient and intelligent action. Can modern AI systems learn and plan in a similar way? We study this question using a dataset of complex human gameplay with concurrent fMRI recordings, in which participants learn novel video games that require rule discovery, hypothesis revision, and multi-step planning. We jointly evaluate models by their ability to play the games, match human learning behavior, and predict brain activity during the same task, comparing a suite of frontier Large Reasoning Models (LRMs) against model-free and model-based deep reinforcement learning agents and a Bayesian theory-based agent. We find that frontier LRMs most closely match human behavioral patterns during game discovery and predict brain activity an order of magnitude better than both reinforcement learning alternatives across cortical and subcortical regions, with effects robust to permutation controls. Through targeted manipulations, we further show that brain alignment reflects the model's in-context representation of the game state rather than its downstream planning or reasoning. Our results establish LRMs as compelling computational accounts of human learning and decision making in complex, naturalistic environments. Project page with interactive replays: https://botcs.github.io/reason-to-play/

2605.08014 2026-05-11 q-bio.NC math.DS

Dynamical mechanisms of flexible phase-locking in cortical theta oscillators

Yangyang Wang, Benjamin R. Pittman-Polletta

AI总结 本研究探讨了大脑皮层θ振荡器如何实现对不同时间尺度输入信号的灵活相位锁定机制。通过动力系统理论分析,研究发现多时间尺度的内在抑制电流相互作用,能够产生延迟Hopf分岔现象,从而显著扩展振荡器的同步频率范围。实验表明,θ频段和δ频段的抑制电流协同作用,形成了三时间尺度结构,使皮层振荡器在外部输入下具备更强的相位锁定能力,为语音分割等认知功能提供了潜在的神经机制基础。

Comments 30 pages, 15 figures

详情
英文摘要

Oscillatory activity in auditory cortex is thought to play a central role in auditory and speech processing by synchronizing neural rhythms to external acoustic features of the speech stream. To support this function, cortical oscillators must flexibly phase-lock to inputs spanning a wide range of timescales, including rhythms substantially slower than their intrinsic frequency. Here we identify a general dynamical mechanism by which intrinsic inhibitory currents operating on multiple timescales enable such flexible phase-locking. Using tools from dynamical systems theory, we show that interactions between slow and superslow inhibitory processes generate prolonged post-input recovery delays through delayed Hopf phenomena, thereby substantially expanding the frequency range over which entrainment can occur. We demonstrate this mechanisms in a biophysically grounded cortical theta oscillator model for speech segmentation. Specifically, we show that both a theta-timescale (4-8 Hz) inhibitory current $I_m$ and a slower delta-timescale (1-4 Hz) inhibitory potassium current $I_{\rm K_{SS}}$ are crucial for entrainment flexibility. Their interaction creates a three-timescale structure that gives rise to pronounced delay phenomena associated with a delayed Hopf bifurcation (DHB). Interestingly, the superslow $I_{\rm K_{SS}}$ and the associated DHB play little role in the unforced oscillatory dynamics, but are recruited to support phase locking under external forcing. Moreover, the intermediate-timescale current $I_m$, rather than being redundant, further expands the phase-locking range by prolonging delayed recovery along the superslow manifold. Together, these results suggest that coordination among intrinsic inhibitory currents operating on multiple timescales may represent a key mechanism supporting flexible phase locking to rhythmic inputs in the brain.

2605.07838 2026-05-11 q-bio.QM cs.AI cs.LG

PPI-Net connects molecular protein interactions to functional processes in disease

Kyle Higgins, Guadalupe Gonzalez, Dennis Veselkov, Ivan Laponogov, Kirill Veselkov

AI总结 该研究提出了一种名为PPI-Net的分层图神经网络,旨在通过整合蛋白质-蛋白质相互作用网络与通路层级表示,揭示分子互作如何驱动疾病功能过程。该模型利用图注意力机制,将患者特异的分子特征在共享的生物互作网络中传播,从而实现从基因到高阶生物学过程的信号聚合。实验表明,PPI-Net在多种癌症数据集上表现出优异的预测性能,并通过整合多组学数据提升了模型的可解释性,揭示了癌症相关的关键信号通路和生物学机制。

Comments 17 pages, 3 figures, 2 tables

详情
英文摘要

Understanding how molecular alterations propagate across biological systems to drive disease remains a central challenge. Although high-throughput profiling enables comprehensive characterization of tumor states, most models neglect structured biological relationships or lack interpretability across scales. Here we present PPI-Net, a hierarchical graph neural network that integrates protein-protein interaction (PPI) networks with pathway-level representations to model disease from molecular interactions to functional processes. Patient-specific molecular profiles are embedded within a shared interaction network from STRING and propagated through a multi-layer Reactome hierarchy using graph attention, enabling aggregation of gene-level signals into higher-order biological programs. Across RNA-seq data from ten cancer types from The Cancer Genome Atlas, PPI-Net achieves robust predictive performance, with balanced accuracy exceeding 90% in multiple cohorts. Comparative analysis on RNA-Seq data from breast cancer demonstrated that PPI-Net's integration of the Reactome hierarchy improved balanced accuracy by 6.7% relative to a PPI-only model, while hierarchical multi-level supervision improved balanced accuracy by 12.3% relative to using only a single top-level prediction head. Applying a multi-omics approach using RNA-seq and methylation data improves model interpretation, recovering canonical oncogenic modules, including TP53-AKT signaling and stress response pathways, while revealing convergence onto coherent programs such as ion signaling and cellular responses to stimuli. These results demonstrate that integrating interaction networks with pathway hierarchies enables accurate prediction while providing mechanistic insight into cancer biology.

2605.07746 2026-05-11 stat.ML cs.LG q-bio.QM

Flow Matching for Count Data

Ganchao Wei, John Pearson

AI总结 本文研究了高维计数数据(如单细胞RNA测序和神经脉冲序列)的生成建模问题,提出了一种基于连续时间出生-死亡过程的流匹配框架count-FM。该方法通过模拟自由的方式学习计数空间中的边际转移率,实现了在任意计数分布源和目标之间进行高效的生成与迁移。实验表明,count-FM在样本质量、模型效率和路径可解释性方面优于现有方法,适用于无条件生成、数据迁移和条件生成等多种任务。

详情
英文摘要

High-dimensional count data arise in applications such as single-cell RNA sequencing and neural spike trains, where mapping between distributions across successive batches or time points form critical components of data analysis. The recent success of diffusion- and flow-based deep generative models for images, video, and text motivates extending these ideas to count-valued settings, but many existing methods either treat each count as a categorical state or transform counts into a continuous space, neither of which is natural or efficient when the count range is large. We propose count-FM, a flow-matching framework for count data based on a continuous-time birth-death process with local unit jumps. Count-FM learns marginal transitions efficiently in count space through simulation-free training of conditional transition rates, allowing transport between arbitrary count-distributed source and target populations. In simulation, count-FM achieves better sample quality than representative baselines while using substantially fewer parameters. We further apply count-FM to scRNA-seq and neural spike-train data for unconditional generation, transport, and conditional generation. Across these tasks, count-FM yields improved sample quality, greater modeling efficiency, and interpretable transport paths.

2605.07614 2026-05-11 math.DS q-bio.MN

Predictive-Switching Control of Stochastic Gene Regulatory Networks: A Contractive PIDE Framework

Christian Fernández, Manuel Pájaro, Gábor Szederkényi, Irene Otero-Muras

AI总结 本文提出了一种基于部分积分微分方程(PIDE)模型的预测切换控制算法,用于调控随机基因调控网络的概率密度函数形状。通过从有限候选集选择控制输入以最小化给定代价函数,并引入神经网络近似控制策略,构建了一个适用于高维系统的混合控制框架。核心理论贡献在于基于收缩性分析的闭环PIDE动力学稳定性证明,确保了概率密度演化对初始条件的渐进独立性,并在存在严格正泄漏项时实现了指数收敛。数值仿真验证了该方法的有效性与灵活性。

详情
英文摘要

This paper develops a predictive switching control algorithm for stochastic gene regulatory networks described by a Partial Integro-Differential Equation (PIDE) model, which enables direct shape control of the probability density function. Control inputs are selected from a finite candidate set to minimize a prescribed cost functional. A hybrid framework is proposed for scalability in higher-dimensional systems, using neural networks to approximate the control policy. A central theoretical contribution is a contraction-based analysis of the closed-loop PIDE dynamics. The paper establishes $L^ 1$-contractivity under the proposed control scheme, yielding formal stability guarantees and showing that the evolution of the probability density becomes progressively independent of the initial condition. Moreover, under strictly positive leakage terms, exponential convergence is obtained. The effectiveness and flexibility of the approach, together with the theoretical contractivity results, are illustrated through numerical simulations on three representative examples of increasing dimensionality.

2605.07608 2026-05-11 q-bio.QM q-bio.BM

GoForth: Language Models for RNA Design under Structure, Sequence, and Coding Constraints

Michael Lindsey

AI总结 本文提出了一种名为GoForth的语言模型,用于在结构、序列和编码约束下进行RNA设计。该模型通过条件生成的方式处理复杂的逆向设计问题,将序列先验、前向折叠采样器和奖励或似然评估器三个通常耦合的要素进行解耦。实验表明,GoForth能够高效生成高质量的RNA序列候选,并提供了对设计任务的语义嵌入和设计可行性的学习表征。

详情
英文摘要

RNA inverse sequence design has broad biological and engineering applications, but computational methods for practical design queries remain limited. Such queries may impose several constraints at once, including target folds or motifs, fixed bases, and coding restrictions, while leaving arbitrary sequence and structure in unspecified regions. Because these constraints may permit many acceptable sequences, we study RNA design as a conditional generative modeling problem. The basic object is a conditional law over RNA sequences given a user-specified condition, with full inverse folding as a special case. We introduce GoForth, a forward-trained RNA language model that conditions on structure, sequence, and coding targets. The formulation separates three ingredients that are often entangled in RNA design: a sequence prior, a forward folding sampler, and a reward or likelihood oracle. We train encoder-decoder models on witnessed folds rather than on outputs from an inverse-design teacher and validate our methodology on full inverse-folding benchmarks, as well as tasks involving constraints on structure, sequence, and coding. The resulting models achieve fast and high-quality candidate generation for mixed RNA design specifications. Moreover they furnish useful semantic embeddings of design tasks and a robust learned notion of designability.

2605.07554 2026-05-11 cs.LG cs.AI q-bio.BM stat.ML

ProteinJEPA: Latent prediction complements protein language models

Dan Ofer, Dafna Shahaf, Michal Linial

AI总结 本文研究了在蛋白质语言模型中引入潜在空间预测(JEPA)是否能提升模型性能,并在相同训练时间预算下与传统的掩码语言建模(MLM)进行对比。研究发现,在预训练和从头训练的蛋白质序列编码器中,仅在掩码位置进行潜在预测并保留MLM交叉熵损失的方法(称为masked-position MLM+JEPA)表现最佳,显著优于仅使用MLM或仅使用JEPA的方法。该方法在多个下游任务中取得了更好的性能,包括蛋白质稳定性预测、酶分类和结构检索等。

详情
英文摘要

Protein language models are trained primarily with masked language modeling (MLM), which predicts amino-acid identities at masked positions. We ask whether latent-space prediction can complement these token-level objectives under matched wall-clock budget. Across pretrained and random-init protein sequence encoders at 35--150M parameters, we find that the best protein-JEPA design is not all-position latent prediction but a variant: predicting latent targets only at masked positions, and retaining the MLM cross-entropy. We call this recipe masked-position MLM+JEPA. On a 16-task downstream suite (15 frozen linear probes plus SCOPe-40 zero-shot fold retrieval), under matched wall-clock budgets, this recipe wins more tasks than it loses against MLM-only continuation: 10 wins / 3 losses / 3 ties (hereafter W/L/T) on pretrained ESM2-35M, 11/2/3 on ESM2-150M while results in pretraining from scratch are mixed (6/8/2). Gains are seen for multiple models on 11 of 16 tasks, including stability, \b{eta}β\b{eta}-lactamase fitness, variant effect, intrinsic disorder, remote homology, enzyme classification, and SCOPe-40 fold retrieval. Tasks with more losses than wins are Fluorescence (TAPE) and Peptide-HLA Binding. All-position MLM+JEPA matches MLM-only overall but does not reproduce the masked-position gains. JEPA-only (no MLM) collapses in nearly every experiment. We conclude that JEPA, when combined with MLM, is competitive and can outperform pure MLM in pretraining and continued training, even under matched wall-clock budgets.

2605.03169 2026-05-11 q-bio.NC

NeuralSet: A High-Performing Python Package for Neuro-AI

Jean-Rémi King, Corentin Bel, Linnea Evanson, Julien Gadonneix, Sophia Houhamdi, Jarod Lévy, Josephine Raugel, Andrea Santos Revilla, Mingfang Zhang, Julie Bonnaire, Charlotte Caucheteux, Alexandre Défossez, Théo Desbordes, Pablo Diego-Simón, Shubh Khanna, Juliette Millet, Pierre Orhan, Saarang Panchavati, Antoine Ratouchniak, Alexis Thual, Teon L. Brooks, Katelyn Begany, Yohann Benchetrit, Marlène Careil, Hubert Banville, Stéphane d'Ascoli, Simon Dahan, Jérémy Rapin

AI总结 本文介绍了NeuralSet,一个高效的Python工具包,旨在解决神经科学与现代人工智能融合中的软件生态碎片化问题。该框架统一处理多种神经记录数据(如fMRI、EEG和 spikes)及复杂实验刺激(如文本、音频和视频),通过解耦元数据与高效的数据提取,实现了与预训练深度学习模型的无缝集成。NeuralSet 提供了一个统一的PyTorch接口,支持从本地开发到高性能集群的无缝扩展,为下一代神经-人工智能研究提供了可扩展的基础设施。

详情
英文摘要

Artificial intelligence (AI) is increasingly central to understanding how the brain processes information. However, the integration of neuroscience and modern AI is bottlenecked by a fragmented software ecosystem. Current tools are siloed by recording modality and optimized for small-scale, in-memory workflows, limiting the use of massive, naturalistic datasets. Here, we introduce NeuralSet, a Python framework that efficiently unifies the processing of diverse neural recordings (including fMRI, M/EEG, and spikes) and complex experimental stimuli (such as text, audio, and video). By decoupling experimental metadata from lazy, memory-efficient data extraction, NeuralSet harmonizes standard neuroscientific preprocessing pipelines with pretrained deep learning embeddings. This approach provides a single PyTorch-ready interface that scales seamlessly from local prototyping to high-performance cluster execution. By eliminating manual data wrangling and ensuring full computational provenance, NeuralSet establishes a scalable, unified infrastructure for the next generation of neuro-AI research.

2602.03490 2026-05-11 cs.LG q-bio.NC

Path Integration and Object-Location Binding Emerge in an Action-Conditioned Predictive Sequence Network

Linda Ariel Ventura, Victoria Bosch, Tim C Kietzmann, Sushrut Thorat

AI总结 该研究探讨了如何通过行动条件下的预测序列网络实现路径整合和物体-位置绑定。研究中使用了一个递归神经网络,在连续的二维场景中依次采样标记,并通过预测下一个标记来学习环境模型。实验表明,网络能够逐步提升预测准确性,并在解码分析中展现出路径整合和动态绑定能力,揭示了结构化表征如何通过灵活绑定支持预测,为认知科学中的序列世界建模提供了机制性解释。

Comments 8 pages, 4 figures; accepted at CogSci 2026

详情
英文摘要

Adaptive cognition requires structured internal models of objects and their relations. Predictive neural networks are often proposed to learn such world models, but how these are instantiated and how they support prediction remain unclear. We investigate this in a minimal in-silico setting. A recurrent neural network samples tokens sequentially from 2D continuous token scenes and is trained to predict the upcoming token from the current input and a saccade-like displacement. On novel scenes, prediction accuracy improves across the sequence, indicating in-context learning. Decoding analyses reveal path integration and dynamic binding of token identity to position. Interventional analyses show that new bindings can be learned late in sequence and that out-of-distribution bindings can be learned as well. Together, these findings show how structured representations relying on flexible binding emerge to support prediction, offering a mechanistic account of sequential world modeling relevant to cognitive science.

2510.01808 2026-05-11 q-bio.PE physics.bio-ph

Optimization of sequential therapies to maximize extinction of resistant bacteria through collateral sensitivity

Javier Molina-Hernández, José A. Cuesta, Beatriz Pascual-Escudero, Saúl Ares, Pablo Catalán

AI总结 该研究探讨了利用交叉敏感性(CS)设计序贯抗生素疗法以最大化耐药细菌灭绝的优化策略。通过构建包含四种基因型的随机出生-死亡模型,研究揭示了抗生素切换周期与细菌灭绝概率之间的非线性关系,并提出了基于几何分布的预测框架。研究还分析了抗生素剂量和突变率对灭绝效果的非单调影响,指出存在权衡关系,为体外和临床序贯疗法提供了定量设计原则。

Comments 17 pages, 15 figures, 2 tables

详情
Journal ref
Phys. Rev. E 113, 054404 (2026)
英文摘要

Antimicrobial resistance (AMR) threatens global health. A promising and underexplored strategy to tackle this problem is sequential therapies exploiting collateral sensitivity (CS), whereby resistance to one drug increases sensitivity to another. Here, we develop a four-genotype stochastic birth-death model with two bacteriostatic antibiotics to identify switching periods that maximize bacterial extinction under subinhibitory concentrations. We show that extinction probability depends nonlinearly on switching period, with stepwise increases aligned to discrete switch events: fast sequential therapies are suboptimal as they do not allow for the evolution of resistance, a key ingredient in these therapies. A geometric distribution framework accurately predicts cumulative extinction probabilities, where the per-switch extinction probability rises with switching period. We further derive a heuristic approximation for the extinction probability based on times to fixation of single-resistant mutants. Sensitivity analyses reveal that strong reciprocal CS is required for this strategy to work, and we explore how increasing antibiotic doses and higher mutation rates modulate extinction in a nonmonotonic manner. Finally, we discuss how longer therapies maximize extinction but also cause higher resistance, leading to a Pareto front of optimal switching periods. Our results provide quantitative design principles for in vitro and clinical sequential antibiotic therapies, underscoring the potential of CS-guided regimens to suppress resistance evolution and eradicate infections.

2508.04056 2026-05-11 cs.RO q-bio.QM

SCOUT: Closed-Loop in-vivo System for Continuous Methane Concentration Monitoring in Cattle

Yuelin Deng, Hinayah Rojas de Oliveira, Richard M. Voyles, Upinder Kaur

AI总结 该研究提出了一种名为SCOUT的闭环在体监测系统,用于持续测量牛瘤胃内甲烷浓度,解决了现有方法在准确性和操作可行性之间的矛盾。SCOUT通过闭环气体循环维持瘤胃厌氧环境,实现了高时间分辨率的甲烷浓度监测,揭示了与动物行为变化相关的快速浓度波动。该系统为建立浓度与排放量之间的模型提供了可靠的数据基础,有助于精准表型分析、排放代理校准和减排策略评估。

详情
英文摘要

Enteric methane measurement from ruminant livestock faces fundamental trade-offs between accuracy and operational feasibility. Existing methods quantify methane after eructation and atmospheric dilution, limiting temporal resolution and confounding biological signals with environmental variables. We present SCOUT (Smart Cannula-mounted Optical Unit for Trace-methane), the first autonomous system for continuous in-vivo monitoring of ruminal headspace methane concentrations. The system addresses a critical engineering barrier through closed-loop gas recirculation that maintains anaerobic ruminal conditions during persistent headspace sampling. SCOUT was deployed on cannulated Simmental heifers under contrasting dietary treatments. Headspace concentrations were 100 to 1000 times higher than concurrent ambient sniffer readings, providing substantially greater signal resolution for characterizing methane dynamics. High-frequency monitoring revealed behavior-production coupling previously inaccessible, including rapid concentration changes ($14.5 \pm 11.3k$ ppm) associated with postural transitions within 15-minute intervals. Cross-platform comparison with ambient sniffers showed scale-dependent correspondence between production and release measurements, with an optimal correlation (r = -0.564) at 40-minute averaging windows consistent with eructation cycles. These results demonstrate that the rumen headspace contains continuous, biologically interpretable methane signals that SCOUT can reliably access, establishing the measurement infrastructure necessary for developing concentration-to-flux models that would support precision phenotyping, emission proxy calibration, and mitigation strategy evaluation.

2505.22134 2026-05-11 q-bio.PE

Infection dynamics for fluctuating infection or removal rates regarding the number of infected and susceptible individuals

Seong Jun Park, M. Y. Choi

AI总结 本文研究了感染率和移除率随感染者和易感者数量变化的非线性关系下的传染病传播动力学问题。作者提出了一种解析方法,用于计算在非线性感染和移除率下感染人数随时间的变化情况,拓展了传统SIR模型的适用范围。该研究为理解复杂传染病动态提供了新的定量分析工具。

详情
英文摘要

In general, the rates of infection and removal (whether through recovery or death) are nonlinear functions of the number of infected and susceptible individuals. One of the simplest models for the spread of infectious diseases is the SIR model, which categorizes individuals as susceptible, infectious, recovered or deceased. In this model, the infection rate, governing the transition from susceptible to infected individuals, is given by a linear function of both susceptible and infected populations. Similarly, the removal rate, representing the transition from infected to removed individuals, is a linear function of the number of infected individuals. While nonlinear infection and removal rates have been extensively studied in deterministic epidemiological models, analytic results for stochastic dynamics with general nonlinear rates remain limited. This work presents an analytic expression for the number of infected individuals considering nonlinear infection and removal rates. In particular, we examine how the number of infected individuals varies as cases emerge and obtain the expression accounting for the number of infected individuals at each moment. This work paves the way for new quantitative approaches to understanding infection dynamics.

2504.16559 2026-05-11 cs.LG q-bio.QM

Synergistic Benefits of Joint Molecule Generation and Property Prediction

Adam Izdebski, Jan Olszewski, Pankhil Gawade, Krzysztof Koras, Serra Korkmaz, Valentin Rauscher, Jakub M. Tomczak, Ewa Szczurek

AI总结 该研究探讨了联合分子生成与性质预测的协同优势,提出了一种基于Transformer架构的联合模型Hyformer。该模型通过交替注意力机制和联合预训练策略,实现了分子生成与性质预测功能的融合,能够在条件采样、分布外性质预测和表征学习等方面展现协同效益。实验表明,Hyformer在抗菌肽设计等药物研发任务中表现出显著的联合学习优势。

Comments 17 pages, 4 figures

详情
Journal ref
Transactions on Machine Learning Research (TMLR), 2026
英文摘要

Modeling the joint distribution of data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic benefits reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. We show that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. Finally, we demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial~peptides.

2311.08433 2026-05-11 q-bio.QM cs.LG stat.AP

Clinical Characteristics and Laboratory Biomarkers in ICU-admitted Septic Patients with and without Bacteremia

Sangwon Baek, Seung Jun Lee

AI总结 该研究旨在探讨重症监护病房内感染性休克患者中是否存在菌血症的临床特征和实验室生物标志物的预测价值。通过回顾性分析218例患者的临床数据,研究发现C反应蛋白(CRP)和降钙素原(PCT)对菌血症具有较好的预测能力,而结合PCT、胆红素、中性粒细胞与淋巴细胞比值(NLR)、血小板、乳酸、红细胞沉降率(ESR)和格拉斯哥昏迷评分(GCS)构建的多变量逻辑回归模型显著提升了预测准确性,AUC达到0.907。研究还发现菌血症与患者死亡率存在显著关联,表明这些生物标志物在临床诊断和预后评估中具有重要应用价值。

Comments This research is not complete

详情
英文摘要

Few studies have investigated the diagnostic utilities of biomarkers for predicting bacteremia among septic patients admitted to intensive care units (ICU). Therefore, this study evaluated the prediction power of laboratory biomarkers to utilize those markers with high performance to optimize the predictive model for bacteremia. This retrospective cross-sectional study was conducted at the ICU department of Gyeongsang National University Changwon Hospital in 2019. Adult patients qualifying SEPSIS-3 (increase in sequential organ failure score greater than or equal to 2) criteria with at least two sets of blood culture were selected. Collected data was initially analyzed independently to identify the significant predictors, which was then used to build the multivariable logistic regression (MLR) model. A total of 218 patients with 48 cases of true bacteremia were analyzed in this research. Both CRP and PCT showed a substantial area under the curve (AUC) value for discriminating bacteremia among septic patients (0.757 and 0.845, respectively). To further enhance the predictive accuracy, we combined PCT, bilirubin, neutrophil lymphocyte ratio (NLR), platelets, lactic acid, erythrocyte sedimentation rate (ESR), and Glasgow Coma Scale (GCS) score to build the predictive model with an AUC of 0.907 (95% CI, 0.843 to 0.956). In addition, a high association between bacteremia and mortality rate was discovered through the survival analysis (0.004). While PCT is certainly a useful index for distinguishing patients with and without bacteremia by itself, our MLR model indicates that the accuracy of bacteremia prediction substantially improves by the combined use of PCT, bilirubin, NLR, platelets, lactic acid, ESR, and GCS score.

2605.07498 2026-05-11 q-bio.PE cs.CY

Modeling the Impact of Exposed Cases in a Hantavirus Outbreak on a Cruise Ship

Jiaming Cui

AI总结 本文研究了某邮轮上汉坦病毒疫情中隐性感染者对疫情传播的影响,构建了一个离散时间随机SEIRD模型,用于估计疫情传播动态、隐性感染情况及爆发风险。通过卡尔曼滤波方法结合世界卫生组织和欧洲疾病预防控制中心的疫情数据,推算了基本再生数为2.76,表明疫情在严格隔离措施实施前具有持续传播的潜力。研究还指出,疫情初期可能存在未被发现的隐性感染者,仅依赖症状监测难以有效识别,强调了在密闭旅行环境中快速监测、广泛检测和针对性隔离的重要性。

详情
英文摘要

The emergence of a hantavirus variant aboard a commercial cruise ship presents a significant public health concern. This study develops a discrete-time stochastic Susceptible-Exposed-Infectious-Recovered-Dead model to estimate transmission dynamics, hidden exposed infections, and outbreak risk among passengers and crew. Epidemiological parameters and latent disease states were inferred using an Ensemble Adjustment Kalman Filter calibrated to reported case data from WHO and ECDC situation reports. The estimated basic reproduction number was 2.76, with a 95\% confidence interval of 2.52-2.99, indicating substantial potential for sustained onboard transmission before strict quarantine measures. Simulations further suggest that several exposed individuals may remain unidentified during the early outbreak phase, creating a hidden reservoir that symptom-based surveillance alone may fail to detect. These findings highlight the importance of rapid surveillance, widespread testing, targeted quarantine, and active monitoring of exposed individuals in confined travel settings. The proposed modeling framework can support timely outbreak assessment and intervention planning for infectious-disease events in similarly dense and spatially constrained populations.

2605.07439 2026-05-11 q-bio.BM

CA-DEL: An Open Multi-Target, Multi-Modal Benchmark for Learning from DNA-Encoded Library Screens

Mutian He, Hanqun Cao, Cheng Tan, Zijun Gao, Xiaojun Yao, Chunbin Gu, Pheng-Ann Heng

AI总结 本文提出了一种名为CA-DEL的开放多靶点、多模态基准数据集,用于从DNA编码文库筛选中学习分子与靶点之间的关系。该数据集聚焦于同源碳酸酐酶亚型(CAII、CAIX、CAXII)的选择性识别问题,通过整合实验测定的结合亲和力数据($K_i$),建立了从噪声筛选数据到高精度生物物理数据的模拟到现实评估范式,为开发鲁棒的药物发现模型提供了重要支持。

详情
英文摘要

The success of machine learning in drug discovery hinges on learning the relationship between a chemical structure and its biological activity. While DNA-Encoded Library (DEL) technology can generate the massive datasets required for this task, its primary signal -- sequencing read counts -- is an indirect and often noisy proxy for true molecular binding affinity. To address the scarcity of public benchmarks for developing robust models that can overcome this data challenge, we introduce CA-DEL, a multi-dimensional public benchmark featuring screens against three homologous carbonic anhydrase isoforms. While recent benchmarks like KinDEL have introduced 3D poses for kinase targets, CA-DEL distinguishes itself by focusing on the selectivity challenge among homologous Carbonic Anhydrase isoforms (CAII, CAIX, CAXII). Unlike benchmarks relying solely on noisy enrichment scores, CA-DEL integrates a rigorous validation set of experimentally determined binding affinities ($K_i$) from ChEMBL, establishing a critical Sim-to-Real evaluation paradigm: training on noisy DEL screens and testing on high-fidelity biophysical data.

2605.07035 2026-05-11 q-bio.OT cs.ET

Genetic Information as a "Chord" of Chemical Oscillations: Emergence of Catalyst-RNA Systems Driven by Superposed Rhythms

Takeshi Ishida

AI总结 本文探讨了生命起源过程中催化多肽和信息承载核酸如何相互依赖地形成系统这一核心问题,提出了一种基于两个内部洛特卡-沃尔泰拉化学振荡器的原始认知模型。通过模拟二进制序列所代表的聚合物相互作用,研究展示了催化循环、原始tRNA和记录放大振荡信息的核酸可能形成的机制。该模型表明,内部振荡可以为聚合物延伸过程中的序列选择提供时间偏差,并有效促进功能分子的积累与催化功能和信息存储的协同演化。

详情
英文摘要

A central challenge in the origin of life is understanding how catalytic peptide-like polymers and information-bearing nucleic acid-like polymers emerged as an interde-pendent system. This study constructs a primordial cognitive model incorporating two internal Lotka-Volterra chemical oscillators to investigate, through simulation, whether a catalytic loop, primordial tRNAs, and nucleic acids that record and amplify them, can form through the interaction of polymers represented by binary (0/1) sequences. In this model, a mechanism was introduced where the synthesis of internal oscillations pro-vides a temporal bias for 0/1 selection during polymer elongation, while generated functional sequences are protected, recorded, and re-amplified. Simulation results demonstrated that the proposed cognitive model significantly outperformed a contrast model based on random 0/1 selection in terms of the establishment rate of catalytic loops, the accumulation of functional molecules, polymer elongation, and the reduction of Shannon entropy in sequence distribution. Furthermore, this superiority was generally maintained across sensitivity analyses, including batch calculations with different ran-dom seeds. While this study is a computational model based on abstract binary se-quences and simplified translation/replication rules rather than a direct reconstruction of life's origin, it provides a working hypothesis for the interdependent emergence of catalytic function and information retention by demonstrating that internal oscillations can bias sequence exploration within a framework linking autocatalytic networks, re-cording, and group selection. Future research must verify the generality and empirical validity of this framework by expanding monomer types, evolving into multi-oscillator systems, and establishing correspondences with compartmentalized experimental sys-tems.

2605.07028 2026-05-11 q-bio.PE physics.soc-ph

Quo nomine vis vocari? A random-copying model explains the temporal sequence of papal names

Egor Lappo, Noah A. Rosenberg

AI总结 本文研究了教皇名号选择这一持续千年的文化演化过程,揭示其背后存在一种类似于种群遗传学中的随机复制机制。研究发现,尽管每个教皇在选择名字时会综合考虑多种因素,但从长期趋势来看,教皇名字的使用频率符合“尤恩斯抽样理论”和“中国餐馆过程”等模型的预测,即名字的选择具有按频率随机复制的特性,并允许新名字的出现。这一发现表明,复杂的人类文化行为可能遵循简单而普适的规律。

Comments 12 pages, 3 figures, 1 table

详情
英文摘要

The study of cultural evolution seeks to understand the processes by which behavioral variants are chosen in cultures over time, often as the result of large numbers of individual human choices. The selection of new popes, each of whom chooses a papal name -- typically reusing previous names in reference to previous popes -- is among the longest ongoing cultural processes taking place in a single human institution. Here, we use the record of papal names as a setting for long-term analysis of human cultural behavior. Although papal name choices are careful individual decisions, we find that the long-term sequence of papal names accords with predictions of a family of models developed in population genetics and stochastic processes -- Ewens sampling theory and the Chinese restaurant process -- which in the case of papal names amounts to randomly copying an existing name in proportion to its frequency, with the possibility of innovation of new names (mutations). Hence, despite the consideration that enters into choices of individual papal names, aggregate cultural behavior in a 2000-year old human process can potentially be described with simple laws. We discuss instances in which particular historical events might have caused temporary deviations from the random-copying model.

2605.07026 2026-05-11 q-bio.NC cs.AI cs.LG

Learning Cross-Atlas Consistent Brain Disorder Representations via Disentangled Multi-Atlas Functional Connectivity Learning

Minheng Chen, Chao Cao, Jing Zhang, Tianming Liu, Dajiang Zhu

AI总结 该研究旨在解决不同脑图谱下功能连接(FC)表示不一致的问题,提出了一种多图谱解耦功能连接学习框架(MADCLE)。该方法通过联合编码来自不同图谱的FC矩阵,学习图谱特异的疾病相关表示,并通过分布对齐促进跨图谱一致性。同时,通过协变量监督、图谱特异性重建和去相关约束,分离协变量和图谱依赖的残差因素,减少非疾病信息对疾病嵌入的干扰。实验表明,MADCLE在ADNI和ADHD-200数据集上优于多种现有方法,展示了其在异构图谱方案下基于FC的疾病识别中的有效性。

详情
英文摘要

Functional connectivity (FC) derived from resting-state fMRI is widely used to characterize large-scale brain network alterations in neurological and psychiatric disorders. However, FC construction critically depends on the choice of brain atlas, and different parcellations may emphasize distinct organizational features, leading to heterogeneous and sometimes inconsistent representations. Existing multi-atlas approaches partially alleviate this issue but often fuse atlas-derived features or predictions at a relatively shallow level, while single-atlas disentanglement methods do not explicitly address cross-atlas heterogeneity. We propose Multi-Atlas Disentangled Connectivity LEarning (MADCLE), a multi-branch representation learning framework that jointly encodes FC matrices derived from different brain atlases. Rather than introducing a single explicitly shared latent variable across parcellations, MADCLE learns atlas-wise disease-related representations and encourages them to be cross-atlas consistent through distributional alignment. Meanwhile, covariate-related and atlas-dependent residual factors are modeled separately using covariate similarity supervision, atlas-specific reconstruction, and decorrelation constraints, thereby reducing the leakage of non-disease and parcellation-dependent information into the disease-related embeddings. Experiments on the ADNI and ADHD-200 datasets suggest that MADCLE achieves competitive or improved performance compared with single-atlas baselines, multi-atlas GNN/Transformer models, and recent multi-atlas consistency frameworks. These results support the potential value of structured disentanglement for FC-based disorder identification under heterogeneous parcellation schemes.

2605.07007 2026-05-11 q-bio.CB physics.bio-ph q-bio.PE

Essential Role of Extrinsic Noise in Models of E. coli Division Control

Mattia Corigliano, Kuheli Biswas, Matteo Bocchiola, Daniele Montagnani, Ariel Amir, Marco Cosentino Lagomarsino

AI总结 本文研究了大肠杆菌分裂调控中外源噪声的关键作用,通过解析求解一个随机阈值积累模型,揭示了分裂蛋白在达到噪声相关阈值时触发分裂的机制。研究量化了内在与外源噪声以及关键分子机制参数的综合影响,表明引入这些因素可以产生比传统“加法模型”更丰富的分裂策略,并能解释实验观测到的细胞大小波动。研究为理解细菌分裂规律提供了统一的理论框架。

Comments 20 pages, 5 figures, 1 table

详情
英文摘要

Our understanding of cell division control in bacteria still relies largely on interpreting correlations between phenomenological variables, with limited connection to the underlying molecular mechanisms. Here, we analytically solve a stochastic threshold-accumulation model in which a size-dependent divisor protein triggers division upon reaching a noisy, autocorrelated threshold, quantifying within a unified framework the combined effects of intrinsic and extrinsic noise and key mechanistic parameters such as protein reset and threshold memory. We show that incorporating these elements yields behavior far richer than the commonly assumed adder, spanning a continuum of division strategies from timer to sizer while modulating size fluctuations in a nontrivial fashion. Comparison with single-cell E. coli data shows that extrinsic noise and additional mechanistic ingredients are required to account for the observed size fluctuations. The adder emerges when threshold correlations balance protein reset, generalizing the hypothesis that full reset is necessary to maintain adder control. Our results establish a unified analytical framework linking stochastic molecular processes to emergent division laws, to be used in more complex bacterial cell-cycle models.

2605.06995 2026-05-11 q-bio.QM q-bio.NC

Partitioning Neural Co-Variability

Skyler Thomas, Brandon J. Zhu, Kathleen E. Cullen, Adam S. Charles

AI总结 该研究探讨了神经元群体响应中的试次间变异问题,提出了一种新的模型——泊松矩阵正态潜在变量模型(PMNLV),用于捕捉神经群体中结构化的增益共变性。该模型通过矩阵正态先验和二次软整流链接函数,扩展了单神经元的过度离散模型,能够同时估计神经元之间的协方差和时间相关性。研究还开发了两种互补的估计算法,并在模拟数据和小鼠视觉皮层的神经记录数据中验证了模型的有效性,揭示了初级视觉皮层中群体共变异的显著特征。

详情
英文摘要

Trial-to-trial variability of neural responses has been linked to important aspects of neural computation and is essential for understanding how neuronal populations respond. While current overdispersion models treat each neuron's gain as independent of each other, this assumption fails to capture the network statistics of neuronal populations. As no existing model can capture overdispersed structured spiking gain-modulation across a neural population, network-level gain covariance remains largely unstudied. We thus present the Poisson matrix-normal latent variable (PMNLV) model, which extends single-neuron overdispersion to neural populations by placing a matrix-normal prior over the latent gain with a Kronecker-factored covariance. Spike counts are Poisson-distributed with a rate equal to the sum of a per-neuron stimulus tuning term and a matrix-normal gain, passed through a quadratic soft-rectifying link. We derive two complementary estimation algorithms: a variational EM (VEM) with a matrix-normal posterior that recovers dense Kronecker factors without structural assumptions, and a Kernel Tournament Method (KTM) that performs data-driven selection over a biologically motivated kernel dictionary and composite likelihood. On simulated data, both algorithms recover the inter-neuron and temporal covariance factors and accurate tuning curves. Applying VEM to Neuropixel recordings across four cortical regions of mouse visual hierarchy, we replicate a previous finding that single-neuron marginal variability changes little across cortical areas. We then show that shared population co-variability, invisible to scalar summaries e.g., the Fano factor, peaks in primary visual cortex and declines in higher visual areas. The PMNLV framework is applicable to any simultaneously recorded population where structured gain covariance is of scientific interest.

2605.06879 2026-05-11 cs.LG q-bio.QM

Better Protein Function Prediction by Modeling Survivorship Bias

Zhongmou Chao, Poompol Buathong, Ekaterina Selivanovitch, Susan Daniel, Peter I. Frazier

AI总结 该研究针对蛋白质功能预测中因自然选择导致的幸存者偏差问题,提出了一种基于进化知识的正例-未标记例学习框架Evo-PU。该方法通过建模序列在进化过程中的可观测性差异,区分因非功能而未被观察到的序列与因突变路径罕见而未出现的序列,从而提升预测准确性。实验表明,Evo-PU在多个单物种和多物种数据集上均优于现有方法,展示了其在蛋白质功能预测中的有效性与广泛适用性。

Comments 29 pages, 12 figures, 3 tables

详情
英文摘要

Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non-functional protein mutations are eliminated by natural selection. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone. While positive-unlabeled (PU) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias. Consider a sequence that is one mutation away from a commonly-observed protein variant in a well-surveilled organism. If the sequence were functional, it would likely be observed. If it is not observed, this suggests non-functionality. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose. Thus, these two kinds of missing sequences should be treated differently when training models. In this work, we propose Evo-PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well-surveilled single-organism sequence data. On three prediction tasks using single-organism uniform-coverage surveillance data -- predicting results from held-out influenza and respiratory syncytial virus (RSV) mutagenesis studies, and predicting future SARS-CoV-2 variants -- Evo-PU outperforms standard PU learning, one-class classification (OCC), and protein language models (PLMs). On prediction tasks from multi-organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach.

2605.06762 2026-05-11 q-bio.GN cs.AI

A Linear-Transformer Hybrid for SNP-Based Genotype-to-Phenotype Prediction in Grapevine

Yibin Wang, Murukarthick Jayakodi, Silvas Kirubakaran, Ambika Chandra, Azlan Zahid

AI总结 本文提出了一种结合线性模型与Transformer架构的混合方法LiT-G2P,用于基于SNP数据的葡萄基因型到表型预测。该方法通过整合加性遗传效应与非线性基因互作,提升了复杂性状在不同年份间的预测稳定性与准确性。实验结果表明,LiT-G2P在单年和跨年预测中均优于基准模型,尤其在叶毛密度和绒毛密度等性状上表现突出,并通过注意力机制提取关键SNP位点,为后续验证提供了可解释的候选标记。

Comments 15 pages, 4 Figures

详情
英文摘要

Robust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Transformer approach, LiT-G2P (Linear-Transformer Genotype-to-Phenotype), an automated predictive framework that integrates additive genetic variance effects with Transformer-based nonlinear interactions using genome-wide single-nucleotide polymorphisms (SNPs) data. We evaluated LiT-G2P on a panel of diverse grape accessions, genotyped with SNP markers and measured for phenotypes across two consecutive years. Target phenotypic traits include leaf hair density and trichome density of grapevines. Across both single-year and cross-year testing scenarios, LiT-G2P consistently improves prediction performance compared with baseline models. For hair density, LiT-G2P achieves the lowest error in both single-year and cross-year evaluations, with RMSEs of 0.469 and 0.454, respectively, while maintaining strong tolerance accuracies of 79.2% and 74.6%, respectively. For trichome density, LiT-G2P also presents the best overall G2P performance. In addition, we extract model-prioritized SNPs from attention weights and apply genotype-stratified analysis to provide interpretable candidate marker for downstream validation. These results demonstrate that integrating stable additive effects with learned interaction patterns can enhance cross-year robustness and support practical SNP-based predictive modeling for genomic selection.

2605.06728 2026-05-11 q-bio.GN cs.AI q-bio.CB

OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning

Maciej Sypetkowski, Joanna Krawczyk, Łukasz Smoliński, Remigiusz Kinas, Przemysław Pietrzak, Tomasz Jetka, Rafał Powalski

AI总结 OmicsLM 是一个用于多样本组学推理的多模态大语言模型,旨在连接定量组学数据与自然语言生物任务。该模型通过将转录组数据表示为紧凑的连续向量,在统一的上下文中处理自然语言指令、基因名称和多个样本数据,从而实现语言引导下的多样本推理。研究还引入了 GEO-OmicsQA 基准,用于评估模型在真实表达谱上的多样本生物问答能力,并表明 OmicsLM 在语言引导的生物推理任务中优于现有专门模型和通用大语言模型。

Comments 13 pages (main text), 14 pages (appendix), 1 figure, 10 tables

详情
英文摘要

Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation within the LLM context. This interface preserves quantitative expression signal while allowing natural-language instructions, explicit gene mentions, and multiple interleaved biological samples to be processed together in one model context. We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, experimental data rendered through diverse language templates, and free-text biological knowledge and question-answering data. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering. Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over real expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from real Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.

2602.09034 2026-05-11 q-bio.NC cs.AI

Latent-Space Causal Discovery from Indirect Neuroimaging Observations

Sangyoon Bae, Miruna Oprescu, David Keetae Park, Shinjae Yoo, Jiook Cha

AI总结 该研究旨在从间接神经影像观测中发现潜在空间中的因果关系,克服了血流动力学和体积传导对信号的扭曲影响。研究提出了一个基于物理模型和非平稳潜在动态的条件框架,并推导了逆向误差传播的上界。在此基础上,作者设计了INCAMA方法,结合物理感知的逆向建模与延迟感知的Mamba编码器,通过机制变化提升因果图结构的估计性能。实验表明,该方法在模拟和真实fMRI数据上均显著优于现有方法,尤其在运动任务中能准确捕捉经典的视觉-运动通路。

Comments 9 pages, 2 figures

详情
英文摘要

Neuroimaging does not observe causal variables directly: hemodynamics and volume conduction distort signals so that statistical dependence need not reflect latent neural influence. Before estimating graphs, one must specify under what assumptions delayed directed structure can be studied from such indirect observations. We formalize a conditional setting - recoverable inversion under modality physics together with nonstationary latent dynamics - and derive an inversion-error propagation bound under explicit assumptions. Building on this framing, we propose INCAMA (INdirect CAusal MAmba): physics-aware inversion coupled with a delay-aware Mamba encoder that uses mechanism shifts as informative variation for directed graph scoring. We use controlled simulations for quantitative validation and HCP motor-task fMRI as a zero-shot external transfer check based on anatomical and task-network consistency. Across TVB simulations, INCAMA improves directed-structure recovery by 2-3x in F1 over observation-space and two-stage baselines, and on HCP motor-task fMRI it produces sparse directed estimates concentrated in canonical visuo-motor pathways.

2602.02320 2026-05-11 cs.CL cs.AI q-bio.BM

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai, Guijuan He, Yi Hu, Jingjing Wang, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo

AI总结 该研究提出了一种基于规则的自动化标注框架,用于生成包含完整分子结构信息的自然语言描述,解决了构建大规模高质量分子结构-语言对数据集的难题。通过扩展化学命名规则解析器,生成结构化的XML元数据,并引导大语言模型生成精确描述,最终构建了一个包含约16.3万个分子-描述对的数据集,经验证其描述精度高达98.6%。该数据集为分子与语言的对齐研究提供了可靠基础,适用于多种化学任务。

详情
英文摘要

Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular descriptions that preserve complete structural details at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structural XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule--description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6$%. The proposed annotation framework is readily beneficial to broader chemical tasks that rely on structural descriptions, with the resulting dataset providing a reliable foundation for molecule--language alignment. The source code and dataset are hosted at https://github.com/TheLuoFengLab/MolLangData and https://huggingface.co/datasets/ChemFM/MolLangData, respectively.

2601.17061 2026-05-11 q-bio.PE cs.NE nlin.AO

How Information Evolves: Stability-Driven Assembly and the Emergence of a Natural Genetic Algorithm

Dan Adler

AI总结 该研究探讨了信息如何在非平衡动力学中演化,提出了一种名为“稳定性驱动组装”(SDA)的框架,通过随机组装与差异持久性机制,使系统倾向于形成更持久的结构模式,从而实现类似遗传算法的自然演化过程。研究将SDA应用于化学符号空间,展示了其具备进化搜索的典型特征,如骨架主导、持续创新和熵减少等,揭示了在固定转移率的平衡模型中所不具备的开放动态演化机制,并提出了“演化阶梯”假说,认为基于持久性的选择先于基因复制发生。

Comments 39 pages, 13 figures

详情
英文摘要

Information can evolve as a physical consequence of non-equilibrium dynamics, even in the absence of genes, replication, or predefined fitness functions. We present Stability-Driven Assembly (SDA), a framework in which stochastic assembly combined with differential persistence biases populations toward longer-lived motifs. Assemblies that persist longer become more frequent and are therefore more likely to participate in subsequent interactions, generating feedback that reshapes the population distribution and implements fitness-proportional sampling, realizing evolution as a natural, emergent genetic algorithm (SDA/GA) driven solely by stability. We apply SDA/GA to chemical symbol space using SMILES fragments with recombination, mutation, and a heuristic stability function. Simulations show hallmark features of evolutionary search, including scaffold-level dominance, sustained novelty, and entropy reduction, yielding open-ended dynamics absent from equilibrium models with fixed transition rates. These results motivate an evolutionary ladder hypothesis where persistence-driven selection precedes genetic replication.

2512.17129 2026-05-11 cs.LG cs.MA cs.RO q-bio.QM

DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations

Seong Ho Pahng, Guoye Guan, Benjamin Fefferman, Sahand Hormoz

AI总结 本文提出了一种名为 DiffeoMorph 的端到端可微分框架,用于学习引导一群智能体从初始状态演化成目标三维形状的形态发生协议。该方法基于 SE(3) 等变图神经网络,使每个智能体能够根据自身状态和与其他智能体的交互信号更新位置和内部状态。研究引入了一种基于三维泽尔尼克多项式的形状匹配损失函数,能够将预测形状与目标形状作为连续空间分布进行比较,并对智能体顺序、数量和全局方向不变,同时保持对镜像的敏感性。实验表明,DiffeoMorph 能够从简单初始条件生成复杂三维结构,为形态发生、群体机器人和可编程自组装等领域的分布式控制策略学习提供了通用框架。

详情
英文摘要

Biological systems can form complex three-dimensional structures through the collective behavior of agents that share a common update rule and operate without central control. How such distributed control gives rise to precise global patterns remains a central question not only in developmental biology but also in distributed robotics, programmable matter, and multi-agent learning. Here, we introduce DiffeoMorph, an end-to-end differentiable framework for learning a morphogenesis protocol that guides a population of agents to morph into a target 3D shape. Each agent updates its position and internal state using an SE(3)-equivariant graph neural network, based on its own internal state and signals received from other agents. To train this system, we introduce a new shape-matching loss based on 3D Zernike polynomials, which compares the predicted and target shapes as continuous spatial distributions, not as discrete point clouds, and is invariant to agent ordering, number of agents, and global orientation. To achieve rotation invariance while preserving reflection sensitivity, we include an alignment step that optimally rotates the predicted Zernike spectrum to match the target before computing the loss. We perform benchmarking to establish the advantages of our shape-matching loss over other standard distance metrics for shape comparison tasks. We then demonstrate that DiffeoMorph can form a range of complex shapes from minimally patterned initial conditions. DiffeoMorph provides a general framework for learning distributed control strategies for morphogenesis, swarm robotics, and programmable self-assembly.

2510.24736 2026-05-11 q-bio.QM cs.LG q-bio.BM

RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics

Danqi Liao, Chen Liu, Xingzhi Sun, Dié Tang, Haochen Wang, Scott Youlten, Srikar Krishna Gopinath, Haejeong Lee, Ethan C. Strayer, Antonio J. Giraldez, Smita Krishnaswamy

AI总结 RNAGenScape 是一种基于流形朗之万动力学的 mRNA 序列生成框架,旨在生成具有特定生物性质的优化 mRNA 序列。该方法通过学习真实数据的潜在流形,并在该流形上进行约束优化,确保生成序列的生物学可行性与功能有效性。研究结合了自编码器、属性预测器和属性引导的优化过程,显著提升了生成序列的性能指标,同时保持了较高的生成效率。

Comments ICML 2025 Generative AI and Biology (GenBio) Workshop, Oral presentation

详情
英文摘要

Generating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.

2510.18516 2026-05-11 q-bio.NC cs.LG

Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware Pretraining

Sangyoon Bae, Mehdi Azabou, Blake Richards, Jiook Cha

AI总结 该研究针对神经记录数据中由细胞类型差异、电路动态和刺激响应随机性引起的异质性问题,提出了一种基于生物特性的预训练方法POYO-CAP。该方法通过识别统计规律性强的神经元并进行掩码重建与辅助监督预训练,再对更随机的神经元群体进行微调,从而提升模型性能。实验表明,该方法在Allen Brain Observatory数据集上相较从零训练提升了12-13%,并实现了模型规模的稳定扩展,有效利用了神经异质性作为可扩展的学习优势。

详情
英文摘要

Neural recordings exhibit a distinctive form of heterogeneity rooted in differences in cell types, intrinsic circuit dynamics, and stochastic stimulus-response variability that goes beyond ordinary dataset variability, mixing statistically regular neurons with highly stochastic, stimulus-contingent ones within the same dataset. This heterogeneity poses a challenge for self-supervised learning (SSL) -- learnable statistical regularity -- thereby destabilizing representation learning and limiting reliable scaling. We introduce POYO-CAP (Cell-pattern Aware Pretraining), a biologically grounded hybrid pretraining strategy that first trains with masked reconstruction plus lightweight auxiliary supervision on statistically regular neurons -- identified via skewness and kurtosis -- and then fine-tunes on more stochastic populations. On the Allen Brain Observatory dataset, this curriculum yields 12--13\% relative improvements over from-scratch training and enables smooth, monotonic scaling with model size, whereas baselines trained on mixed populations plateau or destabilize. By making statistical predictability an explicit data-selection criterion, POYO-CAP turns neural heterogeneity into a scalable learning advantage for robust neural decoding.