arXivDaily arXiv每日学术速递 周一至周五更新
重置
Q-BIO定量生物57
2606.09770 2026-06-09 q-bio.NC cs.LG 新提交

Discovering Functionally Selective Brain Regions with a Deep Topographic Multimodal Model

发现功能选择性脑区:一种深度地形多模态模型

Badr AlKhamissi, Johannes Mehrer, Lara Marinov, Ahmed Abdelaal, Abdulkadir Gokce, Martin Schrimpf

AI总结 提出Topo-Omni模型,通过空间平滑微调预训练基础模型,在单一连续虚拟皮层上整合视觉、听觉和语言/认知处理,产生与人类神经影像一致的多模态聚类,并用于发现新脑区。

详情
Comments
Preprint. First two author contributed equally
AI中文摘要

皮层中的邻近神经元具有相似的反应特征,从而在感觉和认知系统中产生系统性的空间组织。最近的地形模型再现了这种结构的某些方面,但仍然是单模态的,并且对每一层分别施加空间约束,产生了碎片化的图谱,既不能捕捉皮层处理流的连续性,也不能捕捉跨模态的整合。我们引入了Topo-Omni,一种地形多模态模型,其中视觉、听觉和语言/认知处理共享一个单一的连续虚拟皮层。通过使用空间平滑目标微调预训练的基础模型,该架构在跨模态中发展出与人类神经影像一致的聚类,从感觉系统到认知系统。驱动或抑制一个聚类会选择性偏向或损害感知,这与人类干预研究相似。最后,我们使用我们的模型在虚拟皮层中筛选新的聚类,并发现了新的自然景观和动物网络,并在人类数据中验证了它们。因此,单一的空间原则组织了跨模态和处理阶段的表征,产生了关于皮层组织的可检验假设。

英文摘要

Nearby neurons in cortex share similar response profiles, producing systematic spatial organization across sensory and cognitive systems. Recent topographic models reproduce aspects of this structure but remain unimodal and spatially constrain each layer separately, yielding fragmented maps that capture neither the contiguity of cortical processing streams nor their integration across modalities. We introduce Topo-Omni, a topographic multimodal model in which visual, auditory, and language/cognitive processing share a single contiguous in-silico sheet. Built by fine-tuning a pretrained foundation model with a spatial smoothness objective, this architecture develops clusters across modalities that are consistent with human neuroimaging, from sensory to cognitive systems. Driving or suppressing a cluster selectively biases or impairs perception, paralleling human intervention studies. Finally, we use our model to screen for novel clusters in-silico and discover new natural landscape and animal networks which we validate in human data. A single spatial principle thus organizes representations across modalities and processing stages, yielding testable hypotheses about cortical organization.

2606.09675 2026-06-09 q-bio.OT 新提交

The Challenge of Cell Segmentation in Spatially Resolved Transcriptomics

空间分辨转录组学中细胞分割的挑战

Naveed Ishaque, Peter Kharchenko, Daria Lazic, Jieran Sun, Jean Yee Hwa Yang, Martin Emons, Florian Heyl, Wolfgang Huber, Daniel Jones, Louis B. Kuemmerle, Alex R. Lederer, Malte D. Luecken, Vinicius Maracaja-Coutinho, Matthias Meyer-Bender, Andrew Moorman, Evan W. Newell, Quan Nguyen, Shyam Prabhakar, John Randell, Daria Romanovskaia, Oliver Stegle, Gary D. Bader, Raphael Gottardo

AI总结 本文指出空间分辨转录组学中细胞分割是核心未解决问题,分析了稀疏信号、转录本位移等挑战,并呼吁建立共享评估框架和基准数据集。

详情
AI中文摘要

空间分辨转录组学(SRT)通过测量细胞在其空间背景下的基因表达,正在改变我们研究组织的方式。然而,该领域在其最基本的分析步骤之一——如何准确分割细胞并将空间定位的转录本分配给它们——缺乏稳健的方法学指导。主要技术挑战包括稀疏的分子信号、转录本位移、复杂的细胞形态以及将三维组织结构投影到二维成像平面上。这些挑战使得分割成为不确定性的主要来源,错误可能传播到下游分析,最终导致误导性的生物学解释。在此,我们认为分割应被视为空间组学中一个核心未解决问题,而不是常规预处理步骤。我们回顾了当前方法,强调了关键的方法学局限性,包括缺乏适当的指标和黄金标准基准,并提出了一个社区驱动的推进路径。建立共享的评估框架、可扩展的基准数据集和透明的报告标准,对于将SRT转变为生物学发现和临床转化的稳健且可重复的基础至关重要。

英文摘要

Spatially resolved transcriptomics (SRT) is transforming how we study tissues by measuring gene expression in cells in their spatial context. However, the field lacks robust methodological guidance on one of its most fundamental analytical steps: how to accurately segment cells and assign spatially localized transcripts to them. Major technical challenges include sparse molecular signals, transcript displacement, complex cellular morphologies, and the projection of three-dimensional tissue architecture onto two-dimensional imaging planes. These challenges make segmentation a major source of uncertainty, with errors that can propagate through downstream analyses and ultimately lead to misleading biological interpretations. Here, we argue that segmentation should be treated as a central unresolved problem in spatial omics rather than a routine preprocessing step. We review current approaches, highlight key methodological limitations, including the lack of appropriate metrics and gold-standard benchmarks, and propose a community-driven path forward. Establishing shared evaluation frameworks, scalable benchmark datasets, and transparent reporting standards will be essential for transforming SRT into a robust and reproducible foundation for biological discovery and clinical translation.

2606.09672 2026-06-09 cs.AI cs.CL cs.LG cs.PF q-bio.QM 新提交

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

相关性不够:嵌入人类元数据用于个体因果发现

Suraj Biswas, Saurabh Gupta, Pritam Mukherjee

AI总结 针对预训练生物医学语言模型在跨域无关对中产生高余弦相似度(0.76-0.92)导致因果推断错误的问题,提出对比学习(提升分离度至1.63x)和BODHI硬负例挖掘(提升至2.30x),结合OpenVINO优化实现133倍加速。

详情
Comments
20 pages, 18 figures, 9 tables
AI中文摘要

询问一个预训练的生物医学语言模型“皮质醇28 ug/dL”和“股市波动”是否相关,它会返回0.83的余弦相似度(1.0表示完全相同)。两者没有共同机制。这不是个例:我们测试的所有现成生物医学编码器(BioBERT、PubMedBERT、BioM-ELECTRA)在跨域无关对上得分在0.76到0.92之间,而正确答案应接近零。跨域区分准确率为0%。检索系统可以承受这一点,因为下游语言模型会过滤噪声。但大型行为模型(LBM)——一种以人为对象而非句子的基础模型——则不能:它在用户生活图上推理,并将嵌入接近性视为两个事件因果关联的证据。虚假接近性会写入虚假因果边,所有下游都会继承错误。在这里,嵌入几何不是调节旋钮,而是正确性的关键。我们报告了修复方法。对72,034对进行对比训练,将PubMedBERT的BIOSSES相关性从0.633提升到0.828,域内与域间分离度从1.05倍提升到1.63倍。第二次训练BODHI从生物医学知识图中缺失的边挖掘硬负例,将分离度提升到2.30倍,区分差距提升到+0.392,BIOSSES代价为4.5%。在带有AMX的Intel Xeon 6737P上,OpenVINO将单查询延迟从1367毫秒降至10毫秒(133倍),达到每秒555个句子。一个发现与标准建议相悖:在此芯片上,FP16在所有服务批量大小下优于INT8,我们解释了原因。同一模型在无AMX的Ice Lake实例上运行慢13-27倍。我们发布了基准测试套件、训练语料库、BODHI生成器和OpenVINO脚本。

英文摘要

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

2606.09558 2026-06-09 q-bio.GN cs.LG 新提交

Integrating gene regulatory priors into Transformer attention with scTransformer for interpretable scRNA-seq analysis

将基因调控先验知识整合到Transformer注意力中:scTransformer用于可解释的单细胞RNA-seq分析

Mikele Milia, Louis Fabrice Tshimanga, Henning Mueller, Manfredo Atzori, Barbara Di Camillo

AI总结 提出scTransformer,首次将基因调控先验知识嵌入Transformer注意力机制,通过约束信息流学习生物有意义的表示,在疾病相关单核RNA-seq数据上提升分类精度和细胞类型分离,注意力模式与已知调控程序一致。

详情
AI中文摘要

动机:基于Transformer的模型越来越多地应用于大规模单细胞转录组学,通过自监督学习在数百万个细胞上展现出强大性能。然而,大多数现有方法将基因视为独立特征,很大程度上忽略了先验生物学知识,这限制了可解释性和鲁棒性。在本文中,我们探讨了显式整合基因调控信息是否能同时提升模型性能和生物学洞察。结果:我们提出了scTransformer,这是第一个将生物机制的先验知识构建到模型注意力模式中的基于Transformer的方法。通过根据已知调控结构约束信息流,模型学习到更具生物学意义的表示。我们使用监督细胞类型分类在疾病相关的单核RNA-seq数据集上评估scTransformer。与标准Transformer相比,我们的方法提高了分类准确性,增强了嵌入空间中细胞类型的分离,并产生了与已知调控程序一致的注意力模式。总体而言,我们的结果表明,将生物结构嵌入Transformer模型可以在不牺牲性能的情况下增强可解释性,为单细胞组学的生物学基础模型迈出了原则性的一步。

英文摘要

Motivation: Transformer-based models are increasingly applied to large-scale single-cell transcriptomics, showing strong performance through self-supervised learning on millions of cells. However, most existing approaches treat genes as independent features, and largely ignore prior biological knowledge, which limits interpretability and robustness. In this paper, we explore whether explicitly incorporating gene regulatory information can improve both model performance and biological insight. Results: We present scTransformer, the first Transformer-based approach that builds a priori knowledge of biological mechanisms into the model's attention patterns. By constraining information flow according to known regulatory structures, the model learns representations that are more biologically meaningful. We evaluate scTransformer on a disease-relevant single-nucleus RNA-seq dataset using supervised cell-type classification. Compared to standard Transformers, our approach improves classification accuracy, enhances separation of cell types in embedding space, and produces attention patterns consistent with known regulatory programs. Overall, our results demonstrate that embedding biological structure into Transformer models can enhance interpretability without sacrificing performance, offering a principled step toward biologically grounded foundation models for single-cell omics.

2606.09494 2026-06-09 q-bio.PE cond-mat.dis-nn 新提交

Percolation and clustering in ecological communities: A dynamical theory

生态群落中的渗流与聚类:一种动力学理论

Dario Sergo, Cédric Koller, Vittorio Erba, Lenka Zdeborová

AI总结 针对随机相互作用图上的竞争性生态系统,提出离散广义Lotka-Volterra模型,分析渗流簇的出现及存活位点的空间组织,揭示动力学可达性如何控制聚类与渗流的产生。

详情
AI中文摘要

具有结构化相互作用的生态群落表现出占据位点的渗流和聚类等集体现象。尽管这些效应已在实验和模拟中得到记录,但系统的分析理解仍然有限。在本文中,我们针对定义在随机相互作用图上的竞争性生态系统,发展了这些现象的动力学理论。我们引入了广义Lotka-Volterra模型的离散版本,该模型保留了连续生态动力学的主要宏观特征,同时允许分析处理。在此框架内,我们刻画了渗流簇的出现,并描述了存活位点的空间组织。我们的分析揭示了动力学能够达到哪些平衡态,并展示了这种动力学可达性如何控制聚类和渗流的开始。通过这样做,我们的框架为经典Lotka-Volterra理论提供了补充,为结构化群落的集体组织提供了动力学视角。

英文摘要

Ecological communities with structured interactions exhibit collective phenomena such as percolation and clustering of occupied sites. While these effects have been documented in experiments and simulations, systematic analytical understanding has remained limited. In this paper, we develop a dynamical theory of these phenomena for competitive ecological systems defined on random interaction graphs. We introduce a discrete version of the generalized Lotka-Volterra model that preserves key macroscopic features of continuous ecological dynamics while enabling analytical treatment. Within this framework, we characterize the emergence of percolating clusters and describe the spatial organization of surviving sites. Our analysis uncovers which equilibria can be reached by the dynamics and shows how this dynamical accessibility governs the onset of clustering and percolation. In doing so, our framework complements classical Lotka-Volterra theory by providing a dynamical perspective on the collective organization of structured communities.

2606.09040 2026-06-09 q-bio.PE cond-mat.stat-mech physics.bio-ph 新提交

Natural Selection in the Wake of Catastrophe

灾难后的自然选择

Jesse Young Lin, Omer Granek, Joshua Sodicoff, Seppe Kuehn, David Pincus, Vincenzo Vitelli

AI总结 提出灾难后自然选择理论,发现平均适应度随时间倒数松弛,系数与耦合性状数成正比,并通过大肠杆菌抗生素实验验证,揭示其遵循Levenberg-Marquardt优化算法。

详情
Comments
11 pages, 3 figures
AI中文摘要

从细菌到人类,生物体如果其性状增强适应性则更有可能生存。在良好适应其生态位的种群中,自然选择通过罕见的有利突变进行。但当灾难消除生态位多样性时,通常会发生快速适应。在这里,我们提出了一个经过数据验证的灾难后自然选择理论,并揭示了恢复过程中出现的一个简单规律:平均适应度随时间倒数松弛,其前置因子与耦合到灾后环境的性状数量成正比。我们使用抗生素给药后对大肠杆菌测量的实验适应度景观来检验我们的方法。由此产生的平均性状适应不是通过适应度景观上的梯度上升来描述的,而是遵循一种称为Levenberg-Marquardt优化的算法。在适应度峰值附近,进化轨迹偏向于非贪婪——从优化角度来看,灾后选择是乐观的。

英文摘要

Living organisms, from bacteria to humans, are more likely to survive if their traits enhance fitness. In populations well adapted to their environmental niches, natural selection proceeds via rarely beneficial mutations. But when a catastrophe wipes out niche diversity, sudden adaptation often follows. Here, we present a data-validated theory of natural selection in the wake of catastrophe and unveil a simple law that emerges during recovery: the mean fitness relaxes inversely with time, with a prefactor proportional to the number of traits coupled to the post-catastrophe environment. We put our approach to test using experimental fitness landscapes measured following antibiotic administration to E. coli. The resulting mean trait adaptation is not described by gradient ascent on a fitness landscape, instead it follows an algorithm known as Levenberg-Marquardt optimization. Near fitness peaks, evolutionary trajectories are biased against greediness - from an optimization perspective, post-catastrophic selection is optimistic.

2606.08973 2026-06-09 q-bio.QM cs.LG 新提交

A systematic investigation of molecular encoding methods for drug property predictions across neural network and Transformer encoder-based model

基于神经网络和Transformer编码器模型的药物性质预测分子编码方法的系统研究

Sheng-Ya Chen, Shan-Ju Yeh

AI总结 系统研究不同分子编码方法对药物性质预测的影响,使用MLP和MLP+TL模型,发现MACCS和PubChem指纹结合注意力权重可识别关键化学基团,预测准确率平均AUC>0.9。

详情
AI中文摘要

关于不同分子编码方法如何影响分子性质预测的基础研究仍然相对有限。在本研究中,我们使用两种流行的结构设计:经典神经网络模型(MLP)和基于Transformer编码器的模型(MLP+TL),广泛考察了分子性质预测的最优分子编码方法。对于分子编码方法,我们研究了几种类型的指纹,包括传统拓扑指纹、基于子结构的指纹和基于字符串的表示。这两个模型在七个著名的分子数据集上进行了训练,以基于评估指标评估不同的输入分子编码方法。在几个生物学相关的分类任务中,包括毒性、致突变性和副作用预测,我们的模型一致地实现了平均AUC值超过0.9。我们没有依赖外部事后解释方法,如局部可解释模型无关解释(LIME)或深度SHAP(SHAP),而是利用模型内在的注意力权重作为内部可解释性信号来识别潜在重要特征。使用MACCS和PubChem作为输入的MLP+TL模型能够捕获决定主要血脑屏障(BBB)通透性和鼠伤寒沙门氏菌致突变性的化学可解释基团。特别是,吗啡和海洛因之间的比较突出了羟基相关子结构在BBB通透性预测中的作用,这一点在注意力权重中一致反映。总体而言,我们的发现为选择有效的分子编码方法提供了实用指导,并有助于开发用于药物发现的可解释分子信息学方法。

英文摘要

Fundamental investigations into how different molecular encoding methods affect molecular property prediction remain relatively limited. In this study, we extensively examined the optimal molecular encoding methods for molecular properties prediction using two prevalent structure designs: a classical neural network model (MLP) and a Transformer encoder-based model (MLP+TL). For molecular encoding methods, we investigated several types of fingerprints, including traditional topological fingerprints, substructure-based fingerprints, and string-based representations. These two models were trained on seven well-known molecular datasets to evaluate different input molecular encoding methods based on evaluation metrics. On several biologically relevant classification tasks, including toxicity, mutagenicity, and side-effect prediction, our models consistently achieved average AUC values above 0.9. Rather than relying on external post-hoc explanation methods such as the local interpretable model-agnostic explanation (LIME) or the Deep SHapley Additive exPlanations (SHAP), we leveraged the model's intrinsic attention weights as an internal interpretability signal for identifying potentially important feature. The MLP+TL model using MACCS and PubChem as input can capture chemically interpretable groups that determined the major blood-brain barrier (BBB) permeability and mutagenicity in Salmonella typhimurium. In particular, a comparison between Morphine and Heroin highlighted the role of hydroxyl-related substructures in BBB permeability prediction, which was consistently reflected in the attention weights. Overall, our findings provide practical guidance for selecting effective molecular encoding methods and contribute to the development of interpretable molecular informatics approaches for drug discovery.

2606.08897 2026-06-09 cs.CV cs.AI q-bio.QM 新提交

A multi-agent system for spine MRI report generation from multi-sequence imaging

基于多序列影像的脊柱MRI报告生成多智能体系统

Zhiping Xiao, Junwei Yang, Gongbo Sun, Han Zhang, Hanwen Xu, Yi Yao, Zachary D. Miller, William E. King, Mohammed M. Kanani, Jalal B. Andre, Sammy Chu, Ming Zhang, Paul E. Kinahan, Nathan M. Cross, Sheng Wang

发表机构 * University of Washington(华盛顿大学) Peking University(北京大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) New York University(纽约大学) University of Washington Medical Center(华盛顿大学医学中心)

AI总结 提出SpineAgent多智能体框架,利用多序列基础模型整合T1/T2等序列信息,实现脊柱MRI报告生成、病理定位和图文检索,在跨厂商和跨队列评估中表现优异。

详情
AI中文摘要

脊柱病理是全球疼痛和残疾的主要原因之一。脊柱MRI是临床评估的核心,但其解读仍然复杂且耗时,需要整合多个成像序列和解剖区域的信息。尽管自动化MRI分析最近取得了进展,但如何有效结合多序列数据同时保留序列特异性诊断信息仍是一个开放挑战。本文提出SpineAgent,一个基于多序列基础模型的脊柱MRI报告生成多智能体框架,该模型在来自32,047名患者和453,683个MRI系列(总计13,441,191张MRI切片)的常规临床数据上训练。为了适应不同模态的序列,我们首先分别在T1和T2加权序列上预训练两个基于DINOv3的编码器。然后,我们引入一种持续训练策略,学习一个合成器,利用T1和T2编码器嵌入其他序列的图像,生成整合MRI序列间各种信号的患者级嵌入。利用这些嵌入,SpineAgent实现了最先进的性能,并在跨制造商和跨队列评估中展现出强大的泛化能力。除了分类,SpineAgent通过识别与发现相关的切片和分割病理区域实现病理定位。它还支持多模态图像-报告检索,为可扩展和可解释的MRI报告生成提供了坚实基础。我们进一步将这些经过验证的SpineAgent能力集成到37个专门智能体中。最后,我们将它们的输出作为结构化标记,整合到一个端到端训练用于报告生成的医疗报告智能体中。通过自动指标和五位放射科医生的专家评估,SpineAgent在脊柱MRI报告生成中取得了领先性能。

英文摘要

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.

2606.08825 2026-06-09 physics.chem-ph q-bio.MN 新提交

When Three-Dimensional Conformer Ensembles Improve Molecular Property Prediction Beyond Two-Dimensional Fingerprints: A Systematic Study

当三维构象集成超越二维指纹图谱改善分子性质预测时:一项系统研究

Bryan Cheng, Austin Jin, Jasper Zhang

AI总结 通过约1000次实验,系统发现分布核算子提取的构象集成统计量在溶剂化相关性质上显著降低RMSE(如ESOL -11.0%),而对电子或空间任务无益,并揭示了物理基础而非统计原因,建立了四层性能层次。

详情
Comments
10 pages, 4 figures, ACM-BCB 2026 Full Paper
AI中文摘要

何时三维构象集成能超越二维指纹图谱改善分子性质预测?我们提供了首个系统性的、基于机理的答案。通过跨越MoleculeNet、QM9和MARCEL基准的13种模型配置、14个回归目标和2个分类目标的约1000次实验,我们发现了选择性互补性:通过分布核算子(DKOs)提取的构象集成统计量在溶剂化依赖性质上产生统计显著的RMSE降低(ESOL -11.0%,p < 10^{-9};FreeSolv -13.5%,p < 3×10^{-5};10种子配对验证),而对电子或空间任务无益。三条证据证实这种选择性具有物理而非统计基础:在骨架划分下改进大于随机划分(ESOL上+11.9% vs. +8.5%),集中于大而灵活的分子(最重四分位数+18.9%),并随训练数据单调增长。我们建立了四层性能层次:端到端3D GNN(SchNet, PaiNN;比指纹图谱高21-42%)≥ 工程化物理化学描述符(PMI/SASA/USR)> Morgan指纹图谱+XGBoost > 所有神经构象集成方法,由两种架构不同的GNN确认,并揭示了预计算特征瓶颈限制了集成方法。特征归因和互信息分析揭示了机理不对称性:构象均值特征每特征携带的信息量是指纹图谱位的2-8倍,而协方差特征贡献不到模型信号的2%,解释了为何五个简单标量不变量优于所有复杂协方差架构(p < 0.001)。这些发现产生了经验性质分类和实用的决策框架,用于判断何时构象生成值得投入。

英文摘要

When do three-dimensional conformer ensembles improve molecular property prediction beyond two-dimensional fingerprints? We provide the first systematic, mechanistically grounded answer. Through ~1,000 experiments spanning 13 model configurations, 14 regression targets, and 2 classification targets across MoleculeNet, QM9, and MARCEL benchmarks, we discover selective complementarity: conformer ensemble statistics extracted via Distribution Kernel Operators (DKOs) yield statistically significant RMSE reductions on solvation-dependent properties (ESOL -11.0%, p < 10^{-9}; FreeSolv -13.5%, p < 3x10^{-5}; 10-seed paired validation) while providing no benefit for electronic or steric tasks. Three lines of evidence confirm this selectivity has a physical rather than statistical basis: improvement is larger under scaffold splits than random splits (+11.9% vs. +8.5% on ESOL), concentrates on large, flexible molecules (+18.9% for heaviest quartile), and grows monotonically with training data. We establish a four-tier performance hierarchy: end-to-end 3D GNNs (SchNet, PaiNN; 21-42% over fingerprints) >= engineered physicochemical descriptors (PMI/SASA/USR) > Morgan fingerprints + XGBoost > all neural conformer ensemble methods, confirmed by two architecturally diverse GNNs and revealing that the pre-computed feature bottleneck limits ensemble approaches. Feature attribution and mutual information analysis expose the mechanistic asymmetry: conformer mean features carry 2-8x more information per feature than fingerprint bits, yet covariance features contribute <2% of model signal, explaining why five simple scalar invariants outperform all complex covariance architectures (p < 0.001). These findings yield an empirical property taxonomy and a practical decision framework for when conformer generation is worth the investment.

2606.08805 2026-06-09 cond-mat.dis-nn q-bio.NC 新提交

Predictable Mean-Field Chaos in Random Recurrent Networks

随机循环网络中的可预测平均场混沌

Alkesh Yadav, Vladimir Shaidurov, Jonathan Kadmon

AI总结 本文证明在解析非线性且傅里叶衰减足够快的随机循环网络中,平均场轨迹的过去唯一决定其未来,从而将平均场理论从系综描述转化为单个轨迹的条件预测理论,并引入Krylov增长刻画预测复杂度。

详情
Comments
5 pages, 2 figures, Supplementary material
AI中文摘要

动态平均场理论将随机循环网络中的确定性混沌重述为一个有效的随机过程。我们证明,对于具有足够快傅里叶衰减的解析非线性,这种随机性只是表象:一个实现平均场轨迹的连续过去唯一地决定了它的未来。因此,平均场理论不仅仅是一个系综描述,而是单个轨迹的条件预测理论。将功率谱展开到Krylov状态空间,揭示了这种潜在确定性如何跨越一个无限时间模式层级进行组织。相关的Krylov增长率设定了有限分辨率预测的复杂度,并给出了这类网络中最大李雅普诺夫指数的上界。因此,微观敏感性和预测复杂度是平均场混沌的不同方面。我们的结果将针对哈密顿混沌动力学发展的Krylov增长思想扩展到经典耗散系统。

英文摘要

Dynamical mean-field theory recasts deterministic chaos in random recurrent networks as an effective stochastic process. We show that for analytic nonlinearities with sufficiently fast Fourier decay, this stochasticity is only apparent: the continuous past of a realized mean-field trajectory uniquely determines its future. The mean-field theory is therefore not merely an ensemble description, but a conditional prediction theory for individual trajectories. Unfolding the power spectrum into a Krylov state space exposes how this latent determinism is organized across an infinite hierarchy of temporal modes. The associated Krylov growth rate sets the complexity of finite-resolution prediction and upper-bounds the largest Lyapunov exponent in this class of networks. Thus, microscopic sensitivity and predictive complexity are distinct aspects of mean-field chaos. Our results extend Krylov growth ideas developed for Hamiltonian chaotic dynamics to classical dissipative systems.

2606.08720 2026-06-09 q-bio.NC 新提交

This is how the Neocortex Learns

新皮层如何学习

Randall C. O'Reilly

AI总结 本文提出一个满足计算、算法和实现三方面标准的新皮层学习框架,基于皮质-丘脑回路的时间导数误差驱动预测学习,并通过竞争性激酶突触可塑性机制实现。

详情
Comments
9 pages, 4 figures
AI中文摘要

对新皮层学习的充分解释必须满足三个标准:1. 在计算上,它必须近似一种强大的、通用目的的学习算法,该算法已知可扩展到人类级别的智能;2. 在算法上,它必须能够使用新皮层及相关脑结构中已知的、成熟的神经回路实现;3. 在实现上,必须详细说明所有算法机制如何在神经化学水平上实际运作。目前,只有一个框架满足所有这些标准:通过时间导数的误差驱动预测学习,由皮质-丘脑回路驱动,基于竞争性激酶突触可塑性诱导机制。该框架已在Axon神经模拟框架中使用脉冲神经元实现,并在各种具有挑战性的认知动机任务中展示了学习能力。

英文摘要

A sufficient account of how the neocortex learns must meet three criteria: 1. Computationally, it must approximate a powerful, general-purpose learning algorithm known to scale to human-level intelligence; 2. Algorithmically, it must be implementable using known, well-established neural circuits within the neocortex and associated brain structures; 3. Implementationally, there must be a detailed account for how all of the algorithmic mechanisms actually function at a neurochemical level. At present, there is only one framework that meets all of these criteria: error-driven predictive learning via temporal derivatives, driven by corticothalamic circuits, based on competitive kinase synaptic plasticity induction mechanisms. This has been implemented in the Axon neural simulation framework using spiking neurons, and demonstrated to learn across a wide range of challenging cognitively motivated tasks.

2606.08647 2026-06-09 q-bio.BM cond-mat.mes-hall cond-mat.soft 新提交

Protein Dynamics Beyond Structure Prediction

超越结构预测的蛋白质动力学

Juliette Griffié, Sviatlana Shashkova, Antonio Ciarlo, Sreekanth K. Manikandan, Claes Andréasson, Malin Bäckström, Tristan Bereau, Hjalmar Brismar, Carlos Bustamante, Marta Carroni, Roberto Covino, Andreas Dahlin, Sebastian Deindl, Lucie Delemotte, Arne Elofsson, John Eriksson, Giovanna Fragneto, Anders Gunnarsson, Per Hammarström, Caroline Ingre, Christian Kaiser, Petronella Kettunen, Mark C. Leake, Benjamin Loos, Anna Månberg, Antonia S. J. S. Mey, Richard Neutze, Thomas Nyström, Karl Palmås, Charley Schaefer, Markus J. Tamás, Nicola Ticozzi, Tomás S. Pilvelic, Jacopo Sacquegno, B. M., Tijms, Gunnar von Heijne, Björn Wallner, Vitali Zhaunerchyk, Simon Olsson, Joana B. Pereira, Julia Fernandez-Rodriguez, Fredrik Westerlund, Giovanni Volpe

AI总结 本文综述了从静态结构预测转向理解蛋白质折叠动力学的研究进展,结合单分子实验与多尺度建模,旨在建立定量预测框架以控制折叠与疾病相关错误折叠。

详情
Comments
53 pages, 4 figures
AI中文摘要

从氨基酸序列预测蛋白质三维结构的能力是分子生物学的一个里程碑式成就,其中AlphaFold等近期深度学习方法是数十年工作的结晶。然而,对蛋白质序列如何引起动态构象变化和高阶组装的定量理解仍未解决。折叠和构象状态是动态的随机过程,受序列、能量、共翻译约束、伴侣机器以及细胞环境的物理化学条件影响。最近的进展使该领域能够超越静态结构终点,转向对生命系统中折叠动力学的机制理解。单分子技术能够实现时间分辨的折叠轨迹和中间态观测,这些中间态此前被传统结构生物学方法所隐藏,而计算创新和数据驱动方法提供了跨尺度整合异质数据的新途径。在这份路线图中,我们回顾了蛋白质折叠当前的概念格局,审视了仍存在的实验和理论空白,并讨论了将高分辨率测量与多尺度建模相结合的新兴策略。我们勾勒了一条通往蛋白质折叠动力学、构象动力学和大分子自组装定量预测科学的路线图。实现这一愿景将改变我们对分子自组织动力学的理解,从单个多肽的折叠到动态大分子复合物的出现。这将使在健康和疾病中合理控制折叠和错误折叠成为可能,将蛋白质工程原理扩展到静态结构设计之外,并为蛋白质稳态相关疾病的预测性和个性化干预建立机制基础。

英文摘要

The ability to predict protein three-dimensional structures from amino acid sequences is a landmark achievement in molecular biology, where recent deep learning approaches such as AlphaFold are the culmination of decades of work. Yet, the quantitative understanding of how protein sequences give rise to dynamic conformational changes and higher-order assemblies remains unsolved. Folding and conformational states are dynamic, stochastic processes, shaped by sequence, energy, co-translational constraints, chaperone machineries, and the physicochemical conditions of the cellular environment. Recent advances now position the field to move beyond static structural endpoints toward a mechanistic understanding of folding dynamics in living systems. Single-molecule techniques enable time-resolved observation of folding trajectories and intermediate states hitherto hidden by traditional structural biology approaches, while computational innovations and data-driven approaches offer new ways to integrate heterogeneous data across scales. In this Roadmap, we review the current conceptual landscape of protein folding, examine the experimental and theoretical gaps that remain, and discuss emerging strategies that integrate high-resolution measurements with multiscale modeling. We outline a roadmap toward a quantitative and predictive science of protein folding dynamics, conformational kinetics, and macromolecular self-assembly. Realizing this vision would transform our understanding of the dynamics of molecular self-organization, from the folding of individual polypeptides to the emergence of dynamic macromolecular complexes. This will enable rational control of folding and misfolding in health and disease, extend protein engineering principles beyond static structural design, and establish a mechanistic foundation for predictive and personalized interventions in proteostasis-related disorders.

2606.08475 2026-06-09 q-bio.QM stat.ME 新提交

Parameter uncertainty in dynamical models: a practical identifiability index

动力模型中的参数不确定性:一种实用可辨识性指标

Hamed Karami, Alexandra Smirnova, Sunmi Lee, Gerardo Chowell

AI总结 提出实用可辨识性指标(PII),基于置信区间对数跨度量化参数不确定性,用于评估有限噪声数据下参数约束程度。

详情
AI中文摘要

常微分方程模型被广泛用于理解和预测复杂动力系统,但其预测价值依赖于可靠的参数估计。结构可辨识性评估参数是否可以从理想观测中唯一恢复,而实用可辨识性则依赖于有限、含噪声和部分观测的数据。我们引入了实用可辨识性指标(PII),这是一种基于置信区间对数跨度的边际不确定性宽度度量。以数量级尺度表示,PII总结了单个正值参数被可用观测数据约束的紧密程度,从而能够在参数、模型、误差结构和观测设计之间进行比较。PII旨在作为补充诊断工具,而非独立的可辨识性检验,应与覆盖度、剖面似然、后验总结、敏感性分析或结构可辨识性结果结合解读。通过在增长模型和房室流行病模型上使用参数自助法实验,我们识别出一致的原则:随着校准窗口信息量增加,不确定性降低;随着观测噪声和参数耦合增加,不确定性增加;对于潜在或间接观测的过程,不确定性保持较高。控制早期可观测动态的参数更早受到约束,而额外的观测变量可以改善对潜在进展和恢复参数的约束。PII为动力建模提供了一种简单、可报告的边际参数不确定性总结。

英文摘要

Ordinary differential equation models are widely used to understand and forecast complex dynamical systems, but their predictive value depends on reliable parameter estimation. Structural identifiability assesses whether parameters can be uniquely recovered from ideal observations, whereas practical identifiability depends on finite, noisy and partially observed data. We introduce the Practical Identifiability Index (PII), a marginal uncertainty-width metric based on the logarithmic span of confidence intervals. Expressed on an order-of-magnitude scale, the PII summarises how tightly individual positive-valued parameters are constrained by available observations, enabling comparison across parameters, models, error structures and observation designs. The PII is intended as a complementary diagnostic, not a standalone identifiability test, and should be interpreted alongside coverage, profile likelihoods, posterior summaries, sensitivity analysis or structural identifiability results. Using parametric bootstrap experiments across growth and compartmental epidemic models, we identify consistent principles: uncertainty decreases as calibration windows become more informative, increases with observation noise and parameter coupling, and remains high for latent or indirectly observed processes. Parameters governing early observable dynamics become constrained sooner, while additional observables can improve constraint for latent progression and recovery parameters. The PII provides a simple, reportable summary of marginal parameter uncertainty for dynamical modelling.

2606.08409 2026-06-09 stat.ME q-bio.PE 新提交

Matrix representations and distance metrics for unlabeled ranked phylogenetic networks

无标签排序系统发育网络的矩阵表示与距离度量

Jiayang Wang, Julia A. Palacios, Claudia Solís-Lemus

AI总结 针对根有向、排序、无标签的系统发育网络,提出基于双射三角矩阵表示的距离度量族,支持等时和异时网络,可量化拓扑、时间及杂交数差异。

详情
Comments
25 pages, 11 figures. Submitted to the Proceedings of the National Academy of Sciences (PNAS)
AI中文摘要

系统发育网络是从分子序列数据推断出的图,代表由重组、杂交和水平基因转移等网状过程塑造的祖先历史。我们为有根、排序、无标签的系统发育网络引入一系列距离度量,扩展了先前为排序树开发的距离。我们的方法依赖于系统发育网络的双射三角矩阵表示,该表示捕获了内部事件、物种形成和杂交的时间顺序。我们的度量定义为标准矩阵范数,允许对网络拓扑、定时网络和具有不同杂交数量的网络进行高效的定量比较。我们的距离可用于所有末端在一个时间点采样的等时网络,以及允许末端在不同时间点采样的异时网络。我们表明,我们的度量在模拟和病毒系统发育网络的经验后验分布中捕捉到了进化历史上具有生物学意义的差异。这些工具填补了方法论空白,使得对排序、无标签的系统发育网络(包括祖先重组图)进行有原则的比较成为可能。

英文摘要

Phylogenetic networks are graphs inferred from molecular sequence data that represent ancestral histories shaped by reticulate processes such as recombination, hybridization, and horizontal gene transfer. We introduce a family of distance metrics for rooted, ranked, unlabeled phylogenetic networks, extending a previously developed distance for ranked trees. Our approach relies on a bijective triangular matrix representation of phylogenetic networks that captures the temporal order of internal events, speciations, and hybridizations. Our metrics, defined as standard matrix norms, allow efficient quantitative comparisons of network topologies, timed networks and networks with differing numbers of hybridizations. Our distance can be used for both isochronous networks where all tips are sampled at one time point, and heterochronous networks where tips are allowed to be sampled at different time points. We show that our metrics capture biologically meaningful differences among evolutionary histories in both simulations and empirical posterior distributions of viral phylogenetic networks. These tools fill a methodological gap, enabling principled comparisons of ranked, unlabeled phylogenetic networks, including ancestral recombination graphs.

2606.08391 2026-06-09 q-bio.PE q-bio.QM 新提交

Cruise Ship-Associated Andes Virus Cluster aboard MV Hondius, 2026: A Stochastic Scenario Analysis

2026年MV Hondius号邮轮相关安第斯病毒聚集性疫情:随机情景分析

Raj Kumar Subedi, Hamed Karami, Kaustubh Wagh, Kenji Mizumoto, Gerardo Chowell

AI总结 本研究利用随机流行病模型分析2026年MV Hondius号邮轮上首次记录的安第斯病毒聚集性疫情,发现登船时存在两名潜伏感染者与观察到的疫情最一致,并强调了暴露史评估和早期监测的重要性。

详情
AI中文摘要

2026年4月,MV Hondius号探险邮轮成为首次记录的邮轮相关安第斯汉坦病毒(ANDV)聚集性疫情的发生地,在149名乘客和船员中出现13例确诊和疑似病例,其中3例死亡。我们应用随机流行病模型,基于已发表的ANDV估计的再生数,评估了四种登船情景。情景D(登船时有两名潜伏感染者)与观察到的疫情最为一致,在R0=2.12时,最终规模≥13的概率为11.6%,暴发概率为58.5%。近似贝叶斯计算为登船时存在多例潜伏感染提供了补充支持,尤其是E1(0)=1和E3(0)=2,但R0仍难以识别。在第35天减少传播对该反事实模型中的暴发概率影响很小。研究结果支持对来自ANDV流行区的旅行者进行暴露史评估、早期船上监测、快速隔离有症状病例以及下船后监测。

英文摘要

In April 2026, the MV Hondius expedition cruise ship became the site of the first documented cruise ship-associated Andes hantavirus (ANDV) cluster, with 13 confirmed and probable cases and 3 deaths among 149 passengers and crew. We applied a stochastic epidemic model to evaluate four embarkation scenarios under reproductive numbers anchored to published ANDV estimates. Scenario D, involving two latent infected persons at embarkation, was most consistent with the observed outbreak, yielding P(final size >= 13) = 11.6% and P(takeoff) = 58.5% at R0 = 2.12. Approximate Bayesian computation provided complementary support for multiple latent infections at embarkation, especially E1(0)=1 and E3(0)=2, but R0 remained weakly identifiable. A day-35 transmission reduction changed takeoff probability little in this counterfactual model. Findings support exposure-history assessment, early onboard surveillance, rapid isolation of symptomatic cases, and postdisembarkation monitoring for travelers from ANDV-endemic regions.

2606.08366 2026-06-09 q-bio.QM cs.MS 新提交

MetaboliSim: a Python implementation of the Mader model for dynamic and steady-state simulation of muscular energy metabolism

MetaboliSim:用于肌肉能量代谢动态和稳态模拟的Mader模型的Python实现

Katharina Dunst, Vincent Scharf, Clemens Hesse, Alexander Asteroth

AI总结 本文提出MetaboliSim,一个开源Python实现Mader模型,用于肌肉能量代谢的动态和稳态模拟,验证了模型正确性并支持独立复现。

详情
AI中文摘要

Mader模型是德语体育科学中最广泛使用的肌肉能量代谢数学框架,支撑乳酸诊断、最大乳酸稳态(MLSS)估计和训练处方。尽管已使用数十年,但其动态ODE公式和稳态方程均未以开放代码形式提供,导致基于该模型的结果无法独立复现。我们通过MetaboliSim填补了这一空白,这是一个开源Python实现,包含两种公式:一个动态模型,使用四阶Runge-Kutta方案积分五变量ODE系统(磷酸盐势、$\dot{V}\mathrm{O}_2$、肌肉和血乳酸、糖原);以及一个稳态模型,计算MLSS功率和乳酸-功率关系,提供单室和双室变体。我们对照已发表的参考值验证了实现的正确性,并在恒负荷、阶梯测试、冲刺和跑步协议中评估了生理合理性。该实现能在规定容差内重现已发表的参考输出,并保持数值稳定(时间步长减半使血乳酸变化小于0.01 mmol/L),两种公式产生一致的MLSS估计。关键生理行为($\dot{V}\mathrm{O}_2$开启动力学、乳酸积累、PCr动力学以及亚/超MLSS分离)直接从模型方程中产生,无需协议特定调整,敏感性分析显示MLSS功率随$\dot{V}\mathrm{O}_{2\max}$近似线性变化,随$\dot{V}\mathrm{La}_{\max}$非线性变化。作为完整Mader模型的第一个公开可用实现(AGPL-3.0),MetaboliSim允许独立团队复现、验证和基于已发表的模型结果进行构建。源代码:https://codeberg.org/3phos/metabolisim;平台:https://metabolisim.org

英文摘要

The Mader model is the most widely used mathematical framework for muscular energy metabolism in German-language sport science, underpinning lactate diagnostics, maximal lactate steady state (MLSS) estimation and training prescription. Despite decades of use, neither its dynamic ODE formulation nor its steady-state equations have been available as open code, leaving results based on the model impossible to reproduce independently. We close this gap with MetaboliSim, an open-source Python implementation of both formulations: a dynamic model that integrates the five-variable ODE system (phosphate potential, $\dot{V}\mathrm{O}_2$, muscle and blood lactate, and glycogen) with a fourth-order Runge-Kutta scheme, and a steady-state model that computes MLSS power and the lactate-power relationship in one- and two-compartment variants. We verified implementation correctness against published reference values and assessed physiological plausibility across constant-load, step-test, sprint and running protocols. The implementation reproduces the published reference output within stated tolerances and remains numerically stable throughout (halving the time step changes blood lactate by less than 0.01 mmol/L), with both formulations yielding congruent MLSS estimates. Key physiological behaviour ($\dot{V}\mathrm{O}_2$ on-kinetics, lactate accumulation, PCr dynamics and the sub/supra-MLSS separation) emerges directly from the model equations without protocol-specific tuning, and a sensitivity analysis shows MLSS power varying approximately linearly with $\dot{V}\mathrm{O}_{2\max}$ and nonlinearly with $\dot{V}\mathrm{La}_{\max}$. As the first openly available implementation of the complete Mader model (AGPL-3.0), MetaboliSim lets independent groups reproduce, verify and build on published model-based results. Source code: https://codeberg.org/3phos/metabolisim; Platform: https://metabolisim.org

2606.08202 2026-06-09 stat.ML cs.LG physics.data-an q-bio.NC 新提交

Vector Space of Cycles

循环向量空间

Moo K. Chung, Anass B. El-Yaagoubi, Hernando Ombao

AI总结 提出一种变分框架,将循环交互表示为单纯复形上的边流,通过能量最小化动力学分离瞬态与持久谐波流,得到低维循环空间,实现循环结构的投影、平均、比较和统计推断。

详情
AI中文摘要

大多数用于有向交互的统计和机器学习方法关注变量之间的成对效应。即使现有的循环模型也主要通过节点级依赖表示反馈,使得大规模循环组织难以估计和比较。这一限制在生物和神经系统中尤为突出,其中交互高度循环且涉及许多重叠的循环。我们引入了一个用于循环交互统计推断的变分框架。有向交互被表示为单纯复形上的边流,并在能量最小化动力系统下演化。由此产生的动力学将瞬态交互分量与持久谐波流分离,产生一个捕获稳定循环组织的低维循环空间。该框架不是枚举单个循环,而是将循环交互表示为希尔伯特空间的元素,从而实现投影、平均、比较和群体级统计推断。我们建立了谐波投影的理论性质,包括循环空间的表征、方差减少和群体推断。模拟表明,与现有的有向交互方法相比,该方法在密集循环系统中显著改善了循环结构的恢复。应用于400名人类受试者的静息态fMRI,该框架揭示了通过边平均无法检测的可重复的大规模循环组织。这些结果为研究高维动力系统中的循环交互提供了一个可扩展的统计框架。

英文摘要

Most statistical and machine learning methods for directed interactions focus on pairwise effects among variables. Even existing cyclic models represent feedback primarily through node-level dependencies, making large-scale recurrent organization difficult to estimate and compare. This limitation is particularly acute in biological and neural systems, where interactions are highly recurrent and involve many overlapping cycles. We introduce a variational framework for statistical inference on cyclic interactions. Directed interactions are represented as edge flows on a simplicial complex and evolved under an energy-minimizing dynamical system. The resulting dynamics separate transient interaction components from persistent harmonic flows, yielding a low-dimensional cycle space that captures stable recurrent organization. Rather than enumerating individual cycles, the proposed framework represents cyclic interactions as elements of a Hilbert space, enabling projection, averaging, comparison, and population-level statistical inference. We establish theoretical properties of the harmonic projection, including characterization of the cycle space, variance reduction, and population inference. Simulations demonstrate substantially improved recovery of cyclic structure in dense recurrent systems compared with existing directed-interaction methods. Applied to resting-state fMRI from 400 human subjects, the framework reveals reproducible large-scale cyclic organization that is not detectable through edgewise averaging. These results provide a scalable statistical framework for studying recurrent interactions in high-dimensional dynamical systems.

2606.08191 2026-06-09 cs.LG cs.AI q-bio.QM 新提交

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

频域潜在注意力门控用于跨域令牌聚合

Kewei Li, Rongying Zhang, Xueli Wang, Xiwen Gong, Zhongjian Wang, Lan Huang, Ruochi Zhang, Fengfeng Zhou

AI总结 提出FLaG模块,通过实FFT变换、可学习潜在查询的频谱分量汇总、通道门控和时域重建,实现跨域令牌聚合,在AMP预测、图像分类和文本分类任务上取得提升。

详情
AI中文摘要

令牌聚合是将令牌表示映射到样本级预测的模型中的常见瓶颈,然而大多数池化方法仅在原始令牌域中操作。我们提出FLaG,一个即插即用的聚合模块,它使用实FFT变换令牌表示,用可学习的潜在查询汇总频谱分量,应用通道门控,并重建增强的时域令牌以进行最终池化。我们在使用ESM2的抗菌肽(AMP)活性预测、使用ResNet18在CIFAR-10和CIFAR-100上的图像分类,以及使用RoBERTa在IMDB和GLUE上的文本分类中评估FLaG。FLaG在ESM2-8M抗菌肽任务和CIFAR-100上取得了最明显的提升,同时在IMDB和GLUE上与强文本基线保持竞争力。然后,我们通过频带消融、门控汇总、残基扰动、潜在查询读出和结构代理分层来探究其在AMP设置中的行为。我们发现低频带贡献最大,其余高频带模式更具样本特异性。门控充当广泛共享的频谱重加权阶段,交叉注意力模式是样本特异性的,具有轻微的查询差异,并且高螺旋肽在两种细菌中表现出更强的平均频谱敏感性。补充材料、源代码和数据发布在https://www.healthinformaticslab.org/supp/ 和 https://github.com/Kewei2023/AMPCliff/tree/FLaG。

英文摘要

Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT, summarizes spectral components with learnable latent queries, applies a channel-wise gate, and reconstructs enhanced time-domain tokens for final pooling. We evaluate FLaG on antimicrobial peptide (AMP) activity prediction with ESM2, image classification with ResNet18 on CIFAR-10 and CIFAR-100, and text classification with RoBERTa on IMDB and GLUE. FLaG achieves its clearest gains on the ESM2-8M antimicrobial peptide tasks and on CIFAR-100, while remaining competitive with strong text baselines on IMDB and GLUE. Then we probe its behavior on the AMP setting with band knockouts, gate summaries, residue perturbations, latent-query readouts, and structure-proxy stratification. We find that low-frequency bands contribute the most overall, and the remaining higher-band pattern is more sample-specific. The gate acts as a broadly shared spectral reweighting stage and the cross-attention patterns are sample-specific with mild query-wise differentiation, and higher-helix peptides exhibit stronger average spectral sensitivity in both bacteria. The supplementary materials, source code and data are released at https://www.healthinformaticslab.org/supp/ and https://github.com/Kewei2023/AMPCliff/tree/FLaG.

2606.08147 2026-06-09 q-bio.GN cs.LG 新提交

Biological Reasoning-Informed Regression for Interpretable Regulatory DNA Activity Prediction

面向可解释调控DNA活性预测的生物学推理引导回归

Yi Duan, Zhao Yang, Jiwei Zhu, Ying Ba, Chuan Cao, Bing Su

AI总结 提出R3LM框架,通过结构化生物学知识教LLM进行推理引导回归,在增强子预测上达到最优性能并提供可解释机制。

详情
Comments
Accepted at KDD 2026 AI4Sciences Track
AI中文摘要

DNA顺式调控元件(CREs)如增强子控制基因表达水平。从DNA序列准确预测调控活性是有价值但具有挑战性的,因为它需要理解复杂的生物调控过程。现有方法通常以黑盒方式从序列回归活性分数,限制了可解释性和回归性能。同时,大型语言模型(LLMs)受益于显式推理过程,但直接将LLMs应用于原始DNA序列表现不佳。在本文中,我们通过引入R3LM框架弥合这一差距,该框架通过结构化生物学知识教LLMs对调控DNA进行推理引导回归。具体来说,我们设计了一种基于生物学的数据格式,结构化DNA的调控信息以改善LLM理解,并构建了CRE-ReasonBench,这是第一个将DNA序列和活性分数与机制推理轨迹关联的数据集。通过两阶段训练,首先教LLMs对结构化生物信息进行推理,然后进行回归,R3LM在三种细胞类型的增强子预测上达到了最先进性能,优于使用原始序列输入的LLMs和专门的DNA模型,同时提供了可解释的机制解释。我们期望R3LM作为一种可解释的奖励模型,能够有效辅助生物学家进行CRE设计。代码可在https://github.com/DuanYi516/R3LM获取。

英文摘要

DNA cis-regulatory elements (CREs) such as enhancers control gene expression levels. Accurately predicting regulatory activity from DNA sequences is valuable but challenging, as it requires understanding complex biological regulatory processes. Existing methods typically regress activity scores from sequences in a black-box manner, limiting both interpretability and regression performance. Meanwhile, large language models (LLMs) benefit from explicit reasoning processes, yet directly applying LLMs to raw DNA sequences performs poorly. In this paper, we bridge this gap by introducing R3LM, a framework that teaches LLMs reasoning-informed regression on regulatory DNA through structured biological knowledge. Specifically, we design a biologically grounded data format that structures DNA's regulatory information for improved LLM understanding, and construct CRE-ReasonBench, the first dataset that associates DNA sequences and activity scores with mechanistic reasoning traces. Through two-stage training that first teaches LLMs reasoning over structured biological information then performs regression, R3LM achieves state-of-the-art performance on enhancer prediction across three cell types, outperforming both LLMs with raw sequence input and specialized DNA models while providing interpretable mechanistic explanations. We expect R3LM as an interpretable reward model that can effectively assist biologists in CRE design. Code is available at https://github.com/DuanYi516/R3LM.

2606.08138 2026-06-09 physics.bio-ph physics.app-ph q-bio.SC 新提交

DNA Replication under Thermal, Chemical, and Genotoxic Stress

热、化学和基因毒性应激下的DNA复制

Chinmaya Pradhan, Bhakti Mehta, Nirjharini Saha, Mrinal Srivastava, Anupam Gupta

AI总结 开发基于晶格的随机蒙特卡洛框架,模拟酿酒酵母全基因组复制,揭示复制叉速度异质性导致S期持续时间Erlang分布和异常延长事件,预测非单调热行为、羟基脲应激下的幂律标度等。

详情
Comments
5 Figures, 13 pages
AI中文摘要

真核DNA复制必须在热、化学和基因毒性应激下保持稳健,尽管复制动力学存在大幅波动。本文开发了一个基于晶格的随机蒙特卡洛框架,用于酿酒酵母全基因组复制,达到单碱基对分辨率,结合了概率性起始点激活、复制叉速度分布以及一个控制细胞复制资源可用性的时间依赖性限制因子。该模型在应用于应激条件之前,通过实验复制谱进行定量基准测试,并仅使用两个有效参数再现了多种复制应激反应。重要的是,分析揭示了复制叉速度异质性导致了实验观察到的Erlang分布S期持续时间和罕见异常延长复制事件的出现,这些现象在大肠杆菌和人类细胞系中观察到,并预测了酿酒酵母中的类似行为。该框架进一步预测了非单调热行为、羟基脲应激下的幂律标度以及多种基因毒性条件下的总复制时间动态。

英文摘要

Eukaryotic DNA replication must remain robust under thermal, chemical, and genotoxic stress despite large fluctuations in replication dynamics. Here, we develop a lattice-based stochastic Monte Carlo framework for whole-genome replication in Saccharomyces cerevisiae at single base-pair resolution, incorporating probabilistic origin firing, replication fork-speed distributions, and a time-dependent limiting factor that governs the availability of cellular replication resources. The model is benchmarked quantitatively against experimental replication profiles before being applied to stress conditions, and reproduces diverse replication stress responses using only two effective parameters. Importantly, the analysis reveals that replication fork-speed heterogeneity underlies the emergence of Erlang-distributed S-phase durations and rare, anomalously prolonged replication events observed experimentally in Escherichia coli and human cell lines, while predicting similar behavior in S. cerevisiae. The framework further predicts non-monotonic thermal behavior, power-law scaling under hydroxyurea stress, and total replication-time dynamics under diverse genotoxic conditions.

2606.07949 2026-06-09 q-bio.PE cs.CV eess.IV 新提交

Feasibility to detect rapid change and disappearance of seagrass: Lessons from nearly 80 years of vegetation change in the Ako, Seto Inland Sea, Japan

检测海草快速变化和消失的可行性:来自日本濑户内海Ako近80年植被变化的教训

Takehisa Yamakita, Yoji Igarashi, Akira Eto, Ken Ishida, Masaaki Iiyama

AI总结 本研究利用近80年的航拍和卫星影像,结合YOLO深度学习分割,分析了日本Ako潮滩海草床的长期动态,发现2025年Zostera marina在一年内几乎完全消失,表明这是一次由夏季水温升高驱动的快速生态系统转变,并提出了改进海草监测指标的建议。

详情
AI中文摘要

本研究分析了日本濑户内海的Ako潮滩,该地的大叶藻(Zostera marina)在2025年一年内几乎全部消失。利用1940年代以来的航拍照片、高分辨率卫星影像、GRUS图像(2.5-5米)以及每月Sentinel-2合成图像(10米),我们重建了约80年的海草分布。基于深度学习的YOLO分割在这些数据集上实现了高精度(总体精度≥0.9);尽管无法区分物种,但模型捕捉了植被面积的主要时间动态。长期平均海草面积为6.8公顷,但数值波动很大,从1974年的3.5公顷到1989年的41.3公顷,除2025年的0.2公顷外。2019年至2026年的Sentinel-2合成图像显示出明显的季节性,植被在初夏增加,秋季开始减少。然而,2025年夏季后面积急剧下降,并在2025-2026年整个冬季保持异常低值。我们的结果表明,2025年的事件并非正常波动,而是一次快速生态系统转变,涉及优势冠层物种的丧失,最可能的原因是区域夏季水温升高。这些发现对海草基本海洋变量(EOVs)和TNFD对齐的自然相关披露中使用的自然状态(SoN)指标也有影响。与森林不同,海草草甸需要更精细的时间分辨率,因为显著的季节性和突然崩溃都会强烈影响面积指标。因此,除了先前指出的物种级分类精度等问题外,我们建议:(1)基线应在最长的可用记录上定义并进行生态学论证;(2)在年际比较前应用季节性标准化;(3)将面积异常极端的年份标记出来,而非用作参考点。

英文摘要

This study analyses the Ako tidal flat in the Seto Inland Sea, Japan, where nearly all Zostera marina disappeared within a single year in 2025. Using aerial photographs from the 1940s onward, high-resolution satellite imagery, GRUS images (2.5-5 m), and monthly Sentinel-2 composites (10 m), we reconstructed approximately 80 years of seagrass distribution. YOLO-based segmentation using deep learning achieved high accuracy (overall accuracy >= 0.9) across these datasets; although species could not be discriminated, the models captured the major temporal dynamics in vegetation area. The long-term mean seagrass area was 6.8 ha, but values fluctuated widely, from 3.5 ha in 1974 to 41.3 ha in 1989 except 0.2 ha in 2025. Sentinel-2 composites from 2019 to 2026 revealed clear seasonality, with vegetation increasing in early summer and declining from autumn. In 2025, however, the area decreased sharply after summer and remained anomalously low throughout the winter of 2025-2026. Our results, indicating that the 2025 event was not a normal fluctuation but a rapid ecosystem shift involving the loss of the dominant canopy-forming species, most plausibly driven by regionally elevated summer water temperatures. The findings also have implications for seagrass Essential Ocean Variables (EOVs) and the State of Nature (SoN) metrics used in TNFD-aligned nature-related disclosures. Unlike forests, seagrass meadows require finer temporal resolution because both pronounced seasonality and abrupt collapse strongly influence area-based indicators. Therefore, in addition to previously noted issues such as species-level classification accuracy, we recommend that (1) baselines be defined over the longest available record and justified ecologically, (2) seasonal standardization be applied before inter-annual comparisons, and (3) years with extreme area anomalies be flagged rather than used as reference points.

2606.07798 2026-06-09 cs.AI cs.LG q-bio.NC 新提交

Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings

在资源受限环境中利用常规数据重建和预测阿尔茨海默病患者的疾病轨迹

Ratnadeep Das, Atri Chatterjee, Sitikantha Roy

发表机构 * Yardi School of Artificial Intelligence (ScAI), Indian Institute of Technology Delhi(印度理工学院德里分校亚迪人工智能学院) Department of Neurology, Vardhman Mahavir Medical College and Safdarjung Hospital(瓦尔丹·马哈维尔医学院和萨夫达戎医院神经内科) Department of Applied Mechanics, Indian Institute of Technology Delhi(印度理工学院德里分校应用力学系)

AI总结 提出GNOVA框架,结合GRU编码器和神经ODE解码器的变分自编码器,利用常规临床数据(无需神经影像或生物标志物)实现认知评分的双向预测、插值/外推及不确定性估计,在ADNI数据集上取得低误差。

详情
AI中文摘要

阿尔茨海默病是一种进行性神经退行性疾病,其进展在不同患者间差异显著。现有工作旨在预测患者未来的认知状态,但很少关注从既往就诊中重建状态。此外,当前研究中,量化预测不确定性仍未被充分探索,且依赖于MRI、PET和CSF等昂贵模态,限制了在资源有限环境中的部署。在本研究中,我们的主要目标是:第一,从不规则就诊中双向预测认知评分,以呈现完整的疾病轨迹;第二,实现插值和外推能力,以辅助临床医生做出知情预后决策;第三,为所有预测提供校准良好的不确定性估计;最后,利用常规就诊中可用的模态实现上述目标。我们提出了一个统一框架GNOVA:GRU-神经ODE变分自编码器。该架构在变分自编码器框架内结合了门控循环单元编码器和神经ODE解码器。在我们的工作中,我们预测了CDR-SB和MMSE评分。GRU编码器允许在任何时间点输入任意数量的数据。神经ODE解码器执行连续估计,允许在任何期望的时间点进行插值和外推。变分自编码器允许预测中的不确定性估计。我们使用了ADNI数据集中1727名患者超过10年的数据;该模型在无需任何神经影像或生物标志物数据的情况下,对CDR-SB和MMSE评分分别实现了1.35和2.28的平均绝对误差。特征消融研究表明,年龄、BMI和APOE4状态是强预测因子。所提出的框架能够重建不完整的患者病史并预测未来的认知状态。

英文摘要

Alzheimer's disease is a progressive neurodegenerative disorder, and its progression varies substantially across patients. Existing work aims to forecast patients' future cognitive state, with minimal focus on reconstructing the state from past visits. Furthermore, in current research, quantifying predictive uncertainty remains underexplored and relies on costly modalities such as MRI, PET, and CSF, limiting their deployment in resource-limited settings. In this research, our primary objectives are: First, bidirectional prediction of cognitive scores from irregular visits to present the complete disease trajectory. Second, to enable interpolation and extrapolation capabilities to assist clinicians in informed prognostic decision making, and third, to provide a well-calibrated uncertainty estimate for all predictions, and finally, to achieve the objectives using the modalities available during routine visits. We propose a unified framework, GNOVA: A GRU-Neural ODE Variational Autoencoder. The architecture combines a Gated Recurrent Unit encoder and a Neural ODE decoder within a variational autoencoder framework. In our work, we forecast the CDR-SB and MMSE Scores. The GRU encoder allows for any number of inputs at any time point. The Neural-ODE decoder performs continuous estimation, allowing interpolation and extrapolation at any desired time point. The Variational autoencoder allows for uncertainty estimation in predictions. We worked with 1,727 patients from the ADNI dataset over 10 years; the model achieved mean absolute errors of 1.35 and 2.28 for CDR-SB and MMSE scores, respectively, without requiring any neuroimaging or biomarker data. Feature-ablation studies revealed that age, BMI, and APOE4 status were strong predictors. The proposed framework enables the reconstruction of incomplete patient histories and the anticipation of future cognitive states.

2606.07676 2026-06-09 q-bio.GN cs.AI 新提交

Single-Cell Cross-Modal Transfer by Adversarial Fine-Tuning of Foundation Models

通过基础模型的对抗微调实现单细胞跨模态迁移

Joseph Boyd, Matthew Lyon, Martino Mansoldo, Christian Hurry, Finnian Firth

AI总结 提出利用单细胞基础模型进行对抗微调,实现未配对空间转录组与单细胞RNA测序数据的跨模态翻译,性能优于多组学翻译方法。

详情
AI中文摘要

空间转录组学(ST)是探索组织中依赖于结构、邻近性和相互作用的生物学特性的强大工具。支撑ST的方法正在快速发展,但在亚细胞尺度上分析数千个基因的能力有限。尽管从组织中解离,但已知单细胞RNA测序(scRNA-seq)中细胞的全转录组读数保留了其先前原位邻域的信息,这激发了恢复该信息的计算方法。虽然配对的ST和scRNA-seq数据集稀缺,但每种模态本身都很丰富。因此,我们提出在未配对的ST和scRNA-seq数据之间进行跨模态翻译。在这项工作中,我们展示了单细胞基础模型可以通过对抗微调执行这种翻译。我们证明了我们的方法优于为多组学翻译构建的方法。

英文摘要

Spatial transcriptomics (ST) is a powerful tool for exploring biological properties dependent on structure, proximity, and interaction in tissue. The methods underpinning ST are developing rapidly but are limited in their ability to profile many thousands of genes at a subcellular scale. Although dissociated from tissue, it is known that the whole-transcriptome readouts of cells in single-cell RNA sequencing (scRNA-seq) retain information about their former in situ neighbourhoods, motivating computational methods to recover it. While paired ST and scRNA-seq datasets are scarce, each modality in its own right is abundantly available. We therefore propose to perform cross-modal translation between unpaired ST and scRNA-seq data. In this work we show that a single-cell foundation model can perform this translation via adversarial fine-tuning. We demonstrate that our method performs favourably against methods built for multi-omics translation.

2606.07674 2026-06-09 cs.CV q-bio.NC 新提交

Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

同时性多动症表型分析:基于常规视频、无标记姿态估计和表格基础模型的跨队列儿科迁移研究

Laura Cif, Diane Demailly, Zohra Souei, Muhammad Mushhood Ur Rehman, Juan Dario Ortigoza Escobar, Mayté Castro Jiménez, Cécile A. Hubsch, Sophie Huby, Morgan Dornadic, Gun-Marie Hariz, Eduardo M. Moraud, Jocelyne Bloch, Gabriella A. Horvath, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV)(洛桑大学医院) University of Lausanne (UNIL)(洛桑大学) Institut du Neurone(神经元研究所) Clinique Beau Soleil(博索莱伊诊所) Institut Mutualiste Montpelliérain(蒙彼利埃互助研究所) Military University Hospital of Sfax(斯法克斯军事大学医院) University of Edinburgh(爱丁堡大学) Hospital Sant Joan de Déu(圣琼德迪乌医院) European Reference Network for Rare Neurological Diseases (ERN-RND)(欧洲罕见神经系统疾病参考网络) Instituto de Salud Carlos III(卡洛斯三世健康研究所) CHU Montpellier(蒙彼利埃大学医院) Umeå University(于默奥大学) University Hospital Lausanne(洛桑大学医院) Ecole Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) British Columbia Children’s Hospital(不列颠哥伦比亚儿童医院)

AI总结 提出结合无标记姿态估计、运动学描述符和预训练基础模型的视频框架,在成人数据上训练后迁移至儿科队列,经轻量校准后实现多种多动症现象的同时检测。

详情
AI中文摘要

目的:开发并外部测试一个基于视频的框架,用于同时检测多动症运动障碍现象:肌张力障碍、震颤、肌阵挛、舞蹈症、手足徐动症、投掷症、刻板动作和抽动,使用常规临床记录,并明确测试从成人到儿科人群的外部跨队列迁移。方法:在这项概念验证研究中,该框架结合了无标记姿态估计、运动学描述符和预训练基础模型。在21名确诊多动症的成人和4名健康对照(按标准化方案评估)上开发了共享预测骨干。外部验证在一个独立的外部队列上进行:一个真实世界的儿科样本(n=12,单基因联合多动症)。对于外部数据集,骨干网络未经重新训练直接部署;轻量校准仅调整最终受试者级别的决策步骤,使用由临床医生选择的小标记子集(代表队列表型范围)。结果:在临床医生选择的子集上对决策层进行本地校准后,在保留的儿科患者(n=7)上性能持续提升:汉明准确率从0.804提高到0.839,Jaccard指数从0.548提高到0.633。当评估限制在临床医生一致性更高的现象时,校准后的性能得以保持,Jaccard指数进一步提高(汉明准确率0.9,Jaccard指数0.786),表明增益并非依赖于最不可靠的标签。

英文摘要

Objective: To develop and externally test a video-based framework for simultaneous detection of hyperkinetic MDs phenomenologies: dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics using routine clinical recordings, with explicit testing of external, cross-cohort transfer from adult to pediatric populations. Methods: In this proof-of-concept study, the framework combines markerless pose estimation, kinematic descriptors, and a pretrained fondation model. A shared predictive backbone was developed on 21 adults with confirmed hyperkinetic MDs and 4 healthy controls assessed under a standardized protocol. External validation was performed on an independent external cohort: a real-world pediatric sample (n=12, monogenic combined MDs). For the external dataset, the backbone was deployed without retraining; lightweight calibration adjusted only the final subject-level decision step using a small labeled subset of patients selected by clinicians as representative of the cohort's phenotypic range. Results: After local calibration of the decision layer on the clinician-selected subset, performance improved consistently on the held-out pediatric patients (n=7): Hamming accuracy rose from 0.804 to 0.839 and the Jaccard index from 0.548 to 0.633. This calibrated performance was preserved, and the Jaccard index further improved, when the evaluation was restricted to the phenomenologies with more definite clinician agreement (Hamming accuracy 0.9, Jaccard index 0.786), indicating that the gains did not rest on the least-reliable labels.

2606.07607 2026-06-09 cs.LG q-bio.GN 新提交

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

立场:基因组模型研究必须超越可解释性方法的轶事评估

Shasha Zhou, Mingyu Huang, Ke Li

AI总结 本文通过转录因子结合基准测试,揭示不同可解释性方法常产生矛盾解释、无法定位已知调控基序且不能忠实反映模型决策,主张采用类似临床试验的系统验证框架。

详情
AI中文摘要

机器学习和计算能力的进步释放了人类基因组的预测潜力,但生物学家现在要求这些模型也能阐明潜在的生物学机制。尽管可解释机器学习(IML)技术已被越来越多地用于弥合这一差距,但普遍存在对轶事验证的依赖:绝大多数研究仅依赖单一IML方法,并仅报告孤立的成功实例。通过对转录因子结合的基准测试,我们展示了当前实践的风险。我们表明,不同的IML方法通常可能(1)对同一预测产生矛盾的解释,(2)无法定位已知的调控基序,以及(3)未能忠实反映模型的内部决策过程。鉴于此,我们主张建立一个类似于临床试验的验证框架:正如试验需要严格的设计和不良事件报告,基因组可解释性必须超越挑选的合理性,转向对一致性、忠实性和生物学有效性的系统评估。为促进这一点,我们提出了一个分层框架,以指导基因组IML方法的严格评估和报告。

英文摘要

Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists now demand that these models also elucidate the underlying biological mechanisms. While interpretable machine learning (IML) techniques have been increasingly applied to bridge this gap, there has been a pervasive reliance on anecdotal validation: the vast majority of research relies on a single IML method and reports only isolated successful instances. Through a benchmarking study on transcription factor binding, we demonstrate the risks of current practices. We show that different IML methods can often (1) yield contradictory explanations for the same predictions, (2) fail to localize known regulatory motifs, and (3) fail to faithfully reflect the model's internal decision process. In light of this, we argue for a validation framework analogous to clinical trials: just as trials require rigorous design and adverse-event reporting, genomic interpretability must move beyond cherry-picked plausibility toward systematic assessment of consistency, faithfulness, and biological validity. To facilitate this, we propose a tiered framework to guide rigorous evaluation and reporting of genomic IML methods.

2606.07567 2026-06-09 q-bio.BM cs.AI cs.CE 新提交

SurfDesign: Effective Protein Design on Molecular Surfaces

SurfDesign:基于分子表面的高效蛋白质设计

Fang Wu, Shuting Jin, Xiangru Tang, Mark Gerstein, Xiangxiang Zeng, Yejin Choi, Jure Leskovec, Jinbo Xu

AI总结 提出SurfDesign框架,将分子表面建模为连续几何流形并整合预训练蛋白质语言模型,通过表面等变消息传递捕捉几何特征,在从头设计结合子和酶设计基准上优于现有方法。

详情
Journal ref
KDD 2026 AI4Science
AI中文摘要

蛋白质功能很大程度上由分子表面几何和物理化学互补性决定,然而大多数蛋白质设计方法仅以主链结构为条件。我们引入了SurfDesign,一个表面条件蛋白质设计框架,将分子表面建模为连续几何流形,并将其与预训练蛋白质语言模型集成。SurfDesign采用基于表面的等变消息传递来捕捉表面法线、曲率和方向几何,同时采用参数高效的微调策略。专注于功能性蛋白质设计,我们表明SurfDesign在从头设计结合子和酶设计基准上始终优于先前的表面条件和仅主链方法。我们还报告了在逆折叠基准上的强劲性能,作为结构兼容性的诊断。我们的结果强调了流形感知表面表示作为功能性蛋白质和酶设计的原理基础。代码可在https://github.com/smiles724/SurfDesign获取。

英文摘要

Protein function is largely determined by molecular surface geometry and physicochemical complementarity, yet most protein design methods condition only on backbone structure. We introduce SurfDesign, a surface-conditioned protein design framework that models molecular surfaces as continuous geometric manifolds and integrates them with pretrained protein language models. SurfDesign employs surface-based equivariant message passing to capture surface normals, curvature, and directional geometry, together with a parameter-efficient fine-tuning strategy. Focusing on functional protein design, we show that SurfDesign consistently outperforms prior surface-conditioned and backbone-only methods on de novo binder and enzyme design benchmarks. We also report strong performance on inverse-folding benchmarks as a diagnostic of structural compatibility. Our results highlight manifold-aware surface representations as a principled foundation for functional protein and enzyme design. Code is available at https://github.com/smiles724/SurfDesign.

2606.07562 2026-06-09 q-bio.BM cs.AI 新提交

The Montparnasse Algorithm for RNA Design

RNA设计的蒙帕纳斯算法

Tristan Cazenave

AI总结 提出基于广义嵌套滚动策略适应的蒙特卡洛搜索框架Montparnasse,结合问题特定先验和字典序多准则评估,在Eterna100基准上比现有最优方法DesiRNA快三倍以上,并在血红蛋白α信使RNA二级结构优化中优于LinearDesign。

详情
AI中文摘要

RNA设计包括发现一个优化预定义标准(如二级结构)的核苷酸序列。它对合成生物学、医学和纳米技术很有用。我们提出了Montparnasse,一个基于广义嵌套滚动策略适应的蒙特卡洛搜索框架,并增加了问题特定的先验、第1级的慢速和长期适应,以及字典序多准则评估。Montparnasse在所有时间限制下一致地比现有最优方法DesiRNA更快地解决了Eterna100 V1基准的所有100个谜题,总体达到完全覆盖的速度快三倍以上。在血红蛋白α的信使RNA二级结构优化中,它识别出的序列比LinearDesign的MFE最优解具有更多的配对碱基。

英文摘要

RNA design consists of discovering a nucleotide sequence that optimizes predefined criteria, such as secondary structure. It is useful for synthetic biology, medicine, and nanotechnology. We propose Montparnasse, a Monte Carlo search framework based on Generalized Nested Rollout Policy Adaptation, augmented with a problem-specific prior, slow and long adaptation at level 1, and a lexicographic multicriteria evaluation. Montparnasse solves all 100 puzzles of the Eterna100 V1 benchmark consistently faster than DesiRNA, the previous state of the art, across all time limits, reaching full coverage more than three times faster overall. On messenger RNA secondary structure optimization for hemoglobin alpha, it identifies sequences with more paired bases than the MFE-optimal solution of LinearDesign.

2606.00568 2026-06-09 cs.LG q-bio.GN 版本更新

On the Recoverability of Causal Relations from Bulk Gene Expression Data

从批量基因表达数据中恢复因果关系的可能性

Gongxu Luo, Boyang Sun, Kun Zhang

AI总结 本文通过形式化聚合下的一致性和推导充要条件,研究了从批量基因表达数据中恢复因果关系的可能性,并发现仅在线性聚合与仿射结构方程下可恢复,而实证数据偏离线性假设。

详情
AI中文摘要

批量基因表达谱分析将生物样本中所有细胞的RNA混合后测量,在单细胞时代仍然重要,因为它通常比单细胞检测噪声更低、灵敏度更高且成本效益更好。因此,越来越多的计算方法试图从批量表达数据中恢复基因间的因果关系。然而,聚合是对底层细胞系统的有损、不可逆的粗化,目前尚不清楚是否以及在何种条件下可以从聚合的批量基因表达数据中恢复因果关系。为了回答这个问题,我们通过两种一致性概念(函数形式一致性和条件独立性一致性)形式化了聚合下的可恢复性。然后,我们推导了可恢复性的必要和充分条件,表明这些性质仅在线性聚合(如求和/均值)与仿射结构方程结合时得以保持。为了评估这些条件的实际可行性,对四个批量基因表达数据集和四个单细胞基因表达数据集的分析进一步揭示,两种数据类型中估计的基因间成对调控函数均偏离线性,为可恢复性所需的线性假设提供了有限的经验支持。总之,这些结果告诫我们,在没有强额外假设的情况下,不应从聚合的批量表达数据中恢复因果关系。

英文摘要

Bulk gene expression profiling, which aggregates pooled RNA across cells within a biological sample, remains important in the single-cell era because it is typically less noisy, more sensitive, and more cost-effective than single-cell assays. Accordingly, a growing body of computational methods seeks to recover causal relations among genes from bulk expression data. However, aggregation is a lossy, non-invertible coarsening of the underlying cellular system, and it remains unclear whether and under what conditions causal relations are recoverable from aggregated bulk gene expression data. To answer this, we formalize recoverability under aggregation through two notions of consistency: functional-form consistency and conditional-independence consistency. We then derive necessary and sufficient conditions for recoverability, showing that these properties are preserved only under linear aggregations (e.g., sum/mean) coupled with affine structural equations. To assess the practical plausibility of these conditions, analyses of four bulk and four single-cell gene expression datasets further reveal that the estimated pairwise regulatory functions among genes deviate from linearity in both data types, providing limited empirical support for the linearity assumptions required for recoverability. Together, these results caution against recovering causal relations from aggregated bulk expression data without strong additional assumptions.

2606.00196 2026-06-09 q-bio.PE physics.soc-ph 版本更新

Evolution of cooperation in the multiplex

多层网络中的合作演化

Zijie Chen, Xingru Chen, Feng Fu

AI总结 基于多层网络中的多表型同质性,推导了自然选择促进合作的分析条件,揭示了表型多样性通过划分同配生态位促进合作,并发现囚徒困境的强度改变合作对策略突变的依赖性。

详情
Comments
53 pages, 23 figures (including Supplementary Information)
AI中文摘要

在生物和社会系统中,合作通常依赖于表型线索而非随机相遇。为了解释在多个同时维度上展开的真实世界互动,我们在此开发了一个由多表型同质性支配的多层网络中合作演化的通用框架。我们推导了自然选择有利于在独立或具有上位性的表型性状以及不同突变耦合模式下合作的分析条件。尽管适应度跨层整合,合作的条件解析为层特定的$σ$-规则,仅依赖于局部收益结构、有效表型数量和突变率。我们表明,表型多样性通过将种群划分为同配生态位来促进合作。此外,在有限种群中,囚徒困境的加剧使合作对策略突变的依赖性从单调递减,经过U形,变为单调递增。我们的工作为多表型同质性如何支撑异质种群中合作的演化动态提供了统一解释。

英文摘要

Across biological and social systems, cooperation often depends on phenotypic cues rather than random encounters. To account for real-world interactions unfolding across multiple, simultaneous dimensions, here we develop a general framework for the evolution of cooperation in multiplex networks governed by multi-phenotype homophily. We derive analytical conditions for natural selection to favor cooperation across phenotypic traits that are independent or exhibit epistasis and under different modes of mutation coupling. Despite the integration of fitness across layers, the conditions for cooperation resolve into layer-specific $σ$-rules, depending only on the local payoff structure, the effective number of phenotypes, and the mutation rates. We show that phenotypic diversity fosters cooperation by partitioning populations into assortative niches. Furthermore, in finite populations, intensifying the prisoner's dilemma shifts the dependence of cooperation on strategy mutation from monotonically decreasing, through U-shaped, to monotonically increasing. Our work provides a unified account of how multi-phenotype homophily underpins the evolutionary dynamics of cooperation in heterogeneous populations.

2605.31498 2026-06-09 cs.LG q-bio.BM 版本更新

Scalable Inference-Time Annealing with Surrogate Likelihood Estimators

可扩展的推理时退火与代理似然估计器

Daniel Peñaherrera, Rishal Aggarwal, David Ryan Koes

AI总结 提出可扩展推理时退火(SITA)方法,通过基于能量的模型实现快速代理似然,避免昂贵的散度计算,在丙氨酸二肽和三肽上取得最先进性能。

详情
Comments
26 pages, 5 figures, submitted to JMLR 2026
AI中文摘要

计算化学和生物物理学中长期存在的挑战是高效采样分子的玻尔兹曼分布。生成式建模的进展被提出以解决传统采样技术的局限性,通过消除模拟的计算成本。一个有前景的方向是沿着温度阶梯迭代微调扩散模型,其中训练数据通过推理时退火期间的重要性采样生成。不幸的是,这些方法需要在分数场上计算散度来估计重要性权重,使得它们对于较大系统难以处理。在这里,我们提出可扩展的推理时退火(SITA),它重新训练基于流的模型以在逐渐降低的温度下生成样本,使用基于能量的模型来促进快速代理似然。我们在丙氨酸二肽和丙氨酸三肽上展示了最先进的性能,同时避免了昂贵的散度项。我们的代码可在 https://github.com/countrsignal/sita.git 获取。

英文摘要

A long standing challenge in computational chemistry and biophysics is efficiently sampling the Boltzmann distribution of molecules. Advances in generative modeling have been proposed to address the limitations of conventional sampling techniques by eliminating the computational cost of simulation. A promising direction is iteratively finetuning diffusion models along a temperature ladder whereby training data is generated via importance sampling during inference-time annealing. Unfortunately, these methods require computing a divergence over the score field to estimate importance weights, rendering them intractable for larger systems. Here we present scalable inference-time annealing (SITA), which retrains flow-based models to generate samples at progressively lower temperatures using an energy-based model to facilitate fast surrogate likelihoods. We demonstrate state-of-the-art performance on both Alanine Dipeptide and Alanine Tripeptide while avoiding costly divergence terms. Our code is available at https://github.com/countrsignal/sita.git