arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28818 2026-05-28 cs.CL q-bio.NC

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

VLMs 在自然阅读中可能不会全局性地增强与人类的对齐性优于 LLMs

Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu

AI总结 通过严格文本设置比较LLM和VLM,发现多模态预训练在自然阅读中未带来全局性人类对齐优势,但视觉语义内容强的句子中VLM有选择性优势。

详情
Comments
17 pages, 10 figures
AI中文摘要

大型语言模型(LLMs)已成为人类语言处理的有用计算模型,但尚不清楚视觉-语言学习是否使文本表示在自然阅读中更接近人类。本文通过严格文本设置比较紧密匹配的LLM和视觉-语言模型(VLM)对,从而将多模态训练历史的影响与在线视觉输入或跨模态融合分离。我们使用包含全脑fMRI反应和同步眼动扫视的人类自然阅读数据集评估模型对齐。我们的发现表明,多模态预训练可能不会在自然阅读中赋予均匀的全局性人类对齐优势,表明语言内部表示仍然是建模人类文本处理的关键因素。然而,当句子包含更强的视觉语义内容时,VLM的优势可能更具选择性出现,fMRI和眼动对齐均提供汇聚证据。总之,我们的发现提供了一个受控的计算框架,用于测试视觉学习历史如何塑造语言处理的模型-人类对齐,表明多模态预训练在自然阅读中对类人语言表示的贡献是选择性的而非全局性的。

英文摘要

Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.

2605.28739 2026-05-28 cs.LG cs.AI cs.NE q-bio.QM

BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

BIRDNet: 挖掘和编码布尔蕴含知识图作为可解释深度神经网络

Tirtharaj Dash

AI总结 提出BIRDNet,通过挖掘特征间的布尔蕴含关系并编码为稀疏可解释神经网络,在保持高精度的同时大幅减少参数,并在转录组和蛋白质组数据中恢复已知生物学特征。

详情
Comments
5 pages; 1 figure, 4 tables
AI中文摘要

知识丰富领域中的表格数据通常携带特征对之间的布尔蕴含关系(BIR)形式的潜在先验。我们使用稀疏异常二项检验挖掘此类关系。挖掘出的蕴含构成一个带类型的定向图,等价于一个由2-文字子句组成的命题规则库。我们将该图编码为分层神经网络的连接性,称为BIRDNet,其中每个隐藏单元对应一条挖掘出的规则,并仅绑定到其两个特征。我们展示了这种设计的两个结果:首先,该架构在构造上是稀疏的:每个BIR层中最多有$2/d$的权重是活跃的,其中$d$是输入维度。其次,模型是可解释的:每个训练后的单元保持稳定的符号身份,因此无需代理模型即可从网络中读取规则。与大多数神经符号模型不同,BIRDNet不消耗外部规则库;其结构先验是从数据中挖掘的。我们在六个转录组和蛋白质组基准上评估BIRDNet。我们的结果表明,BIRDNet在AUROC上与最强的密集基线相差0.02以内,精度损失很小,同时使用的活跃参数比架构匹配的密集MLP少高达96倍。第一层规则恢复了多种癌症亚型和组织类型中的已知生物学特征,包括典型扩增子、谱系定义共表达模块和免疫浸润标记。数据和代码可在 https://github.com/MAHI-Group/BIRDNet 获取。

英文摘要

Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2-literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most $2/d$ of the weights in each BIR layer are active, where $d$ is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to $96\times$ fewer active parameters than an architecture-matched dense MLP. First-layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage-defining co-expression modules, and immune-infiltration markers. Data and code are available at: https://github.com/MAHI-Group/BIRDNet.

2605.28693 2026-05-28 q-bio.NC cs.AI

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

反向传播与大脑对图像响应的层级结构之间的错位

Joséphine Raugel, Maximilian Seitzer, Marc Szafraniec, Huy V. Vo, Jérémy Rapin, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean-Rémi King

AI总结 通过fMRI和MEG记录人类对自然图像的脑响应,发现预训练模型的反向传播梯度虽能预测高级视觉皮层和晚期信号,但其时空组织与大脑层级结构不一致,表明深度网络与大脑可能依赖不同的学习机制。

详情
Comments
13 pages, 9 figures
AI中文摘要

反向传播是深度学习核心的学习机制。然而,该算法是否以及如何在大脑中实现仍存在高度争议。特别是,虽然预训练模型的前向激活可靠地映射到视觉处理的皮层层级结构,但反向传播梯度是否表现出类似的对应关系尚不清楚。在这里,我们利用功能性磁共振成像(fMRI)和脑磁图(MEG)记录人类对自然图像的脑响应来探讨这一问题。为此,我们将前向激活的标准编码分析扩展到将反向传播梯度映射到神经数据。聚焦于最近的自监督视觉模型(DINOv3)并在八个视觉模型上复现结果,我们发现反向传播梯度能够可靠地预测fMRI和MEG信号,尤其是在高级视觉皮层和较晚的潜伏期。然而,这些反向传播梯度在大脑中的空间和时间组织与生物合理反向传播机制预期的模式不同:具体而言,梯度计算的顺序及其空间组织均与人类大脑的时间和空间层级结构相偏离。这些结果表明,尽管深度网络和大脑可能共享相似的表征内容,但它们可能依赖根本不同的机制来学习这些表征。

英文摘要

Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brain remains highly debated. In particular, while forward activations of pretrained models reliably map onto the cortical hierarchy of visual processing, it is unknown whether backpropagated gradients exhibit a similar correspondence. Here, we address this question using functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) recordings of human brain responses to natural images. For this, we extend standard encoding analyses of forward activations to map backpropagated gradients onto neural data. Focusing on a recent self-supervised vision model (DINOv3) and reproducing results on eight vision models, we find that backpropagated gradients can reliably predict both fMRI and MEG signals, specifically in higher-level visual cortex and for later latencies. However, the spatial and temporal organization of these backpropagated gradients in the brain diverges from the patterns expected under a biologically plausible backpropagation mechanism: specifically, both the order in which gradients are computed and their spatial organization diverge from the temporal and spatial hierarchies of the human brain. Together, these results suggest that, although deep networks and the brain may share similar representational content, they likely rely on fundamentally different mechanisms to learn those representations.

2605.28652 2026-05-28 q-bio.QM nlin.CD q-bio.PE q-bio.SC

Widespread quasi-steady state assumption in biological interaction modeling mischaracterizes system transitions

生物相互作用建模中广泛使用的准稳态假设错误表征了系统转变

Pan-Jun Kim

AI总结 本文推导了一个理论框架,考虑被准稳态近似忽略的弛豫过程,揭示了在系统转变点附近QSSA会错误估计转变持续时间和振荡起始点,而弛豫动力学通过反馈相互作用产生反直觉的时间延迟效应。

详情
Comments
Main manuscript and supplementary information provided
AI中文摘要

从分子、细胞到生态系统,生物过程的建模通常基于一个假设:快速组分在每一时刻立即达到平衡(准稳态),只有慢速组分主导相关的系统动力学。这种准稳态近似(QSSA)简化了建模,但忽略了向每个准稳态弛豫的影响。目前尚不清楚QSSA在转变点(系统转变为定性不同状态的具体条件)附近的适用性。为此,我们推导了一个生物系统近转变动力学的理论框架,明确考虑了QSSA忽略的弛豫过程。数值模拟验证了我们在细胞决策、代谢振荡和生态循环中的预测。尽管在转变点附近极度减速,仅用QSSA会错误估计从一个状态到另一个状态的转变持续时间。此外,QSSA错误预测了振荡起始的转变点本身,而弛豫动力学通过反直觉的时间延迟效应促进或抑制振荡起始。生物组分之间的常见反馈相互作用对这些弛豫效应至关重要。我们的研究为理解相互作用生物组分在转变附近丰富的瞬态或节律动力学提供了分析基础。

英文摘要

From molecular, cellular, to ecological systems, the modeling of biological processes often stands on the assumption that fast components immediately reach the equilibrium at each moment (quasi-steady state) and only slow components govern the relevant system dynamics. This quasi-steady state approximation (QSSA) simplifies the modeling but discards the effects of the relaxation towards each quasi-steady state. Unclear is the QSSA's suitability around the transition point, a specific condition where the system changes to a qualitatively different state. In this regard, we here derived a theoretical framework for the near-transition dynamics of biological systems, explicitly considering the relaxation processes overlooked by the QSSA. Numerical simulations verify our predictions for cellular decision-making, metabolic oscillations, and ecological cycles. Despite the extreme slowdown near the transition point, the QSSA alone misestimates the duration of the transition from one state to another. Moreover, the QSSA erroneously predicts the transition point itself for the onset of oscillations, while the relaxation dynamics facilitates or suppresses the oscillation onset with a counterintuitive time-delay effect. Common feedback interactions between biological components are pivotal to those relaxation effects. Our study provides an analytical foundation to understand the rich transient or rhythmic dynamics of interacting biological components near the transitions.

2605.28651 2026-05-28 cond-mat.soft physics.bio-ph q-bio.MN

Determinants of Phase-Separation Propensities, Material States, and Material Properties of Biomolecular Condensates

生物分子凝聚物的相分离倾向、材料状态和材料性质的决定因素

Huan-Xiang Zhou

AI总结 本文通过理论框架和实验计算研究,解释了生物分子凝聚物的相分离倾向(阈值浓度与过量化学势的关系)、材料状态(液滴、无定形稠密液体、可逆聚集体和凝胶的形成机制)以及材料性质(应力松弛时间决定粘弹性行为)。

详情
Comments
58 pages, 4 figures
AI中文摘要

各种材料的相分离研究已有一个半世纪的历史。在过去的二十年中,由于蛋白质和核酸的相分离与细胞功能的相关性,它们受到了极大的关注。然而,关于由此产生的生物分子凝聚物的许多观察结果缺乏理论基础。本报告的第一个目标是提出关于生物分子凝聚物的相分离倾向、材料状态和材料性质的理论框架。利用这些框架,我合理化了我们最近的实验和计算研究中的机制解释,并将这些研究与先前的文献综合起来得出新的结论。对于相分离倾向,我将阈值(或饱和)浓度与稠密相中的过量化学势联系起来,而后者又取决于分子间相互作用强度和价数。对于材料状态,我假设液滴通过完全相分离形成,而无定形稠密液体、可逆聚集体和凝胶则由于过弱或过强的相互作用或方向性相互作用而由旋节线分解提前终止产生。特别是,凝胶和聚集体是动态受阻状态的不同形式,凝胶由方向性相互作用驱动的尖端生长驱动,而聚集体则通过内部位点的单体添加以最大化价数。对于材料性质,我强调了应力松弛时间的关键作用,该时间由凝聚物中分子间键的平均寿命决定。这个松弛时间决定了凝聚物如何表现粘弹性,包括剪切增稠和剪切稀化,并解释了不同凝聚物之间零剪切粘度的广泛变化。

英文摘要

Phase separation of various materials has been studied for one and a half centuries. In the last two decades, phase separation of proteins and nucleic acids has received enormous attention, due its relevance to cellular functions. However, many of the observations on the resulting biomolecular condensates lack a theoretical underpinning. The first goal of this Account is to put forward theoretical frameworks for the phase-separation propensities, material states, and material properties of biomolecular condensates. Using these frameworks, I rationalize mechanistic interpretations from our recent experimental and computational studies, and synthesize these studies with prior literature to draw new conclusions. For phase-separation propensities, I relate the threshold (or saturation) concentration to the excess chemical potential in the dense phase, which in turn depends on intermolecular interaction strength and valency. For material states, I posit that liquid droplets form via complete phase separation, whereas amorphous dense liquids, reversible aggregates, and gels arise from premature termination of spinodal decomposition, due to overly weak or overly strong interactions or directional interactions. In particular, gels and aggregates are different forms of dynamically arrested states, with gels driven by tip growth via directional interactions whereas aggregates driven by monomer addition at interior sites to maximize valency. For material properties, I highlight the crucial roles of the stress relaxation time, which is determined by the mean lifetime of intermolecular bonds in a condensate. This relaxation time dictates how the condensate manifests viscoelasticity, including shear thickening and shear thinning, and accounts for the wide variation in zero-shear viscosity among different condensates.

2605.28545 2026-05-28 q-bio.PE

PhyloFrame: A DataFrame-based Library for Fast, Flexible Phylogenetic Computation

PhyloFrame:基于DataFrame的快速灵活系统发育计算库

Matthew Andres Moreno, Jeet Sukumaran, Luis Zaman, Emily Dolson

AI总结 提出基于DataFrame的Python库PhyloFrame,通过数组存储和JIT编译实现大规模树(≥30万分类单元)的高效计算,性能媲美原生代码库。

详情
AI中文摘要

PhyloFrame是一个用于系统发育计算的Python库,旨在弥合专家级编译器优化操作与灵活的脚本工作流之间的差距——重点在于对非常大的树规模(例如≥300,000个分类单元)实现快速、内存高效的操作。PhyloFrame围绕基于DataFrame的树表示构建,其中每行对应一个节点,列记录祖先关系、分支长度、分类单元标签以及任何用户定义的属性。这种基于数组的存储对于可扩展性至关重要,它允许库和最终用户代码无缝利用即时编译(例如Numba)和向量化执行(例如NumPy、Polars)。在大树规模下,性能通常达到或超过由原生代码支持的Python库——特别是在拓扑顺序遍历和Newick I/O方面表现出色。 基于DataFrame的表示还提供了若干额外便利,包括: - 简洁的批量操作(例如NumPy); - 强大的查询和转换(例如Polars表达式、Pandas索引、SQL风格的连接和合并); - 与现代表格数据格式兼容,这些格式压缩友好、类型感知、可空且高度可移植(例如Parquet);以及 - 与面向表格的数据科学工具广泛互操作(例如Seaborn、Plotly、Vega-Altair、tidyverse、Excel)。 当前库功能包括树输入/输出、合成树生成、基于分类单元的查询、树遍历、树度量、树操作、树降采样和树比较。大多数功能支持Pandas和Polars DataFrame,并通过编程和基于CLI的接口提供。

英文摘要

PhyloFrame is a Python library for phylogenetic computation targeting the gap between specialist, compiler-optimized operations and flexible, script-based workflows -- with emphasis on fast, memory-efficient operations for very large tree sizes (e.g., $\geq$ 300,000 taxa). PhyloFrame is built around a DataFrame-based tree representation, where each row corresponds to a node and columns record ancestor relationships, branch lengths, taxon labels, and any user-defined attributes. Crucial for scalability, such array-backed storage allows both library and end-user code alike to seamlessly harness Just-in-Time (JIT) compilation (e.g., Numba) and vectorized execution (e.g., NumPy, Polars). At large tree sizes, performance generally matches or exceeds Python libraries backed by native code -- notably, achieving strong performance in topological-order traversals and Newick I/O. DataFrame-based representation affords several additional conveniences, including: - succinct bulk operations (e.g., NumPy); - powerful queries and transformations (e.g., Polars expressions, Pandas indexing, SQL-style joins and merges); - compatibility with modern tabular data formats that are compression-friendly, type-aware, nullable, and highly portable (e.g., Parquet); and - broad interoperation with table-oriented data science tools (e.g., Seaborn, Plotly, Vega-Altair, tidyverse, Excel). Current library features include tree input/output, synthetic tree generation, taxon-based queries, tree traversals, tree metrics, tree manipulation, tree downsampling, and tree comparison. Most functionality supports both Pandas and Polars DataFrames, and is available through programmatic and CLI-based interfaces.

2605.28200 2026-05-28 cs.LG q-bio.GN

Geometry-First Generative Spatial Single-Cell Reconstruction

几何优先的生成式空间单细胞重建

Ehtesamul Azim, Muhtasim Noor Alif, Tae Hyun Hwang, Yanjie Fu, Wei Zhang

AI总结 提出GEARS框架,通过几何优先方法结合扩散模型和置换等变生成器,从单细胞RNA测序数据重建空间几何,无需细胞类型标签或组织学图像。

详情
Comments
32nd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

单细胞RNA测序(scRNA-seq)可分析大量细胞但丢失空间背景,而空间转录组学(ST)以较低分辨率保留部分空间结构。现有大多数整合方法要么解卷积斑点混合物,要么将细胞映射到测量的斑点网格上,这使重建受限于固定网格和切片特定坐标系,在非配对设置中尤其成问题。我们提出GEARS,一种几何优先框架,在ST引导下重建内在的单细胞空间几何,无需依赖细胞类型标签、组织学图像或细胞-斑点分配。GEARS首先学习一个域不变的表达编码器,对齐ST斑点和解离细胞,然后训练一个置换等变生成器,配合基于扩散的细化器(采用EDM风格预处理),在来自ST坐标的姿态不变监督下生成局部空间几何。在推理时,GEARS在scRNA-seq细胞的多个重叠子集上重建几何,聚合跨子集的预测成对距离,并解决全局距离几何问题以获得规范二维坐标和密集距离矩阵。大量定量和定性实验(包括横截面泛化)表明,与强空间映射和解卷积基线相比,GEARS在全局距离保持、局部邻域保真度和空间分布对齐方面持续改进。

英文摘要

Single-cell RNA sequencing (scRNA-seq) profiles large numbers of cells but loses spatial context, whereas spatial transcriptomics (ST) preserves partial spatial structure at lower resolution. Most existing integration methods either deconvolve spot mixtures or map cells onto a measured spot lattice, which ties reconstructions to a fixed grid and slide-specific coordinate systems, a limitation that is especially problematic in unpaired settings. We propose GEARS, a geometry-first framework that reconstructs an intrinsic single-cell spatial geometry guided by ST, without relying on cell-type labels, histological images, or cell-to-spot assignment. GEARS first learns a domain-invariant expression encoder that aligns ST spots and dissociated cells, and then trains a permutation-equivariant generator with a diffusion-based refiner with EDM-style preconditioning to generate local spatial geometries under pose-invariant supervision derived from ST coordinates. At inference, GEARS reconstructs geometry on many overlapping subsets of scRNA-seq cells, aggregates predicted pairwise distances across subsets, and solves a global distance-geometry problem to obtain canonical two-dimensional coordinates and a dense distance matrix. Extensive quantitative and qualitative experiments, including cross-section generalization, show that GEARS consistently improves global distance preservation, local neighborhood fidelity, and spatial distribution alignment compared to strong spatial mapping and deconvolution baselines.

2605.26910 2026-05-28 cs.LG cs.AI q-bio.NC

EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models

EEG-FM-Audit:脑电图基础模型的系统评估与分析流程

Xianheng Wang, Yige Yang, Damien Coyle

AI总结 提出EEG-FM-Audit流程,通过ASHA驱动的基准测试、范式级消融研究和神经生理探测,系统评估脑电图基础模型,发现调优的监督基线可媲美或超越先进基础模型。

详情
Comments
26 pages
AI中文摘要

大型脑电图基础模型在解码跨多种认知任务的脑电图信号方面展现出巨大潜力。然而,现有的EEG-FM研究存在三个关键局限性:不透明的监督基线调优、复杂学习范式的贡献未经验证以及模型决策缺乏透明度。为解决这些问题,我们提出了EEG-FM-Audit,一个旨在系统化评估EEG-FM的综合评估与分析流程。EEG-FM-Audit包含三个主要组成部分:(1) ASHA驱动的基准测试协议,通过透明优化监督基线确保公平比较;(2) 范式级消融研究,评估FM中学习范式的有效性;(3) 神经生理探测框架,探究FM是否利用了有效的时域、空域和频域脑电图特性。我们将EEG-FM-Audit应用于四个最先进的EEG-FM和五个代表性监督模型,涉及三个公开数据集。结果表明,尽管参数显著减少,但适当调优的监督基线可以匹配或超越先进的FM。此外,我们发现FM学习范式的有效性高度依赖于数据集规模和架构。最后,NPP分析展示了FM如何依赖特定的生理特征,为更可解释的神经解码建立了框架。

英文摘要

Large EEG Foundation Models (FMs) have shown great potential for decoding EEG signals across diverse cognitive tasks. However, existing EEG-FM studies exhibit three critical limitations: opaque supervised baseline tuning, unverified contributions of complex learning paradigms, and a lack of transparency in model decision-making. To address these, we propose EEG-FM-Audit, a comprehensive evaluation and analysis pipeline designed to systematize the assessment of EEG-FMs. EEG-FM-Audit consists of three primary components: (1) an ASHA-driven benchmarking protocol that ensures fair comparisons by transparently optimizing supervised baselines; (2) paradigm-level ablation studies to evaluate the effectiveness of learning paradigms in FMs; and (3) a neurophysiological probing (NPP) framework, which explores whether FMs leverage valid temporal, spatial, and spectral EEG properties. We apply EEG-FM-Audit to four state-of-the-art EEG-FMs and five representative supervised models across three public datasets. Our results reveal that properly tuned supervised baselines can match or outperform advanced FMs, despite requiring significantly fewer parameters. Furthermore, we find that the effectiveness of learning paradigms of FMs is highly dependent on dataset scale and architecture. Finally, NPP analysis demonstrates how FMs rely on specific physiological features, establishing a framework for more interpretable neural decoding.

2605.27986 2026-05-28 cs.CL q-bio.QM

An Evolutionary Approach for Designing Stable and Highly Expressible Low-Immunogenicity Therapeutic mRNA Sequences

一种设计稳定、高表达且低免疫原性治疗性mRNA序列的进化方法

Dhawa Sang Dong, Mausam Gurung, Suraj Kandel

AI总结 提出BERT-GA两阶段框架,结合深度学习和遗传算法优化mRNA序列,在翻译效率、结构稳定性和低免疫原性之间取得平衡。

详情
AI中文摘要

信使RNA(mRNA)序列作为治疗药物需要优化设计,以确保高效翻译、结构稳定性和最小免疫原性。本研究提出一个两阶段计算机模拟框架,整合深度学习和进化计算,用于理性mRNA优化,而非现有最先进模型。第一阶段,预训练的CodonTransformer(类似BERT的大语言模型)生成编码目标抗原的生物一致性mRNA序列。第二阶段,遗传算法(GA)通过密码子感知的交叉和同义突变(由人类密码子使用偏好引导)进化这些候选序列。评估的适应度函数结合了翻译相关指标(CAI、tAI、密码子对偏好)、mRNA结构稳定性(通过RNAfold计算的局部和全局MFE、GC含量)以及降低的免疫原性(CpG/UpA基序频率)。经过连续世代(第38、40和42代),GA改进了CAI和tAI(CAI值从0.73到0.74,tAI值从0.63到0.64),提升超过6%,密码子对偏好高且一致(0.97),并改善了5'端的核糖体可及性,未配对30分数达到0.87;全局最小自由能(MFE)收敛到平衡范围-346至-356 kcal/mol,实现约84%的碱基配对结构稳定性,并减少了免疫刺激基序——最终世代平均免疫惩罚降至27.3。线性设计产生超稳定转录本(MFE < -2000 kcal/mol),由于极端刚性存在翻译效率低下的风险;BiLSTM-CRF仅关注高CAI(0.96至0.98)而无结构约束;我们的框架实现了翻译-稳定性的最优平衡,突出了所提出的BERT-GA框架作为一种有效的、数据驱动的计算机模拟mRNA序列设计和优化方法。

英文摘要

Messenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5' end, with an unpaired_30 fraction reaching 0.87; Global Minimum Free Energy (MFE) converged to a balanced range of -346 to -356 kcal/mol, achieving approximately 84% base-paired structural stability, and reduced immune-stimulatory motifs - lowering the average immune penalty to 27.3 in the final generation. Linear Design produces hyper-stable transcripts (MFE < - 2000 kcal/mol) that risk translation inefficiency due to extreme rigidity, and BiLSTM-CRF focuses solely on high CAI (0.96 to 0.98) without structural constraints, our framework achieves an optimal translation-stability equilibrium, highlighting the proposed BERT-GA framework as an effective, data-driven approach for the design and optimization of in-silico mRNA sequences.

2605.27929 2026-05-28 q-bio.NC cs.LG

Exploratory Experience Shapes the Geometry of Predictive Representations

探索性经验塑造预测表征的几何结构

Kseniia Shilova, Abdelrahman Sharafeldin, Advay Balakrishnan, Hannah Choi

AI总结 通过构建树状迷宫中的在线学习智能体,研究探索与利用行为策略如何影响基于预测编码的内部表征几何结构,发现探索行为促进更具空间组织性的表征,且与小鼠实验数据一致。

详情
AI中文摘要

主动感知通过行动-感知循环将行为与学习联系起来:行动决定了用于更新内部感知预测模型的观测,而该模型随后指导下一步行动。预测编码框架为建模这一过程提供了自然方式,因为内部表征不断更新以预测未来观测。这里,我们探究探索性和利用性行为策略如何塑造这些内部预测表征。我们在树状迷宫中构建了一个在线学习智能体,其可调参数控制探索与利用模式之间的平衡。智能体根据自身行为产生的经验更新基于预测编码的感知模型。该模型预测未来迷宫状态和奖励概率,使智能体能够通过探索期间的预期信息增益或利用期间的预测奖励来选择行动。结果表明,产生的内部预测表征强烈依赖于智能体的行为模式。探索性智能体发展出更具空间组织性的表征,并更好地在潜在空间中保留迷宫转换的结构。相反,利用性智能体学习到组织性较差的表征。然后,我们用水剥夺小鼠在相同迷宫中导航的自然轨迹训练该预测模型,并将结果表征与智能体轨迹学习到的表征进行比较。更具探索性的小鼠表现出与探索性智能体高度匹配的表征几何结构,而访问模式受限的小鼠则类似于奖励驱动的利用性智能体。总之,这些发现表明,在人工智能体和动物中,探索通过围绕空间位置和转换上下文组织潜在空间,使预测模型能够形成泛化的内部表征。

英文摘要

Active sensing links behavior and learning through an action-perception loop: actions determine the observations used to update internal predictive models of perception, which subsequently guide the next actions. Predictive-coding frameworks provide a natural way to model this process, since internal representations are continuously updated to predict future observations. Here, we ask how exploratory and exploitative behavioral strategies shape these internal predictive representations. We build an online learning agent in a tree-like maze with a controllable parameter regulating the balance between exploratory and exploitative regimes. The agent updates a predictive-coding-based perception model from experience generated by its own behavior. The model predicts both future maze states and reward probability, allowing the agent to select actions either by expected information gain during exploration or by predicted reward during exploitation. We show that the resulting internal predictive representations depend strongly on the agent's behavioral regime. Exploratory agents develop representations that are more spatially organized and better preserve the structure of maze transitions in latent space. In contrast, exploitative agents learn less organized representations. We then train this predictive model on natural trajectories of water-deprived mice navigating the same maze and compare the resulting representations with those learned from agent trajectories. More exploratory mice show representational geometries that closely match those of exploratory agents, whereas mice with more restricted visitation patterns resemble reward-driven, exploitative agents. Together, these findings suggest that exploration enables predictive models to form generalized internal representations by organizing latent space around both spatial location and transition context in artificial agents and animals.

2605.27861 2026-05-28 cs.LG cs.AI q-bio.QM

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

从检测到机制:跨注意力图神经网络实现药物相互作用类型预测——一项以乙酰水杨酸验证的消融研究

Juergen Dietrich

AI总结 本研究通过系统消融实验比较三种图神经网络架构,发现跨注意力机制(CrossAtt)在药物相互作用类型预测(多分类)上比二元检测提升显著,并在乙酰水杨酸验证中实现10/10正确预测。

详情
Comments
12 pages, 1 figure
AI中文摘要

预测两种药物是否相互作用(二元检测)与预测该相互作用的机制类型(多分类)是本质上不同的任务。本研究在包含38,337个正例对(涵盖86种相互作用类型)的公开基准数据集上,对三种图神经网络架构进行了系统的消融实验,用于药物相互作用预测。在相同训练条件下(n=61,339对)比较了三种架构:带有拼接的双消息传递神经网络(Concat)、带有四头跨注意力的双MPNN(CrossAtt)以及引入相互作用图的三元MPNN(Ternary)。CrossAtt在多分类F1-macro上比Concat绝对提升+0.186(+45%),而二元AUC仅提升+0.012(+1.3%),证实原子级分子间通信专门支持机制类型分类。尽管训练数据相同,三元架构表现不佳,其失败与训练不稳定性假设一致。在训练前保留的十个乙酰水杨酸药物对上的验证表明,CrossAtt实现了10/10正确的DDI类型预测,而Ternary为0/10。在所有架构中识别出两个一致的失败案例,与一项配套毒性研究中确立的结构限制相关。

英文摘要

Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that interaction (multi-class classification). This study presents a systematic ablation study of three Graph Neural Network (GNN) architectures for drug-drug interaction (DDI) prediction on a publicly available benchmark dataset comprising 38,337 positive pairs across 86 interaction types. Three architectures are compared under identical training conditions (n = 61,339 pairs): a siamese dual Message Passing Neural Network (MPNN) with concatenation (Concat), a dual MPNN with four-head cross-attention (CrossAtt), and a ternary MPNN incorporating an interaction graph (Ternary). CrossAtt improves multi-class F1-macro by +0.186 absolute (+45%) over Concat, while improving binary AUC by only +0.012 (+1.3%) - confirming that atom-level inter-molecular communication specifically enables mechanism-type classification. The ternary architecture underperforms despite equivalent training data, with its failure consistent with a training instability hypothesis. Validation on ten acetylsali- cylic acid (ASA) drug pairs, held out prior to training, demonstrates 10/10 correct DDI-type predictions for CrossAtt versus 0/10 for Ternary. Two consistent failure cases are identified across all architectures, linking to structural limits established in a companion toxicity study.

2605.27677 2026-05-28 q-bio.PE

ESL-PSC Toolkit: a graphical software environment for linking shared genetic changes to convergent phenotypes

ESL-PSC Toolkit:用于连接共享遗传变化与趋同表型的图形化软件环境

John B. Allard, Sudhir Kumar

AI总结 提出一个基于图形用户界面的ESL-PSC分析工具包,简化从序列比对中识别与趋同性状相关的基因和位点的流程,支持交互式操作和性能优化。

详情
AI中文摘要

趋同进化为检验独立起源的相似性状是否共享共同遗传机制提供了有用的框架。进化稀疏学习与配对物种对比(ESL-PSC)是一种通过将稀疏预测模型拟合到系统发育信息物种对比上,从比对序列中识别与趋同性状相关的基因和位点的方法。然而,目前ESL-PSC的实际使用需要用户具备相当的命令行熟练度,用于数据组装、物种对设计、执行和输出解释。本文介绍了一个以图形用户界面(GUI)为核心的集成ESL-PSC分析环境(ESL-PSC Toolkit)。ESL-PSC Toolkit旨在帮助用户从实验设计到数据解释,无需广泛的技术专业知识。它支持引导式输入验证、交互式基于树的配对选择、命令预览、实时执行、运行后对排名基因和比对位点的探索、一种互补的替代计数方法,以及对连续定量趋同性状的分析。计算后端已用Rust重新实现,具有许多性能优化和并行化,大大减少了大多数分析的运行时间,并实现了跨平台打包分发。适用于Mac、Windows和Linux的可下载GUI和CLI工具包软件可在https://github.com/John-Allard/ESL-PSC/releases/latest获取。

英文摘要

Convergent evolution provides a useful framework for testing whether independent origins of similar traits share common genetic mechanisms. Evolutionary Sparse Learning with Paired Species Contrast (ESL-PSC) is an approach to identify genes and sites associated with convergent traits from aligned sequences by fitting sparse predictive models to phylogenetically informed species contrasts. However, practical use of ESL-PSC currently requires substantial command-line fluency for data assembly, species-pair design, execution, and output interpretation. Here we present an integrated ESL-PSC analysis environment (ESL-PSC Toolkit) centered on a graphical user interface (GUI). ESL-PSC Toolkit is designed to assist users from experimental design through data interpretation without requiring extensive technical expertise. It supports guided input validation, interactive tree-based pair selection, command preview, live execution, post-run exploration of ranked genes and aligned sites, a complementary substitution-counting method, and analysis of continuous quantitative convergent traits. The computational backend has been reimplemented in Rust with many performance optimizations and parallelism, greatly reducing runtime for most analyses and enabling cross-platform packaged distributions. Downloadable GUI and CLI toolkit software packages for Mac, Windows, and Linux are available at https://github.com/John-Allard/ESL-PSC/releases/latest.

2605.27573 2026-05-28 q-bio.QM

Cycle Based Computational Pipeline for Extracting Instantaneous Whisking Frequency

基于周期的计算流程用于提取瞬时触须摆动频率

Guanghui Li, Fangyuan Li, Barbara Lykke Lind, Rune W Berg

AI总结 提出一种基于周期估计的计算流程,通过无标记触须位置检测和峰谷识别,逐周期提取瞬时频率,并与傅里叶变换方法对比,证明其能更好捕捉时间变异性。

详情
AI中文摘要

触须摆动是啮齿动物用于探测和与环境互动的节律性自适应行为,其运动频率反映了感觉运动处理和内部脑状态。一种稳健且传统的触须频率估计方法是对跨越多个周期的触须位置进行功率谱分析。为了提高触须运动的时间分辨率,我们在此估计每个周期的周期长度,从而间接提取瞬时频率。我们通过无标记的触须位置估计,识别每个周期的波峰和波谷。提取周期长度,并通过基于峰显著性和序列幅度滤波的纹波排除验证器剔除伪迹。将该方法与使用0.5秒时间窗的傅里叶变换的功率谱估计进行比较。我们发现,使用固定窗口的频率估计无法捕捉瞬态变异性,而逐周期方法恢复了更高时间分辨率的频率。逐周期方法还揭示了预期的周期级变异性。通过子序列滤波进行伪迹剔除,消除了高于30 Hz的虚假频率,使精炼频率与既定的生理范围(4至28 Hz)一致。该流程为实时兼容的频率估计提供了一种替代方案,以频率估计精度为代价,更好地捕捉时间变异性。

英文摘要

Whisking is a rhythmic and adaptive behavior that rodents use to probe and interact with their environment, and the frequency of movement reflects both sensorimotor processing and internal brain states. A robust and traditional method of whisker frequency estimation uses power spectral analysis of whisker position spanning several cycles. To improve the temporal resolution of whisker movement, we here estimate the period for each cycle, hence indirectly extracting an instantaneous frequency. We do this using markerless estimation of whisker position and identifying the peak and trough for each cycle. The cycle period is extracted, and artifacts are rejected with a ripple exclusion validator based on peak prominence and sequential amplitude filtering. The method is compared with power spectral estimation, using the Fourier transform of a temporal window of 0.5 seconds. We find that frequency estimation using a fixed window does not capture transient variability, while the cycle by cycle method recovers higher, time-resolved frequencies. The cycle by cycle approach also reveals the expected cycle-level variability. Artifact rejection through subsequence filtering removed spurious frequencies above 30 Hz, aligning refined frequencies with established physiological bounds (4 to 28 Hz). This pipeline provides an alternative solution for real time compatible frequency estimation, which better captures temporal variation at the expense of precision in frequency estimation.

2605.27459 2026-05-28 q-bio.OT cs.CE

Real-Time In Silico Modeling of Postprandial Macronutrient Kinetics: A Validated Computational Engine for Nutrition Research and Digital Health

餐后宏量营养素动力学的实时计算建模:一个经过验证的营养研究和数字健康计算引擎

Alberto Calderone

AI总结 本研究提出一个基于双室Bateman动力学、伽马变分布和有限状态机的计算引擎,通过实时求解微分方程模拟餐后代谢,平均响应时间约135毫秒,并在多种饮食条件下验证,全局MAPE约18%。

详情
AI中文摘要

模拟餐后药代动力学,如通过mTORC1的肌肉蛋白质合成(MPS)和胰岛素诱导的葡萄糖摄取,由于多室方法的计算强度通常具有挑战性。在这项研究中,我介绍了一个计算机代谢模拟器,它使用双室Bateman动力学过程、伽马变分布和有限状态机推理来瞬时求解时间微分方程,根据输入餐生成代谢曲线和预测。新颖的底层算法是完全独立于第三方库或外部服务定制的。这个原创的计算引擎弥合了学术界和数字健康领域之间的差距,集成在一个网络仪表板中,并通过REST API作为服务提供。平均响应时间约为135毫秒,最大低于750毫秒。使用Landmark Validation方法在不同饮食条件(乳清蛋白、混合餐、OGTT)下校准多维模型,并通过网格搜索优化。最终,系统实现了全局生理学最优的平均绝对百分比误差(MAPE)约为18%,同时保持算法复杂度为O(n log n)。

英文摘要

Simulation of post-prandial pharmacokinetics, such as muscle protein synthesis (MPS) through mTORC1 and insulin-induced glucose uptake, is often challenging due to the computational intensity of the multi-compartmental approach. In this study, I introduce an in silico metabolic simulator that uses bi-compartmental Bateman kinetic processes, gamma-variate distributions, and finite state machine reasoning to solve temporal differential equations instantaneously, generating metabolic curves and predictions depending on input meals. The novel underlying algorithm was custom-built entirely independent of third-party libraries or external services. This original computational engine, bridging the gap between academia and the digital health sector, is integrated within a web dashboard and provided as a service via REST APIs. The average response time is approximately 135 ms with a maximum below 750 ms. The multi-dimensional model was calibrated using a Landmark Validation approach across diverse dietary conditions (Whey Protein, mixed meal, OGTT) and optimized via Grid Search. Ultimately, the system achieved a global physiologically optimal Mean Absolute Percentage Error (MAPE) of $\sim18\%$ while maintaining an algorithmic complexity of $O(n \log n)$.

2605.27413 2026-05-28 q-bio.BM cs.AI

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

配体条件离散扩散用于蛋白质序列-结构协同设计

Chen Wei, Fanding Xu, Minghao Sun, Zhiyuan Liu, Lin Wang, Tianrui Jia, Yihang Zhou, Yang Zhang

AI总结 提出配体条件离散扩散模型ProtLiD²,通过几何感知交叉注意力联合生成氨基酸序列和离散结构令牌,实现配体约束下的蛋白质序列-结构协同设计,显著提升全局折叠置信度和配体感知通过率。

详情
Comments
19 pages, 6 figures
AI中文摘要

蛋白质通过氨基酸序列编码的三维结构执行其生物学功能,而配体结合蛋白质的协同设计需要模型在明确的配体约束下生成序列-结构兼容的蛋白质。尽管连续扩散和基于流的模型支持在坐标或潜在空间中进行配体感知设计,但现有的离散扩散蛋白质语言模型主要操作于序列或结构令牌,缺乏直接的小分子条件。我们引入了 extbf{ProtLiD$^2$},一个用于蛋白质序列-结构协同设计的 extbf{蛋白质}配体条件 extbf{离散扩散}模型。ProtLiD$^2$联合生成氨基酸序列和离散结构令牌,同时通过几何感知交叉注意力整合配体化学和几何信息。在超过一百万个配体-蛋白质复合物上训练后,ProtLiD$^2$将掩码离散扩散扩展到配体感知的功能性蛋白质设计。我们进一步提出了最大置信度边界引导的ReMask解码,这是一种推理时自校正策略,保留高置信度预测并重新掩码不确定的令牌。在整体蛋白质设计中,ProtLiD$^2$相比Complexa提高了全局折叠置信度,将TM-score从0.672提升至0.802,pLDDT从64.55提升至73.00。在口袋协同设计中,ProtLiD$^2$将活性位点BB-RMSD从FAIR/PocketGen的3.46/3.40Å降低至1.97Å,并在更严格的对接阈值下,将配体感知通过率从PocketGen的14.86%提升至59.73%,从6.08%提升至23.49%。这些结果支持配体条件离散扩散作为功能性蛋白质协同设计的有效令牌空间框架。代码将在https://github.com/auroua/ProtLiD提供。

英文摘要

Proteins perform their biological functions through three-dimensional structures encoded by amino acid sequences, and ligand-binding protein co-design requires models that generate sequence-structure compatible proteins under explicit ligand constraints. Although continuous diffusion and flow-based models support ligand-aware design in coordinate or latent spaces, existing discrete diffusion protein language models mainly operate over sequence or structure tokens without direct small-molecule conditioning. We introduce \textbf{ProtLiD$^2$}, a \textbf{Prot}ein \textbf{L}igand-conditioned \textbf{D}iscrete \textbf{D}iffusion model for protein sequence-structure co-design. ProtLiD$^2$ jointly generates amino-acid sequence and discrete structure tokens while incorporating ligand chemical and geometric information through geometry-aware cross-attention. Trained on over one million ligand-protein complexes, ProtLiD$^2$ extends masked discrete diffusion to ligand-aware functional protein design. We further propose maximum confidence-margin guided ReMask decoding, an inference-time self-correction strategy that retains confident predictions and remasks uncertain tokens. ProtLiD$^2$ improves global fold confidence over Complexa in whole-protein design, increasing TM-score from 0.672 to 0.802 and pLDDT from 64.55 to 73.00. In pocket co-design, ProtLiD$^2$ reduces active-site BB-RMSD from 3.46/3.40Å for FAIR/PocketGen to 1.97Å, and improves ligand-aware pass rates over PocketGen from 14.86% to 59.73% and from 6.08% to 23.49% under stricter docking thresholds. These results support ligand-conditioned discrete diffusion as an effective token-space framework for functional protein co-design. Code will be available at https://github.com/auroua/ProtLiD.

2605.23122 2026-05-28 q-bio.NC

Beyond Neural Activity Prediction: Probing Latent Representations in Mouse V1 Digital Twins

超越神经活动预测:探究小鼠V1数字孪生中的潜在表征

Adriano Lima, Yuchen Hou, Michael Beyeler, Marius Schneider

AI总结 本研究通过系统探测一组小鼠V1数字孪生模型的潜在表征,发现预测准确率与表征特性相关,但相似预测能力的模型在潜在表征上仍存在显著差异,从而提出多层级表征探测作为神经预测评估的补充。

详情
AI中文摘要

感觉皮层的数字孪生作为强大的响应预言机。尽管预测准确性是评估这些模型的核心指标,但它对支持这些预测的潜在表征提供的洞察有限。随着数字孪生被用作刺激设计和假设生成的计算机实验系统,这一点变得越来越重要:预测准确性相似的模型可能依赖于不同的潜在表征。我们通过系统探测一组小鼠V1数字孪生来解决这一差距,这些模型被训练来预测自由移动小鼠记录的自然视频中的神经活动。这些模型共享相同的训练数据和神经预测目标,但视觉编码器架构不同。对于每个冻结模型,我们在三个层面上表征潜在表征:(i) 从受控视觉探针(方向、对比度和运动)的线性可解码性;(ii) 潜在单元对经典视觉特征(包括方向选择性、对比度响应、空间频率调谐)的调谐;(iii) 隐藏层活动的群体几何结构。跨架构,更好的神经响应预测与更强的探针准确性相关。此外,高预测模型表现出更平坦的隐藏群体特征谱,表明更高维度的表征更接近小鼠V1中报告的群体几何特征。尽管这些表征特性在不同架构中与预测准确性共变,但具有可比预测分数的数字孪生在探针性能和潜在单元调谐上仍可能显著不同。这些结果确立了多层级表征探测作为标准神经预测评估的补充,提供了一个框架,不仅将数字孪生视为预测器,而且将其视为研究视觉计算的基质。

英文摘要

Digital twins of sensory cortex serve as powerful response oracles. Although prediction accuracy is the central metric by which these models are evaluated, it provides limited insight into the latent representations that support those predictions. This becomes increasingly important as digital twins are used as in silico experimental systems for stimulus design and hypothesis generation: models with similar prediction accuracy may rely on different latent representations. We address this gap by systematically probing a family of digital twins of mouse V1 trained to predict neural activity from naturalistic videos recorded in freely moving mice. The models share the same training data and neural-prediction objective, but differ in visual-encoder architecture. For each frozen model, we characterize latent representations along three levels: (i) linear decodability from controlled visual probes of orientation, contrast, and motion; (ii) latent-unit tuning to canonical visual features including orientation selectivity, contrast response, spatial-frequency tuning; and (iii) population geometry of hidden-layer activity. Across architectures, better neural-response prediction correlates with stronger probe accuracy. Additionally, highly predictive models exhibit flatter hidden-population eigenspectra, indicating higher-dimensional representations closer to population-geometry signatures reported in mouse V1. Although these representational properties covary with prediction accuracy across architectures, digital twins with comparable prediction scores can still differ substantially in probe performance and latent-unit tuning. These results establish multi-level representational probing as a complement to standard neural-prediction evaluation, providing a framework for understanding digital twins not only as predictors, but also as substrates for studying visual computations.

2510.21484 2026-05-28 q-bio.QM

eMZed 3: flexible and interactive development of scalable LC-MS/MS data analysis workflows in Python

eMZed 3:在 Python 中灵活交互式开发可扩展的 LC-MS/MS 数据分析工作流

Uwe Schmitt, Jethro L. Hemmann, Nicola Zamboni, Julia A. Vorholt, Patrick Kiefer

AI总结 本文介绍 eMZed 3,一个基于 Python 3 的现代框架,用于灵活交互式分析 LC-MS/MS 数据,支持可扩展工作流开发,并包含基于色谱的数据支持、SQLite 后端和丰富可视化工具。

详情
Comments
7 pages, 1 figure
AI中文摘要

液相色谱-质谱联用(LC-MS/MS)数据分析需要适应性强的软件解决方案来满足多样化的分析需求。我们提出了 eMZed 3,一个用于灵活交互式分析 LC-MS/MS 数据的现代 Python 框架。eMZed 3 使用户能够开发针对其特定需求的可扩展工作流,同时利用 Python 丰富的库生态系统。在其前身的基础上,eMZed 3 现在基于 Python 3,并包含重大改进,包括支持基于色谱的 LC-MS 数据、新的基于 SQLite 的后端(支持可选的内存外处理)以及丰富的交互式可视化工具。与之前版本相比,eMZed 3 现在分为三个包:emzed(核心功能)、emzed-gui(交互式数据可视化)和 emzed-spyder(集成开发环境)。这种模块化架构允许将 emzed 核心库直接集成到无头 Python 环境中,包括计算笔记本(如 Jupyter)或高性能计算集群。eMZed 3 整合了 OpenMS 等成熟库,非常适合靶向和非靶向代谢组学。总体而言,eMZed 3 支持高效开发可扩展且可重现的 LC-MS 数据分析,并且对新手和高级程序员都易于使用。可用性和实现:eMZed 3 及其文档可在 https://emzed.ethz.ch 免费获取,源代码托管在 https://gitlab.com/groups/emzed3。一个可在线执行的工作流示例可在 Binder 上获取:https://mybinder.org/v2/gl/emzed3%2Femzed-example-workflow/HEAD?labpath=example.ipynb。

英文摘要

Liquid chromatography-mass spectrometry (LC-MS/MS) data analysis requires adaptable software solutions to meet diverse analytical needs. We present eMZed 3, a modern Python framework for flexible and interactive analysis of LC-MS/MS data. eMZed 3 enables users to develop scalable workflows tailored to their specific requirements while leveraging Python's extensive ecosystem of libraries. Building on its predecessor, eMZed 3 is now Python 3-based and includes substantial enhancements, including support for chromatogram-based LC-MS data, a new SQLite-based backend supporting optional out-of-memory processing, and rich interactive visualization tools. Compared to the previous version, eMZed 3 is now split into three packages: emzed (core functionalities), emzed-gui (interactive data visualization), and emzed-spyder (an integrated development environment). This modular architecture allows straightforward integration of the emzed core library into headless Python environments, including computational notebooks (such as Jupyter) or high-performance computing clusters. eMZed 3 incorporates well-established libraries such as OpenMS, and is highly suited for both targeted and untargeted metabolomics. Overall, eMZed 3 supports the efficient development of scalable and reproducible LC-MS data analysis and is accessible to both novice and advanced programmers. Availability and Implementation: eMZed 3 and its documentation are freely available at https://emzed.ethz.ch, the source code is hosted at https://gitlab.com/groups/emzed3. An online-executable example workflow is available on Binder at: https://mybinder.org/v2/gl/emzed3%2Femzed-example-workflow/HEAD?labpath=example.ipynb.

2605.00025 2026-05-28 q-bio.NC cs.CL cs.HC cs.LG eess.AS

MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis

MoDAl: 基于去相关的自监督神经模态发现用于语音神经假体

Yuanhao Chen, Peter Chin

AI总结 提出MoDAl框架,通过对比学习和对齐损失与去相关损失之间的协同作用,从多脑区发现互补神经模态,在Brain-to-Text Benchmark '24上将词错误率从26.3%降至21.6%。

详情
AI中文摘要

语音神经假体系统在无听觉输出的情况下从神经活动解码预期语音,为言语障碍患者恢复交流提供了途径。当前方法主要从运动皮层区域解码,忽略了其他区域——如布罗卡区的一部分44区——这些区域可能编码互补的语言信息。我们提出了MoDAl(模态去相关与对齐)框架,该框架通过在共享投影空间中两个目标的相互作用来发现互补的神经模态。对比损失将多个并行脑编码器中的每一个与预训练大语言模型(LLM)的文本嵌入对齐,而去相关损失防止编码器合并成重复表示。我们证明这些目标之间存在富有成效的张力:对比对齐诱导传递性模态合并,而去相关必须抵消这一点,以使框架发现多样的神经语言学模态。在Brain-to-Text Benchmark '24上,与之前最佳端到端方法相比,MoDAl将词错误率(WER)从26.3%降低到21.6%,其中纳入先前丢弃的44区信号的增益完全来自去相关机制。对发现模态的分析揭示了功能特化:接收44区输入的编码器捕获结构和句法属性(句子长度、语法语态、wh-词),这与布罗卡区的神经语言学理解一致。

英文摘要

Speech neuroprosthesis systems decode intended speech from neural activity in the absence of audible output, offering a path to restoring communication for individuals with speech-impairing conditions. Current approaches decode predominantly from motor cortical areas, discarding others -- such as area 44, part of Broca's area -- that may encode complementary linguistic information. We introduce MoDAl (Modality Decorrelation and Alignment), a framework that discovers complementary neural modalities through the interplay of two objectives in a shared projection space. A contrastive loss aligns each of several parallel brain encoders with the text embeddings of a pretrained large language model (LLM), while a decorrelation loss prevents the encoders from coalescing to duplicative representations. We prove that these objectives are in productive tension: Contrastive alignment induces transitive modality coalescence, which decorrelation must counteract for the framework to discover diverse neurolinguistic modalities. On the Brain-to-Text Benchmark '24, MoDAl reduces word error rate (WER) from 26.3% to 21.6% compared to the previous best end-to-end method, with the gain from incorporating previously discarded area 44 signals arising entirely from the decorrelation mechanism. Analysis of the discovered modalities reveals functional specialization: Encoders receiving area 44 input capture structural and syntactic properties (sentence length, grammatical voice, wh-words), consistent with the neurolinguistic understanding of Broca's area.

2603.26630 2026-05-28 physics.bio-ph physics.optics q-bio.NC

Revisiting claims of extracranial biophoton detection from the human brain

重新审视人脑颅外生物光子检测的主张

Vahid Salari, Vishnu Seshan, Rishabh Rishabh, Daniel Oblak, Christoph Simon

AI总结 本文通过实验证明,在适当黑暗条件下,头部超弱光子发射远弱于某些研究报道,且大信号可由背景光污染解释,同时指出波长<600 nm的光子被头皮和颅骨强烈衰减,因此即使检测到信号也主要来自头皮而非大脑。

详情
Comments
8 pages (single-column), 5 figures
AI中文摘要

超弱光子发射,也称为生物自发光或生物光子发射,是广泛生物系统自发发射的极低水平光。最近的研究报道,颅外测量的UPE可作为脑活动的潜在非侵入性生物标志物。在这里,我们表明这种解释存在严重问题。我们证明,在适当黑暗条件下观察时,头部的UPE比某些关于人脑“脑UPE”的论文中报道的要弱得多。我们还表明,这些研究中报道的大信号可以用背景光污染来解释。此外,波长<600 nm的光子被头皮和颅骨组织强烈衰减,而更长的波长在很大程度上超出了所用光电倍增管的有效光谱灵敏度。因此,即使在适当的无背景条件下检测到头部的UPE,它也很可能由头皮而非大脑的发射主导,只要使用PMT就肯定如此。我们的结果强调了仔细实验设计对于在这一重要问题上取得真正进展的重要性。

英文摘要

Ultraweak photon emission, also referred to as biological autoluminescence or biophoton emission, is the spontaneous emission of extremely low levels of light from a broad range of biological systems. Recent studies have reported that UPE measured extracranially can serve as a potential non-invasive biomarker of brain activity. Here, we show that this interpretation suffers from serious problems. We show that, when observed under properly dark conditions, the UPE from the head is much weaker than what is reported in certain papers on 'brain UPE' from human heads. We also show that the large signals reported in these studies can be explained by background light contamination. Furthermore, photons with wavelengths < 600 nm are strongly attenuated by scalp and skull tissues, and longer wavelengths fall largely outside the effective spectral sensitivity of the photomultiplier tubes (PMTs) used. As a consequence, even if UPE from the head is detected under properly background-free conditions, it is likely to be dominated by emission from the scalp rather than from the brain, certainly as long as PMTs are used. Our results emphasize the importance of careful experimental design to make genuine progress on this important question.

2603.17754 2026-05-28 q-bio.PE

Slow evolution towards generalism in a model of variable dietary range

可变饮食范围模型中向泛化主义的缓慢进化

Elliot M. Butterworth, Tim Rogers

AI总结 通过数学模型研究物种饮食范围的进化,发现随机效应驱动泛化饮食的长期演化,并存在路径依赖的准稳态。

详情
AI中文摘要

共享栖息地的物种会共同进化以利用可用资源,因为消费受到竞争以及消费者与资源之间负反馈循环的调节。给定物种的饮食范围决定了它能够获取的资源,从而决定了与之竞争的其他物种。狭窄的饮食范围避免了竞争,但代价是过度依赖少量资源;相反,广泛的饮食范围提供了更多选择,但也增加了与其他物种竞争的机会。在这里,我们在一个生态位形成的数学模型中研究饮食范围的进化。我们发现高度路径依赖的共同进化动力学,其特征是长期存在的准稳态。最终,随机效应驱动泛化饮食的进化,正如我们在分析和模拟中所揭示的那样。

英文摘要

Species sharing a habitat will co-evolve to make use of the available resources, as consumption is modulated by competition and negative feedback loops between consumers and resources. The dietary range of a given species determines the resources it has access to and thus the other species with which it competes. A narrow dietary range avoids competition at the cost of over-reliance on a small selection of resources; conversely a wide dietary range provides more alternatives but also more chance of competition with other species. Here, we investigate the evolution of dietary range within a mathematical model of niche formation. We find highly path dependent co-evolution dynamics characterised by long-lived quasi-stable states. Ultimately, stochastic effects drive the evolution of generalist diets, as we uncover in our analysis and simulations.

2602.18982 2026-05-28 cs.LG q-bio.PE

Conditionally Site-Independent Neural Evolution of Antibody Sequences

抗体序列的条件性位点无关神经进化

Stephen Zhewen Lu, Aakarsh Vermani, Kohei Sanno, Jiarui Lu, Frederick A Matsen, Milind Jagota, Yun S. Song

AI总结 提出CoSiNE模型,用深度神经网络参数化的连续时间马尔可夫链桥接系统发育模型与深度学习,实现抗体序列进化建模,在零样本变异效应预测中优于现有语言模型,并引入引导吉莱斯皮采样优化抗体亲和力。

详情
Comments
28 pages, 15 figures. Accepted as a poster at ICML 2026
AI中文摘要

常见的抗体工程深度学习方法侧重于建模序列的边缘分布。然而,这些方法将序列视为独立样本,忽略了亲和力成熟作为抗体探索潜在适应度景观的进化过程中丰富且很大程度上未开发的信息来源。相比之下,经典的系统发育模型明确表示进化动力学,但缺乏捕捉复杂上位相互作用的表达能力。我们通过CoSiNE(一种由深度神经网络参数化的连续时间马尔可夫链)弥合了这一差距。数学上,我们证明CoSiNE提供了难以处理的顺序点突变过程的一阶近似,以分支长度二次方的误差界捕捉上位效应。实验上,CoSiNE通过明确区分选择与上下文依赖的体细胞超突变,在零样本变异效应预测中优于最先进的语言模型。最后,我们引入了引导吉莱斯皮(Guided Gillespie),一种在推理时引导CoSiNE的分类器引导采样方案,从而实现对特定抗原的抗体结合亲和力的高效优化。

英文摘要

Common deep learning approaches for antibody engineering focus on modeling the marginal distribution of sequences. By treating sequences as independent samples, however, these methods overlook affinity maturation as a rich and largely untapped source of information about the evolutionary process by which antibodies explore the underlying fitness landscape. In contrast, classical phylogenetic models explicitly represent evolutionary dynamics but lack the expressivity to capture complex epistatic interactions. We bridge this gap with CoSiNE, a continuous-time Markov chain parameterized by a deep neural network. Mathematically, we prove that CoSiNE provides a first-order approximation to the intractable sequential point mutation process, capturing epistatic effects with an error bound that is quadratic in branch length. Empirically, CoSiNE outperforms state-of-the-art language models in zero-shot variant effect prediction by explicitly disentangling selection from context-dependent somatic hypermutation. Finally, we introduce Guided Gillespie, a classifier-guided sampling scheme that steers CoSiNE at inference time, enabling efficient optimization of antibody binding affinity toward specific antigens.

2601.10464 2026-05-28 stat.AP q-bio.GN

MitoFREQ: A Novel Approach for Mitogenome Frequency Estimation from Top-level Haplogroups and Single Nucleotide Variants

MitoFREQ:一种基于顶级单倍群和单核苷酸变异进行线粒体基因组频率估计的新方法

Mikkel Meyer Andersen, Nicole Huber, Kimberly S Andreaggi, Tóra Oluffa Stenberg Olsen, Walther Parson, Charla Marshall

AI总结 提出MitoFREQ方法,利用HelixMTdb和gnomAD数据库中顶级单倍群的SNV等位基因频率,通过加权稀有SNV频率估计线粒体基因组群体频率,并开发了开源R包mitofreq。

详情
AI中文摘要

谱系标记群体频率可作为法医遗传学中表达证据价值的一种方式。然而,对于高质量的全线粒体DNA基因组序列(线粒体基因组),群体数据仍然有限。在本文中,我们提供了一种新方法MitoFREQ,用于估计线粒体基因组的群体频率。MitoFREQ使用线粒体基因组资源HelixMTdb和gnomAD,分别包含来自195,983和56,406个线粒体基因组的信息。HelixMTdb和gnomAD都不能直接查询单个线粒体基因组频率,但提供了30个“顶级”单倍群(TLHG)中每个的单核苷酸变异(SNV)等位基因频率。我们建议通过将给定线粒体基因组分类到TLHG方案中,随后使用该TLHG内其最稀有SNV的频率(按TLHG频率加权)来利用HelixMTdb和gnomAD资源。我们证明,该方法保证提供比使用精细单倍群及其SNV频率更高的群体频率估计。此外,我们表明,仅使用227个特定位置即可对99.9%的测试线粒体基因组实现顶级单倍群分类,可能使该方法适用于低质量样本。该方法在两类数据集上进行了测试:高质量法医参考数据集和来自GenBank的多样化经过审查的线粒体基因组集合。这种双重评估表明,该方法在精心策划的法医数据和更广泛的群体水平序列上均具有稳健性。该方法产生的似然比在100-100,000范围内,展示了其加强法医mtDNA证据统计评估的潜力。我们开发了一个开源R包`mitofreq`来实现我们的方法,包括一个Shiny应用程序,可以在其中提供自定义TLHG频率。

英文摘要

Lineage marker population frequencies can serve as one way to express evidential value in forensic genetics. However, for high-quality whole mitochondrial DNA genome sequences (mitogenomes), population data remain limited. In this paper, we offer a new method, MitoFREQ, for estimating the population frequencies of mitogenomes. MitoFREQ uses the mitogenome resources HelixMTdb and gnomAD, harbouring information from 195,983 and 56,406 mitogenomes, respectively. Neither HelixMTdb nor gnomAD can be queried directly for individual mitogenome frequencies, but offers single nucleotide variant (SNV) allele frequencies for each of 30 "top-level" haplogroups (TLHG). We propose using the HelixMTdb and gnomAD resources by classifying a given mitogenome within the TLHG scheme and subsequently using the frequency of its rarest SNV within that TLHG weighted by the TLHG frequency. We show that this method is guaranteed to provide a higher population frequency estimate than if a refined haplogroup and its SNV frequencies were used. Further, we show that top-level haplogrouping can be achieved by using only 227 specific positions for 99.9% of the tested mitogenomes, potentially making the method available for low-quality samples. The method was tested on two types of datasets: high-quality forensic reference datasets and a diverse collection of scrutinised mitogenomes from GenBank. This dual evaluation demonstrated that the approach is robust across both curated forensic data and broader population-level sequences. This method produced likelihood ratios in the range of 100-100,000, demonstrating its potential to strengthen the statistical evaluation of forensic mtDNA evidence. We have developed an open-source R package `mitofreq` that implements our method, including a Shiny app where custom TLHG frequencies can be supplied.

2512.18566 2026-05-28 cs.LG cs.SY eess.SY q-bio.NC

Comparing Dynamical Models Through Diffeomorphic Vector Field Alignment

通过微分同胚向量场对齐比较动力学模型

Ruiqi Chen, Giacomo Vedovati, Todd Braver, ShiNung Ching

AI总结 提出DFORM框架,通过非线性坐标变换对齐两个动力系统的轨迹,评估拓扑等价性并定位高维模型中的低维动力学模式。

详情
Journal ref
Neural Computation (2026) 38 (6): 1006-1061
Comments
57 pages, 18 figures. For associated code, see https://github.com/rq-Chen/DFORM_stable
AI中文摘要

诸如递归神经网络(RNN)等动力系统模型在理论神经科学中越来越受欢迎,用于假设生成和数据分析。评估这些模型中的动力学是理解其学习到的生成机制的关键。然而,这种评估受到两个主要挑战的阻碍:首先,由于没有强制要求坐标系统等价,跨模型比较学习到的动力学很困难。其次,在高维非线性模型(如RNN)中,识别机制上重要的低维模式(例如极限集)是难以处理的。在这里,我们提出了一个全面的框架来解决这两个问题,称为学习模型的微分同胚向量场对齐(DFORM)。DFORM学习两个动力系统状态空间之间的非线性坐标变换,以最大程度地一对一地对齐它们的轨迹。通过这样做,DFORM能够评估两个模型是否表现出拓扑等价性,即尽管坐标系统不同但机制相似。该方法的一个副产品是一种在高维系统中嵌入的低维流形上定位动力学模式的方法。我们使用典型的拓扑等价系统、RNN和通过非线性流相关的系统验证了DFORM识别线性和非线性坐标变换的能力。DFORM还被证明可以提供拓扑不同系统之间的相似性量化。然后,我们证明了DFORM可以在高维模型中定位重要的动力学模式,包括不变流形和鞍极限集。最后,使用一组在人类功能性磁共振成像(fMRI)记录上训练的RNN模型,我们展示了DFORM可以从高维数据驱动模型中识别极限环,这与先前的数值分析结果一致。

英文摘要

Dynamical systems models such as recurrent neural networks (RNNs) are increasingly popular in theoretical neuroscience for hypothesis-generation and data analysis. Evaluating the dynamics in such models is key to understanding their learned generative mechanisms. However, such evaluation is impeded by two major challenges: First, comparison of learned dynamics across models is difficult because there is no enforced equivalence of their coordinate systems. Second, identification of mechanistically important low-dimensional motifs (e.g., limit sets) is intractable in high-dimensional nonlinear models such as RNNs. Here, we propose a comprehensive framework to address these two issues, termed Diffeomorphic vector field alignment FOR learned Models (DFORM). DFORM learns a nonlinear coordinate transformation between the state spaces of two dynamical systems, which aligns their trajectories in a maximally one-to-one manner. In so doing, DFORM enables an assessment of whether two models exhibit topological equivalence, i.e., similar mechanisms despite differences in coordinate systems. A byproduct of this method is a means to locate dynamical motifs on low-dimensional manifolds embedded within higher-dimensional systems. We verified DFORM's ability to identify linear and nonlinear coordinate transformations using canonical topologically equivalent systems, RNNs, and systems related by nonlinear flows. DFORM was also shown to provide a quantification of similarity between topologically distinct systems. We then demonstrated that DFORM can locate important dynamical motifs including invariant manifolds and saddle limit sets within high-dimensional models. Finally, using a set of RNN models trained on human functional MRI (fMRI) recordings, we illustrated that DFORM can identify limit cycles from high-dimensional data-driven models, which agreed well with prior numerical analysis.

2511.02558 2026-05-28 cs.CV cs.LG q-bio.NC

Forecasting Future Anatomies: Longitudinal Brain Mri-to-Mri Prediction

预测未来解剖结构:纵向脑MRI到MRI的预测

Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka

AI总结 本文研究从基线MRI预测未来脑部MRI,采用五种深度学习架构(UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet)在ADNI和AIBL数据集上实现高保真体素级预测,并验证了跨队列泛化能力。

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), Apr. 2026
AI中文摘要

从基线磁共振图像(MRI)预测未来脑状态是神经影像学的一个核心挑战,对研究阿尔茨海默病(AD)等神经退行性疾病具有重要意义。大多数现有方法预测未来认知评分或临床结果,例如从轻度认知障碍向痴呆的转化。相反,本文研究纵向MRI图像到图像的预测,该预测可以预测参与者未来数年的整个脑部MRI,内在建模复杂的、空间分布的神经退行模式。我们在两个纵向队列(ADNI和AIBL)上实施并评估了五种深度学习架构(UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet)。使用捕捉全局相似性和局部差异的指标,将预测的随访MRI与实际随访扫描直接进行比较。表现最佳的模型实现了高保真预测,并且所有模型都能很好地泛化到独立的外部数据集,展示了稳健的跨队列性能。我们的结果表明,深度学习可以在体素水平上可靠地预测参与者特定的脑部MRI,为个体化预后提供了新的机会。

英文摘要

Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.

2511.00278 2026-05-28 q-bio.GN

SCUDDO: An unsupervised clustering algorithm for single-cell Hi-C maps using diagonal diffusion operators

SCUDDO:一种使用对角扩散算子的单细胞Hi-C图谱无监督聚类算法

Luka Maisuradze, Corey S. O'Hern, Mark D. Shattuck

AI总结 针对单细胞Hi-C图谱稀疏性高导致聚类困难的问题,提出无监督算法SCUDDO,利用对角扩散算子嵌入和聚类,在多个数据集上ARI提升超过0.2。

详情
Journal ref
Bioinformatics 42(5), btag284 (2026)
AI中文摘要

动机:高通量染色质构象捕获技术的进展为染色质的三维结构和组织提供了洞察。群体Hi-C实验捕获了数百万个细胞的时空平均染色质相互作用,而单细胞Hi-C实验报告了单个细胞的染色质相互作用。已有监督和无监督算法用于嵌入单细胞Hi-C图谱并识别不同细胞类型。然而,单细胞Hi-C图谱由于高度稀疏性往往难以聚类,现有最优算法在多个数据集上的调整兰德指数(ARI)最高仅为<0.4,且需要标签进行训练。 结果:我们提出了一种新颖的无监督算法——使用对角扩散算子的单细胞聚类(SCUDDO),用于嵌入和聚类单细胞Hi-C图谱。我们在三个先前难以聚类的单细胞Hi-C数据集上评估SCUDDO,并显示其在ARI上优于其他现有算法超过0.2。此外,即使我们限制每种细胞类型的染色体内图谱数量,或者仅使用每个Hi-C图谱中一小部分接触,SCUDDO也优于所有其他测试算法。因此,SCUDDO能够捕捉单细胞Hi-C图谱的潜在特征,并在细胞类型未知时提供准确的细胞类型标注。 可用性:SCUDDO可在www.github.com/lmaisuradze/scuddo免费获取。测试数据集公开可用,可从Gene Expression Omnibus下载。

英文摘要

Motivation: Advances in high-throughput chromatin conformation capture have provided insight into the three-dimensional structure and organization of chromatin. While bulk Hi-C experiments capture spatio-temporally averaged chromatin interactions across millions of cells, single-cell Hi-C experiments report on the chromatin interactions of individual cells. Supervised and unsupervised algorithms have been developed to embed single-cell Hi-C maps and identify different cell types. However, single-cell Hi-C maps are often difficult to cluster due to their high sparsity, with state-of-the-art algorithms achieving a maximum Adjusted Rand Index (ARI) of only < 0.4 on several datasets while requiring labels for training. Results: We introduce a novel unsupervised algorithm, Single-cell Clustering Using Diagonal Diffusion Operators (SCUDDO), to embed and cluster single-cell Hi-C maps. We evaluate SCUDDO on three previously difficult-to-cluster single-cell Hi-C datasets, and show that it can outperform other current algorithms in ARI by > 0.2. Further, SCUDDO outperforms all other tested algorithms even when we restrict the number of intrachromosomal maps for each cell type and when we use only a small fraction of contacts in each Hi-C map. Thus, SCUDDO can capture the underlying latent features of single-cell Hi-C maps and provide accurate labeling of cell types even when cell types are not known a priori. Availability: SCUDDO is freely available at www.github.com/lmaisuradze/scuddo. The tested datasets are publicly available and can be downloaded from the Gene Expression Omnibus.

2507.09466 2026-05-28 cs.LG q-bio.QM

La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching

La-Proteina: 通过部分潜变量流匹配进行原子级蛋白质生成

Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, Arash Vahdat

AI总结 提出La-Proteina模型,利用部分潜变量表示和流匹配方法联合生成蛋白质的全原子结构和氨基酸序列,在多项基准测试中达到最先进性能。

详情
AI中文摘要

近年来,出现了许多用于从头蛋白质结构设计的生成模型。然而,只有少数模型能够处理直接生成全原子结构及其对应氨基酸序列这一艰巨任务。这之所以具有挑战性,例如是因为模型必须处理在生成过程中长度变化的侧链。我们提出了La-Proteina,用于原子级蛋白质设计,基于一种新颖的部分潜变量蛋白质表示:粗粒度主链结构被显式建模,而序列和原子细节则通过每个残基的固定维度潜变量捕获,从而有效规避了显式侧链表示的挑战。在此部分潜变量空间中的流匹配则对序列和全原子结构的联合分布进行建模。La-Proteina在多个生成基准测试中达到了最先进的性能,包括全原子共设计性、多样性和结构有效性,这一点通过详细的结构分析和评估得到了证实。值得注意的是,La-Proteina在原子级基序支架设计性能上也超越了之前的模型,解锁了关键的原子结构条件蛋白质设计任务。此外,La-Proteina能够生成多达800个残基的共设计蛋白质,而在此规模下大多数基线模型都会崩溃并无法生成有效样本,这证明了La-Proteina的可扩展性和鲁棒性。

英文摘要

Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.

2506.04219 2026-05-28 q-bio.CB cond-mat.dis-nn cond-mat.stat-mech

Collective gene dynamics leave signatures of decision landscapes in cell fate coordinates

集体基因动力学在细胞命运坐标中留下决策景观的印记

Maria Yampolskaya, Laertis Ikonomou, Pankaj Mehta

AI总结 本文提出一个结合低维梯度动力学与高维Hopfield网络的现象学模型,从单细胞RNA测序时间序列数据中识别决策类别的实验印记,并应用于小鼠肺泡细胞成熟和造血分化数据,证明细胞命运动力学与包含中间祖细胞和鞍点的发育景观一致。

详情
Comments
19 pages, 4 figures
AI中文摘要

多细胞生物包含多种高度特化的细胞类型。发育轨迹的一致性和稳健性表明,复杂的基因调控网络有效充当低维细胞命运景观。受动力系统理论启发的先前工作认为,细胞命运转变属于通用决策类别,但连接这些几何景观与高维基因表达空间的理论仍处于起步阶段。这里,我们引入一个现象学模型,从单细胞RNA测序时间序列数据中识别决策类别的实验印记。该模型结合低维梯度样动力学与高维Hopfield网络,以捕捉细胞命运、基因表达和信号传导之间的相互作用。我们将该框架应用于成熟肺泡细胞和谱系追踪的造血分化的小鼠实验数据,并表明测量的细胞命运动力学与包含中间祖细胞和鞍点的发育景观一致。我们进一步展示该框架可用于理解空间模式和细胞命运组织,重点关注肺气道中的Notch信号传导。总之,这些结果提供了集体转录组动力学携带与通用决策类别相关的景观特征印记的证据。

英文摘要

Multicellular organisms contain a wide variety of highly specialized cell types. The consistency and robustness of developmental trajectories suggest that complex gene regulatory networks effectively act as low-dimensional cell fate landscapes. Prior work inspired by dynamical systems theory argues that cell fate transitions fall into universal decision-making classes, but the theory connecting these geometric landscapes to high-dimensional gene expression space is still in its infancy. Here, we introduce a phenomenological model that identifies experimental signatures of decision-making classes in single-cell RNA-sequencing time-series data. The model combines low-dimensional gradient-like dynamics with high-dimensional Hopfield networks to capture the interplay between cell fate, gene expression, and signaling. We apply the framework to experimental mouse data on maturing lung alveolar cells and lineage-traced hematopoietic differentiation and show that the measured cell fate dynamics are consistent with developmental landscapes containing intermediate progenitors and saddle points. We further show that the framework can be used to understand spatial patterning and cell fate organization, focusing on Notch signaling in lung airways. Together, these results provide evidence that collective transcriptomic dynamics carry signatures of landscape features associated with universal decision-making classes.

2408.00057 2026-05-28 q-bio.BM cs.LG

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning

GOProteinGNN:利用蛋白质知识图谱进行蛋白质表示学习

Dan Kalifa, Uriel Singer, Kira Radinsky

AI总结 提出GOProteinGNN架构,通过整合蛋白质知识图谱信息增强蛋白质语言模型,在氨基酸和蛋白质级别进行图学习,从而在多个下游任务上取得最优性能。

详情
Journal ref
CIKM 2025: Proceedings of the 34th ACM International Conference on Information and Knowledge Management
AI中文摘要

蛋白质在生物过程中起着至关重要的作用,是生命体不可或缺的。准确的蛋白质表示至关重要,尤其是在药物开发中。近年来,利用机器学习和深度学习技术进行蛋白质表示的无监督学习引起了显著关注。然而,这些方法通常只关注蛋白质的氨基酸序列,缺乏关于蛋白质及其相互作用的实际知识,从而限制了其性能。在本研究中,我们提出了GOProteinGNN,一种新颖的架构,通过在创建氨基酸级别表示时整合蛋白质知识图谱信息来增强蛋白质语言模型。我们的方法允许在单个氨基酸级别和整个蛋白质级别整合信息,通过基于图的学习实现全面有效的学习过程。通过这样做,我们可以捕捉蛋白质与其功能注释之间的复杂关系和依赖关系,从而产生更鲁棒且上下文更丰富的蛋白质表示。与以往方法不同,GOProteinGNN在训练过程中独特地学习了整个蛋白质知识图谱,这使其能够捕捉更广泛的关系细微差别和依赖关系,而不仅仅是像以往工作那样处理三元组。我们在多个下游任务上进行了全面评估,结果表明GOProteinGNN始终优于以往方法,展示了其有效性,并将其确立为蛋白质表示学习的最先进解决方案。

英文摘要

Proteins play a vital role in biological processes and are indispensable for living organisms. Accurate representation of proteins is crucial, especially in drug development. Recently, there has been a notable increase in interest in utilizing machine learning and deep learning techniques for unsupervised learning of protein representations. However, these approaches often focus solely on the amino acid sequence of proteins and lack factual knowledge about proteins and their interactions, thus limiting their performance. In this study, we present GOProteinGNN, a novel architecture that enhances protein language models by integrating protein knowledge graph information during the creation of amino acid level representations. Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process through graph-based learning. By doing so, we can capture complex relationships and dependencies between proteins and their functional annotations, resulting in more robust and contextually enriched protein representations. Unlike previous methods, GOProteinGNN uniquely learns the entire protein knowledge graph during training, which allows it to capture broader relational nuances and dependencies beyond mere triplets as done in previous work. We perform a comprehensive evaluation on several downstream tasks demonstrating that GOProteinGNN consistently outperforms previous methods, showcasing its effectiveness and establishing it as a state-of-the-art solution for protein representation learning.