arXivDaily arXiv每日学术速递 周一至周五更新
重置
q-bio.QM定量方法12
2606.12209 2026-06-11 q-bio.QM 新提交

Interpretable enzyme function prediction via sparse autoencoder features of ESMC across the microbial protein universe

通过ESMC稀疏自编码器特征实现可解释的酶功能预测:跨越微生物蛋白质宇宙

Yue Hu, Wanyu Cheng, Junqing Wang, Yingchao Liu

AI总结 利用ESMC-6B蛋白质语言模型的稀疏自编码器特征,无需任务特定训练即可准确预测酶功能,在微生物酶基准上实现78.9% top-1准确率,并发现16.9万个暗酶候选。

详情
Comments
17 pages, 5 figures, 3 tables
AI中文摘要

微生物基因组和宏基因组包含数百万功能未知的酶,即酶暗物质。虽然深度学习改进了蛋白质功能预测,但大多数方法是基于序列或结构相似性的黑箱,限制了新型催化活性的发现。ESMC-6B蛋白质语言模型及其稀疏自编码器(具有16,384维可解释生物学概念码本,每个概念由GPT-5注释)创造了新的机会:直接将这些特征用作酶功能的语义签名。在这里,我们展示了ESMC-SAE特征能够实现准确且可解释的酶委员会(EC)编号预测,无需任务特定训练或GPU密集型计算。在包含4,868个微生物SwissProt酶(涵盖161个EC3子类)的平衡基准上,ESMC-SAE二元特征达到78.9%的top-1和88.5%的top-5准确率,比3-mer基线(57.3%)高37.6%。在模拟发现新型酶类的留一EC3子类评估中,SAE特征在47.7%的情况下恢复了EC1超类(随机为14.3%,3.3倍),而序列方法为26.6%。判别性特征对应于机制上可解释的概念:水解酶的催化三联体几何结构、氧化还原酶的NAD(P)H结合Rossmann折叠、转移酶的磷酸结合P环。我们还调查了包含770万个簇的ESM Atlas,并在所有主要微生物门中识别出169,859个暗酶样候选。我们的结果为微生物暗物质中的酶功能发现建立了一个范式:设计上可解释,无需GPU集群即可扩展,适用于ESM Atlas中的数十亿蛋白质。

英文摘要

Microbial genomes and metagenomes contain millions of proteins whose enzymatic functions remain unknown, the enzyme dark matter. While deep learning has improved protein function prediction, most methods are black boxes relying on sequence or structural similarity, limiting discovery of novel catalytic activities. The ESMC-6B protein language model and its sparse autoencoder with a 16,384-dimensional codebook of interpretable biological concepts, each annotated by GPT-5, creates a new opportunity: using these features directly as semantic signatures for enzyme function. Here, we show that ESMC-SAE features enable accurate and interpretable enzyme commission (EC) number prediction without task-specific training or GPU-intensive computation. On a balanced benchmark of 4,868 microbial SwissProt enzymes across 161 EC3 subclasses, ESMC-SAE binary features achieve 78.9% top-1 and 88.5% top-5 accuracy, 37.6% higher than 3-mer baselines (57.3%). In leave-one-EC3-class-out evaluation simulating discovery of novel enzyme classes, SAE features recover the EC1 superclass in 47.7% of cases (3.3x random, 14.3%), versus 26.6% for sequence methods. Discriminative features correspond to mechanistically interpretable concepts: catalytic triad geometry for hydrolases, NAD(P)H-binding Rossmann folds for oxidoreductases, phosphate-binding P-loops for transferases. We also survey the ESM Atlas of 7.7 million clusters and identify 169,859 dark enzyme-like candidates across all major microbial phyla. Our results establish a paradigm for enzyme function discovery in microbial dark matter: interpretable by design, scalable without GPU clusters, and applicable to the billions of proteins in the ESM Atlas.

2606.11876 2026-06-11 q-bio.QM cs.LG stat.ME 新提交

Seeing Below the Limit of Detection: A Censored-Poisson Bayesian Latent-Growth Change-Point Detector (the Span Detector) for Serial ctDNA in HR+/HER2- Metastatic Breast Cancer

检测限以下:用于HR+/HER2-转移性乳腺癌连续ctDNA的删失泊松贝叶斯潜在增长变点检测器(Span检测器)

Aarchi Singh Thakur, Abhijoy Sarkar

AI总结 提出Span检测器,利用删失泊松贝叶斯潜在增长变点模型处理ctDNA非检测作为左删失观测,通过序贯广义似然比统计量检测变异检测率上升点,在10%假警报率下将提前三个月捕获进展的比例从11%提升至25%。

详情
Comments
9 pages, 4 figures, 2 tables. Code and synthetic data generator: this https URL
AI中文摘要

循环肿瘤DNA(ctDNA)在影像学显示耐药性数月前就已携带证据,但最早证据存在于检测限(LoD)以下:新生亚克隆仅被间歇性检测到,产生微弱检测和非检测的闪烁序列。商业液体活检将每次抽取视为独立快照,并将非检测视为无信号。我们认为非检测是左删失观测,而随时间变化的非检测和微弱检测模式在单个值可信之前就携带了可操作的生长证据。我们引入Span,一种删失泊松贝叶斯潜在增长变点检测器,它对二元检测过程建模,为每个变异的检测率累积一个向上变点的序贯广义似然比统计量,并以校准的假警报控制发出竞争风险警报。Span没有学习权重,因此没有过拟合风险。在一线CDK4/6抑制剂联合内分泌治疗的HR+/HER2-转移性乳腺癌合成队列中,在匹配的10%假警报率下,Span将提前三个月捕获的即将进展比例大约翻倍(惰性出现:25% vs 快照的11%),具有可证伪的剂量反应:对惰性出现效果显著,对快速出现效果消失。值轨迹基线表现与快照相同,将增益归因于删失检测模型。生存主干在真实乳腺癌数据(GBSG-2,n=686;C指数0.67 vs 0.68)上与Cox基线匹配,在具有清洁生物标志物的真实纵向队列(PBC2,n=312)上,同一管道正确拒绝获胜,这是一个可证伪的边界测试,确认机制是特定于状态的。所有ctDNA轨迹均为合成数据。

英文摘要

Circulating-tumour DNA (ctDNA) carries evidence of drug resistance months before imaging shows it, but the earliest evidence lives below the assay's limit of detection (LoD): a nascent subclone is detected only intermittently, producing a flickering sequence of faint detects and non-detects. Commercial liquid biopsies treat each draw as an independent snapshot and a non-detect as nothing. We argue a non-detect is a left-censored observation, and the pattern of non-detects and faint detects over time carries actionable evidence of growth before any single value is trustworthy. We introduce Span, a censored-Poisson Bayesian latent-growth change-point detector that models the binary detection process, accumulates a sequential generalised-likelihood-ratio statistic for an upward change-point in the per-variant detection rate, and raises a competing-risks alarm with calibrated false-alarm control. Span has no learned weights, so there is nothing to overfit. On a synthetic cohort of HR+/HER2- metastatic breast cancer on first-line CDK4/6-inhibitor plus endocrine therapy, at a matched 10% false-alarm rate, Span roughly doubles the fraction of impending progressions caught three months ahead (indolent regime: 25% vs 11% for the snapshot), with a falsifiable dose-response: large for indolent emergence, vanishing for fast emergence. A value-trajectory baseline performs identically to the snapshot, isolating the gain to the censored detection model. The survival backbone matches a Cox baseline on real breast-cancer data (GBSG-2, n=686; C-index 0.67 vs 0.68), and on a real longitudinal cohort with clean biomarkers (PBC2, n=312) the same pipeline correctly declines to win, a falsifiable boundary test confirming the mechanism is regime-specific. All ctDNA trajectories are synthetic.

2606.11868 2026-06-11 cs.LG q-bio.QM 新提交

MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

MemNovo: 回顾谱图以实现质谱中平衡的从头肽段测序

Dongxin Lyu, Jingbo Zhou, Hongxin Xiang, Yuqiang Li, Jun Xia

发表机构 * Westlake University(西湖大学) Hunan University(湖南大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) HKUST-GZ & HKUST(香港科技大学(广州)与香港科技大学)

AI总结 针对现有Transformer模型在从头肽段测序中过度依赖生成序列先验而忽视谱图证据的问题,提出训练无关的即插即用机制MemNovo,通过建立持久谱记忆库和超保守残差连接在解码阶段注入谱特征,显著提升氨基酸和肽段精度。

详情
Comments
Code: this https URL
AI中文摘要

从串联质谱中进行从头肽段测序是蛋白质组学的关键,能够在不依赖参考数据库的情况下识别新型肽段。尽管基于Transformer的编码器-解码器模型已取得显著性能,但我们发现其推理动态中存在关键病理现象。通过全面的特征缩放实验,我们证明现有的自回归肽段解码器倾向于过度依赖生成序列的先验,同时逐渐未能充分利用输入质谱中的细粒度物理证据。这一现象导致次优结果,生成的肽段序列在生物学上合理但不符合输入谱图。为解决此问题,我们提出MemNovo,一种无需训练且即插即用的机制,在推理时重新平衡肽段和谱图的贡献。MemNovo通过建立持久的谱记忆库,并通过超保守残差连接将检索到的特征直接注入最终解码阶段,从而缓解信息瓶颈。理论分析证实,该机制恢复了解码器状态与原始谱图之间的互信息。在Nine Species基准上使用两个代表性基线模型Casanovo和InstaNovo进行的大量实验表明,MemNovo持续提高了氨基酸精度和肽段精度,对于Casanovo,肽段精度相对提升高达39.1%,对于InstaNovo提升高达3.9%,且计算开销可忽略不计。

英文摘要

De novo peptide sequencing from tandem mass spectrometry is pivotal in proteomics, enabling identification of novel peptides without reference databases. While recent Transformer-based encoder-decoder models have achieved remarkable performance, we uncover a critical pathology in their inference dynamics. Through comprehensive feature scaling experiments, we demonstrate that existing auto-regressive peptide decoders tend to over-rely on generated-sequence priors while progressively under-utilizing fine-grained physical evidence from the input mass spectrum. This phenomenon leads to suboptimal results, where generated peptide sequences are biologically plausible yet not faithful to the input spectrum. To rectify this, we propose MemNovo, a training-free and plug-and-play mechanism that re-balances peptide and spectral contributions at inference time. MemNovo alleviates the information bottleneck by establishing a persistent spectral memory bank and injecting retrieved features directly into the final decoding stage via an ultra-conservative residual connection. Theoretical analysis confirms that this mechanism restores the mutual information between the decoder state and the raw spectrum. Extensive experiments on the Nine Species benchmark with two representative baselines, Casanovo and InstaNovo, demonstrate that MemNovo consistently improves both amino acid precision and peptide precision, achieving up to 39.1% relative improvement in peptide precision for Casanovo and up to 3.9% for InstaNovo, with negligible computational overhead.

2606.11775 2026-06-11 math.MG q-bio.QM stat.ML 新提交

Magnitude-Based Features for Multispecies Spatial Data

基于量值的多物种空间数据特征

Julia Sollberger, Joshua Bull, Sara Kališnik, Bernadette Stolz

AI总结 提出基于量值的全局和局部特征向量,用于分析多物种空间数据中的相互作用,在合成肿瘤微环境和人类结直肠癌组织微阵列数据中验证了其识别空间异质性和分类能力。

详情
Comments
32 pages, 24 figures
AI中文摘要

多物种空间数据出现在许多应用中,其中不同实体之间的相互作用对系统行为至关重要,包括生物医学成像、地理空间分析和物种生态学。尽管它们很重要,但捕获这种相互作用的定量工具相对较少。在这项工作中,我们提出了基于量值的特征用于分析多物种空间数据。量值是有限度量空间的一个实值不变量,可以解释为有效点数,结合了空间配置和尺度。我们开发了全局和局部量值特征向量,并在合成肿瘤微环境数据以及人类结直肠癌样本的组织微阵列数据中展示了它们的实用性。在局部,该方法识别出不同的邻域类型并揭示空间异质性;在模型中,这包括与模拟的不同定性结果相关的径向模式,而在真实世界数据中,它反映了B细胞和T细胞群体之间三级淋巴结构样相互作用的重要性。在全局上,该方法恢复了合成数据中跨参数区域的长期模拟结果的已知分类,并提示CD4+ T细胞和CD163+巨噬细胞在区分有利的克罗恩样反应与不利的弥漫性免疫浸润患者中发挥重要作用。总之,这些结果表明基于量值的特征为多物种空间数据分析提供了强大而灵活的工具。

英文摘要

Multispecies spatial data arise in many applications where interactions between different entities are central to system behaviour, including biomedical imaging, geospatial analysis, and species ecology. Despite their importance, relatively few quantitative tools exist to capture such interactions. In this work, we propose magnitude-based features for the analysis of multispecies spatial data. Magnitude is a real-valued invariant of finite metric spaces that can be interpreted as an effective number of points, incorporating both spatial configuration and scale. We develop global and local magnitude feature vectors and demonstrate their utility on synthetic tumour microenvironment data, and in tissue microarray data from human colorectal cancer samples. Locally, the method identifies distinct neighbourhood types and reveals spatial heterogeneity; in the model, this includes radial patterns associated with different qualitative outcomes of the simulations, while in the real-world data it reflects the importance of tertiary lymphoid structure-like interactions between B and T cell populations. Globally, the approach recovers known classifications of long-term simulation outcomes across parameter regimes in synthetic data, and suggests important roles for CD4+ T cells and CD163+ macrophages in distinguishing patients with favourable Crohn's like reactions from unfavourable diffuse immune infiltration. Together, these results suggest that magnitude-based features provide a powerful and flexible tool for the analysis of multispecies spatial data.

2606.11651 2026-06-11 cs.LG q-bio.QM stat.AP 新提交

DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics

DeepRHP:一种用于设计随机异聚合物作为蛋白质模拟物的混合变分自编码器

Shuni Li, Zhiyuan Ruan, Andy Shen, Ivan Jayapurna, Ting Xu, Haiyan Huang

AI总结 提出混合变分自编码器DeepRHP,在半监督框架下结合特征VAE与经典VAE,通过潜在空间捕获关键化学特征与序列模式,指导随机异聚合物设计,实验验证其稳定膜蛋白的有效性。

详情
Comments
Oral presentation at AAAI 2023 Workshop on AI to Accelerate Science and Engineering
AI中文摘要

由预定义单体组成的合成随机异聚合物(RHP)为设计类蛋白质材料提供了一种方法。如果设计得当,这些RHP可以模拟蛋白质的行为和功能。因此,需要计算工具来有效指导RHP设计。我们通过开发DeepRHP(一种在半监督框架下改进的变分自编码器(VAE)模型)来弥补这一差距。通过为经典VAE配备额外的基于特征的VAE,DeepRHP迫使潜在空间捕获关键化学特征的结构以及单个RHP序列模式。从这个意义上说,我们的方法是通用的,允许以混合方式纳入任何相关特征。我们通过提出在非原生环境中稳定膜蛋白(例如水通道蛋白Z)的潜在单体组成,并将我们的预测与已发表的结果进行交叉验证,证明了DeepRHP的有效性。我们的模型与真实RHP功能之间的一致性表明,利用混合自编码器架构来指导蛋白质和其他生物化合物的RHP设计具有巨大潜力。

英文摘要

Synthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein-like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi-supervised framework. By equipping a classical VAE with an additional feature-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non-native environments and cross-validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.

2606.11646 2026-06-11 cs.LG q-bio.QM stat.ML 新提交

Tree-Structured Orthonormal Decomposition of the Aitchison Simplex

Aitchison单纯形的树结构正交分解

Daisuke Yamada, Qijun Zhang, Travis Pence, Barbara B. Bendlin, Federico Rey, Vikas Singh

AI总结 提出PolyILR方法,利用树结构对成分数据进行正交分解,在微生物组和单细胞数据中生成稳定可解释的特征,并建立与softmax分类器的理论联系。

详情
Comments
Accepted at ICML 2026. To appear in PMLR vol. 306
AI中文摘要

成分数据——编码相对比例的向量——出现在包括生态学、地球化学和基因组学在内的科学领域。这些数据中的特征通常具有已知的层次结构(例如,分类学、系统发育、本体论),但现有方法要么忽略这种结构,要么丢弃内在的Aitchison几何,要么设计用于二叉树,要么产生不完整的坐标系。我们描述了PolyILR,一种与任何树拓扑对齐的Aitchison切空间的正交分解。我们的构造在每个内部节点定义了一个加权局部几何,捕获完整的分支结构,然后将这些提升到一个全局正交基,其中每个坐标对应一个特定的树位置。在微生物组和单细胞基准测试中,PolyILR产生稳定、可解释的特征,并支持多尺度树分辨率下的推理。我们还建立了与softmax分类器的新理论联系,暗示了在概率建模中的可能应用。

英文摘要

Compositional data -- vectors encoding relative proportions -- arise across scientific domains, including ecology, geochemistry, and genomics. The features in these data often come with known hierarchical structure (e.g., taxonomies, phylogenies, ontologies), yet existing methods either ignore this structure, discard the intrinsic Aitchison geometry, are designed for binary trees, or yield incomplete coordinate systems. We describe PolyILR, a canonical orthonormal decomposition of the Aitchison tangent space aligned with any tree topology. Our construction defines a weighted local geometry at each internal node capturing full branching structure, then lifts these to a global orthonormal basis where every coordinate corresponds to a specific tree location. On microbiome and single-cell benchmarks, PolyILR yields stable, interpretable features and enables inference at multiscale tree resolution. We also establish a novel theoretical connection to softmax classifiers, suggesting possible applications to probabilistic modeling.

2606.11510 2026-06-11 q-bio.QM q-bio.PE stat.ML 新提交

Continuous biome representations from Earth observation embeddings

从地球观测嵌入中提取连续生物群落表示

Maxwell B. Joseph, Flávia De Souza Mendes, Dieu My T. Nguyen, Camile Sothe, Christopher B. Anderson (Planet Labs PBC)

AI总结 针对离散生物群落图压缩生态连续性的问题,提出从卫星图像嵌入中学习连续概率表示,在巴西6个生物群落和4672种植物数据上验证,优于离散标签预测物种分布。

详情
Comments
8 pages, 4 figures
AI中文摘要

生物群落随空间连续变化,但生物群落图通过分类边界压缩了这种变化,特别是在生态过渡带,过渡群落具有独特的生态特征。地球观测基础模型通过密集嵌入编码光谱、空间和时间信息,能否将离散的生物群落图转换为更好地捕捉生态变化的连续表示?本文在Clay v1.5卫星图像嵌入上拟合线性分类器,从分类图中预测生物群落标签。softmax输出产生一个连续概率向量,其维度对应命名的生物群落类别。我们使用巴西六个生物群落、130万个嵌入和10015个保留的森林清查样地(涵盖4672种植物)评估该方法。连续生物群落表示在预测物种出现方面优于离散生物群落标签(10次空间交叉验证中平均每物种AUC 0.618 vs. 0.570)。分解这一增益表明,改进来自分级概率输出的连续性,而非标签重新分配;该模式在距生物群落边界的所有距离上均成立。原始1024维嵌入仍然是我们测试的最强预测因子(平均AUC 0.646 vs. 0.618),但连续表示恢复了嵌入相对于离散标签的大部分增益。这种简单方法为分类地图标签提供了概率替代方案,保留了其含义,同时编码了离散地图抑制的分级变化。

英文摘要

Biotic communities vary continuously across space, yet biome maps impose categorical boundaries that compress this variation, particularly at ecotones where transitional communities are ecologically distinct. Could Earth observation (EO) foundation models, which encode spectral, spatial, and temporal information with dense embeddings, convert discrete biome maps into continuous representations that better capture ecological variation? Here, we fit a linear classifier on Clay v1.5 satellite image embeddings to predict biome labels from a categorical map. The softmax output yields a continuous probability vector whose dimensions correspond to named biome classes. We evaluate this approach using six Brazilian biomes, 1.3 million embeddings, and 10,015 withheld forest inventory plots spanning 4,672 plant species. The continuous biome representation outperforms discrete biome labels for predicting species occurrence (mean per-species AUC 0.618 vs. 0.570 across 10 spatial cross-validation folds). Decomposing this gain shows that continuity in the graded probability output, rather than label reassignment, accounts for the improvement; the pattern holds across all distances from biome boundaries. The raw 1024-dimensional embedding remains the strongest predictor we tested (mean AUC 0.646 vs. 0.618), but the continuous representation recovers most of the embedding's gain over discrete labels. This simple approach provides a probabilistic replacement for categorical map labels, preserving their meaning while encoding graded variation that discrete maps suppress.

2606.11508 2026-06-11 cs.LG q-bio.QM 新提交

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

概率对比预训练用于多任务ADME性质预测

Yifan Xue, Srimukh Prasad Veccham, Saee Paliwal, Tyler Shimko, Micha Livne

发表机构 * NVIDIA(英伟达)

AI总结 提出分子图-Transformer预训练框架,结合化学自监督与对比互信息,通过统一概率潜变量目标优化重构、对比和化学任务,在多任务微调中采用任务特定MLP头,在三个数据集上平均提升7.6%-9.5%。

详情
AI中文摘要

准确预测吸收、分布、代谢和排泄(ADME)性质对药物发现至关重要,但由于ADME终点存在噪声、相互依赖且数据有限,仍然具有挑战性。我们提出了一种分子图-Transformer预训练框架,结合了化学特异性自监督与对比互信息机器学习(cMIM)。我们的方法将分子图编码为潜变量,从图导出的潜代码重建SMILES字符串,并用领域特定的自监督化学任务增强对比目标。我们不是将这些任务视为具有单独调整损失权重的辅助正则化器,而是将重建、对比判别和化学特异性监督表述为单个概率潜变量目标中的单位权重对数概率因子。对于微调,我们提出了一种具有任务特定多层感知器头的多任务GNN读出架构,在保留共享表示学习的同时减轻负迁移并改进异质非线性任务关系的建模。在Biogen、ExpansionRX和ChEMBL-MT上,所得到的对比KERMT预训练相比KERMT基线分别提高了7.6%、9.9%和9.5%(在显著改进的终点上平均)。将ADME邻近分子添加到预训练语料库进一步改善了迁移,并且对比组件锐化了化学上有意义的潜邻域。

英文摘要

Accurate prediction of absorption, distribution, metabolism, and excretion (ADME) properties is critical to drug discovery, but remains challenging because ADME endpoints are noisy, interdependent, and often data-limited. We propose a molecular graph-transformer pretraining framework that combines chemistry-specific self-supervision with contrastive mutual information machine learning (cMIM). Our method encodes molecular graphs into latent variables, reconstructs SMILES strings from the graph-derived latent codes, and augments the contrastive objective with domain-specific self-supervised chemistry tasks. Rather than treating these tasks as auxiliary regularizers with separately tuned loss weights, we formulate reconstruction, contrastive discrimination, and chemistry-specific supervision as unit-weighted log-probability factors in a single probabilistic latent-variable objective. For fine-tuning, we propose a multi-task GNN readout architecture with task-specific multilayer perceptron heads, preserving shared representation learning while mitigating negative transfer and improving the modeling of heterogeneous, nonlinear task relationships. Across Biogen, ExpansionRX, and ChEMBL-MT, the resulting Contrastive KERMT pretraining improves over the KERMT baseline by 7.6%, 9.9%, and 9.5% respectively (averaged over significantly-improved endpoints). Adding ADME-adjacent molecules to the pretraining corpus further improves transfer, and the contrastive component sharpens chemically meaningful latent neighborhoods.

2606.11426 2026-06-11 math.OC math.CA q-bio.QM 新提交

Sharpness characterizes Hill functions

Sharpness刻画Hill函数

Marc Stephan

AI总结 本文严格证明了在有理函数中,Hill函数是半对数尺度下导数上确界(sharpness)达到最大值的唯一函数,且sharpness不超过Hill系数n/4。

详情
Comments
10 pages, 2 figures
AI中文摘要

虽然长期以来被视为经验拟合,但Martinez-Corral、Nam、DePace和Gunawardena提出Hill函数是输入-输出响应sharpness的通用Hopfield屏障。Hopfield屏障是生物系统在不消耗能量的情况下处理信息的基本限制。他们的论证基于Hill系数$4$和$6$的数值结果。我们给出了精确表述和证明:通过半对数尺度下导数的上确界衡量sharpness,任何具有实系数$0\leq \alpha_i\leq \beta_i$的有理函数$r(x)=(\alpha_0+\alpha_1 x+ \cdots +\alpha_n x^n)/(\beta_0 + \beta_1 x+ \cdots + \beta_n x^n)$的sharpness至多为$n/4$,当且仅当$r$是Hill系数为$n$的Hill函数时取等。

英文摘要

While long treated as empirical fits, Hill functions have been postulated to be the universal Hopfield barrier for sharpness of input-output responses by Martinez-Corral, Nam, DePace, and Gunawardena. A Hopfield barrier is a fundamental limit on how well biological systems can process information without expending energy. Their case rested on numerical findings for Hill coefficients $4$ and $6$. We give a precise formulation and proof of this: measuring sharpness by the supremum of the derivative in semi-log scale, any rational function $r(x)=(\alpha_0+\alpha_1 x+ \cdots +\alpha_n x^n)/(\beta_0 + \beta_1 x+ \cdots + \beta_n x^n)$ with real coefficients $0\leq \alpha_i\leq \beta_i$ has sharpness at most $n/4$, with equality if and only if $r$ is a Hill function with Hill coefficient $n$.

2606.11415 2026-06-11 q-bio.NC cs.LG physics.data-an q-bio.QM 新提交

Spatially Masked Regression Reveals Local and Distributed Predictability in Electrophysiological Recordings

空间掩蔽回归揭示电生理记录中的局部和分布式可预测性

Maryam Ostadsharif Memar, Nima Dehghani

AI总结 提出空间掩蔽回归(SMR)框架,通过逐步增大掩蔽区域量化电极信号中局部与分布式信息的贡献,应用于颅内和头皮脑电数据,发现邻近电极贡献显著但非全部,表明信号同时包含局部冗余和全局结构。

详情
AI中文摘要

神经记录通常被解释为局部测量,但任何单个传感器的信号也可能反映分布在整个网络中的结构化活动。这引出一个基本问题:电极信号在多大程度上反映底层系统中的局部信息与分布式信息?更具体地说,电极的活动有多少由其邻近区域携带,又有多少嵌入在阵列的更广泛分布中?我们通过空间掩蔽回归(SMR)框架解决这一问题,该框架从其余电极重建每个电极的时间序列,同时排除目标周围可配置的邻域。通过逐步增大掩蔽,空间局部性成为实验控制,用于量化在移除附近通道后有多少预测信息幸存。我们将SMR应用于具有异质电极覆盖的颅内脑电图(iEEG)和具有标准化导联组合的感觉运动皮层头皮脑电图(EEG)。使用原始信号与重建信号之间的距离相关性,我们发现两种模态中均存在强烈的受试者内重建,即使排除局部邻域后仍有显著的可预测性,且EEG中的跨受试者转移明显强于iEEG。掩蔽显示邻近电极对重建贡献显著,但并非全部,表明单个通道既反映局部冗余也反映更广泛的分布式结构。保留选定边际或谱特性但破坏相位结构或时间顺序的替代数据显著降低了性能,支持SMR依赖于结构化时间和跨通道组织而非仅边际统计的结论。这些结果将SMR定位为量化记录中局部与分布式信息平衡的可解释框架。

英文摘要

Neural recordings are often interpreted as local measurements, yet the signal at any one sensor can also reflect structured activity distributed across the broader network. This raises a basic question: to what extent does an electrode's signal reflect local versus distributed information in the underlying system? More specifically, how much of an electrode's activity is carried by its immediate neighborhood, and how much is embedded more broadly across the array? We address this with a Spatially Masked Regression (SMR) framework that reconstructs each electrode's timeseries from the remaining electrodes while excluding a configurable neighborhood around the target. By progressively increasing this mask, spatial locality becomes an experimental control for quantifying how much predictive information survives after nearby channels are withheld. We apply SMR to intracranial EEG with heterogeneous electrode coverage and to scalp EEG with standardized montages over sensorimotor cortex. Using distance correlation between original and reconstructed signals, we find strong within-subject reconstruction in both modalities, substantial residual predictability even when local neighbors are excluded, and markedly stronger cross-subject transfer in EEG than in iEEG. Masking shows that nearby electrodes contribute strongly to reconstruction but do not account for all of it, indicating that individual channels reflect both local redundancy and broader distributed structure. Surrogates that preserve selected marginal or spectral properties while disrupting phase structure or temporal ordering substantially reduce performance, supporting the conclusion that SMR depends on structured temporal and cross-channel organization rather than on marginal statistics alone. These results position SMR as an interpretable framework for quantifying the balance between local and distributed information in recordings.

2606.11264 2026-06-11 q-bio.QM cs.AI 新提交

OmniBioTwin: A System-of-Twinned-Systems Framework for Health Digital Twins

OmniBioTwin:用于健康数字孪生的孪生系统之系统框架

Zhaohui Wang, Yu Huang, Jiang Bian

AI总结 提出OmniBioTwin框架,通过多层级网络架构中的模块化孪生体和交互算子,实现跨尺度健康数字孪生的系统级集成,并在阿尔茨海默病GLP-1信号通路中验证。

详情
AI中文摘要

健康数字孪生(HDT)有望实现患者特异性建模和决策支持,但目前的方法在结构上仍然碎片化:针对单个器官或任务的单一模型缺乏跨尺度保真度,而系统级孪生缺乏通用的架构框架。我们提出OmniBioTwin,一种孪生系统之系统(SoTS)框架,将HDT组织为模块化计算实体,通过多层网络架构中的显式交互算子进行耦合。该框架包括七个协调层——涵盖数据集成、自主孪生建模、跨尺度耦合、时间同步和人机交互决策支持。我们通过实例化阿尔茨海默病中胰高血糖素样肽-1(GLP-1)信号通路的多尺度孪生来演示OmniBioTwin,说明如何在统一系统中组合和耦合分子、细胞和器官级别的孪生。

英文摘要

Health digital twins (HDTs) promise patient-specific modeling and decision support but current approaches remain structurally fragmented: monolithic models that address a single organ or task lack cross-scale fidelity, while system-level twins lack generalizable architectural frameworks. We propose OmniBioTwin, a System-of-Twinned-Systems (SoTS) framework that organizes HDTs as modular computational entities coupled through explicit interaction operators within a multi-layer network architecture. The framework comprises seven coordinated layers - spanning data integration, autonomous twin modeling, cross-scale coupling, temporal synchronization, and human-in-the-loop decision support. We demonstrate OmniBioTwin by instantiating a multiscale twin for glucagon-like peptide-1 (GLP-1) signaling pathways in Alzheimer's disease, illustrating how molecular, cellular, and organ-level twins can be composed and coupled within a unified system.

2605.00545 2026-06-11 cs.LG cs.AI math-ph q-bio.GN q-bio.QM 版本更新

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

超越连续性:从单细胞快照无模拟重建离散分支动力学

Junda Ying, Yuxuan Wang, Bowen Yang, Peijie Zhou, Lei Zhang

AI总结 针对单细胞快照数据中随机性和非保守质量动态(如细胞增殖和凋亡)的挑战,提出无模拟框架Unbalanced Schrödinger Bridge (USB),通过离散分支薛定谔桥问题建模单细胞分辨率的跳跃式生灭动态,实现高效轨迹重建与离散模拟。

详情
AI中文摘要

从破坏性快照推断细胞轨迹因随机性和非保守质量动态(如细胞增殖和凋亡)的挑战而复杂化。现有的不平衡最优传输(OT)方法将质量视为连续流体,在群体水平进行推断。然而,这种宏观视角往往无法捕捉单细胞分辨率下生灭事件的离散跳跃性质,而这对于理解谱系分支和命运决定至关重要。我们提出无模拟框架Unbalanced Schrödinger Bridge (USB),用于学习底层动态,有效整合随机和非平衡效应,并在单细胞分辨率下建模离散、跳跃式的生灭动态。理论上,USB为分支薛定谔桥(BSB)问题提供了可处理的解,给出了严格的微观解释,其中单个细胞同时经历布朗运动和离散生灭跳跃。技术上,该方法通过引入无模拟训练目标实现高效求解器,有效扩展到高维组学数据。实验上,我们在模拟和真实数据集上证明,USB不仅达到优于或可比于确定性基线的轨迹重建性能,而且独特地实现了单细胞分辨率下生灭动态的真实离散模拟。

英文摘要

Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.