arXivDaily arXiv每日学术速递 周一至周五更新
重置
Q-BIO定量生物38
2606.09906 2026-06-10 stat.ME q-bio.PE 新提交

An information-geometric framework for mapping maximum potential biodiversity

一种用于映射最大潜在生物多样性的信息几何框架

Shinto Eguchi

AI总结 提出信息几何框架,通过约束变分原理定义潜在组成和多样性差距,统一处理Hill型多样性和Rao二次熵,为生态保护提供基准比较。

详情
Comments
22 pages, 1 figure
AI中文摘要

生物多样性度量通常被描述性地使用:从观测或估计的群落组成计算多样性指数,并将结果值映射到空间上。然而,保护规划还需要一个特定地点的基准,以便将观测到的群落与之进行比较。本章为这种“潜在多样性”和相关的“多样性差距”开发了一个信息几何框架。核心对象是物种单纯形上的一对概率向量:观测或实现的组成\(p^{\mathrm{obs}}\),以及通过约束变分原理获得的潜在组成\(p^{\mathrm{pot}}\)。然后通过比较这两个组成处的多样性泛函来定义差距。该框架针对Hill型多样性(衡量丰度和均匀度)和Rao二次熵(包含物种间的性状、系统发育或生态差异)进行了开发。空间点过程解释阐明了如何在进入单纯形之前定义局部生态容量。然后,护航约束、容量约束和散度投影提供了一种统一的方法来定义超出均匀分布的非平凡基准。得到的公式区分了两个不同的问题:一个群落有多多样化,以及它离局部允许的潜在基准有多远。它还将暗多样性的生态概念与概率单纯形上的连续、丰度加权比较联系起来。我们还概述了一个动态扩展,其中容量、物种迁移和气候驱动的变化随时间变化。使用大规模公民科学生物多样性数据和性状数据库的实证实施留待未来工作。

英文摘要

Biodiversity measures are often used descriptively: one computes a diversity index from an observed or estimated community composition and maps the resulting values across space. Conservation planning, however, also requires a site-specific benchmark against which the observed community can be compared. This chapter develops an information-geometric framework for such \emph{potential diversity} and the associated \emph{diversity gap}. The central object is a pair of probability vectors on the species simplex: an observed or realized composition \(p^{\mathrm{obs}}\), and a potential composition \(p^{\mathrm{pot}}\) obtained by a constrained variational principle. The gap is then defined by comparing a diversity functional at these two compositions. The framework is developed for both Hill-type diversity, which measures abundance and evenness, and Rao's quadratic entropy, which incorporates trait, phylogenetic, or ecological dissimilarities among species. A spatial point-process interpretation clarifies how local ecological capacities can be defined before passing to the simplex. Escort constraints, capacity constraints, and divergence projections then provide a unified way to define nontrivial benchmarks beyond the uniform distribution. The resulting formulation separates two distinct questions: how diverse a community is, and how far it is from a locally admissible potential benchmark. It also connects the ecological idea of dark diversity with a continuous, abundance-weighted comparison on the probability simplex. We also outline a dynamic extension in which capacities, species migration, and climate-driven shifts vary over time. Empirical implementation with large-scale citizen-science biodiversity data and trait databases is left for future work.

2606.10891 2026-06-10 q-bio.NC 新提交

Bilinear gating of motor primitives: a principle linking dendritic computation to rapid goal-directed adaptation

运动基元的双线性门控:连接树突计算与快速目标导向适应的原理

Cristiano Capone, Luca Falorsi, Andrea Ciardiello, Luca Manneschi

AI总结 研究发现猕猴运动皮层神经元的爆发比例编码目标信息,提出双线性门控机制解释其来源,并展示该机制支持零样本泛化和快速适应。

详情
AI中文摘要

运动需要运动皮层同时指定产生\emph{什么}动作以及该动作服务于\emph{哪个目标},但单个神经元如何分离这些因素尚不清楚。这里我们展示,在猕猴运动皮层中,神经元的\emph{爆发比例}(其尖峰中高频爆发的比例)编码到达方向的选择性远高于其总体放电率。这种分离高度一致:在跨越三只动物和两个实验室的12个记录会话中均成立(所有$p<10^{-12}$),并且通过控制去除放电率的任何贡献后仍然存在,表明目标信息特别集中在爆发中。然后我们展示,这种编码特征是第5层锥体神经元中树突符合检测的预测结果:当与目标相关的顶端输入与状态相关的基底部驱动同时发生时,神经元爆发,因此爆发概率计算目标与状态的乘积,即双线性门控$G(g)\,Y(s)$。一个最小两室尖峰模型重现了该效应,并且相同的乘法门控嵌入强化学习智能体后,支持对新目标的零样本泛化和快速在线适应,为将目标信息分离到爆发中提供了计算理由。这些结果确定了爆发比例作为运动皮层中的目标选择性编码,将其与具体的细胞机制联系起来,并表明该机制带来了学习优势。

英文摘要

Movement requires the motor cortex to specify both \emph{what} action to produce and \emph{which goal} it serves, yet how individual neurons separate these factors is not understood. Here we show that in macaque motor cortex the \emph{burst fraction} of a neuron, the proportion of its spikes emitted in high-frequency bursts, encodes reach direction far more selectively than its overall firing rate. This dissociation is highly consistent: it holds in every one of 12 recording sessions spanning three animals and two laboratories (all $p<10^{-12}$) and survives controls that remove any contribution of firing rate, showing that goal information is concentrated specifically in bursts. We then show that this coding signature is the predicted consequence of dendritic coincidence detection in layer-5 pyramidal neurons: when a goal-related apical input coincides with a state-related basal drive the neuron bursts, so burst probability computes the product of goal and state, a bilinear gate $G(g)\,Y(s)$. A minimal two-compartment spiking model reproduces the effect, and the same multiplicative gate, embedded in a reinforcement-learning agent, supports zero-shot generalisation to new goals and rapid online adaptation, providing a computational rationale for segregating goal information into bursts. These results identify burst fraction as a goal-selective code in motor cortex, tie it to a concrete cellular mechanism, and show that the mechanism confers a learning advantage.

2606.10879 2026-06-10 q-bio.OT 新提交

From the microscope to High Performance Computing centers, a national effort toward automated data workflows for microscopy facility users in France

从显微镜到高性能计算中心:法国显微镜设施用户自动化数据工作流的国家努力

Guillaume Gay, Théo Barnouin, Marc Mongy, Guillaume Maucort, Perrine Paul-Gilloteaux, Emmanuel Faure

AI总结 针对生物显微镜设施数据管理碎片化问题,法国国家生物成像基础设施开发了基于OMERO、iRODS等开源技术的BioImage Cloud平台,实现从采集到归档的完整数据生命周期自动化,并支持与HPC中心及公共数据库集成。

详情
Comments
25 pages, 3 figures
AI中文摘要

现代生物显微镜常规生成大型且复杂的图像数据集,包括多维、多模态和时间分辨采集。虽然成像技术迅速发展,但显微镜设施内的数据管理基础设施通常仍然分散,依赖于异构的本地解决方案,这些方案难以维护、扩展,并与高性能计算中心和公共数据存储库集成。为了解决这些问题,法国国家生物成像基础设施(France BioImaging, FBI)开发了此http URL及相关的BioImage Cloud平台。该倡议旨在通过可互操作和可扩展的此http URL架构,提供一个协调的国家基础设施,连接显微镜设施、集中存储资源、HPC环境和公共生物成像档案。所提出的架构结合了开源技术,包括用于图像管理的OMERO、用于分布式数据编排的iRODS、用于联合认证的Authentik,以及新兴标准如OME-Zarr和REMBI元数据建议。该基础设施旨在支持完整的成像数据生命周期,从采集和传输到可视化、分析、共享和长期归档。除了技术实现,本文还介绍了在分布式成像设施中部署共享国家基础设施所需的组织和治理策略。我们讨论了与互操作性、元数据标准化、可持续性和用户采纳相关的挑战,以及成像数据与大规模计算资源更紧密集成为未来AI驱动的生物图像分析工作流所开辟的前景。

英文摘要

Modern biological microscopy routinely generates large and complex image datasets, including multidimensional, multimodal, and time-resolved acquisitions. While imaging technologies have rapidly evolved, data management infrastructures within microscopy facilities often remain fragmented, relying on heterogeneous local solutions that are difficult to maintain, scale, and integrate with High-Performance Computing (HPC) centers and public data repositories. To address these issues, France BioImaging (FBI), the French national infrastructure for biological imaging, has developed FBI.DATA and the associated BioImage Cloud platform. This initiative aims to provide a coordinated national infrastructure connecting microscopy facilities, centralized storage resources, HPC environments, and public bioimaging archives through interoperable and scalable workflows.The proposed architecture combines open-source technologies including OMERO for image management, iRODS for distributed data orchestration, Authentik for federated authentication, and emerging standards such as OME-Zarr and REMBI metadata recommendations. The infrastructure is designed to support the complete imaging data lifecycle, from acquisition and transfer to visualization, analysis, sharing, and long-term archiving. Beyond the technical implementation, this work presents the organizational and governance strategies required to deploy a shared national infrastructure across distributed imaging facilities. We discuss the challenges associated with interoperability, metadata standardization, sustainability, and user adoption, as well as the perspectives opened by tighter integration between imaging data and large-scale computing resources for future AI-driven bioimage analysis workflows.

2606.10873 2026-06-10 q-bio.QM 新提交

Spatial Model Selection and Uncertainty Quantification: Comparing Continuous and Discrete Wound Healing Models

空间模型选择与不确定性量化:连续与离散伤口愈合模型的比较

John T. Nardini, Jana L. Gevertz

AI总结 针对空间过程建模中偏微分方程与基于智能体模型的选择问题,提出基于近似贝叶斯计算的模型选择流程,发现平均场PDE在计算速度和模型选择上优于ABM,并应用于伤口愈合数据。

详情
AI中文摘要

所有数据驱动的建模任务(例如参数估计、不确定性量化和数据预测)都需要选择一个数学模型。模型选择中一个被忽视的方面是模态;例如,对于空间过程,何时使用偏微分方程(PDE)模型或基于智能体模型(ABM)尚无指导原则。为解决这一问题,我们创建了一个模型选择流程,该流程使用近似贝叶斯计算进行参数估计、不确定性量化和模型选择(同时使用信息准则和样本外预测)。将该流程应用于人工数据集(由ABM生成)表明,虽然两种模态的参数估计性能相当,但ABM估计的不确定性更高,而PDE模型的计算速度快1000倍以上。令人惊讶的是,使用信息准则和数据预测,平均场PDE通常被选为优于真实生成ABM模型。将该流程应用于公共伤口愈合数据表明,具有细胞牵引和时间延迟的PDE模型是该数据最合适的模型,然而该模型具有较高的参数不确定性。该方法为选择空间生物数据的适当建模模态建立了一个初步框架。

英文摘要

All data-driven modeling tasks (e.g., parameter estimation, uncertainty quantification, and data forecasting) require the selection of a mathematical model. An overlooked aspect of model selection is modality; for example, there are no guidelines on when to use a partial differential equation (PDE) model or an agent-based model (ABM) for spatial processes. To address this, we created a model selection pipeline that uses approximate Bayesian computations to perform parameter estimation, uncertainty quantification, and model selection (using both information criteria and out-of-sample forecasting). Applying the pipeline to artificial datasets (generated from ABMs) reveals that while both modalities yield comparable parameter estimation performance, the ABM estimates exhibit higher uncertainty, and the PDE models compute more than 1,000$\times$ faster. Surprisingly, the mean-field PDE is often selected over the true generative ABM model using both information criteria and data forecasting. Applying the pipeline to public wound healing data indicates that a PDE model with cell pulling and a time delay is the most appropriate model for this data, however, this model has high levels of parametric uncertainty. This methodology establishes a preliminary framework for selecting the appropriate modeling modality for spatial biological data.

2606.10809 2026-06-10 q-bio.PE 新提交

Chaos and stability in the marine trophic network: the importance of interactions over complexity

海洋营养网络中的混沌与稳定性:相互作用比复杂性更重要

Ilaria Cunico, Guido Occhipinti, Gregor Fussmann, Paolo Lazzari

AI总结 通过数值模拟研究复杂海洋营养网络动力学,发现较长的营养链和更多的消费者增加混沌性,而杂食性相互作用促进稳定性,表明相互作用而非复杂性是稳定性的关键驱动因素。

详情
AI中文摘要

理解现实世界复杂网络的动力学对于评估其可预测性、恢复力以及改善生态系统管理至关重要,尤其是在气候变化的背景下。生态网络中稳定性与复杂性之间的关系在文献中仍存在争议。在这项建模研究中,我们探讨了一个以多种营养相互作用和环境约束为特征的复杂海洋营养网络是否表现出主要稳定、周期或混沌动力学。我们将微生物环纳入营养网络模型,该模型包括一到三个初级生产者、一个或两个消费者,以及多达三个营养级的捕食者。微生物环是一个关键过程,其中细菌将来自较高营养级的碎屑回收为可供初级生产者生长的营养物质,确保系统内的质量守恒。我们进行数值模拟以研究网络的动态行为,通过关闭物种间的捕食-被捕食链接并探索高维参数空间,考察了几种配置。我们的结果表明:(i) 较长的营养链和 (ii) 更多的消费者增加了系统的混沌性,而 (iii) 杂食性相互作用促进了稳定性。值得注意的是,许多配置表现出高比例的混沌行为。反馈环分析表明,负反馈和正反馈之间的平衡在系统趋向稳态的过程中起着关键作用。这项研究表明,相互作用和反馈,而非复杂性,是稳定性的关键驱动因素,指出了稳定性-复杂性关系的不明确性,反而强调了稳定性-相互作用的依赖性。混沌动力学也可能发挥重要作用,对可预测性和生态系统管理具有潜在影响。

英文摘要

Understanding the dynamics of real world complex networks is crucial for assessing their predictability, resilience, and improving ecosystem management, especially in the context of climate change. The relationship between stability and complexity in ecological networks is still debated in the literature. In this modeling study, we investigate whether a complex marine trophic network, characterized by multiple trophic interactions and environmental constraints, exhibits predominantly stable, periodic or chaotic dynamics. We incorporate the microbial loop into a trophic network model, which includes one to three primary producers, one or two consumers, and up to three trophic levels of predators. The microbial loop is a key process in which bacteria recycle detritus from higher trophic levels into nutrients available for the growth of primary producers, ensuring mass conservation within the system. We perform numerical simulations to investigate the dynamic behavior of the network, exploring several configurations by turning off predator prey links between species and varying the high dimensional parameter space. Our results show that (i) longer trophic chains and (ii) a higher number of consumers increase system chaoticity, whereas (iii) omnivorous interactions promote stability. Notably, many of the configurations exhibit high percentages of chaotic behavior. Feedback loop analysis suggests that the balance between negative and positive interactions plays a key role in the convergence of the system toward a steady state. This study shows that interactions and feedback, rather than complexity, are key drivers of stability, pointing to the absence of a clear stability complexity relationship and instead highlighting a stability interaction dependence. Chaotic dynamics may also play an important role, with potential implications for predictability and ecosystem management.

2606.10636 2026-06-10 q-bio.OT 新提交

Compositional proofreading through critical self-tuning

通过临界自调谐的组合校对

Omer Karin

AI总结 提出临界调谐机制通过竞争实现多组分系统的校对,将群体集中于持久组分,并预测去钉扎转变可能解释癌症、免疫缺陷和衰老中的异常激活。

详情
AI中文摘要

高维多组分系统,包括免疫和表观遗传库,必须选择性地保留稀有、有益的组分,同时清除大量涌入的次优变体。我们证明,通过竞争对组分控制参数进行临界调谐,自然地在这些系统中实现了校对。对共享输入的竞争将系统钉扎在最持久物种的边缘稳定阈值上。这赋予优势物种更长的寿命,将群体集中在优势组分中,同时迫使稳定性较差的变体进入快速漂移驱动的更替。当总驱动力超过特征尺度时,这种钉扎失效,产生非选择性状态,其中组分寿命随总驱动力呈通用幂律标度。将该框架应用于生物记忆,我们在浆细胞积累动力学中识别出这种效应的特征,并提出去钉扎转变可能代表跨生物领域的失效点,包括癌症、免疫缺陷以及衰老过程中有害基因组元件的异常激活。

英文摘要

High-dimensional multicomponent systems, including immune and epigenetic repertoires, must selectively retain rare, beneficial components while purging a massive influx of suboptimal variants. We demonstrate that critical tuning of component control parameters through competition naturally implements proofreading in these systems. Competition for shared inputs pins the system to the marginal stability threshold of the most persistent species. This grants dominant species extended lifetimes, concentrating the population into dominant components while forcing less-stable variants into rapid drift-driven turnover. When aggregate drive exceeds a characteristic scale, this pinning fails, producing a non-selective state where component lifetimes scale as a universal power law with aggregate drive. Applying this framework to biological memory, we identify the hallmarks of this effect in plasma cell accumulation dynamics and propose that de-pinning transitions may represent failure points across biological domains, including cancer, immunodeficiencies, and the aberrant activation of harmful genomic elements during ageing.

2606.10605 2026-06-10 q-bio.PE 新提交

Modeling pest dynamics in trap cropping to improve yield: the effects of attraction, retention, and land allocation

模拟诱集作物中的害虫动态以提高产量:吸引力、滞留力和土地分配的影响

Matthew H Holden

AI总结 通过产量最大化模型,研究诱集作物的吸引力、害虫滞留力及土地分配比例对害虫防治效果和可行性的影响,发现降低害虫从诱集作物向主作物的扩散可大幅减少所需诱集面积。

详情
AI中文摘要

诱集作物通过将害虫吸引离开主作物来减少对主作物的损害。然而,当害虫重新扩散回主作物时,这种保护作用会被削弱。本文重点关注防止这种回流的重要性,表明有效的诱集作物取决于害虫被吸引到诱集植物的强度以及它们离开诱集植物的频率。结合用于诱集作物的土地比例,这些过程决定了诱集作物在商业规模上的有效性和可行性。我们使用一个简单的产量最大化框架来形式化这种关系,其中种植者权衡害虫抑制效益与牺牲给诱集作物的土地。模型表明,当从诱集植物扩散的害虫数量等于从主作物扩散的数量时,最优诱集覆盖率可能超过景观的20%至30%,这一水平很少被种植者接受。然而,将害虫从诱集植物扩散的比例降低到主作物扩散的四分之一,可将最优所需诱集面积降至约5%,从而使诱集作物从不可行变为可行。理解这些关系可以指导诱集作物的设计,从植物选择到减少害虫移动的针对性干预措施,以最小化损害、最大化产量,并使诱集作物成为可持续害虫管理的可靠组成部分。

英文摘要

Trap crops reduce damage to a cash (main) crop by attracting pests away from it. Yet this protection is weakened when pests disperse back into the cash crop. In this paper, we focus on the importance of preventing this backflow, showing that effective trap cropping depends jointly on how strongly pests are attracted to trap plants and how rarely they leave them. Together with the proportion of the field devoted to trap plants, these processes determine both the efficacy and feasibility of trap cropping at commercial scales. We formalise this relationship using a simple yield-maximisation framework, in which growers weigh pest suppression benefits against the land sacrificed to trap plants. The model shows that when dispersal from trap plants equals that from the cash crop, optimal trap coverage can exceed 20 to 30 percent of the landscape, levels rarely acceptable to growers. However, reducing pest dispersal off trap plants to just one-quarter of cash crop dispersal lowers the optimal required trap area to approximately 5 percent, transforming trap cropping from impractical to feasible. Understanding these relationships can guide trap-cropping design, from plant choice to targeted interventions that reduce pest movement, to minimise damage, maximise yield, and make trap cropping a reliable component of sustainable pest management.

2606.10109 2026-06-10 q-bio.OT 新提交

When is Enough Enough? A Proposed Termination Point for the Number of Replicates in Computational Simulations

何时足够?计算模拟中重复次数的终止点提议

Eric T. Lofgren, Kellen Myers, Nina H. Fefferman

AI总结 针对计算模拟中通过增加试验次数来获得统计显著性的问题,提出Ω检验作为确定模拟重复次数的终止标准,以提高效率并统一理解。

详情
AI中文摘要

计算模拟为计算机实验提供了强大的工具包。然而,尽管该领域已经为这类模型的设计和实施制定了最佳实践,但在讨论如何理解和/或解释其结果时仍存在模糊性,因为其固有的能力可以通过简单地增加模拟试验次数来压倒传统的频率统计。这从两个方面使学科失效:首先,它使社区不确定什么是统一理解的最佳实践;其次,它可能使计算研究负担过重,这些研究消耗时钟周期仅仅是为了确保“足够的运行以满足同行”,而没有任何关于“足够”定义的理论基础。我们提出了一个简单直接的停止模拟额外试验的标准,即Ω检验,其设计类似于传统频率P检验的功能。社区采用合理且统一的标准将允许更高效的计算实验,并清晰地沟通/解释以此方式发现的发现。

英文摘要

Computational simulation provides a powerful toolkit for in silico experimentation. However, while the field has developed best practices for the design and implementation of such models, there remains ambiguity in discussions about how to understand and/or interpret their results due to their inherent ability to overwhelm traditional frequentist statistics by simply increasing the number of trials simulated. This fails the discipline in two ways: first, it leaves the community unsure of what constitutes a best practice for uniform understanding, and second, it potentially overburdens computational studies that burn clock cycles solely to ensure "enough runs to satisfy peers" without any theoretical underpinning for a definition of "enough". We propose a simple and straightforward standard for when to stop simulating additional trials, the Ω test, designed to be analogous to the function of traditional frequentist P-tests. Community adoption of a reasonable and uniform standard will permit more efficient computational experimentation and clearly communication/interpretation of the findings discovered in this way.

2606.10889 2026-06-10 q-bio.NC cs.LG 新提交

Sleep EEG Signal Criticality as a Non-Invasive Predictor of Cognitive Decline in Dementia

睡眠脑电信号临界性作为痴呆认知衰退的非侵入性预测指标

Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

AI总结 研究通过多重分形去趋势波动分析量化睡眠脑电信号临界性,发现认知健康者更接近最优临界状态,痴呆组DFA指数向1.0偏移,表明睡眠中无标度神经动力学重组先于临床症状,可作为早期筛查工具。

详情
Comments
4 pages, 2 figures, accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026
AI中文摘要

神经退行性疾病的早期检测仍然是一个关键的临床挑战。本研究探讨了通过多重分形去趋势波动分析(MFDFA)量化的睡眠脑电信号临界性是否可作为未来认知衰退的非侵入性生物标志物。我们分析了国家睡眠研究资源(NSRR)骨质疏松性骨折研究(SOF)队列的纵向数据,比较了保持认知正常与后来进展为痴呆相关损伤(3MS < 78)的女性之间的基线睡眠脑电动力学。我们的结果揭示了Hurst指数$H(q)$分布在组间的显著差异,特别是在非快速眼动阶段N2和N3期间。认知健康的个体在所有电极位置上表现出显著更接近最优临界状态的信号动力学($p \leqslant 0.001$),支持了大脑临界性假说。监督UMAP投影证实了整夜睡眠期间组间的清晰空间分离。痴呆组表现出DFA指数向$1.0$的偏移,表明睡眠中无标度神经动力学的重组先于临床症状。这些发现强调了将MFDFA衍生测量整合到自动化、基于睡眠的筛查工具中的潜力,从而能够在痴呆的前驱窗口期进行更早的预防性干预。

英文摘要

Early detection of neurodegeneration remains a critical clinical challenge. This study investigates whether sleep EEG signal criticality, quantified via Multifractal Detrended Fluctuation Analysis (MFDFA), serves as a non-invasive biomarker for future cognitive decline. We analyzed longitudinal data from the National Sleep Research Resource (NSRR) Study of Osteoporotic Fractures (SOF) cohort, comparing baseline sleep EEG dynamics between women who remained cognitively normal and those who later progressed to dementia-related impairment ($3MS < 78$).Our results reveal significant group-level differences in Hurst exponent $H(q)$ distributions, particularly during non-REM stages N2 and N3. Cognitively healthy individuals exhibited signal dynamics significantly closer to an optimally critical state across all electrode locations ($p \leqslant 0.001$), supporting the Brain Criticality Hypothesis. Supervised UMAP projections confirmed clear spatial separation between groups throughout the overnight sleep architecture.The dementia group demonstrated a shift in DFA exponents toward $1.0$, suggesting that a reconfiguration of scale-free neural dynamics during sleep precedes clinical symptoms. These findings highlight the potential for MFDFA-derived measures to be integrated into automated, sleep-based screening tools, enabling earlier preventative interventions during the prodromal window of dementia.

2606.10238 2026-06-10 q-bio.NC cs.AI 新提交

Hyperbolic Neural Population Geometry Benefits Computation

双曲神经群体几何结构有益于计算

Dennis Wu, Yi-Chun Hung, Braden Yuille, James E. Fitzgerald, Han Liu

AI总结 本文提出海马体群体活动诱导双曲几何的理论框架,证明现代Hopfield网络更新规则计算最小均方误差估计,并引入双曲空间中的新联想记忆模型,其容量显著优于现有模型。

详情
Comments
Accepted at ICML 2026, 37 pages, 5 figures
AI中文摘要

神经群体几何结构影响下游计算。最近神经生物学的实验发现表明,海马体中的群体活动具有双曲结构。本文为这一现象提供了理论框架。首先,我们提出了一种海马体调谐曲线的合理构造,该构造在统计上诱导双曲几何。接着,我们通过证明现代Hopfield网络更新规则计算最小均方误差(MMSE)估计,建立了神经解码与联想记忆之间的联系。最后,我们引入了一个在双曲空间中定义的新型联想记忆模型,其容量显著大于领先模型。我们的结果表明,动物将空间信息编码为潜在的双曲认知地图,从而提高了记忆容量和解码精度。

英文摘要

Neural population geometry shapes downstream computation. Recent empirical findings in neurobiology suggest that a hyperbolic structure underlies population activity in the hippocampus. Here we provide a theoretical framework for this phenomenon. First, we propose a plausible construction of hippocampal tuning curves that statistically induces hyperbolic geometry. Next, we establish a connection between neural decoding and associative memory by demonstrating that the Modern Hopfield Network update rule computes the minimum mean-squared-error (MMSE) estimator. Finally, we introduce a novel associative memory model defined in hyperbolic space that yields significantly larger capacity than leading models. Our results suggest that animals encode spatial information as a latent hyperbolic cognitive map, improving both memory capacity and decoding accuracy.

2606.11144 2026-06-10 cs.LG q-bio.GN q-bio.QM stat.AP 新提交

OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib

OncoTraj:EGFR突变非小细胞肺癌奥希替尼耐药纵向预测的公共基准

Abhijoy Sarkar, Aarchi Singh Thakur

发表机构 * Span AI

AI总结 针对EGFR突变非小细胞肺癌一线奥希替尼耐药预测缺乏公共基准的问题,提出OncoTraj基准,整合813名患者数据,定义三项任务,并发现单时间点组织NGS特征导致所有模型性能接近随机,而TP53共突变与进展率升高相关。

详情
Comments
24 pages, 7 figures, 4 tables. Code, data, and trained model weights: https://github.com/span-ai-labs/oncotraj. Python package: pip install oncotraj. Dataset: https://huggingface.co/datasets/span-ai-labs/oncotraj-v1
AI中文摘要

EGFR突变非小细胞肺癌(NSCLC)对一线奥希替尼的耐药是治疗压力下可预测克隆演化的典型例子,但目前尚无用于训练或评估相应纵向患者轨迹计算模型的公共基准。我们推出OncoTraj,这是一个来自三个真实世界临床基因组数据源(MSK-CHORD(672名患者)、AACR Project GENIE BPC NSCLC(34名患者)和FLAURA分子耐药补充(107名患者))的813名接受一线奥希替尼治疗的EGFR突变NSCLC患者的公共基准。OncoTraj定义了三个锁定任务:(A)固定12个月标志点的进展二元分类,(B)首次进展时间(天)的回归,以及(C)主要耐药机制的六类分类。我们发布了统一的数据集、经过审计的无泄漏保证的患者级训练/验证/测试划分、一个开源评估框架,以及六个参考基线,涵盖多数类预测器、逻辑回归、随机森林、XGBoost、LSTM和多任务Transformer。使用v1的单时间点快照特征,所有模型在干净的源内评估中均未超过随机水平:这种天花板在不同模型类别中的一致性表明限制在于输入模态(单快照组织NGS而非连续ctDNA),而非算法。该基准确实恢复了可重复的、与文献一致的关联:TP53共突变使整个队列的12个月进展率从29%提高到59%。OncoTraj建立了一个可重复、经泄漏审计的基线,并将模态限制转化为针对富集连续ctDNA的v2的具体设计要求。

英文摘要

Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.

2606.11091 2026-06-10 eess.SY cs.SY q-bio.NC 新提交

QUIET: Quantifying Underutilized Influential Edges for Targeted Synchronization

QUIET: 量化未充分利用的影响边以实现目标同步

Sovesh Mohapatra, Christoffer G. Alexandersen, Panagiotis Fotiadis, Max B. Kelz, John A. Detre, Fabio Pasqualetti, Dani S. Bassett

AI总结 提出边中心框架QUIET,结合结构可控性和功能互信息识别能量高效的同步路径,验证其在合成网络和人类连接组中的有效性。

详情
Comments
38 Pages; 6 Figures; 8 SIs
AI中文摘要

网络控制理论可用于建模内在和外在策略以引导神经动力学。标准方法是节点中心、结构性的,并专注于实现期望的瞬时状态。在这里,我们开发了一种边中心方法,该方法结合了结构和功能,以实现由期望同步状态表征的扩展神经动力学模式。我们的方法,量化未充分利用的影响边以实现目标同步(QUIET),是一个边中心框架,它整合了个体白质连接的结构可控性和成对功能时间序列之间的互信息,以识别能量高效的同步路径。QUIET识别安静高速公路,即结构上有影响力但功能上未充分利用的边,以优化区域同步。我们在75种合成配置上验证了QUIET,其中QUIET排名的边集在93%的情况下显著优于随机选择(p<0.01)。该框架在人类连接组计划参与者上测试,揭示了显著性网络同步所需的控制能量与流体智力相关。将QUIET应用于接受右美托咪定诱导无反应的健康成年人,显示额顶叶和默认模式网络在清醒和镇静状态下均表现出同步所需的最大控制能量。QUIET作为独立软件发布,用于研究理论上定义的同步路径,进而可为扰动研究中的可测试假设提供信息。

英文摘要

Network control theory can be used to model intrinsic and extrinsic strategies to steer neural dynamics. Standard approaches are node-centric, structural, and focused on achieving desired instantaneous states. Here, we develop an edge-centric approach which incorporates both structure and function to achieve extended patterns of neural dynamics characterized by desired synchronization states. Our method, Quantifying Underutilized Influential Edges for Targeted Synchronization (QUIET), is an edge-centric framework that integrates structural controllability of individual white matter connections and mutual information between pairwise functional timeseries to identify energy-efficient synchronization pathways. QUIET identifies quiet highways, edges that are structurally influential but functionally underutilized, to optimize regional synchronization. We validated QUIET across 75 synthetic configurations, where QUIET-ranked edge sets significantly outperformed random selection in 93% of cases (p<0.01). The framework, tested on Human Connectome Project participants, revealed that the control energy required for synchronization of the salience network correlates with fluid intelligence. QUIET, applied to healthy adults undergoing dexmedetomidine-induced unresponsiveness, showed that the frontoparietal and default-mode networks exhibited the largest control energy required for synchronization in both awake and sedated states. QUIET is released as a stand-alone software to be used to study theoretically-defined synchronization pathways, which in turn could inform testable hypotheses in perturbative studies.

2606.11066 2026-06-10 cs.LG q-bio.NC 新提交

GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

GRAFT: 基于Transformer的神经群体活动建模中的增益重校准适配器

Xiangsheng Ge, Yang Xie

AI总结 提出GRAFT模型,通过分离可重用时间动态与可重校准神经元接口,在MC Maze数据集上达到SOTA,并仅更新9.21%参数实现跨天重校准。

详情
AI中文摘要

神经群体活动模型可以从分箱的尖峰信号中恢复丰富的时间结构,但其读入和读出层通常与固定的记录神经元集合绑定。这种耦合限制了在长期脑机接口中的重用,因为记录神经元的身份、数量和响应统计可能每天变化。我们引入了GRAFT,一种基于Transformer的神经群体活动模型,它将可重用时间动态与可重校准的神经元接口分离。神经元接口控制记录神经元如何进入和离开共享骨干网络,辅助增益和位置机制支持Transformer内部的神经活动建模。在标准NLB'21协议下的MC Maze上,GRAFT作为集成模型达到0.3866 co-bps,在公共和报告的NLB'21结果中,在主要co-bps指标上创造了新的最先进水平。在从NLB'21 MC Maze数据集系列构建的跨天协议中,GRAFT通过仅更新9.21%的参数,从MC Maze重校准到缩放后的MC Maze数据集(Large/Medium/Small),在受限的目标天支持集下分别达到0.3749、0.3112和0.3152 co-bps。这些结果表明,相同的接口-骨干分离既支持强大的基于Transformer的神经群体活动建模,也支持数据高效的跨天重校准。

英文摘要

Neural population activity models can recover rich temporal structure from binned spikes, but their read-in and readout layers often remain tied to a fixed set of recorded neurons. This coupling limits reuse in long-term brain-computer interfaces, where recorded neuron identities, counts, and response statistics can change across days. We introduce GRAFT, a Transformer-based neural population activity model that separates reusable temporal dynamics from a recalibratable neuron interface. The neuron interface controls how recorded neurons enter and leave the shared backbone, and auxiliary gain and positional mechanisms support neural activity modeling inside the Transformer. On MC Maze under the standard NLB'21 protocol, GRAFT reaches 0.3866 co-bps as an ensemble, setting a new state of the art on the primary co-bps metric among public and reported NLB'21 results. In a cross-day protocol constructed from the NLB'21 MC Maze dataset series, GRAFT recalibrates from MC Maze to the scaled MC Maze datasets (Large/Medium/Small) by updating only 9.21% of parameters, reaching 0.3749, 0.3112, and 0.3152 co-bps with restricted target-day support sets. These results show that the same interface-backbone separation supports both strong Transformer-based neural population activity modeling and data-efficient cross-day recalibration.

2606.11057 2026-06-10 cs.LG q-bio.BM stat.ML 新提交

Flexible Kernels for Protein Property Prediction

用于蛋白质性质预测的灵活核函数

Martin Jankowiak, Yerdos Ordabayev, Rudraksh Tuwani, Henry N. Ward, Hunter Nisonoff, James M. McFarland, Gevorg Grigoryan

AI总结 提出利用进化替代矩阵和局部线性性的序列核函数,结合高斯过程实现数据高效的蛋白质性质预测,并融入结构信息进行多任务学习。

详情
Comments
50 pages; to appear at ICML 2026
AI中文摘要

尽管对蛋白质设计应用至关重要,但从稀疏实验数据预测蛋白质性质(如结合亲和力和热稳定性)仍然是一个重大挑战。因此,我们引入了一类序列核函数,利用进化替代矩阵以及局部线性性,并证明由此产生的高斯过程为蛋白质性质景观提供了数据高效的模型,通常优于依赖基础模型嵌入的替代方法。此外,通过学习实际上是结构感知的替代矩阵,我们展示了我们的核函数可以轻松地整合来自基础模型的结构信息。我们证明了这些结构条件核函数非常适合跨多个蛋白质性质景观的多任务学习,并且可以显著优于局部监督学习方法。

英文摘要

Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.

2606.10543 2026-06-10 cs.LG cs.AI cs.ET q-bio.QM 新提交

Flexible Flows for Biological Sequence Design

生物序列设计的灵活流模型

Yogesh Verma, Dani Korpela, Harri Lähdesmäki, Vikas Garg

发表机构 * Aalto University(阿尔托大学) YaiYai Ltd(YaiYai有限公司) OpenProtein.AI

AI总结 提出结构化耦合、潜编辑速率参数化和潜分类器无引导机制,实现变长序列生成和细粒度控制,在多种生物序列任务中达到最优性能。

详情
AI中文摘要

设计功能性生物序列需要在严格的进化和生物物理约束下导航巨大的离散空间。离散流匹配(DFM)提供了在此类空间上的生成框架,但现有方法依赖于生物学上无信息的耦合,并且在变长序列生成和细粒度控制方面灵活性有限。我们提出了一种结构化耦合,编码序列元素间的领域特定偏好,将源分布偏向合理区域,而不修改流目标或训练过程。在此基础上,我们引入了一种基于潜编辑的速率参数化,通过基于共享全局潜变量的编辑操作(类似于潜变量模型)对变长生成进行建模,同时保持可追踪性。我们进一步引入了一种潜分类器无引导机制,在连续潜空间中连贯地引导生成,以及用于测试时控制编辑操作的Dirichlet先验温度缩放。我们的方法在多种生物序列任务中实现了最先进的性能,包括密度估计、无条件和条件DNA序列生成以及肽序列生成。

英文摘要

Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.

2606.10410 2026-06-10 cs.LG eess.SP q-bio.QM 新提交

A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection

生理信号中的综合推理时增强框架:应用于基于PPG的房颤检测

Davood Fattahi, Runze Yan, Saurabh Kataria, Zhaoliang Chen, Xiao Hu

AI总结 提出一个包含13种增强方法的统一推理时增强框架,通过贝叶斯优化超参数,在PPG房颤检测任务中显著提升AUROC和AUPRC,降低假阳性率。

详情
Comments
22 pages, 11 figures, 4 tables. Under review at Physiological Measurement
AI中文摘要

目标:在真实部署中,生理信号的准确分类面临传感器噪声、运动伪影以及训练数据与部署数据之间分布偏移的挑战。推理时增强(ITA)在推理过程中应用增强而非重新训练,提供了一种简单、模型无关的机制来提高鲁棒性。然而,ITA在生理信号中的应用范围仍然狭窄,依赖于有限的增强方法和固定的未优化参数。本文提出一个统一的ITA框架以解决这一差距。方法:该框架包含13种增强方法,涵盖时域、幅值域、频域和伪影注入变换,并通过贝叶斯优化优化超参数。我们使用GPT-PPG和ResNet在五个数据集(包含400多名患者和约9,800小时记录)上评估基于30秒PPG信号的房颤(AF)检测。主要结果:标准ITA持续改善了AUROC(GPT-PPG最高提升8.5%,ResNet最高提升0.7%)和AUPRC(GPT-PPG最高提升10.6%,ResNet最高提升0.8%)。选择性ITA进一步将非AF数据集上的平均FPR降低了高达4.4%(GPT-PPG)和1.3%(ResNet)。意义:这些发现确立了ITA作为一种实用的、模型无关的方法,用于在无法重新训练的部署环境中提高基于PPG的房颤分类可靠性,并具有更广泛的生理信号分析适用性。

英文摘要

Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.

2606.10407 2026-06-10 cs.SD cs.CV q-bio.QM 新提交

Time-frequency localization of bird calls in dense soundscapes

密集声景中鸟鸣的时频定位

Simen Hexeberg, Fanghui Tong, Hari Vishnu, Mandar Chitre

发表机构 * Acoustic Research Laboratory, National University of Singapore(新加坡国立大学声学研究实验室) Tropical Marine Science Institute, National University of Singapore(新加坡国立大学热带海洋科学研究所) School of Marine Science and Technology, Northwestern Polytechnical University(西北工业大学航海学院)

AI总结 将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在密集热带声景中定位鸟鸣,并引入IoMin评估指标,在分布内和分布外数据上均优于基线。

详情
AI中文摘要

被动声学监测能够大规模观测野生动物,但大多数生物声学分类器仅预测时间窗口内的物种存在,而无法在时间或频率上精确定位发声,限制了后续分析。我们将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在新加坡密集热带声景中定位鸟鸣。此外,我们引入了一个开源的基于浏览器的标注工具,并提出了Intersection over Minimum (IoMin)评估指标,该指标比标准IoU更好地处理模糊的声学边界,更适合当前问题。最佳YOLO模型在新加坡的分布内声景中几乎将基线性能翻倍(81.8% vs. 42.1% IoMin@50 F1分数),同时在夏威夷的未见分布外录音上仍优于基线(58.6% vs. 48.6%)。这些结果表明,目标检测框架是复杂声景中动物发声时频定位的一种有前景的方法。

英文摘要

Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

2606.10107 2026-06-10 cs.CV q-bio.QM 新提交

Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching

最大匹配精度:利用全局最优匹配的实例分割评估指标

Kaden Stillwagon, Alexandra D. VandeLoo, Craig R. Forest

AI总结 提出最大匹配精度(MMA),通过全局最优一对一匹配和逐像素归一化,克服现有指标在细胞分割评估中的不连续、不敏感和匹配非最优问题,提供更稳定、敏感和可解释的评分。

详情
AI中文摘要

可靠评估实例分割模型需要准确且一致反映分割质量的指标。然而,生物成像中最广泛使用的指标存在根本性的数学缺陷:硬交并比阈值导致不连续、低灵敏度的评分;逐对象归一化在对象大小变化下扭曲分数;以及贪婪或一对多匹配过程产生非最优、顺序依赖的对应关系。这些特性共同导致在常见失败模式(如细胞分裂、细胞合并和细胞边界不精确)下产生不直观且不可靠的模型排名。我们提出最大匹配精度(MMA),一种无阈值连续指标,它找到预测对象与真实对象之间的全局最优一对一匹配,并使用逐像素归一化聚合总重叠。我们在三个实验(合成失败案例、渐进式破坏测试和模型排名比较)中评估MMA与AP@50、PQ、SEG和AJI。MMA产生的分数比现有替代方案更稳定、更敏感、更可解释,为生物细胞成像中的公平实例分割基准测试提供了原则性基础。

英文摘要

Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection-over-Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per-object normalization that distorts scores under object size variation; and greedy or one-to-many matching procedures that yield non-optimal, order-dependent correspondences. Together, these properties produce unintuitive and unreliable model rankings under common failure modes such as split cells, merged cells, and cell boundary imprecision. We propose Maximum Matching Accuracy (MMA), a threshold-free continuous metric that finds a globally optimal one-to-one matching between predicted and ground truth objects and aggregates total overlap using per-pixel normalization. We evaluate MMA against AP@50, PQ, SEG, and AJI across three experiments: synthetic failure cases, progressive corruption tests, and a model ranking comparison. MMA produces scores that are more stable, more sensitive, and more interpretable than existing alternatives, providing a principled foundation for fair instance segmentation benchmarking in biological cell imaging.

2606.10080 2026-06-10 cs.LG cs.AI q-bio.QM 新提交

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

VFUSE: 基于稀疏自编码器的毒力特征理解

Michael Yu, Matthew L. Olson

AI总结 提出VFUSE方法,通过训练稀疏自编码器(SAE)分析扩散-Transformer模型激活,识别蛋白质设计中的危险特征,实现可解释性提升而不牺牲性能。

详情
AI中文摘要

生成模型在蛋白质设计等领域取得了显著进展,但这种能力也使得危险蛋白质的生成变得不透明。在这项工作中,我们引入了VFUSE(基于稀疏自编码器的毒力特征理解),这是一种机制可解释性方法,通过在扩散-Transformer激活上训练SAE来审计蛋白质模型中的危险感知特征。我们将VFUSE应用于RoseTTAFold3和RFDiffusion3,这些是流行的开源蛋白质折叠和合成模型。我们发现,对于某些模块,线性探针在SAE潜在空间中的拟合效果显著优于原始模型表示,从而在不牺牲模型性能的情况下提高了可解释性。此外,我们识别出SAE中的单语义特征,这些特征仅在危险设计上激活,AUROC高达0.84(q < 10^{-13})。据我们所知,这是首次在全原子扩散模型上训练SAE,也是首次对蛋白质设计模型进行特征级毒力审计,为安全且可解释的蛋白质设计铺平了道路。

英文摘要

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC $0.84$ ($q < 10^{-13}$). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

2606.09898 2026-06-10 cs.LG cs.MA q-bio.QM 新提交

TRAPS: Therapeutic Response Analysis via Pathway-informed Stratification

TRAPS: 基于通路信息分层的治疗反应分析

Sujoy Banik, Sayantan Chakraborty, Boishakhi Das Toma, Zainab Ghafoor, Ushashi Bhattacharjee, Koushik Howlader, Tirtho Roy

AI总结 提出首个统一基准,评估三种通路引导的深度学习模型在联合预测癌症治疗反应和生存率上的表现,发现不同模型在不同任务上各有优劣。

详情
AI中文摘要

癌症治疗规划需要同时考虑多个临床维度。临床医生必须确定患者是否应接受靶向分子治疗、放疗,以及他们是否可能存活超过六个月。现有的通路引导深度学习模型是孤立开发和测试的,无法进行跨架构的公平比较。我们提出了第一个用于通路引导治疗反应建模的统一基准,评估了三种生物信息学架构:BINN、GraphPath 和 PATH,使用了来自癌症基因组图谱的五个癌症队列,代表 2,622 名患者,这些患者使用 Reactome 通路活性评分进行编码。每个模型在相同的数据和评估条件下联合训练所有三个临床结果,这是第一项将通路结构化深度学习视为联合治疗和生存预测问题的研究。我们的结果表明,没有一个架构在所有任务中获胜:PATH 在整体靶向分子治疗预测中表现最佳,BINN 在生存预测中最可靠,而没有一个模型能对放疗产生有用的预测,因为该决策的关键驱动因素是基因表达数据中未捕获的临床变量。最引人注目的是,GraphPath 在前列腺靶向分子治疗预测中达到了 0.92 的 AUROC,是整个基准中的最高分,这表明当与具有狭窄靶向驱动程序的队列匹配时,即使存在极端类别不平衡(仅 11% 阳性率),横向共调控结构也能产生卓越的判别能力。

英文摘要

Cancer treatment planning requires decisions across multiple clinical dimensions at once. Clinicians must determine whether a patient should receive targeted molecular therapy, radiation therapy, and whether they are likely to survive beyond six months. Existing pathway-informed deep learning models have been developed and tested in isolation, making fair comparison across architectures impossible. We present the first unified benchmark for pathway-guided therapy response modeling, evaluating three biologically informed architectures, BINN, GraphPath, and PATH, across five cancer cohorts drawn from The Cancer Genome Atlas, representing 2,622 patients encoded using Reactome pathway activity scores. Each model is trained jointly on all three clinical outcomes under identical data and evaluation conditions, the first study to treat pathway-structured deep learning as a combined therapy and survival prediction problem. Our results show that no single architecture wins across all tasks: PATH performs best for targeted molecular therapy prediction overall, BINN is most reliable for survival prediction, and no model produces useful predictions for radiation therapy, as the key drivers of that decision are clinical variables not captured in gene expression data. Most strikingly, GraphPath achieves an AUROC of 0.92 on prostate targeted molecular therapy prediction, the highest score in the entire benchmark, demonstrating that lateral co-regulation structure produces exceptional discriminative power when matched to a cohort with a narrow targetable driver programme, even under conditions of extreme class imbalance at only 11\% positive prevalence.

2606.09952 2026-06-10 q-bio.QM physics.med-ph 新提交

Adjusted trajectory of medication exposure taking into account the periodicity of dispensations and the number of dispensed packs and comparative analysis on EFEMERIS database

考虑配药周期性和配药包数的药物暴露轨迹调整方法及基于EFEMERIS数据库的比较分析

Cécile Chouquet, Anna-Belle Beau, Christine Damase-Michel, David Jeauneau, Isabelle Lacroix, Sabine Mercier

AI总结 提出一种基于配药包数和配药类型(偶尔或规律)调整药物暴露轨迹的方法,通过EFEMERIS数据比较三种轨迹计算场景,发现调整方法改善了聚类质量并影响新生儿结局分析。

详情
Journal ref
only, 2025, vol. 20, no 2, p. e0308767
Comments
10 pages, 2 figures, 3 tables
AI中文摘要

我们提出了一种基于配药包数和配药类型(偶尔或规律)计算药物暴露轨迹的调整方法。基于EFEMERIS数据进行了比较研究,使用了三种不同的轨迹计算场景,取决于是否考虑配药包数和配药周期性。通过所有暴露女性的限定日剂量(DDD)数量的全局指标突出了场景的影响;研究了从一个场景到另一个场景个体轨迹的变化;我们还比较了分为四组的聚类结果。如果65%的轨迹保持不变,我们可以在其余轨迹中观察到DDD数量和/或个体暴露概况的显著变化。我们观察到4%的轨迹被分配到不同的聚类,并且调整方法的聚类质量更好。根据研究背景,某些母亲特征和新生儿结局的聚类分布可能受到影响。例如,属于高剂量精神药物聚类的母亲的新生儿出现新生儿病理的发生率更高,从而强化了先前研究关于高暴露于精神药物与新生儿病理存在关联的结论。

英文摘要

We presented an adjustment method for the calculation of medication exposure trajectories based on the number of dispensed packs and the type of dispensations (occasional or regular). A comparative study based on the EFEMERIS data was carried out using three different scenarios of trajectory calculation depending on whether or not the number of packs and the periodicity of medication dispensations were taken into account. The impact of the scenario was highlighted using global indicators on the number of Define-Daily Dose (DDD) on all women exposed; the study of changes in individual trajectories from one scenario to another was carried out; we also compared the results of a clustering into four groups. If 65% of the trajectories remained unchanged, we could observe on the rest significant changes in number of DDD and/or on individual exposure profile. We observed 4% of trajectories that were attributed to a different cluster, and the clustering was of better quality with the adjustment method. Depending on the study context, an impact on cluster distribution could be observed for some maternal characteristics and neonatal outcomes. This was the case for a higher occurrence of neonatal pathology for neonates from mothers belonging to the cluster with high doses of psychotropics, thus reinforcing the conclusions of previous studies of a link between high exposure to psychotropic medications and presence of pathology for the newborn.

2606.10878 2026-06-10 physics.bio-ph q-bio.CB 新提交

Spontaneous polarization for protrusion-driven cell crawling

自发极化驱动突起介导的细胞爬行

Pierre Recho

AI总结 提出最小一维连续模型,通过细胞运动与外部化学调节因子反馈导致对称性破缺,实现自发极化驱动细胞爬行,预测了真实爬行速度和肌动蛋白密度分布。

详情
AI中文摘要

我们提出了一个最小的一维连续模型,用于描述在刚性基底上自发启动突起驱动的细胞爬行。细胞骨架被表示为粘性肌动蛋白网络,该网络在体内周转并在两个移动的细胞边缘聚合。对称性破缺源于细胞运动、外部化学调节因子(调控肌动蛋白成核)以及细胞前沿肌动蛋白聚合之间的反馈。当细胞移动时,调节因子在移动边界周围极化,从而在两个边缘施加不同的肌动蛋白成核密度。这产生不等的突起速率,进而增强运动并维持化学极化。当突起活性超过临界值时,静态对称态失稳,系统经历分岔进入运动极化态。根据外部线索如何控制肌动蛋白成核,相变可以是超临界或亚临界,后者导致静态和运动态共存。使用适合角质细胞的参数值,模型预测了真实的爬行速度和肌动蛋白密度分布,包括不对称的边缘局部密度峰值。这些结果确定了一种通用机制,通过该机制,肌动蛋白成核的外部生化调节可以触发沿一维轨道的自发运动,而不需要分子马达、特定的粘附动力学、可变形基底或预先存在的极性。

英文摘要

We propose a minimal one-dimensional continuum model for the spontaneous initiation of protrusion-driven cell crawling on a rigid substrate. The cell cytoskeleton is represented as a viscous actin meshwork that turns over in the bulk and polymerizes at two moving cell edges. Symmetry breaking arises from the feedback between cell motion, an external chemical regulator of actin nucleation, and actin polymerization at the cell fronts. When the cell moves, the regulator becomes polarized around the moving boundaries, thereby imposing different actin nucleation densities at the two edges. This generates unequal protrusive rates, which in turn reinforce motion and sustain the chemical polarization. Above a critical protrusive activity, the static symmetric state loses stability and the system undergoes a bifurcation toward a motile polarized state. Depending on how the external cue controls actin nucleation, the transition can be either supercritical or subcritical, leading in the latter case to coexistence between static and motile states. Using parameter values appropriate for keratocyte cells, the model predicts realistic crawling speeds and actin-density profiles, including asymmetric edge-localized density peaks. These results identify a generic mechanism by which external biochemical regulation of actin nucleation can trigger spontaneous motility along a one-dimensional track without requiring molecular motors, specific adhesion dynamics, deformable substrates, or pre-existing polarity.

2606.10355 2026-06-10 nlin.PS q-bio.CB 新提交

Mean-field models for morphogenetic processes in physiological contexts

生理背景下形态发生过程的平均场模型

D. Hernández, Alejandro Valdés López, E. C. Herrera-Hernández

AI总结 本文提出一种生物物理形式,通过耦合反应-扩散方程描述组织内化学轮廓的时空演化,建模组织区室化和细胞维持非平衡态机制,并发现单形态原系统也可产生图灵斑图。

详情
AI中文摘要

本工作引入了一种生物物理形式,用于描述组织中化学轮廓的时空演化,其新颖之处在于对组织区室化以及细胞通过产生和/或降解物质使系统远离热力学平衡的机制进行建模。模型基于守恒定律、化学动力学理论和几何约束推导,同时考虑组织的基本性质以连接理论建模与实验观察。在形态发生背景下,每个形态原由两个耦合的反应-扩散方程描述,代表细胞内和细胞外动力学,通过膜运输过程(如非线性、交叉和反常扩散)连接。我们通过扩散驱动的不稳定性探索模型的形态发生潜力,并讨论自然组织异质性如何影响图灵不稳定性和自组织现象。数学结构揭示,双形态原系统可以产生具有多个特征长度尺度的图灵斑图,而系统的维度性使得充分混合动力学中出现混沌行为。此外,由于域耦合,单形态原系统也允许图灵不稳定性。我们使用Schnakenberg动力学证明,即使激活剂扩散快于抑制剂(d<1),图灵斑图也会出现,从而扩展了斑图形成的参数空间。我们的结果表明,组织空间结构对图灵不稳定性机制具有重要影响,在某些情况下削弱了其出现的通常条件,同时拓宽了可能产生的斑图。所提出的框架为探索生物和合成背景下的涌现动力学提供了最小数学基础,在发育生物学和组织工程中具有潜在应用。

英文摘要

This work introduces a biophysical formalism to describe the spatiotemporal evolution of the chemical profile in tissues, with the novelty of modeling tissue compartmentalization and the mechanism by which cells maintain the system far from thermodynamic equilibrium via production and/or degradation of substances. The models were derived from conservation laws, chemical kinetic theory, and geometric constraints, while considering fundamental properties of tissues to connect theoretical modeling with experimental observations. In a morphogenetic context, each morphogen is described by two coupled reaction-diffusion equations, representing intra- and extracellular dynamics, linked through membrane transport processes such as nonlinear, cross, and anomalous diffusion. We explore the models' morphogenetic potential through diffusion-driven instabilities and discuss how natural tissue heterogeneities influence Turing instabilities and self-organized phenomena. The mathematical structure reveals that two-morphogen systems can produce Turing patterns with multiple characteristic length scales, while the system's dimensionality enables chaotic behavior in well-mixed dynamics. Moreover, due to domain coupling, Turing instabilities are allowed for single-morphogen systems. We used Schnakenberg kinetics to demonstrate that Turing patterns arise even when the activator diffuses faster than the inhibitor (d$<$1), thereby expanding the parameter space for pattern formation. Our results suggest that tissue spatial structure has important consequences for Turing instability mechanisms, in some cases weakening the usual conditions for its emergence while widening the possible patterns it can produce. The proposed framework offers a minimal mathematical basis to explore emergent dynamics in biological and synthetic contexts, with potential applications in developmental biology and tissue engineering.

2606.10955 2026-06-10 q-bio.BM cond-mat.soft q-bio.QM 新提交

A kinetic model of shear-induced rupture of short dsDNA

短双链DNA剪切诱导断裂的动力学模型

Ayman Hussein, Ralf Bundschuh

AI总结 基于力依赖的成核-拉链路径,建立主方程框架计算短双链DNA在剪切力下的解离速率和过渡态距离,揭示螺旋几何的关键作用,并统一解释不同力区的实验数据。

详情
Comments
Supporting Information is provided at the end of the main text
AI中文摘要

短双链DNA(dsDNA)的力诱导解离是单分子生物物理学和DNA纳米技术的核心问题,但目前仍缺乏一个物理基础的动力学描述来解释有限长度结构在剪切诱导下的断裂。本文基于力依赖的成核-拉链路径(单碱基转移)建立了一个主方程框架,能够直接计算宽力范围内的解离速率和过渡态距离。将该模型应用于恒定剪切力下的DNA-金纳米颗粒-DNA结构,该模型准确再现了所覆盖力区间的室温实验数据,并为所有力区间内类似剪切双链体的先前测量提供了统一解释。一个核心结果是,dsDNA的三维螺旋几何对于在短dsDNA的棒状聚合物模型中正确定义剪切下的末端距离至关重要。我们进一步证明,在实验相关范围内,提取的过渡态距离对ssDNA聚合物参数的变化具有鲁棒性。最后,我们分析了过渡态距离的温度依赖性,并讨论了我们的框架如何捕捉全局加热的断裂,同时识别了金纳米颗粒耦合结构中局域等离子体加热引入的额外复杂性。这些结果为解释力断裂实验以及设计力和温度驱动的DNA纳米结构提供了预测性的动力学基础。

英文摘要

Force-induced dissociation of short double-stranded DNA (dsDNA) is central to single-molecule biophysics and DNA nanotechnology, yet a physically grounded kinetic description of shear-induced rupture for finite-length constructs remains lacking. Here we develop a master equation framework built on a force-dependent nucleation-zipper pathway with single-base transitions, enabling direct calculation of dissociation rates and transition state distances over a broad force range. Applied to a DNA-gold nanoparticle-DNA construct under constant shear force, the model accurately reproduces the experimental room-temperature data in the covered force regime and provides a unified interpretation of prior measurements on similarly sheared duplexes across all force regimes. A central result is that the three-dimensional helical geometry of dsDNA is essential for correctly defining the end to end distance under shear in the rod-like polymer model of short dsDNA. We further show that the extracted transition state distances are robust to variations in ssDNA polymer parameters within the experimentally relevant regime. Finally, we analyze the temperature dependence of the transition state distance and discuss how our framework captures globally-heated rupture while identifying the additional complications introduced by localized plasmonic heating in gold nanoparticle-coupled constructs. These results provide a predictive kinetic foundation for interpreting force-rupture experiments and for designing force- and temperature-actuated DNA nanostructures.

2606.10222 2026-06-10 q-bio.NC cond-mat.dis-nn q-bio.QM 新提交

Multifractal Signatures of Ageing and Dementia Development: A Multifractal Space-Filling Curve Analysis

衰老与痴呆发展的多重分形特征:一种多重分形空间填充曲线分析

Marta Lotka, Jacek Grela, Zbigniew Drogosz, Jeremi K. Ochab, Paweł Oświęcimka

AI总结 提出多重分形空间填充曲线分析(MFSCA)方法,用于量化多维数据的相关结构,并应用于阿尔茨海默病MRI数据,发现脑组织多重分形性随年龄和痴呆发展减弱,从多重分形向单分形转变。

详情
AI中文摘要

多重分形是量化复杂数据非线性、无标度特性的有效形式。在本研究中,我们提出了一种新颖且高效的方法,称为多重分形空间填充曲线分析(MFSCA),用于量化多维数据的相关结构。在该框架内,原始多维数据——在保留局部和长程组织特性的同时——通过分形空间填充曲线投影到一维表示上。然后使用多重分形算法分析得到的一维信号。我们通过人工生成的多重分形结构和真实数据展示了该方法的实用性。特别是,我们将MFSCA应用于分析不同痴呆阶段阿尔茨海默病患者的磁共振成像(MRI)数据。基于结果,我们估计了不同年龄健康受试者以及痴呆患者大脑的多重分形轮廓。分析表明,脑结构的空间组织(以多重分形程度衡量)随着年龄和痴呆的发展逐渐减弱。在对照组中,比较年轻对照组和老年对照组时,以及在年龄相似但处于不同疾病阶段(即早期痴呆和轻度认知障碍)的痴呆受试者中,均观察到从多重分形到单分形的转变。因此,从多尺度特性的角度来看,脑空间组织的异质性特征在恶化条件下退化,导致同质且弱相关的结构。这些发现不仅有效捕捉了脑组织的关键方面,而且表明MRI数据的多重分形性可作为脑结构变化的标志。

英文摘要

Multifractality is an effective formalism for quantifying the nonlinear, scale-free properties of complex data. In this study, we propose a novel and efficient methodology, termed Multifractal Space-filling Curve Analysis (MFSCA), for quantifying the correlation structure of multidimensional data. Within this framework, the original multidimensional data - while preserving both local and long-range organisational properties - are projected onto a one-dimensional representation using a fractal space-filling curve. The resulting one-dimensional signal is then analysed using multifractal algorithms. We demonstrate the utility of the method using both artificially generated multifractal structures and real data. In particular, we apply MFSCA to analyse magnetic resonance imaging (MRI) data from Alzheimer patients at different stages of dementia. Based on the results, we estimate the multifractal profiles of the brain for healthy subjects of different ages as well as for dementia patients. The analysis reveals that the spatial organization of brain structures, as measured by the degree of multifractality, progressively weakens with age and the development of dementia. A transition from multifractality to monofractality is observed both in control groups, when comparing the Young Control and Elderly Control groups, and among dementia subjects of similar age but at different stages of the disease, namely early dementia and mild cognitive impairment. Thus, from the perspective of multiscaling properties, the heterogeneous characteristics of spatial brain organization deteriorate under worsening conditions, leading to a homogeneous and weakly correlated structure. These findings not only effectively capture key aspects of brain organisation, but also demonstrate that the multifractality of MRI data can serve as a marker of structural brain changes.

2606.02386 2026-06-10 cs.AI q-bio.QM 版本更新

AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

AgentPLM:具有推理增强解码的智能体蛋白质语言模型用于蛋白质序列设计

Sahil Rahman, Maxx Richard Rahman

AI总结 提出AgentPLM,通过推理增强解码和对比智能体策略优化,使预训练蛋白质语言模型能够利用外部生物物理反馈进行在线纠错,在多项蛋白质设计任务上取得最优结果。

详情
Journal ref
Workshop on Generative and Agentic AI for Biology, 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

蛋白质语言模型(PLM)是被动预言机:它们通过单次前向传递生成序列,没有机制来咨询外部生物物理反馈或在候选序列违反热力学或结构约束时重定向生成。我们引入AgentPLM,通过为预训练PLM配备i)推理增强解码(RAD),该解码将自回归生成与工具调用(ESMFold、FoldX、AutoDock Vina)交错进行,以及ii)对比智能体策略优化(CAPO),这是直接偏好优化的轨迹级扩展,它端到端地训练策略以学习何时预言机反馈具有信息性,而不仅仅是模仿高适应度序列。我们在基准任务上评估AgentPLM,涵盖从头酶设计、抗体优化、热稳定性、PPI界面设计和零样本适应度预测,使用标准化的预言机API和受控的序列同一性划分。AgentPLM取得了最先进的结果,抗体前10%命中率相比最强被动基线有所提升,提供了无需显式回溯的在线纠错的机制证据。

英文摘要

Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre-trained PLM with i) Reasoning-Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory-level extension of direct preference optimisation that trains the policy end-to-end to learn when oracle feedback is informative rather than merely imitating high-fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction with standardised oracle APIs and controlled sequence-identity splits. AgentPLM achieves state-of-the-art results with a gain in antibody top-10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.

2605.26921 2026-06-10 cs.CV q-bio.NC 版本更新

Similarity-based matrix factorization for revealing interpretable dimensions in representational data

揭示大脑、行为和AI中表征的核心维度

Florian P. Mahner, Ka Chun Lam, Francisco Pereira, Martin N. Hebart

AI总结 提出相似性基表示因子分解(SRF)方法,从相似性矩阵中恢复低维、非负、可解释的嵌入,以揭示神经、行为和计算数据中表征的潜在维度。

详情
AI中文摘要

表征研究广泛存在于神经科学、心理学和人工智能等领域。虽然通常通过刺激之间的相似性来研究和比较表征,但现有方法仅能有限地访问塑造这些表征的维度,且可解释性有限。为克服这些挑战,本文引入相似性基表示因子分解(SRF),一种通用的计算方法,用于从测量数据导出的相似性矩阵中恢复低维、非负、可解释的嵌入。在模拟以及多种神经、行为和计算数据集中,SRF能从各种形式的表征数据中恢复可解释的维度,即使对于非常稀疏采样、不完整的数据也是如此。从这些数据集中导出的维度与任务特定模型获得的维度相匹配,预测独立的行为属性,改进探索性分析,并且与比较相似性矩阵相比,为验证性假设检验提供更高的统计功效。这些结果共同确立了SRF作为一种通用方法,在揭示、理解和利用表征背后的维度方面具有广泛的应用前景。

英文摘要

The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and using the dimensions underlying representations.

2601.14653 2026-06-10 cs.LG q-bio.GN 版本更新

Efficient Imputation for Patch-based Missing Single-cell Data via Cluster-regularized Optimal Transport

基于聚类正则化最优传输的块缺失单细胞数据高效插补

Yuyu Liu, Jiannan Yang, Ziyang Yu, Weishen Pan, Fei Wang, Tengfei Ma

AI总结 提出CROT算法,利用最优传输处理单细胞数据中的块缺失问题,实现高精度插补并显著降低运行时间。

详情
Comments
Accepted to ACM-BCB 2026
AI中文摘要

单细胞测序数据集中的缺失数据对提取有意义的生物学见解构成了重大挑战。然而,现有的插补方法通常假设数据均匀且完整,难以处理存在大片缺失数据的情况。在本文中,我们提出了CROT(聚类正则化最优传输),一种基于最优传输的插补算法,旨在处理表格格式中的块缺失数据。我们的方法在存在显著缺失的情况下有效捕捉底层数据结构。值得注意的是,它在显著减少运行时间的同时实现了优越的插补精度,展示了其在大规模数据集上的可扩展性和效率。这项工作为具有结构化数据缺失的异质性高维数据集提供了一种鲁棒的插补解决方案,解决了生物学和临床数据分析中的关键挑战。我们的代码可在GitHub上获取,https://github.com/yuyuliu11037/CROT。

英文摘要

Missing data in single-cell sequencing datasets poses significant challenges for extracting meaningful biological insights. However, existing imputation approaches, which often assume uniformity and data completeness, struggle to address cases with large patches of missing data. In this paper, we present CROT (Cluster-Regularized Optimal Transport), an optimal transport-based imputation algorithm designed to handle patch-based missing data in tabular formats. Our approach effectively captures the underlying data structure in the presence of significant missingness. Notably, it achieves superior imputation accuracy while significantly reducing runtime, demonstrating its scalability and efficiency for large-scale datasets. This work introduces a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence, addressing critical challenges in both biological and clinical data analysis. Our code is available on GitHub, https://github.com/yuyuliu11037/CROT.

2605.11197 2026-06-10 q-bio.QM physics.data-an 版本更新

The Same Problem by Different Names: Unifying Regression Dilution and Regression to the Mean

同一个问题的不同名称:统一回归稀释和回归到均值

José F. Fontanari, Mauro Santos

AI总结 本文通过线性误差变量框架揭示回归到均值与回归稀释的本质联系,统一了临床与生态学中的不同方法,提供选择合适工具的指导。

详情
Journal ref
Mathematics 14 (2026) 2052
AI中文摘要

回归到均值(RTM)和回归稀释通常被视为临床和生态学文献中的无关问题。本文证明,在基线变量受瞬时时间或测量噪声影响的线性误差变量框架中,这两种现象具有相同的数学特征。通过比较专用临床工具如Berry收缩修正与标准符号无关的结构估计器如主轴(MA)和缩减主轴(RMA)回归,本文统一了这些不同的传统。通过分析框架,评估了这些方法在不同噪声与信号比和样本量下的闭式总体极限和有限样本性能。结果表明,Berry方法是为预期1:1关系的临床场景设计的专用工具。但将其应用于生态学中的负斜率贸易时会导致严重误差。本文提供了最优性地图,以识别在不同条件下哪种估计器最准确地恢复真实生物信号。通过协调这些不同方法,本文为研究人员提供了基于数据噪声特征选择正确工具的原理性指南。

英文摘要

Regression to the Mean (RTM) and Regression Dilution are traditionally treated as unrelated issues in the clinical and ecological literatures. In this work, we demonstrate that within a linear errors-in-variables framework where baseline variables are subject to transient temporal or measurement noise, these two phenomena share an identical underlying mathematical signature. We unify these disparate traditions by comparing specialized clinical tools, such as the Berry shrinkage correction, with standard sign-agnostic structural estimators like Major Axis (MA) and Reduced Major Axis (RMA) regression. Using an analytical framework, we evaluate the closed-form population limits and finite-sample performance of these methods across various noise-to-signal ratios and sample sizes. Our results show that the Berry method is a specialized tool designed for clinical scenarios where a 1:1 relationship is expected. However, applying it to ecological trade-offs with negative slopes can lead to severe errors. We provide maps of optimality to identify which estimator most accurately recovers the true biological signal under different conditions. By reconciling these disparate methods, we offer a principled guide for researchers to choose the correct tool based on their data's noise profile rather than their disciplinary tradition.

2604.04287 2026-06-10 cs.LG cs.CL q-bio.GN 版本更新

Entropy, Disagreement, and the Limits of Foundation Models in Genomics

熵、分歧与基因组基础模型的局限性

Maxime Rochkoulets, Lovro Vrček, Mile Šikić

AI总结 本文通过分析熵对模型学习的影响,发现基因组序列的高熵导致输出分布接近均匀、模型间分歧大和静态嵌入不稳定,且Fisher信息集中在嵌入层,表明仅靠序列自监督训练可能不适用于基因组数据。

详情
Comments
Accepted to LMLR Workshop at ICLR 2026
AI中文摘要

基因组学中的基础模型与自然语言处理中的基础模型相比,成功程度参差不齐。然而,其有效性有限的原因仍不清楚。在这项工作中,我们研究了熵作为限制此类模型从训练数据中学习并发展基础能力的基本因素的作用。我们在文本和DNA序列上训练模型集成,并分析它们的预测、静态嵌入和经验Fisher信息流。我们表明,从未见标记预测的角度来看,基因组序列的高熵导致输出分布接近均匀、模型间分歧大以及静态嵌入不稳定,即使模型在架构、训练和数据上匹配也是如此。然后,我们证明在DNA上训练的模型将Fisher信息集中在嵌入层,似乎未能利用标记间关系。我们的结果表明,仅从序列进行自监督训练可能不适用于基因组数据,这质疑了当前训练基因组基础模型方法背后的假设。

英文摘要

Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences -- from the point of view of unseen token prediction -- leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.