arXivDaily arXiv每日学术速递 周一至周五更新
重置
q-bio.GN基因组学4
2606.12219 2026-06-11 q-bio.GN q-bio.MN 新提交

m6A-FORM: A Foundation Model for Decoding N6-methyladenosine Biology

m6A-FORM:解码N6-甲基腺苷生物学的基础模型

Tinghe Zhang, Sumin Jo, Shou-Jiang Gao, Yufei Huang

AI总结 提出基于Transformer的基础模型m6A-FORM,利用MeRIP-seq峰作为先验,预训练后微调实现m6A位点预测,性能优于现有方法,并支持调控因子结合位点预测和组织保守位点分析。

详情
AI中文摘要

N6-甲基腺苷(m6A)是真核生物mRNA中最丰富的内部修饰。然而,现有大多数预测器采用以腺苷为中心的公式,计算效率低且易产生假阳性。本文提出m6A-FORM,一种基于Transformer的RNA甲基化基础模型,使用MeRIP-seq峰作为甲基化富集先验,并在来自143个人类MeRIP-seq研究的约2200万个峰衍生序列上预训练。使用来自m6A-Atlas v2.0和GLORI的高置信度单核苷酸m6A注释微调后,m6A-FORM-sites实现了最先进的m6A位点预测性能,PR-AUC为0.635,ROC-AUC为0.988,PR-AUC比现有方法至少提高0.14,同时推理速度显著加快。任务特定适配进一步支持19个m6A相关调控因子的结合位点预测,以及识别与mRNA降解相关的YTHDF2结合m6A位点。将m6A-FORM应用于来自24个人类组织的67个数据集,识别出19,631个组织保守位点,这些位点具有独特的定位、聚类、甲基化、表达、RBP相互作用和衰变相关特征。

英文摘要

N6-methyladenosine (m6A) is the most abundant internal modification in eukaryotic mRNA. However, most existing predictors use adenosine-centered formulations that are computationally inefficient and prone to false positives. Here we present m6A-FORM, a transformer-based foundation model for RNA methylation that uses MeRIP-seq peaks as methylation-enriched priors and is pretrained on approximately 22 million peak-derived sequences from 143 human MeRIP-seq studies. After fine-tuning with high-confidence single-nucleotide m6A annotations from m6A-Atlas v2.0 and GLORI, m6A-FORM-sites achieves state-of-the-art m6A site prediction performance, with a PR-AUC of 0.635 and ROC-AUC of 0.988, improving PR-AUC by at least 0.14 over existing methods while enabling substantially faster inference. Task-specific adaptation further supports prediction of binding sites for 19 m6A-associated regulators and identification of YTHDF2-bound m6A sites associated with mRNA degradation. Applying m6A-FORM across 67 datasets from 24 human tissues identifies 19,631 tissue-conserved sites with distinct localization, clustering, methylation, expression, RBP-interaction, and decay-associated signatures.

2606.11276 2026-06-11 q-bio.GN 新提交

A mathematical framework for centromere-aware evaluation of human genome assemblies

面向人类基因组组装的着丝粒感知评估的数学框架

Luca Franco, Matteo Migliarini, Matteo Tommaso Ungaro, Egnald Çela, Luca Corda, Andreas Giannis, Ester Mondelli, Fabio Galasso, Simona Giunta

AI总结 提出基于着丝粒功能基序间距离分布的KL散度度量,实现重复区域基因组组装准确性的定量评估,并应用于T2T基因组。

详情
AI中文摘要

在高度重复区域(如着丝粒)中准确评估基因组组装仍然是基因组学中的一个主要开放挑战。传统的基准测试依赖于序列比对,这在高度同质性和差异性的区域中会出现问题。在这里,我们将着丝粒组装评估框架化为一个紧凑的centeny表示中的比较分布问题,通过计算功能基序之间的基因组距离,而不是依赖于核苷酸序列。我们的基于分布的度量通过比较由KL散度呈现的着丝粒间基序距离来评估查询染色体和目标染色体之间的一致性。当全基因组应用于当前可用的人类端粒到端粒(T2T)基因组时,该方法为整个组装和每个单独染色体提供了准确性排名。总之,我们提出了一个基于基因组间基序距离数值呈现的快速且稳健的评分系统,为重复DNA区域中的组装完整性提供了定量标准,并建立了染色体水平基因组间比较的真正框架。

英文摘要

Accurate evaluation of genome assemblies within highly repetitive regions, such as centromeres, remains a major open challenge in genomics. Conventional benchmarking relies on sequence alignment, which becomes problematic in regions of high homogeneity and divergence. Here, we framed centromere assembly evaluation as a comparative distribution problem in a compact centeny representation by computing genomic distances between functional motifs, rather than relying on nucleotide sequence. Our distribution-based metric assesses agreement between a query and a target chromosome by comparing their centromeric inter-motif distances rendered by KL divergence. When applied genome-wide to currently available human telomere-to-telomere (T2T) genomes, this approach yields an accuracy ranking for the entire assembly and for each individual chromosome. Altogether, we present a rapid and robust scoring system based on genomes numerical rendering of inter-motif distances, that provides a quantitative standard of assembly integrity in repetitive DNA regions and establishes a bona fide framework for chromosome-level genome-to-genome comparison.

2606.08493 2026-06-11 q-bio.GN cs.LG stat.ML 版本更新

Querying Counterfactuals on Tissue Graphs with Supervised Disentanglement

在组织图上通过监督解缠查询反事实

Abdul Moeed, Stefan Schrod, Martin Rohbeck, Marc Jan Bonder, Pavlo Lutsik, Oliver Stegle, Daniel Dimitrov

AI总结 本文形式化组织图反事实为空间干预,提出Cellina框架通过监督解缠分解细胞内在状态与空间上下文,用于反事实预测,在结直肠癌和小鼠大脑数据上优于现有方法。

详情
AI中文摘要

组织图反事实询问在改变的空间邻居上下文中细胞的表达将如何变化。这类查询对于预测组织中细胞行为至关重要,但缺乏统一定义,现有方法针对特定干预类型或将细胞视为独立同分布。在这项工作中,我们首先将组织图反事实形式化为一类空间干预,这些干预要么重新连接细胞之间的边(边扰动),要么修改其邻居的表达(节点扰动)。然后,我们介绍Cellina(https://cellina.readthedocs.io),一个使用监督解缠将细胞内在状态从其空间上下文中分解出来的框架,将后者作为反事实预测的条件输入。在跨越结直肠癌和小鼠大脑中超过250万个空间分辨细胞的基准测试中,Cellina在组织扰动、解缠和可扩展性方面优于空间感知和非空间的竞争对手。此外,我们展示了Cellina以无监督方式揭示生物学上不同的癌症子域,并实现靶向邻居扰动模拟。

英文摘要

Tissue graph counterfactuals ask how a cell's expression would change under altered spatial neighbor contexts. Such queries are central to predicting cell behavior in tissues, but lack a unified definition, with existing methods targeting specific intervention types or treating cells as i.i.d. In this work, we first formalize tissue graph counterfactuals as a class of spatial interventions that either rewire connections between cells (edge perturbation) or modify the expression of their neighbors (node perturbation). We then introduce Cellina ( this https URL ) - a framework that uses supervised disentanglement to decompose a cell's intrinsic state from its spatial context, using the latter as a conditioning input for counterfactual predictions. Across benchmarks spanning over 2.5 million spatially-resolved cells in colorectal cancer and mouse brain, Cellina outperforms spatially-informed and non-spatial competitors in in-silico graph perturbations, disentanglement, and scalability. Additionally, we show that Cellina reveals biologically distinct cancer subdomains in an unsupervised manner and enables targeted neighbor perturbation simulations.

2605.00545 2026-06-11 cs.LG cs.AI math-ph q-bio.GN q-bio.QM 版本更新

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

超越连续性:从单细胞快照无模拟重建离散分支动力学

Junda Ying, Yuxuan Wang, Bowen Yang, Peijie Zhou, Lei Zhang

AI总结 针对单细胞快照数据中随机性和非保守质量动态(如细胞增殖和凋亡)的挑战,提出无模拟框架Unbalanced Schrödinger Bridge (USB),通过离散分支薛定谔桥问题建模单细胞分辨率的跳跃式生灭动态,实现高效轨迹重建与离散模拟。

详情
AI中文摘要

从破坏性快照推断细胞轨迹因随机性和非保守质量动态(如细胞增殖和凋亡)的挑战而复杂化。现有的不平衡最优传输(OT)方法将质量视为连续流体,在群体水平进行推断。然而,这种宏观视角往往无法捕捉单细胞分辨率下生灭事件的离散跳跃性质,而这对于理解谱系分支和命运决定至关重要。我们提出无模拟框架Unbalanced Schrödinger Bridge (USB),用于学习底层动态,有效整合随机和非平衡效应,并在单细胞分辨率下建模离散、跳跃式的生灭动态。理论上,USB为分支薛定谔桥(BSB)问题提供了可处理的解,给出了严格的微观解释,其中单个细胞同时经历布朗运动和离散生灭跳跃。技术上,该方法通过引入无模拟训练目标实现高效求解器,有效扩展到高维组学数据。实验上,我们在模拟和真实数据集上证明,USB不仅达到优于或可比于确定性基线的轨迹重建性能,而且独特地实现了单细胞分辨率下生灭动态的真实离散模拟。

英文摘要

Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.