AI for Science - arXivDaily 专题

2508.20275 2026-06-18 cs.LG cs.CL q-bio.QM 95%

A Systematic Review on the Generative AI Applications in Human Medical Genomics

关于生成式AI在人类医学基因组学中的应用系统综述

Anton Changalidis, Yury Barbitoff, Yulia Nasykhova, Andrey Glotov

发表机构 * Dpt. of Genomic Medicine（基因组医学系）； D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology（D.O. Ott妇产科与生殖医学研究所）

专题命中蛋白质与生物分子：系统综述生成式AI在人类医学基因组学中的应用，涉及基因组变异识别和注释。

AI总结本文系统综述了生成式AI在罕见和常见疾病遗传研究与诊断中的应用，分析了LLM在基因组变异识别、注释及医学影像中的作用，指出其在多模态数据整合和临床应用中的挑战。

Comments 31 pages, 5 figures

Journal ref Frontiers in Genetics 16 (2026) 1694070

详情

DOI: 10.3389/fgene.2025.1694070

AI中文摘要

尽管传统统计技术和机器学习方法在遗传学和特别是遗传病诊断中做出了重要贡献，但它们在处理复杂、高维数据时往往遇到困难，而最先进的深度学习模型现在解决了这一挑战。基于Transformer架构的大语言模型（LLMs）在需要理解非结构化医疗数据的任务中表现出色。本文系统综述了LLMs在遗传研究和诊断中的作用，通过PubMed、bioRxiv、medRxiv和arXiv的自动化关键词搜索，分析了172项研究，突显了基因组变异识别、注释和解释以及通过视觉Transformer改进的医学影像进展。关键发现表明，虽然基于Transformer的模型显著提高了疾病和风险分层，但在变异解释、医学影像分析和报告生成方面仍存在挑战，整合多模态数据（基因组序列、影像和临床记录）到统一且临床稳健的流程中面临可扩展性和临床应用限制。本文提供了LLM在转变遗传病诊断和支持遗传教育方面的全面分类和评估，为导航这一快速发展的领域提供指导。

英文摘要

Although traditional statistical techniques and machine learning methods have contributed significantly to genetics and, in particular, inherited disease diagnosis, they often struggle with complex, high-dimensional data, a challenge now addressed by state-of-the-art deep learning models. Large language models (LLMs), based on transformer architectures, have excelled in tasks requiring contextual comprehension of unstructured medical data. This systematic review examines the role of LLMs in the genetic research and diagnostics of both rare and common diseases. Automated keyword-based search in PubMed, bioRxiv, medRxiv, and arXiv was conducted, targeting studies on LLM applications in diagnostics and education within genetics and removing irrelevant or outdated models. A total of 172 studies were analyzed, highlighting applications in genomic variant identification, annotation, and interpretation, as well as medical imaging advancements through vision transformers. Key findings indicate that while transformer-based models significantly advance disease and risk stratification, variant interpretation, medical imaging analysis, and report generation, major challenges persist in integrating multimodal data (genomic sequences, imaging, and clinical records) into unified and clinically robust pipelines, facing limitations in generalizability and practical implementation in clinical settings. This review provides a comprehensive classification and assessment of the current capabilities and limitations of LLMs in transforming hereditary disease diagnostics and supporting genetic education, serving as a guide to navigate this rapidly evolving field.

URL PDF HTML ☆

赞 0 踩 0

2606.18703 2026-06-18 cs.LG q-bio.QM 新提交 90%

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

跨模态生物学语言模型的逻辑空间对比对齐

Yanjun Shao, Yundi Chen, Yashvi Patel, Aurelien Pelissier, María Rodríguez Martínez

发表机构 * Biomedical Informatics and Data Science, Yale School of Medicine（耶鲁医学院生物医学信息学与数据科学）

专题命中蛋白质与生物分子：生物学语言模型跨模态对齐，用于蛋白质-配体预测

AI总结提出LOGICA框架，在输出逻辑空间进行对比学习，通过门控跨模态适配器保留预训练似然接口，实现跨不同词汇表模型的上下文条件预测，在蛋白质-配体结合、TCR-肽活性和药物耐药性预测任务上超越现有方法。

详情

AI中文摘要

预训练的生物学语言模型通过掩码标记预测暴露每个标记的概率分布，提供序列设计、变异评分和机制解释所依赖的似然接口。然而，这些分布是从广泛的无标注语料中学习得到的，并未自然地以任务特定的生物学上下文（如相互作用伙伴、细胞环境或治疗干预）为条件。现有的上下文匹配方法通常通过池化嵌入、对比潜在空间或任务特定的预测头来扭曲这一接口。我们提出了LOGICA（逻辑空间对比对齐），一种用于上下文条件预测的框架，直接在输出逻辑空间中进行对比学习。通过与每个模型的原生标记头兼容的门控跨模态适配器，LOGICA保留了预训练的似然接口，并将上下文化的标记对数似然转换为匹配分数。对齐是通过上下文敏感的标记概率来定义的，而不是共享嵌入空间中的邻近性，从而能够从具有不同词汇表的模型之间的稀疏配对数据中学习，无需共享分词器或解码器。LOGICA特别适用于突变局部变异排序，其中比较简化为扰动位点上突变标记的上下文条件似然。在蛋白质-配体结合、TCR-肽活性和药物条件耐药性预测中，LOGICA优于先前的最先进方法，包括匹配的潜在对比和条件MLM基线，同时保留了用于解释和生成的标记级接口。在保留基因的单突变药物耐药性预测中，LOGICA将AUC从接近随机的潜在空间基线约0.55提高到约0.65。

英文摘要

Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet these distributions are learned from broad unlabeled corpora and are not naturally conditioned on task-specific biological contexts such as interaction partners, cellular environments, or therapeutic interventions. Existing contextual matching methods often distort this interface through pooled embeddings, contrastive latent spaces, or task-specific prediction heads. We introduce LOGICA (Logit-space Contrastive Alignment), a framework for context-conditioned prediction that performs contrastive learning directly in output-logit space. Using gated cross-modal adapters compatible with each model's native token head, LOGICA preserves the pretrained likelihood interface and converts contextualized token log-likelihoods into matching scores. Alignment is defined through context-sensitive token probabilities rather than proximity in a shared embedding space, enabling learning from sparse paired data across models with distinct vocabularies, without a shared tokenizer or decoder. LOGICA is particularly effective for mutation-local variant ranking, where comparisons reduce to context-conditioned likelihoods of mutant tokens at perturbed sites. Across protein--ligand binding, TCR--peptide activity, and drug-conditioned resistance prediction, LOGICA improves over prior state-of-the-art methods, including matched latent-contrastive and conditional MLM baselines, while retaining a token-level interface for interpretation and generation. On held-out-gene single-mutation drug-resistance prediction, LOGICA improves AUC from near-random latent-space baselines of $\sim$0.55 to $\sim$0.65.

URL PDF HTML ☆

赞 0 踩 0

2606.18672 2026-06-18 cs.LG cs.AI q-bio.GN 新提交 90%

scGTN: Deep Siamese Graph Transformer Network for Single-cell RNA Sequencing Clustering

scGTN：用于单细胞RNA测序聚类的深度孪生图变换网络

Jinke Wu, Yifan Wang, Siyu Yi, Caiyang Yu, Ziyue Qiao, Nan Yin, Jiancheng Lv, Wei Ju

发表机构 * Sichuan University（四川大学）； University of International Business and Economics（对外经济贸易大学）； Great Bay University（大湾区大学）； The Education University of Hong Kong（香港教育大学）

专题命中蛋白质与生物分子：单细胞RNA测序聚类，孪生图变换网络

AI总结提出scGTN框架，通过孪生图变换网络整合基因表达与细胞间结构信息，利用最优传输策略进行自监督聚类，在多个数据集上优于现有方法。

Comments Accepted by Proceedings of the Thirty-Fifth International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情

AI中文摘要

单细胞RNA测序（scRNA-seq）在表征细胞水平基因表达、识别细胞类型以及促进对细胞异质性的理解中起着关键作用。尽管scRNA-seq数据聚类取得了显著进展，但我们认为当前方法常常忽略scRNA-seq数据固有的稀疏性和噪声，以及复杂的细胞间结构信息。为此，本文提出了一种基于深度孪生图变换网络（称为scGTN）的新型单细胞RNA-seq聚类框架，该框架明确整合了基因表达谱和细胞间结构依赖关系以进行细胞聚类。具体而言，我们将scRNA-seq数据建模为图，并构建两个增强图视图作为双视图以捕获互补的细胞间信息。然后，采用孪生图变换网络显式整合最短路径信息和节点间距离，以捕获细胞间更丰富的结构关系。最后，我们采用最优传输策略以自监督方式指导细胞聚类。在多个基准scRNA-seq数据集上的大量实验表明，我们的scGTN始终优于现有方法。我们的代码可在以下网址获取：https://github.com/...（原文链接）。

英文摘要

Single-cell RNA sequencing (scRNA-seq) serves a pivotal role in characterizing gene expression at the cellular level, enabling the identification of cell types and advancing the understanding of cellular heterogeneity. Despite the significant progress in scRNA-seq data clustering, we argue that current methods always ignore the sparsity and noise, as well as the complex intercellular structural information inherent in scRNA-seq data. Toward this end, in this paper, we propose a novel single-cell RNA-seq clustering framework via deep Siamese Graph Transformer Network (termed scGTN), which explicitly integrates gene expression profile and intercellular structural dependencies for cell clustering. In particular, we formulate scRNA-seq data as a graph and construct two augmented graph views that serve as dual views to capture complementary intercellular information. Then, a Siamese graph transformer network is employed to explicitly incorporate shortest-path information and node-wise distances for capturing richer structural relationships between cells. Finally, we employ an optimal transport strategy to guide the cell clustering in a self-supervised manner. Extensive experiments on multiple benchmark scRNA-seq datasets demonstrate that our scGTN consistently outperforms existing methods. Our code is available at https://github.com/W-RMSL/scGTN.

URL PDF HTML ☆

赞 0 踩 0

2606.18302 2026-06-18 q-bio.OT cs.LG 新提交 85%

Protein-Based Fish Species Identification: Dataset, Models, and Insights from Native Bangladeshi Fish

基于蛋白质的鱼类物种识别：孟加拉本土鱼类的数据集、模型与见解

Md Nasiat Hasan Fahim, Md. Abid Ullah Muhib, Mohammad Shahidur Rahman

发表机构 * Shahjalal University of Science

专题命中蛋白质与生物分子：鱼类蛋白质序列分类，轻量混合模型

AI总结本研究构建了首个孟加拉本土鱼类蛋白质序列数据集，并系统评估了七种架构，提出了一种轻量级混合模型MotifCNN-Transformer+TA-PE，在资源受限场景下优于大型蛋白质语言模型ProtBERT。

Comments Published in 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted

Journal ref 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)

详情

DOI: 10.1109/QPAIN69676.2026.11546620

AI中文摘要

在孟加拉国，正确识别鱼类物种对于粮食安全、经济发展和气候适应性至关重要。蛋白质序列直接反映功能和进化约束，对物种认证和生物多样性监测具有重要意义。然而，目前尚无针对孟加拉本土鱼类物种的蛋白质序列识别基准。本研究通过引入首个包含9种孟加拉本土鱼类2845条高质量蛋白质序列的精选数据集来填补这一空白。我们还通过对七种架构范式进行系统基准测试，建立了该领域首个蛋白质序列分类基线。此外，我们提出了一种实用的新型混合架构——MotifCNN与具有末端感知位置编码的Transformer（MotifCNN-Transformer+TA-PE）。该新架构实现了79.80%的准确率和0.80的宏F1分数。最高准确率83.04%由微调的蛋白质语言模型ProtBERT取得，该模型有4.2亿参数，需要双16GB GPU进行推理。根据McNemar检验，ProtBERT相比我们的MotifCNN-Transformer+TA-PE的3.24%准确率提升在统计上不显著（p = 0.1120）。在九类中的六类上，我们的新架构在每类识别中优于ProtBERT。此外，我们的MotifCNN-Transformer+TA-PE比ProtBERT快约5倍，小42倍，支持16倍更大的批处理大小，且无需GPU推理，使其在资源受限地区（如孟加拉农村）部署更为实用。除此之外，我们的基础性工作展示了系统发育关系对序列相似性的影响，并为南亚蛋白质依赖型经济中的渔业管理、食品认证和生物多样性保护建立了途径。

英文摘要

Correct identification of fish species is highly significant for food security, economic development, and climate resilience in Bangladesh. Protein sequences directly reflect functional and evolutionary constraints which are important for species authentication and biodiversity monitoring. Yet there exists no benchmark for native Bangladeshi fish species identification from protein sequence. In this study, we addressed this gap by introducing the first curated dataset for nine native Bangladeshi fish species of 2845 high quality protein sequences. We also established the first protein sequence classification baseline for this domain through a systematic benchmarking of seven architectural paradigms. Moreover, we propose a realistic deployable novel hybrid architecture of MotifCNN and Transformer with Terminal-Aware Positional-Encoding (MotifCNN-Transformer+TA-PE). Our novel architecture achieves 79.80% accuracy with macro-F1 of 0.80. The highest 83.04% accuracy is achieved by finetuned protein language model ProtBERT that has 420M parameters and requires dual 16GB GPUs for inference. According to McNemar's test, ProtBERT's 3.24% accuracy gain over our MotifCNN-Transformer+TA-PE is statistically insignificant (p = 0.1120). Our novel architecture beats it among six of the nine classes in per class identification. Also our MotifCNN-Transformer+TA-PE is approximately 5x faster, 42x smaller, and supports 16x larger batch size than ProtBERT and has GPU free inference, making it more practical for deployment in resources constrained areas such as rural Bangladesh. Beyond this, our foundational work shows effects of phylogenetic relationships on sequence similarity and establishes pathways for fisheries management, food authentication and biodiversity conservation in South Asia's protein dependent economy.

URL PDF HTML ☆

赞 0 踩 0

2606.18961 2026-06-18 cs.LG 新提交 85%

Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization

做自己的老师：通过无监督奖励优化引导蛋白质语言模型

Lanqing Li, Shentong Mo, Yang Yu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； MBZUAI ； Hong Kong University of Science and Technology（香港科学理工大学）

专题命中蛋白质与生物分子：无监督奖励优化引导蛋白质语言模型生成。

AI总结提出无监督奖励优化框架，结合模型不确定性和语义一致性作为代理奖励，通过SRO和BRO算法优化PLMs，在无标签数据下实现可控蛋白质生成，性能接近有监督方法。

Comments 24 pages, 2 figures, 13 tables

详情

AI中文摘要

蛋白质语言模型（PLMs）已成为可控生物分子设计的有力工具，但其后训练适应通常依赖于昂贵的湿实验验证或精心策划的偏好数据集。为了克服这一监督瓶颈，我们引入了PLMs的无监督奖励优化，这是一个无需真实标签即可实现可引导蛋白质生成的综合框架。我们的关键见解是，任务无关的奖励（将内在模型不确定性与由蛋白质表示模型指导的外在语义一致性相结合）在基础模型和温度设置中与可控性度量表现出强相关性。基于这一发现，我们提出了两种离线算法：软奖励优化（SRO）和二值化奖励优化（BRO），它们有效地最大化由这些代理奖励诱导的经典RLHF目标。在组合性分布外提示上的大量实验表明，两种方法均显著优于竞争基线（DPO、KTO），同时在多个采样温度、模型规模和蛋白质家族中接近理想性能。此外，使用无监督奖励微调的PLMs在pass@k评估中相比其基础模型能够实现持续更高的覆盖率。通过使PLMs能够利用自身生成的体验进行自我改进，我们的框架为在标签偏好或实验反馈稀缺或不可用的环境中实现可控生物分子设计提供了一条可扩展的途径。

英文摘要

Protein language models (PLMs) have emerged as powerful tools for controllable biomolecular design, yet their post-training adaptation typically relies on costly wet-lab validation or curated preference datasets. To overcome this supervision bottleneck, we introduce unsupervised reward optimization of PLMs, a comprehensive framework for steerable protein generation without ground-truth labels. Our key insight is that task-agnostic rewards, which combine intrinsic model uncertainty with extrinsic semantic consistency informed by protein representation models, exhibit strong correlation with controllability measures across base models and temperature regimes. Building upon this discovery, we propose two offline algorithms: Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO), which effectively maximize the classical RLHF objective induced by these proxy rewards. Extensive experiments on compositional out-of-distribution prompts demonstrate that both methods significantly outperform competitive baselines (DPO, KTO), while approaching oracle performance across multiple sampling temperatures, model scales and protein families. Moreover, PLMs fine-tuned with unsupervised rewards can achieve consistently higher coverage compared to their base model in pass@k evaluations. By enabling self-improvement of PLMs through their own generated experience, our framework provides a scalable pathway toward controllable biomolecular design in settings where labeled preferences or experimental feedback are scarce or unavailable.

URL PDF HTML ☆

赞 0 踩 0

2606.18495 2026-06-18 physics.chem-ph physics.bio-ph physics.comp-ph q-bio.BM 新提交 80%

Bayesian Sampling of Structural Ensembles: The Role of Ensemble-Counting Measures

结构系综的贝叶斯采样：系综计数测度的作用

Ivan Gilardoni, Giovanni Bussi

专题命中蛋白质与生物分子：贝叶斯采样结构系综，RNA模拟

AI总结本文提出Jeffreys测度作为系综计数测度，解决BELT框架中拉格朗日乘子空间平直测度导致的有限参考轨迹下后验分布不可归一化问题，并在RNA寡聚体模拟中验证了测度选择对贝叶斯估计的影响。

详情

AI中文摘要

结构系综精修被广泛用于将分子模拟与实验测量相结合。虽然大多数应用关注最大后验（MAP）系综，但后验分布的贝叶斯采样可以为任意可观测量提供不确定性估计和后验平均值。贝叶斯能量景观倾斜（BELT）框架引入了这一方向的一个显著步骤，其中对由拉格朗日乘子参数化的最大熵系综族进行采样。这里，我们表明在这种设置下，贝叶斯采样需要显式选择系综计数测度。特别是，原始BELT公式中使用的拉格朗日乘子空间的平直测度导致后验分布对于有限参考轨迹在形式上不可归一化。我们提出Jeffreys测度作为一种不变的系综计数处方，恢复了此处考虑的有限样本情况下的可归一化性，并为后验平均值提供了一致的定义。使用解析可处理的高斯模型和RNA寡聚体模拟的最大熵精修，我们比较了不同的系综计数测度，并表明它们可以显著影响贝叶斯估计。所得方法已在\ exttt{MDRefine}软件包中实现。

英文摘要

Structural ensemble refinement is widely used to integrate molecular simulations with experimental measurements. While most applications focus on the maximum-a-posteriori (MAP) ensemble, Bayesian sampling of the posterior distribution can provide uncertainty estimates and posterior averages for arbitrary observables. A notable step in this direction was introduced by the Bayesian Energy Landscape Tilting (BELT) framework, where sampling is performed on a family of maximum-entropy ensembles parametrized by Lagrange multipliers. Here, we show that Bayesian sampling in this setting requires an explicit choice of ensemble-counting measure. In particular, the flat measure in Lagrange-multiplier space used in the original BELT formulation leads to a posterior distribution that is formally non-normalizable for finite reference trajectories. We propose the Jeffreys measure as an invariant ensemble-counting prescription, restoring normalizability in the finite-sample situations considered here, and providing a consistent definition of posterior averages. Using both an analytically tractable Gaussian model and maximum-entropy refinement of RNA oligomer simulations, we compare different ensemble-counting measures and show that they can significantly affect Bayesian estimates. The resulting methodology has been implemented in the \texttt{MDRefine} software package.

URL PDF HTML ☆

赞 0 踩 0