Structure-Informed Multiple Sequence Alignment: A Formal Model and Hardness Results
结构信息引导的多序列比对:形式化模型与困难性结果
Yoshiki Kanazawa, Naphan Benchasattabuse, Michal Hajdušek, Rodney Van Meter
AI总结 本文提出一种结构信息引导的多序列比对问题MSA-S,通过形式化模型证明其判定问题NP完全,且优化问题无多项式时间近似方案。
详情
我们形式化了一个结构信息引导的多序列比对问题,记为MSA-S。该模型将生物序列抽象为字符串,结构信息抽象为指定的位置对。它通过一个固定的非空位符号对评分规则和固定的仿射空位罚分定义了一个固定的成对字符串得分,并在指定的位置对上增加了一个二元重叠得分,在结构应用中可解释为接触图重叠得分。这产生了一个固定得分、整数值的优化模型,适合复杂性理论分析。在此形式化下,我们证明对于一大类固定的成对字符串评分方案,判定问题MSA-S-DEC是NP完全的。我们还证明,即使在每个指定位置对集合非空且位置对重叠阈值严格为正的限制下,NP困难性仍然存在。对于关联的标量优化问题MSA-S-OPT(λ),其中λ≥1为任意固定有理常数,我们进一步证明,在非空位符号对评分规则的规范单位方案下,即使对于两个输入字符串(k=2),MSA-S-OPT(λ)也不存在多项式时间近似方案(PTAS),除非P=NP。这些结果为结构信息引导的多序列比对建立了形式化的复杂性理论基线。
We formulate a structure-informed multiple sequence alignment problem, denoted MSA-S. The model abstracts biological sequences as strings and structural information as designated position-pairs. It augments a fixed pairwise string score, defined by a fixed non-gap symbol-pair scoring rule and fixed affine gap penalties, with a binary overlap score on designated position-pairs, which can be interpreted as a contact-map overlap score in structural applications. This yields a fixed-score, integer-valued optimization model suitable for complexity-theoretic analysis. Under this formulation, we show that the decision problem MSA-S-DEC is NP-complete for a broad class of fixed pairwise string scoring schemes. We also show that NP-hardness persists even under the restriction that every designated position-pair set is nonempty and the pair-overlap threshold is strictly positive. For the associated scalarized optimization problem MSA-S-OPT(lambda) with any fixed rational constant lambda >= 1, we further show that, under the canonical unit scheme for the non-gap symbol-pair scoring rule, MSA-S-OPT(lambda) admits no polynomial-time approximation scheme (PTAS) even for two input strings (k = 2), unless P = NP. These results establish a formal complexity-theoretic baseline for structure-informed multiple sequence alignment.