rsx: A high-performance streaming toolkit for RAD-seq sex determination
rsx: 用于RAD-seq性别鉴定的高性能流式工具包
Rohit Goswami, Ruhila Goswami
AI总结 针对RAD-seq性别鉴定中大规模数据的内存和效率问题,提出Rust实现的rsx工具包,通过2-bit DNA键、并行读取、内存映射、外部排序、位集分组计数和流式Gram矩阵PCA等优化,并添加共轭Beta-Binomial贝叶斯因子,实现8.38倍几何平均加速并保持结果一致性。
详情
- Comments
- 37 pages, 12 figures. Software: https://github.com/HaoZeke/rsx-rs . Reproducibility archive: https://doi.org/10.5281/zenodo.20531539
限制性位点相关DNA测序(RAD-seq)广泛用于发现非模式生物中的性别连锁标记,但大型研究会产生包含数百万RAD标签的标记表。RADSex提供了构建标记-个体深度表和测试性别偏向标记分布的参考工作流程,但其depth、merge和相关的表构建命令内存消耗大,且标准输出仅提供频率论调用,无后验证据,也无直接的Python或C集成。我们提出rsx,一个用Rust实现的完整RADSex命令集,保留标记表语义和命令行兼容性。rsx结合了2-bit DNA键、并行读取、内存映射标记表、外部排序、位集组计数和流式Gram矩阵PCA,使得内存受限于个体数量或显式缓冲区。它增加了共轭Beta-Binomial贝叶斯因子以及XY和ZW假设下的后验概率,返回严格、后验支持和仅贝叶斯因子的证据等级。一个便携、独立于libm的误差函数极小极大近似,使得卡方尾部在不同平台上可重现,而不改变底层的Yates检验。在四个真实RAD-seq数据集(包含419亿碱基和2900万个标记)上,rsx重现了已发表的RADSex v1.2.0调用,在56个配对计时中实现了8.38倍几何平均加速(FASTQ处理为2.77倍),并恢复了所有Bonferroni显著的阳性对照标记。在Danio albolineatus(在源出版物中被视为零假设)中,后验层发现了30个W连锁标记假设;在Notothenia rossii中,它保留了400个仅贝叶斯因子的行,与低流行率零假设兼容。Python绑定、C API和可重复性存档提供了所有报告数字所用的工作流程。rsx在GPL-3.0-or-later下发布。
Restriction site-associated DNA sequencing (RAD-seq) is widely used to discover sex-linked markers in non-model organisms, but large studies produce marker tables with millions of RAD tags. RADSex provides the reference workflow for building marker-by-individual depth tables and testing sex-biased marker distributions, but its depth, merge, and related table-building commands grow memory-hungry, and its standard output reports frequentist calls with no posterior evidence and no direct Python or C integration. We present rsx, a Rust implementation of the complete RADSex command set that preserves marker-table semantics and command-line compatibility. rsx combines 2-bit DNA keys, parallel ingestion, memory-mapped marker tables, external sorting, bitset group counts, and streamed Gram-matrix PCA so that memory stays bounded by the number of individuals or by explicit buffers. It adds conjugate Beta-Binomial Bayes factors and posterior probabilities under XY and ZW hypotheses, returning strict, posterior-supported, and Bayes-factor-only evidence grades. A portable, libm-independent minimax approximation of the error function keeps the chi-squared tail reproducible across platforms without changing the underlying Yates test. On four real RAD-seq datasets comprising 41.9 billion bases and 29 million markers, rsx reproduced published RADSex v1.2.0 calls, achieved an 8.38-fold geometric-mean speedup across 56 paired timings (2.77-fold for FASTQ processing), and recovered every Bonferroni-significant positive-control marker. In Danio albolineatus, treated as null in the source publication, the posterior layer surfaced 30 W-linked marker hypotheses; in Notothenia rossii it withheld 400 Bayes-factor-only rows compatible with a low-prevalence null. Python bindings, a C API, and a reproducibility archive provide the workflows used for all reported numbers. rsx is released under GPL-3.0-or-later.