AI中文摘要
PhyloFrame是一个用于系统发育计算的Python库,旨在弥合专家级编译器优化操作与灵活的脚本工作流之间的差距——重点在于对非常大的树规模(例如≥300,000个分类单元)实现快速、内存高效的操作。PhyloFrame围绕基于DataFrame的树表示构建,其中每行对应一个节点,列记录祖先关系、分支长度、分类单元标签以及任何用户定义的属性。这种基于数组的存储对于可扩展性至关重要,它允许库和最终用户代码无缝利用即时编译(例如Numba)和向量化执行(例如NumPy、Polars)。在大树规模下,性能通常达到或超过由原生代码支持的Python库——特别是在拓扑顺序遍历和Newick I/O方面表现出色。 基于DataFrame的表示还提供了若干额外便利,包括: - 简洁的批量操作(例如NumPy); - 强大的查询和转换(例如Polars表达式、Pandas索引、SQL风格的连接和合并); - 与现代表格数据格式兼容,这些格式压缩友好、类型感知、可空且高度可移植(例如Parquet);以及 - 与面向表格的数据科学工具广泛互操作(例如Seaborn、Plotly、Vega-Altair、tidyverse、Excel)。 当前库功能包括树输入/输出、合成树生成、基于分类单元的查询、树遍历、树度量、树操作、树降采样和树比较。大多数功能支持Pandas和Polars DataFrame,并通过编程和基于CLI的接口提供。
英文摘要
PhyloFrame is a Python library for phylogenetic computation targeting the gap between specialist, compiler-optimized operations and flexible, script-based workflows -- with emphasis on fast, memory-efficient operations for very large tree sizes (e.g., $\geq$ 300,000 taxa). PhyloFrame is built around a DataFrame-based tree representation, where each row corresponds to a node and columns record ancestor relationships, branch lengths, taxon labels, and any user-defined attributes. Crucial for scalability, such array-backed storage allows both library and end-user code alike to seamlessly harness Just-in-Time (JIT) compilation (e.g., Numba) and vectorized execution (e.g., NumPy, Polars). At large tree sizes, performance generally matches or exceeds Python libraries backed by native code -- notably, achieving strong performance in topological-order traversals and Newick I/O.
DataFrame-based representation affords several additional conveniences, including:
- succinct bulk operations (e.g., NumPy);
- powerful queries and transformations (e.g., Polars expressions, Pandas indexing, SQL-style joins and merges);
- compatibility with modern tabular data formats that are compression-friendly, type-aware, nullable, and highly portable (e.g., Parquet); and
- broad interoperation with table-oriented data science tools (e.g., Seaborn, Plotly, Vega-Altair, tidyverse, Excel).
Current library features include tree input/output, synthetic tree generation, taxon-based queries, tree traversals, tree metrics, tree manipulation, tree downsampling, and tree comparison. Most functionality supports both Pandas and Polars DataFrames, and is available through programmatic and CLI-based interfaces.