FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
FalconGEMM:通过低复杂度矩阵乘法超越硬件极限
Honglin Zhu, Jiaping Cao, Jiang Shao, Siyuan Feng, Qian Qiu, Peng Chen, Xu Zhang, Yixian Zhou, Man Lung Yiu, Guang Ji, Minwen Deng, Jintao Meng, Wenxi Zhu
AI总结 FalconGEMM通过自动化部署优化低复杂度矩阵乘法算法,实现DL性能提升,在GPU和CPU上均超越传统GEMM库和AlphaTensor等竞品。
详情
峰值突破矩阵乘法是一种提升深度学习性能的有前途技术,特别是在大语言模型训练和推理中。我们提出了FalconGEMM,一个跨平台框架,自动化部署、优化和选择低复杂度矩阵乘法算法(LCMAs)以适应多样化的硬件。三个关键创新包括:(1)部署模块通过代码生成实现跨各种硬件和输入配置的可移植执行;(2)执行模块具有分组并行优化,最大化芯片内数据重用,利用并行资源并减少带宽开销;(3)决策模块具备轻量级分析性能模型,根据矩阵形状和硬件配置选择最优策略。在多种数据类型下,对LLM工作负载在GPU(H20,A100)和CPU(ARM,x86)架构上进行了广泛评估。FalconGEMM成功实现了峰值突破性能,在GEMM库(如cuBLAS、CUTLASS、Intel MKL等)上提升了7.59%-17.85%,在LCMA竞争对手如AlphaTensor上提升了12.41%-55.61%。我们的框架使LCMAs的理论承诺在现代异构硬件的生产部署中成为现实。
Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.