arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.MS数学软件3
2606.08339 2026-06-11 cs.MS math.NA 版本更新

Floating-point autotuning with customized precisions

自定义精度的浮点自动调优

Xinye Chen, Thibault Hilaire, Fabienne Jézéquel

AI总结 提出一种通过自定义浮点格式实现自动精度调优的方法,结合数值验证与系统搜索生成满足精度要求的程序变体,并在线性求解器和Rodinia基准测试中验证了大部分变量可安全降精度。

详情
AI中文摘要

降低精度算术在保持数值精度的前提下,为提高数值应用的性能、内存使用和能效提供了重要机会。本文研究了通过用户定义的指数和尾数大小的自定义浮点格式进行自动精度调优,从而在统一的混合精度框架内模拟新兴的低精度格式并探索非标准精度配置。所提出的方法在PROMISE精度自动调优工具中实现,将数值验证与系统搜索相结合,生成满足用户定义精度要求的程序变体。为解决这种探索的计算成本,一个容器化基准测试框架支持跨多个算法和参数配置的并行执行。该方法在一组数值程序上进行评估,包括线性求解器和Rodinia基准测试中的应用。结果表明,大部分变量可以安全地降低到较低精度而保持准确性,表明标准双精度通常过度配置。这些发现凸显了自动精度调优在根据应用特定精度要求推导高效混合精度配置方面的潜力。

英文摘要

Reduced-precision arithmetic offers significant opportunities to improve performance, memory usage, and energy efficiency in numerical applications, provided that numerical accuracy is preserved. This work investigates automated precision tuning through customized floating-point formats with user-defined exponent and significand sizes, enabling the emulation of emerging low-precision formats and the exploration of non-standard precision configurations within a unified mixed-precision framework. The proposed methodology, implemented in the PROMISE precision autotuning tool, combines numerical validation with a systematic search to generate program variants that satisfy user-defined accuracy requirements. To address the computational cost of this exploration, a containerized benchmarking framework supports parallel execution across multiple algorithms and parameter configurations. The approach is evaluated on a suite of numerical programs, including linear solvers and applications from the Rodinia benchmark. Results show that a substantial proportion of variables can be safely reduced to lower precision while preserving accuracy, indicating that standard double precision is often over-provisioned. These findings highlight the potential of automated precision tuning to derive efficient mixed-precision configurations tailored to application-specific accuracy requirements.

2605.06057 2026-06-11 cs.DC cs.MS 版本更新

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

FalconGEMM:通过低复杂度矩阵乘法超越硬件极限

Honglin Zhu, Jiaping Cao, Jiang Shao, Siyuan Feng, Qian Qiu, Peng Chen, Xu Zhang, Yixian Zhou, Man Lung Yiu, Guang Ji, Minwen Deng, Jintao Meng, Wenxi Zhu

AI总结 FalconGEMM通过自动化部署优化低复杂度矩阵乘法算法,实现DL性能提升,在GPU和CPU上均超越传统GEMM库和AlphaTensor等竞品。

详情
AI中文摘要

峰值突破矩阵乘法是一种提升深度学习性能的有前途技术,特别是在大语言模型训练和推理中。我们提出了FalconGEMM,一个跨平台框架,自动化部署、优化和选择低复杂度矩阵乘法算法(LCMAs)以适应多样化的硬件。三个关键创新包括:(1)部署模块通过代码生成实现跨各种硬件和输入配置的可移植执行;(2)执行模块具有分组并行优化,最大化芯片内数据重用,利用并行资源并减少带宽开销;(3)决策模块具备轻量级分析性能模型,根据矩阵形状和硬件配置选择最优策略。在多种数据类型下,对LLM工作负载在GPU(H20,A100)和CPU(ARM,x86)架构上进行了广泛评估。FalconGEMM成功实现了峰值突破性能,在GEMM库(如cuBLAS、CUTLASS、Intel MKL等)上提升了7.59%-17.85%,在LCMA竞争对手如AlphaTensor上提升了12.41%-55.61%。我们的框架使LCMAs的理论承诺在现代异构硬件的生产部署中成为现实。

英文摘要

Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.

2508.10076 2026-06-11 cs.MS cond-mat.str-el quant-ph 版本更新

TensorKit.jl: A Julia package for large-scale tensor computations, with a hint of category theory

TensorKit.jl: 一个用于大规模张量计算的Julia包,带有范畴论色彩

Lukas Devos, Jutho Haegeman

AI总结 介绍Julia包TensorKit.jl,其通过TensorMap类型处理阿贝尔、非阿贝尔和任意子对称性,实现灵活高性能的张量计算。

详情
Comments
69 pages, 4 figures
AI中文摘要

TensorKit.jl是一个基于Julia的软件包,用于张量计算,特别关注具有内部对称性的张量。本文介绍了其设计理念、核心功能和独特特性,包括如何通过“TensorMap”类型处理阿贝尔、非阿贝尔和任意子对称性。我们强调了该软件的灵活性、性能以及扩展到新张量类型和对称性的能力,并通过精选案例研究说明了其实际应用。

英文摘要

TensorKit$.$jl is a Julia-based software package for tensor computations, especially focusing on tensors with internal symmetries. This paper introduces the design philosophy, core functionalities, and distinctive features, including how to handle abelian, non-abelian, and anyonic symmetries through the ``TensorMap'' type. We highlight the software's flexibility, performance, and its capability to extend to new tensor types and symmetries, illustrating its practical applications through select case studies.