Low-Cost Multi-Precision Systolic Arrays for Accelerating FHE NTTs on AI ASICs
低成本多精度脉动阵列用于在AI ASIC上加速FHE NTT
George Alexakis, Dimitrios Schoinianakis, Giorgos Dimitrakopoulos
AI总结 针对FHE在AI硬件上因精度不匹配导致的性能瓶颈,提出一种最小修改的多精度脉动阵列,在统一数据流下原生执行全精度输出重建,实现1.33倍加速。
详情
全同态加密(FHE)确保了强大的数据隐私,但面临难以承受的计算开销。在AI硬件(如张量处理单元TPU)上加速FHE很有前景,但受到精度不匹配的根本限制:TPU针对8位算术优化,而FHE及其关键部分(如数论变换NTT)需要高精度。当前方法通过矩阵分解在低精度矩阵引擎上执行NTT计算来弥合这一差距。然而,重建全精度结果需要移位加累加,这与矩阵乘法的数据流不匹配。这迫使将全精度重建从矩阵引擎卸载到向量处理器,破坏了矩阵乘法数据流,造成显著的性能瓶颈。为解决这一限制,我们提出一种最小修改的多精度脉动阵列,在统一数据流下,与低精度矩阵乘法同步,在阵列内部原生执行全精度输出重建。使用OpenRoad在7nm工艺下综合,我们的设计硬件开销可忽略不计。使用SCALE-Sim的周期精确模拟表明,在128x128矩阵引擎上,对于2^12到2^16的变换大小,在所提出的架构上原生执行NTT可实现至少1.33倍的加速,成功使标准AI硬件支持高精度FHE加速。
Fully Homomorphic Encryption (FHE) ensures robust data privacy but suffers from prohibitive computational overhead. Accelerating FHE on AI hardware like Tensor Processing Units (TPUs) is promising, yet fundamentally limited by a precision mismatch: TPUs are optimized for 8-bit arithmetic, whereas FHE and its critical parts such as the Number Theoretic Transform (NTT), demand high precision. Current approaches bridge this gap using matrix decomposition to execute NTT computations on low-precision matrix engines. However, reconstructing the full-precision results requires shift-and-add accumulation that does not match the dataflow of matrix multiplication. This forces offloading full-precision reconstruction from matrix engines to vector processors that disrupts the matrix multiplication dataflow, creating significant performance bottleneck. To resolve this limitation, we propose a minimally modified multi-precision systolic array that performs full-precision output reconstruction natively within the array in sync with low-precision matrix multiplication under a uniform dataflow. Synthesized at 7nm with OpenRoad, our design incurs negligible hardware overhead. Cycle-accurate simulations using SCALE-Sim demonstrate that natively executing NTTs on the proposed architecture achieves at least 1.33x speedup, for transform sizes 2^12 to 2^16 on 128x128 matrix engines, successfully enabling standard AI hardware to support high-precision FHE acceleration.