Randomized Sketching is Robust to Low-Precision Rounding on GPUs
随机草图对GPU低精度舍入具有鲁棒性
Aryaman Jeendgar, Clément Flint, Hartwig Anzt
AI总结 研究随机草图在GPU低精度下的性能与精度,提出SparseStack改进CountSketch,发现FP16舍入方式对嵌入质量影响小,分布比量化更关键。
Comments 14 pages, 3 figures
详情
随机草图是随机数值线性代数中的核心原语。在现代硬件架构上,特别是在GPU上,稀疏草图的性能受限于内存流量和原子累加,而非浮点吞吐量。这使得草图成为混合精度的自然目标,前提是低精度累加不会降低嵌入质量。我们研究了稀疏子空间嵌入的混合精度GPU实现,重点关注Higgins等人提出的GPU CountSketch内核的SparseStack泛化。SparseStack在相干输入上相对于CountSketch提高了嵌入质量,但其每列额外的非零元素增加了原子更新争用并降低了吞吐量。因此,我们实现了使用确定性舍入到最近、精确随机舍入和抖动舍入的FP16 SparseStack变体,并将它们与FP32 SparseStack、CountSketch、混合精度CountSketch和FlashSketch进行比较。我们的主要实证发现是,在测试的范围内,SparseStack嵌入质量对FP16舍入规则不敏感。确定性、随机和抖动舍入的FP16 SparseStack在不相干、相干和对抗性测试问题上产生几乎相同的子空间失真和草图求解最小二乘精度。主导精度因素是草图分布而非量化规则:SparseStack变体在相干输入上显著改善失真,而所有方法在不相干输入上表现相似。由于确定性舍入的开销最低,它在FP16 SparseStack变体中提供了最佳的性能-精度权衡。
Randomized sketching is a core primitive in randomized numerical linear algebra. On modern hardware architectures, in particular on GPUs, the performance of sparse sketches is limited by memory traffic and atomic accumulation rather than floating-point throughput. This makes sketching a natural target for mixed precision, provided that low-precision accumulation does not degrade the embedding quality. We study mixed-precision GPU implementations of sparse oblivious subspace embeddings, focusing on a SparseStack generalization of the GPU CountSketch kernel of Higgins et al. SparseStack improves embedding quality relative to CountSketch on coherent inputs, but its additional nonzeros per column increase atomic-update contention and reduce throughput. We therefore implement FP16 SparseStack variants using deterministic round-to-nearest, exact stochastic rounding, and dithered rounding, and compare them with FP32 SparseStack, CountSketch, mixed-precision CountSketch, and FlashSketch. Our main empirical finding is that, for the tested regimes, SparseStack embedding quality is insensitive to the FP16 rounding rule. Deterministic, stochastic, and dithered rounding FP16 SparseStack produce nearly identical subspace distortion and sketch-and-solve least-squares accuracy across incoherent, coherent, and adversarial test problems. The dominant accuracy factor is the sketch distribution rather than the quantization rule: SparseStack variants substantially improve distortion on coherent inputs, while all methods behave similarly on incoherent inputs. Since deterministic rounding has the lowest overhead, it provides the best performance--accuracy tradeoff among the FP16 SparseStack variants.