Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode
受限于内存但不受限于带宽:批量1的LLM解码中的物理AI推理差距
Josef Chen
AI总结 本文通过测量不同GPU上批量1的自回归解码性能,发现物理AI推理并非仅受内存带宽限制,还受启动开销影响,并指出量化路径的实际收益取决于运行时实现。
详情
物理AI系统,包括机器人、自动驾驶车辆、具身智能体和边缘副驾驶,通常运行与云端LLM服务不同的推理工作负载:单流、批量1的自回归解码,其中一个机器人、摄像头流或用户会话等待下一个token。这种工作负载通常被描述为受内存带宽限制。每个解码步骤都会流式传输模型权重和活跃的KV缓存,因此延迟应与峰值HBM带宽成比例。我们表明这种说法是正确的但不完整。我们测量了三个7至8B类GQA变压器在四个NVIDIA GPU(H100 SXM5、A100-80GB SXM4、L40S和L4)上的批量1解码。我们评估了从2048到16384的上下文长度,在受控的bf16 SDPA设置下产生了44个有效单元。达到的峰值HBM带宽比例随着峰值带宽的增加而下降。在标题性的Qwen-2.5-7B ctx=2048单元中,L4达到了其分析内存下限的大约81%,而H100仅达到27%。物理AI解码是内存主导的,但更快的内存并不能转化为成比例的延迟增益。我们通过CUDA Graphs A/B实验测试了缺失项。在H100上,ctx=2048时,CUDA Graphs在N=10个新会话中将解码延迟提高了1.259倍,95%自助法置信区间为1.253至1.267。在L4上,相同的干预仅提供了1.028倍的提升。这分离出了在快速GPU上可见但在较慢、带宽受限的GPU上基本隐藏的启动侧开销。部署的含义是,只有当运行时实现时,内存节省才重要。在L4上,bf16解码接近内存下限,但常见的量化路径并未恢复预期的4倍权重流量减少:从62.32 ms/step的bf16基线,bnb-nf4达到59.36 ms/step,AutoAWQ+Marlin达到45.24 ms/step。使用Ada调优的int4内核的GPTQ+ExLlamaV2达到17.36 ms/step。
Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.