ARGUS: Production-Scale Tracing and Performance Diagnosis for over 10,000-GPU Clusters
ARGUS:面向超过10,000 GPU集群的生产级追踪与性能诊断
Jiasheng Zhou, Longbin Zeng, Clavis Chen, Ruiming Lu, Qinwei Yang, Leyi Ye, Ray Ying, Key Zhang
AI总结 提出低开销、细粒度的始终在线追踪与实时分析系统ARGUS,通过分解训练调用层次、统一数据管道和渐进式诊断框架,在超过10,000 GPU集群上实现<2%开销的持续故障检测与性能优化。
详情
大规模LLM训练需要始终在线、细粒度的可观测性以实现有效的规模性能诊断。粗粒度的资源监控器无法定位根本原因,而细粒度的分析器会产生高昂(5%-30%)的开销和海量追踪数据,使得在大型生产集群中始终在线部署不切实际。我们提出ARGUS,一个面向10,000+ GPU规模生产集群中训练工作负载的低开销、细粒度、始终在线的追踪与实时分析系统。ARGUS将沿训练调用层次的观测分解为CPU调用栈、框架语义和GPU内核执行,始终在线收集的总开销低于2%。它构建统一数据管道,将原始内核事件压缩约3,700倍,从每个rank每步10 MB降至2.7 KB。其渐进式诊断框架通过迭代时间、阶段级和内核级分析自动隔离异常窗口、落后rank和性能下降的内核。在超过10,000 GPU的生产集群上部署超过六个月,ARGUS持续支持故障慢速检测和性能优化。我们的案例研究进一步展示了其在代表性异常中的有效性,包括计算落后、链路退化、流水线气泡放大、FlashAttention JIT停滞以及被通信症状掩盖的计算落后。
Large-scale LLM training requires always-on, fine-grained observability for effective performance diagnosis at scale. Coarse resource monitors alone cannot localize root causes, and fine-grained profilers incur prohibitive (5%-30%) overheads and massive trace volumes, making always-on deployment impractical in large production clusters. We propose ARGUS, a low-overhead, fine-grained, always-on tracing and real-time analysis system for training workloads in 10,000+ GPU-scale production clusters. ARGUS decomposes observation along the training call hierarchy into CPU call stacks, framework semantics, and GPU kernel execution, with always-on collection under a combined overhead of less than 2%. It builds a unified data pipeline and compresses raw kernel events by approximately 3,700x from 10 MB to 2.7 KB per rank per step. Its progressive diagnosis framework automatically isolates anomalous windows, straggler ranks, and degraded kernels through iteration-time, phase-level, and kernel-level analysis. Deployed for over six months on a 10,000+ GPU production cluster, ARGUS has supported continuous fail-slow detection and performance optimization. Our case studies further demonstrate its effectiveness across representative anomalies, including compute stragglers, link degradation, pipeline-bubble amplification, FlashAttention JIT stalls, and compute stragglers masked by communication symptoms.