Fair Comparison of Scheduling Algorithms on Heterogeneous Edge Clusters: A Continuous Adaptive Benchmark
异构边缘集群上调度算法的公平比较:一种连续自适应基准测试
Zihang Wang, Boris Sedlak, Juan Luis Herrera, Schahram Dustdar
AI总结 提出一种开源基准平台,用于公平比较异构边缘集群上的连续多模式调度算法,通过统一接口、闭环工作负载驱动器和双指标SLO评分,揭示控制器排名强烈依赖配置,且原始SLO与稳态SLO分离可暴露切换成本。
详情
现代人工智能工作负载部署在边缘-云连续体的异构层级上,必须满足关于延迟、吞吐量和输出质量的多维服务等级目标(SLO)。对于每个传入任务,调度器选择目标节点和处理模式(例如,完整或降低推理精度)。我们将这类问题称为连续多模式调度(CMMS)。公平比较CMMS算法很困难,因为先前的研究通常在自己的栈中、在单一工作负载下评估每个控制器,且不报告每次决策的开销。为弥补这些差距,我们提出一个开源基准平台,具有(i)统一控制器接口,(ii)覆盖多种工作负载模式的闭环工作负载驱动器,以及(iii)双指标SLO评分,分别报告原始SLO(整体合规性)和稳态SLO(稳定运行期间的合规性)。通过运行六个控制器跨越五个集群配置和两种负载状态(424个回合),我们发现控制器排名强烈依赖于配置:在轻负载下获胜的深度强化学习控制器,在负载增加时输给基于规则的启发式算法近29个百分点,且每次决策的操作开销约为500倍。我们进一步表明,将原始SLO与稳态SLO分离可以暴露切换成本,而单一聚合分数会混淆这些成本。
Modern Artificial Intelligence (AI) workloads deployed across the heterogeneous tiers of an edge--cloud continuum must satisfy multi-dimensional Service Level Objectives (SLOs) over latency, throughput, and output quality. For each incoming task, the scheduler picks both a target node and a processing mode (e.g., full or reduced inference precision). We call this class of problems \emph{Continuous Multi-Mode Scheduling} (CMMS). Comparing CMMS algorithms fairly is difficult because prior studies typically evaluate each controller in its own stack, under a single workload, and without reporting per-decision overhead. To close these gaps, we present an open source benchmark platform that features (i) a unified controller interface, (ii) a closed-loop workload driver covering multiple workload patterns, and (iii) dual-metric SLO scoring that reports raw SLO (overall compliance) and steady-state SLO (compliance during stable operation) separately. Running six controllers across five cluster configurations and two load regimes (424 episodes), we find that controller rankings are strongly configuration-dependent: a deep reinforcement-learning winner under light workloads loses to a rule-based heuristic by nearly 29 percentage points once load intensifies, at roughly 500$\times$ the per-decision operational overhead. We further show that separating raw from steady-state SLOs exposes switching costs that a single aggregate score would otherwise conflate.