arXivDaily arXiv每日学术速递 周一至周五更新
2606.20374 2026-06-19 cs.DC 新提交

ARGUS: Production-Scale Tracing and Performance Diagnosis for over 10,000-GPU Clusters

ARGUS:面向超过10,000 GPU集群的生产级追踪与性能诊断

Jiasheng Zhou, Longbin Zeng, Clavis Chen, Ruiming Lu, Qinwei Yang, Leyi Ye, Ray Ying, Key Zhang

AI总结 提出低开销、细粒度的始终在线追踪与实时分析系统ARGUS,通过分解训练调用层次、统一数据管道和渐进式诊断框架,在超过10,000 GPU集群上实现<2%开销的持续故障检测与性能优化。

详情
AI中文摘要

大规模LLM训练需要始终在线、细粒度的可观测性以实现有效的规模性能诊断。粗粒度的资源监控器无法定位根本原因,而细粒度的分析器会产生高昂(5%-30%)的开销和海量追踪数据,使得在大型生产集群中始终在线部署不切实际。我们提出ARGUS,一个面向10,000+ GPU规模生产集群中训练工作负载的低开销、细粒度、始终在线的追踪与实时分析系统。ARGUS将沿训练调用层次的观测分解为CPU调用栈、框架语义和GPU内核执行,始终在线收集的总开销低于2%。它构建统一数据管道,将原始内核事件压缩约3,700倍,从每个rank每步10 MB降至2.7 KB。其渐进式诊断框架通过迭代时间、阶段级和内核级分析自动隔离异常窗口、落后rank和性能下降的内核。在超过10,000 GPU的生产集群上部署超过六个月,ARGUS持续支持故障慢速检测和性能优化。我们的案例研究进一步展示了其在代表性异常中的有效性,包括计算落后、链路退化、流水线气泡放大、FlashAttention JIT停滞以及被通信症状掩盖的计算落后。

英文摘要

Large-scale LLM training requires always-on, fine-grained observability for effective performance diagnosis at scale. Coarse resource monitors alone cannot localize root causes, and fine-grained profilers incur prohibitive (5%-30%) overheads and massive trace volumes, making always-on deployment impractical in large production clusters. We propose ARGUS, a low-overhead, fine-grained, always-on tracing and real-time analysis system for training workloads in 10,000+ GPU-scale production clusters. ARGUS decomposes observation along the training call hierarchy into CPU call stacks, framework semantics, and GPU kernel execution, with always-on collection under a combined overhead of less than 2%. It builds a unified data pipeline and compresses raw kernel events by approximately 3,700x from 10 MB to 2.7 KB per rank per step. Its progressive diagnosis framework automatically isolates anomalous windows, straggler ranks, and degraded kernels through iteration-time, phase-level, and kernel-level analysis. Deployed for over six months on a 10,000+ GPU production cluster, ARGUS has supported continuous fail-slow detection and performance optimization. Our case studies further demonstrate its effectiveness across representative anomalies, including compute stragglers, link degradation, pipeline-bubble amplification, FlashAttention JIT stalls, and compute stragglers masked by communication symptoms.

2606.19989 2026-06-19 cs.DC cs.LG 新提交

Online Dynamic Batching with Formal Guarantees for LLM Training

面向LLM训练的具有形式保证的在线动态批处理

Dian Li, Zekun Wang, Yaoru Wang, Jiahong Yan

AI总结 提出在线动态批处理(ODB)系统,在数据加载器侧将批构建延迟到样本真实成本可观测时,解决离线批采样中预处理成本不可见问题,实现1.58-4.43x吞吐量提升,并提供无死锁有界终止的形式化保证。

Comments 29 pages, 3 figures, 21 tables

详情
AI中文摘要

现代LLM训练打破了离线批采样器背后的一个核心假设:样本的真实训练成本只有在预处理、增强、模板化、分词和多模态视觉标记扩展之后才能观察到。除非为依赖于预处理和增强的长度缓存付费,否则批构建对于决定填充、内存使用和GPU饱和度的量是盲目的。我们引入了在线动态批处理(ODB),这是一个数据加载器侧的即插即用系统,它将批形成移动到这一精确可观测性点,同时保持DDP步骤对齐。我们将这一同步需求形式化为分布式组对齐问题,并证明了在默认加入模式身份覆盖和可选非加入样本配额封闭下的无死锁有界终止。ODB不需要修改模型、优化器或注意力核,并以轻量级训练器适配器的形式发布为online-dynamic-batching。在UltraChat/LLaVA/ShareGPT4o上对公开的2B/8B Qwen3-VL进行的实验中,与固定批Standard相比,ODB在单节点全量微调/LoRA上实现了1.58-2.51倍的逐字样本吞吐量提升,在两节点全量微调上实现了1.71-3.78倍提升,质量与Standard相当;生产环境MM-Mix达到4.43倍。与GMT/BMT离线令牌预算预言机相比,ODB在UltraChat/LLaVA上差距在15%以内,在高变异系数的ShareGPT4o上更快:单节点全量微调/LoRA为2.24-2.39倍,两节点全量微调为3.06-3.69倍。总之,ODB占据了高异质性LLM微调的在线/即插即用领域:在质量与Standard相当的情况下实现大幅吞吐量提升,提供形式化的DGAP保证,无需长度缓存预计算或核重写。

英文摘要

Modern LLM training breaks a core assumption behind offline batch samplers: the true training cost of a sample is only observable after preprocessing, augmentation, templating, tokenization, and multimodal visual-token expansion. Unless one pays for a preprocessing- and augmentation-dependent length cache, batch construction is therefore blind to the quantity that determines padding, memory use, and GPU saturation. We introduce Online Dynamic Batching (ODB), a DataLoader-side drop-in system that moves batch formation to this point of accurate observability while preserving DDP step alignment. We formalize this synchronization requirement as the Distributed Group Alignment Problem and prove deadlock-free bounded termination with default join-mode identity coverage and opt-in non-join sample-quota closure. ODB requires no model, optimizer, or attention-kernel changes and is released as online-dynamic-batching with lightweight trainer adapters. Across public 2B/8B Qwen3-VL runs on UltraChat/LLaVA/ShareGPT4o, ODB improves literal emitted-sample throughput vs. fixed-batch Standard by 1.58-2.51x on single-node Full FT/LoRA and 1.71-3.78x on two-node Full FT, with Standard-comparable quality; production MM-Mix reaches 4.43x. Against GMT/BMT offline token-budget oracles, ODB is within 15% on UltraChat/LLaVA and faster on high-CV ShareGPT4o: 2.24-2.39x single-node Full FT/LoRA and 3.06-3.69x two-node Full FT. Together, ODB occupies the online/drop-in regime for high-heterogeneity LLM fine-tuning: large throughput gains at Standard-comparable quality, formal DGAP guarantees, and no length-cache precompute or kernel rewrites.

2606.19869 2026-06-19 cs.DC 新提交

EVM Workloads in the Wild: Evidence for Multi-Dimensional Gas Metering, State Growth, Delayed Execution, and Parallelism

现实中的EVM工作负载:多维Gas计量、状态增长、延迟执行和并行性的证据

Lioba Heimbach, Kushal Babel, Jason Milionis

AI总结 通过分析2025年以太坊L1和Base L2的区块追踪,发现资源组合不稳定、状态增长被低估、执行结果对历史状态敏感,为多维Gas计量和状态增长显式定价提供了实证基础。

详情
AI中文摘要

EVM兼容区块链上的Gas计量假设执行条件是稳定的:资源组合足够恒定,可以将执行成本合并为具有固定相对价格的单一标量,并且提交与执行之间的状态漂移不会实质性改变交易结果。我们衡量了这一假设失败的程度。我们呈现了2025年全年以太坊(L1)和Base(L2)上EVM工作负载的追踪级测量研究,每条链每天采样3000个区块。我们将每笔交易分解为操作码级执行Gas、固有Gas、退款和持久状态增量。为测量状态敏感性,我们在旧状态上重新执行2025年9月的交易,并记录Gas使用和存储访问模式的变化。我们发现资源组合远非稳定:在Base上,存储读取和计算分别占执行Gas的29.2%和24.3%,而以太坊将34.9%用于存储写入。以太坊在2025年Gas上限翻倍,使其自身配置转向更重计算、类似Base的模式。Base还表现出更高比例的冷存储读取(49.7%对以太坊的39.6%)。持久状态增长(一种被定价为临时成本的永久成本)在Base上达到456 GB,而在以太坊上为38 GB。执行结果同样不稳定:在Base上,46.0%的交易在附近历史状态间的Gas估算存在差异,而以太坊为13.9%,MEV和DeFi活动的敏感性尤其高。存储访问模式在不同状态间也存在差异,限制了访问列表的有效性并使并行执行复杂化。我们的工作为多维Gas计量和状态增长的显式定价提供了实证基础。研究表明,状态敏感的执行行为使工作负载估算复杂化,直接影响交易可预测性和用户体验。

英文摘要

Gas metering on EVM-compatible blockchains assumes that execution conditions are stable: that the resource mix is constant enough to justify collapsing execution costs into a single scalar with fixed relative prices, and that state drift between submission and execution does not materially alter a transaction's outcome. We measure the extent to which this assumption fails. We present a trace-level measurement study of EVM workloads on Ethereum (L1) and Base (L2) throughout 2025, sampling 3,000 blocks per day per chain. We decompose each transaction into opcode-level execution gas, intrinsic gas, refunds, and persistent state deltas. To measure state sensitivity, we re-execute transactions from September 2025 on older states and record how gas usage and storage access patterns change. We find the resource mix to be far from stable: on Base, storage reads and compute account for 29.2% and 24.3% of execution gas, while Ethereum devotes 34.9% to storage writes. Ethereum's gas limit doubling during 2025 shifted its own profile toward compute-heavier, Base-like patterns. Base also exhibits a higher fraction of cold storage reads (49.7% versus 39.6% on Ethereum). Persistent state growth, a permanent cost priced as a transient one, reaches 456 GB on Base versus 38 GB on Ethereum. Execution outcomes are equally unstable: gas estimates vary across nearby historical states for 46.0% of transactions on Base, compared to 13.9% on Ethereum, with especially high sensitivity for MEV and DeFi activity. Storage access patterns also diverge across states, limiting the effectiveness of access lists and complicating parallel execution. Our work provides an empirical foundation for multi-dimensional gas metering and explicit pricing of state growth. They show that state-sensitive execution behavior complicates workload estimation, directly affecting transaction predictability and user experience.

2606.19746 2026-06-19 cs.DC 新提交

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

SAC: 面向稀疏注意力LLM的基于CXL的解耦KV缓存系统

Ruiyang Ma, Teng Ma, Junru Li, Hantian Zha, Xuchun Shang, Qingda Hu, Zheng Liu, Xinjun Yang, Tao Ma, Guojie Luo

AI总结 针对稀疏注意力模型在长上下文推理中全量KV缓存传输导致的瓶颈,提出基于CXL按需获取top-k KV条目的解耦缓存系统SAC,相比RDMA方案吞吐提升2.1倍、TTFT降低9.7倍。

详情
AI中文摘要

LLM向长上下文推理的扩展将主要服务系统瓶颈从计算转移到内存容量。传统针对密集注意力模型的解决方案依赖基于RDMA的解耦内存池,在解码前从远程存储粗粒度地获取整个前缀KV缓存到本地内存。然而,这种方法对于新兴的稀疏注意力模型本质上是低效的。尽管解码过程中只有一小部分KV条目是活跃的,这些系统仍然将完整的KV缓存获取到本地,导致严重的传输瓶颈和本地内存浪费。为了解决这个问题,我们提出了SAC,第一个针对稀疏注意力模型优化的高效解耦KV缓存系统。通过利用Compute Express Link (CXL)的低延迟、缓存行粒度的加载/存储语义,SAC在推理过程中按需仅获取所需的top-k KV条目。在使用SGLang对DeepSeek-V3.2的评估中,与基于RDMA的基线相比,SAC实现了2.1倍的吞吐量提升、9.7倍的TTFT降低和1.8倍的TBT降低,确立了基于CXL的解耦作为新兴稀疏注意力模型的优越基础设施。

英文摘要

The scaling of LLMs toward long-context inference has shifted the primary serving system bottleneck from computation to memory capacity. Traditional solutions for dense attention models rely on RDMA-based disaggregated memory pools, which perform coarse-grained fetching of the entire prefix KV cache from remote storage to local memory before decoding. However, this approach is fundamentally inefficient for emerging sparse attention models. While only a small fraction of KV entries are active during decoding, these systems still fetch the full KV cache locally, leading to severe transmission bottlenecks and local memory wastage. To address this, we propose SAC, the first efficient disaggregated KV cache system optimized for sparse attention models. By leveraging the low-latency, cache-line granularity load/store semantics of Compute Express Link (CXL), SAC fetches only the required top-k KV entries on demand during inference. Evaluations on DeepSeek-V3.2 using SGLang show that SAC achieves 2.1x higher throughput, 9.7x lower TTFT, and 1.8x lower TBT compared to RDMA-based baselines, establishing CXL-based disaggregation as the superior infrastructure for emerging sparse attention models.

2606.19529 2026-06-19 cs.DC 新提交

The Sheaf Laplacian: A Topological Framework for Data Fusion and Consensus in Distributed Sensing Networks

层拉普拉斯算子:分布式传感网络中数据融合与共识的拓扑框架

Manuel Hernández, Eduardo Sánchez-Soto

AI总结 提出层理论作为传统图模型的替代,利用层拉普拉斯算子实现异构分布式传感网络中的数据融合与共识。

详情
AI中文摘要

我们在此论证,传统网络模型——绝大多数基于简单图的数学构造——从根本上不足以捕捉现代分布式系统的复杂性。这类系统的特点是具有不同能力的异构代理、高维多模态数据流,以及无法用简单连接或标量权重充分描述的复杂上下文相关关系。这些经典模型的局限性要求一种具有更强表达能力的新数学语言。我们发现层理论为我们提供了这样一种语言。此外,我们表明层拉普拉斯算子是分布式传感网络中进行数据融合和建立共识的合适机制。

英文摘要

We argue here that traditional network models, which are overwhelmingly based on the mathematical construct of a simple graph, are fundamentally insufficient for capturing the complexity of modern distributed systems. Such systems are characterized by heterogeneous agents with diverse capabilities, high-dimensional and multi-modal data streams, and intricate, context-dependent relationships that cannot be adequately described by a simple connection or a scalar weight. The limitations of these classical models necessitate a new mathematical language, one with far greater expressive power. We have found that sheaf theory provides us with such a language. Moreover, we show that the sheaf Laplacian is a suitable mechanism for data fusion and establishing consensus within distributed sensing networks.

2606.19519 2026-06-19 cs.DC 新提交

A Topos-Theoretic Interpretation of Blockchain Systems: Sheaves of Consensus and the Logic of Decentralized Truth

区块链系统的拓扑学解释:共识层与去中心化真理的逻辑

Manuel Hernández, Eduardo Sánchez-Soto

AI总结 本文提出用拓扑论(层范畴理论)作为区块链系统的数学语言,将共识过程建模为局部一致性到全局真理的构造,超越传统有限状态机模型。

详情
AI中文摘要

区块链系统,特别是智能合约的主要形式模型,大多源自经典计算理论,有限状态机或带标号迁移系统是主要概念工具。然而,有限状态机将区块链最困难和新颖的方面——在去中心化环境中达成共识——归结为复杂且往往混乱的实现细节,位于形式模型之外。但共识过程并非附属特征;它是计算现象的本质。为了忠实地建模它,需要一种新的数学语言。本文的核心论点是,拓扑论,即层范畴理论,为以局部一致性和全局真理构造为定义的系统提供了本原的数学语言。

英文摘要

The predominant formal models for blockchain systems, particularly smart contracts, have largely been drawn from the classical theory of computation, with the finite state machine (FSM) or labeled transition system serving as the primary conceptual tool. However, the FSM relegates the most difficult and novel aspect of a blockchain -- the achievement of consensus in a decentralized environment -- to a complex, often messy, implementation detail that lies outside the formal model itself. But the process of consensus is not an ancillary feature; it is the very essence of the computational phenomenon. To model it faithfully, a new mathematical language is required. The central thesis of this work is that topos theory, the theory of categories of sheaves, provides the native mathematical language for systems defined by local consistency and the construction of global truth.

2606.19834 2026-06-19 cs.DC cs.IT cs.NI math.IT 新提交

Multi-Orientation Edge-Minimum Repair for Non-Redundant Fault-Tolerant Broadcasting in Dense Eisenstein--Jacobi Networks

密集Eisenstein-Jacobi网络中非冗余容错广播的多方向边最小修复

Bader Albader

AI总结 针对密集Eisenstein-Jacobi网络,提出多方向边最小修复方法EJ-MOEM,通过评估六边形广播树方向、选择容错候选、收缩故障剪枝树并利用外部跨组件修复边重构生成树,证明单故障深度不超过t+1、双故障深度不超过t+2,实验验证至t=200均成功。

Comments Preprint also available on Zenodo:https://doi.org/10.5281/zenodo.20691537

详情
AI中文摘要

密集Eisenstein-Jacobi (EJ) 网络是六次代数互连网络,其有限商几何自然由六边形轴向坐标球表示。本文研究由 $\alpha=(t+1)+t\omega$ 生成的密集EJ网络中的非冗余一对多广播修复,其中 $t$ 是网络直径。我们提出EJ-MOEM,一种多方向边最小修复方法,该方法评估一个常数大小的六边形广播树方向族,选择一个容错感知候选,将故障剪枝树收缩为健康组件,并使用外部跨组件修复边重新连接这些组件。得到的结构是健康子图的一个有根生成树:每个健康节点恰好接收一次消息,不使用任何故障节点,并保留原始健康树组件。我们证明,对于所选方向,其故障剪枝组件图是连通的,恰好需要 $c-1$ 条外部修复边,其中 $c$ 是健康组件的数量。我们还证明了EJ坐标归约树的深度证书定理:每个单故障位置允许深度至多 $t+1$ 的修复,每个双故障位置允许深度至多 $t+2$ 的修复。证明使用了EJ六边形的三带表示、扇区后缀附着引理、非相邻扇区分离引理以及六方向屏蔽分类用于配对割集。扩展验证包括对 $t=2,\ldots,12,14,16,18$(在 $t=18$ 时多达 $N=1027$ 和 525,825 个双故障位置)的穷举单故障和双故障枚举,通过 $t=30$ 的结构化定理关键测试,以及通过 $t=200$ 的大型随机测试,全部100%成功且无违反定理的情况。

英文摘要

Dense Eisenstein--Jacobi (EJ) networks are degree-six algebraic interconnection networks whose finite quotient geometry is naturally represented by a hexagonal axial-coordinate ball. This paper studies non-redundant one-to-all broadcast repair in the dense EJ network generated by $α=(t+1)+tω$, where $t$ is the network diameter. We propose EJ-MOEM, a multi-orientation edge-minimum repair method that evaluates a constant-size family of hexagonal broadcast-tree orientations, selects a fault-aware candidate, contracts the fault-pruned tree into healthy components, and reconnects these components using external component-crossing repair edges. The resulting structure is a rooted spanning tree of the healthy subgraph: every healthy node receives the message exactly once, no faulty node is used, and the original healthy tree components are preserved. We prove that, for a chosen orientation whose fault-pruned component graph is connected, exactly $c-1$ external repair edges are necessary and sufficient, where $c$ is the number of healthy components. We also prove a depth-certificate theorem for EJ coordinate-reduction trees: every one-fault placement admits a repair of depth at most $t+1$, and every two-fault placement admits a repair of depth at most $t+2$. The proof uses the three-strip representation of EJ hexagons, a sector-suffix attachment lemma, a non-adjacent-sector separation lemma, and a six-direction shielding classification for paired cuts. Extended validation includes exhaustive one- and two-fault enumeration for $t=2,\ldots,12,14,16,18$ (up to $N=1027$ and 525,825 two-fault placements at $t=18$), structured theorem-critical tests through $t=30$, and large random tests through $t=200$, all with 100\% success and no violation of the theorem.

2606.19833 2026-06-19 cs.DC cs.IT cs.NI math.IT 新提交

Fault-Tolerant Shared-Relay Communication in Circulant Interconnection Networks

循环互连网络中的容错共享中继通信

Bader Albader, Galal Hassan, Mohamed R. Al-Mulla

AI总结 本文研究有向循环图中两跳容错共享中继问题,通过循环差多重性条件建立网络设计框架,分析中继冗余度与度预算的关系,并验证生成器选择对中继生存性的关键影响。

Comments Preprint also available on Zenodo:https://doi.org/10.5281/zenodo.20691084

详情
AI中文摘要

循环互连网络提供对称寻址、紧凑生成器描述和均匀局部连通性。本文映射了有向循环图中容错两跳原语的度-冗余度景观:给定$n$个节点和度预算$m$,最坏情况下的共享中继多重性$R(n,m)$能有多大?如果节点到有序终端对都有出边,则该节点是共享中继;一个$f$中继容错循环图要求每对终端至少有$f+1$个这样的中继。基本可行性条件是循环差多重性条件,我们将其作为数学工具而非新对象。贡献在于围绕该工具的网络设计框架:参数$R(n,m)$和$D_f(n)$、区间循环图的否定定理、中继表预处理和查找算法、对抗性和随机故障保证、负载均衡范围、启发式设计的认证上界解释、精确的小$n$校准、软件查找与搜索微基准测试,以及对526,539个生成器集的可重复研究。结果表明,生成器选择关键决定最坏情况下的中继生存性:优化阈值设计在约$1.16$-$1.63$倍计数下界内实现$f$中继容错,而标准区间生成器即使在更大度下也可能结构失效。

英文摘要

Circulant interconnection networks provide symmetric addressing, compact generator descriptions, and uniform local connectivity. This paper maps a degree--redundancy landscape for a fault-tolerant two-hop primitive in directed circulants: given $n$ nodes and degree budget $m$, how large can the worst-case shared-relay multiplicity $R(n,m)$ be? A node is a shared relay for an ordered terminal pair if it has outgoing links to both terminals; an $f$-relay-fault-tolerant circulant requires at least $f+1$ such relays for every pair. The underlying feasibility condition is a cyclic difference-multiplicity condition, which we use as a mathematical tool rather than claim as a new object. The contribution is the network-design framework around this tool: the parameters $R(n,m)$ and $D_f(n)$, a negative theorem for interval circulants, relay-table preprocessing and lookup algorithms, adversarial and random failure guarantees, load-balance scope, certified upper-bound interpretation of heuristic designs, exact small-$n$ calibration, a software lookup-versus-search microbenchmark, and a reproducible study of 526,539 generator sets. The results show that generator choice critically determines worst-case relay survivability: optimized threshold designs achieve $f$-relay-fault tolerance within about $1.16$--$1.63$ of the counting lower bound, while standard interval generators can fail structurally even at much larger degrees.

2606.19832 2026-06-19 cs.DC cs.IT cs.NI math.IT 新提交

Certified Euclidean-Residue Minimal-Alignment Switch Decompositions for Three Edge-Disjoint Hamiltonian Cycles in Eisenstein--Jacobi Networks

Eisenstein-Jacobi网络中三条边不交哈密顿环的认证欧几里得剩余最小对齐交换分解

Bader Albader

AI总结 针对非互质Eisenstein-Jacobi网络,提出一种基于局部交换演算的最小交换分解方法,构建三条边不交哈密顿环,并通过代数补关联证明其正确性。

Comments Preprint also available on Zenodo:https://doi.org/10.5281/zenodo.20693870

详情
AI中文摘要

Eisenstein-Jacobi (EJ) 网络是六度商格互连网络。对于生成元 $\alpha=a+b\rho$,设 $N=a^2+ab+b^2$ 和 $d=\gcd(a,b)$。若 $d=1$,三个自然单位方向已给出三条边不交哈密顿环。若 $d>1$,每个单位方向分裂为 $d$ 个环,边不交哈密顿环问题变为环拼接问题。现有的非互质EJ分解通过矩形表示和交换调度证明存在性。本文在自然Cayley几何中发展了一种不同的局部交换演算。前两个哈密顿环各自使用最少可能的 $d-1$ 个组件间交换构建,第三个因子作为未使用的边补集获得。贡献并非对所有非互质EJ网络的新存在性定理,而是针对欧几里得剩余族的一种紧凑、公式驱动、最小交换分解,其补关联通过符号方式证明。证明分离四个要素:组件标签坍缩、锚点取消、提升交换代表的无碰撞性以及连通补关联。本文中没有无限族定理通过有限证据或计算枚举证明。定理范围限定在代数补关联证书已写明的参数范围内。表格和CSV数据仅用于验证和重现公式,从不作为无限族定理的证明。

英文摘要

Eisenstein--Jacobi (EJ) networks are degree-six quotient-lattice interconnection networks. For a generator $α=a+bρ$, let $N=a^2+ab+b^2$ and $d=\gcd(a,b)$. If $d=1$, the three natural unit directions already give three edge-disjoint Hamiltonian cycles. If $d>1$, each unit direction splits into $d$ cycles and the EDHC problem becomes a cycle-splicing problem. Existing non-coprime EJ decompositions prove existence by using a rectangular representation and exchange schedules. This paper develops a different, local switch calculus in the natural Cayley geometry. The first two Hamiltonian cycles are built using the minimum possible $d-1$ intercomponent switches each, and the third factor is obtained as the unused edge complement. The contribution is deliberately not a new existence theorem for all non-coprime EJ networks; rather, it is a compact, formula-driven, minimal-switch decomposition for Euclidean-residue families whose complement incidence is proved symbolically. The proof separates four ingredients: component-label collapse, anchor cancellation, noncollision of lifted switch representatives, and connected complement incidence. No infinite-family theorem in this manuscript is proved by finite witnesses or by computational enumeration. The theorem scope is stated for the parameter ranges where an algebraic complement-incidence certificate is written down. Tables and CSV data are used only to verify and reproduce the formulas, never as proof of an infinite-family theorem.

2606.20537 2026-06-19 cs.LG cs.DC 交叉投稿

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

执行状态胶囊:面向低延迟、小批量、设备端物理AI服务的图绑定执行状态检查点与恢复

Liang Su

AI总结 针对低延迟、小批量、设备端物理AI服务场景,提出执行状态胶囊机制,通过图绑定检查点与恢复完整可恢复状态,在RTX 5090上实现亚毫秒级恢复,TTFT加速比达3.9倍至27倍。

Comments 27 pages, 9 figures

详情
AI中文摘要

主流LLM服务系统主要通过分页或基数键值(KV)缓存重用前缀工作。这对于高吞吐量、高并发服务非常有效,但它只管理执行状态的一个位置片段:KV缓存。我们研究相反的场景:低延迟、小批量、设备端物理AI服务,其中交互式LLM代理、语音系统和机器人策略在严格的响应预算下频繁分支、重置、中断和重新进入。我们引入执行状态胶囊,一种图绑定的检查点和恢复机制,用于在提交边界处保存完整的可恢复状态。FlashRT是一个白盒、后端内核运行时,其评估的NVIDIA CUDA后端在连续的静态缓冲区上运行捕获的图计划,无需块表间接寻址。由于活动状态是一组命名的封闭缓冲区,胶囊可以快照、恢复、分叉或回滚整个执行边界,包括KV、循环状态、卷积状态、MTP状态和元数据。这将重用从令牌寻址的KV片段转移到图绑定的执行状态边界。在RTX 5090上,胶囊恢复在存储状态级别是字节精确的,在贪婪解码下是令牌一致的。仅KV的消融实验出现分歧,表明循环状态是承载负载的。GPU驻留的快照和恢复是亚毫秒级的,TTFT相对于冷预填充的加速比从2k令牌时的3.9倍增长到16k令牌时的27倍。在Jetson AGX Thor和DGX Spark上,相同的正确性和结构属性成立。胶囊不是高吞吐量KV缓存服务的替代品;它们定义了一个互补的以延迟为先的服务点,用于显式执行状态重用。

英文摘要

Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.

2606.20520 2026-06-19 cs.CR cs.AI cs.DC cs.LG 交叉投稿

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

主权执行代理:在智能体控制平面中强制执行证书绑定权限

Jun He, Deying Yu

AI总结 针对自主代理在生产环境中执行变更时缺乏强制权限验证的问题,提出主权执行代理(SEB),通过证书验证、状态检查和范围身份实现运行时强制权限控制,并在AWS和Kubernetes上验证了其安全性和性能。

Comments 19 pages, 6 figures, 10 tables

详情
AI中文摘要

自主代理越来越多地连接到云、部署和数据控制工作流,但生产环境的变更权限不应存在于非确定性推理过程中。现有的访问控制机制授权身份,而保证层认证提议的操作;两者单独都无法在变更时刻提供对认证权限的强制执行点。本文介绍了主权执行代理(SEB),一种用于证书绑定智能体基础设施的运行时强制边界。SEB消耗由主权保证边界(SAB)颁发的证书,验证请求的变更与认证的执行合约匹配,检查有效期窗口、策略时期、撤销时期和实时状态漂移,铸造范围执行身份,调用基础设施API,并记录签名的决策和结果记录。通过分离提议、准入和执行,SEB将认证权限转化为短暂的、可撤销的、可审计的运行时能力,前提是生产变更API拒绝非代理身份。我们展示了SEB执行模型、证书和重放验证谓词、范围身份语义、绕过预防部署模式、失败行为以及一个具体的原型实现。我们在AWS和Kubernetes集群上评估了原型,测量了延迟开销、撤销传播、漂移检测以及故障注入下的安全性。

英文摘要

Autonomous agents are increasingly connected to cloud, deployment, and data-control workflows, but production mutation authority should not reside inside non-deterministic reasoning processes. Existing access-control mechanisms authorize identities, while assurance layers certify proposed actions; neither alone provides a mandatory enforcement point for certified authority at the moment of mutation. This paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure. SEB consumes certificates issued by the Sovereign Assurance Boundary (SAB), verifies that the requested mutation matches the certified execution contract, checks validity windows, policy epochs, revocation epochs, and live-state drift, mints scoped execution identity, invokes infrastructure APIs, and records signed decision and outcome records. By separating proposal, admission, and execution, SEB turns certified authority into a short-lived, revocable, auditable runtime capability, provided that production mutation APIs reject non-broker identities. We present the SEB execution model, certificate and replay-verification predicates, scoped identity semantics, bypass-prevention deployment patterns, failure behavior, and a concrete prototype implementation. We evaluate the prototype on AWS and Kubernetes clusters, measuring latency overheads, revocation propagation, drift detection, and security under fault injection.

2606.20128 2026-06-19 cs.SE cs.DC cs.LG 交叉投稿

The Correctness Illusion in LLM-Generated GPU Kernels

LLM生成的GPU内核中的正确性错觉

Dipankar Sarkar

AI总结 通过高精度CPU参考和操作模式感知的模糊测试,发现现有基准测试中基于固定形状的allclose检查无法检测LLM风格的转录错误,提出一种新协议并验证其有效性。

Comments 10 pages, 2 figures, LNCS format. Companion papers to follow on arXiv next week; IDs will be added in a v2 replace

详情
AI中文摘要

针对LLM生成的GPU内核的基准测试(KernelBench、TritonBench、GEAK)通过固定形状、小样本的allclose风格检查来评分正确性。不同基准测试的输入数量不同。每个内核的形状、数据类型和容差是固定的。我们凭经验测试了该oracle。我们构建了一个包含24个Triton和CPU替代内核(15个正确对照和9个带有记录转录错误的LLM风格错误变体)的受控语料库,并在操作模式感知的种子模糊测试下,使用高精度(fp64)CPU参考和每个(操作,数据类型)的绝对容差重新评估。种子oracle标记了9个错误内核中的9个,并通过了15个正确对照中的15个,对照的精度成本为零。我们将语料库扩展到26个操作(添加一个flash-attention对),并在五类GPU(RTX 3060、A10、L40S、A100 SXM4、H100 NVL)上重新运行相同的协议。所有五个GPU的判定结果相同:10个错觉中的10个被捕获,16个对照中的16个干净。语料库结果涉及LLM风格的转录错误,这些错误被单形状allclose oracle认证为正确,而不涉及任何特定部署的LLM的错误率。每个标记的失败都从存储的种子逐字节重放。

英文摘要

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

2606.19969 2026-06-19 cs.DB cs.DC 交叉投稿

The Bi-Channel Networking Paradigm for Database Systems in the Cloud

云数据库系统的双通道网络范式

Georg Kreuzmayr, Muhammad El-Hindi, Benjamin Wagner, Tobias Ziegler, Viktor Leis

AI总结 针对现代高速云网络中内核TCP栈成为数据库性能瓶颈的问题,提出双通道网络范式,将通信分离为高性能数据路径和可靠控制路径,结合用户空间UDP与内核TCP,在分布式shuffle和复制键值存储中实现高吞吐与低开销。

Comments Accepted to EDBT 2027 (Lille, France)

详情
AI中文摘要

当网络链路速度较慢时,云和分布式数据库系统可以依赖通用的内核抽象,并将网络通信视为黑盒。在当今快速云网络下,这种方法失效了:数据库性能受到内核TCP栈CPU开销的限制。用用户空间UDP替换TCP可以减少这种开销,但需要重新实现基本保证,如可靠性和有序性。为解决这一难题,数据库系统不应再将网络视为黑盒,而应将其与数据库操作协同设计。我们提出了数据库系统的双通道范式,将通信分为两个通道:一个用于延迟和带宽敏感操作的高性能数据路径,以及一个用于协调和恢复的可靠控制路径。我们通过结合用户空间UDP和基于内核的TCP来实现该范式,尽管其他协议栈组合也是可能的。这种设计利用了现代NIC的能力,同时保留了TCP的可靠性。我们在两个代表性场景中展示了该范式的效率和简洁性:一个分布式shuffle用三个CPU核饱和200 Gbit/s,以及一个每秒处理数百万条消息的复制键值存储。

英文摘要

When network links were slow, cloud and distributed database systems could rely on generic kernel abstractions and treat network communication as a black box. With today's fast cloud networks, this approach breaks down: database performance becomes limited by the CPU overhead of the kernel TCP stack. Replacing TCP with user-space UDP can reduce this overhead, but it requires reimplementing essential guarantees, such as reliability and ordering. To solve this conundrum, database systems should no longer treat networking as a black box but co-design it with database operations. We propose the bi-channel paradigm for database systems, which separates communication into two channels: A high-performance data path for latency- and bandwidth-sensitive operations, and a reliable control path for coordination and recovery. We implement the paradigm by combining user-space UDP and kernel-based TCP, though other stack combinations are possible. This design exploits modern NIC capabilities while preserving TCP's reliability. We demonstrate the paradigm's efficiency and simplicity in two representative settings: a distributed shuffle saturating 200 Gbit/s with three CPU cores, and a replicated key-value store processing millions of messages per second.

2606.19576 2026-06-19 cs.DB cs.DC 交叉投稿

REMOP: REmote-Memory-aware OPerator Optimization

REMOP: 远程内存感知的算子优化

Shiquan Zhang, Yunhao Mao, Yuqiu Zhang, Gengrui Zhang, Jeyhun Karimov, Hans-Arno Jacobsen

AI总结 针对远程内存环境下查询处理中数据传输轮次过多的问题,提出REMOP框架,通过轮次感知的算子内内存策略优化内存溢出执行,在DuckDB中实现三种算子,减少高达97%的传输轮次和48%的算子运行时间。

Comments 14 pages, 13 figures, 9 tables. Preprint, under review

详情
AI中文摘要

远程和分离内存层扩展了分析数据库引擎的有效内存容量,但也重塑了内存溢出查询处理的成本结构。当算子溢出到本地DRAM之外时,将页面移动到远程内存既会产生数据传输时间,也会产生每次传输的固定往返延迟。经典的算子分析和缓冲区分配启发式方法主要通过最小化总I/O量来针对磁盘溢出。在远程内存下,这些策略可能不是最优的,因为它们可能触发过多的传输轮次。我们提出了REMOP,一个远程内存感知的算子优化框架,它使用传输轮次感知的算子内内存策略来改善内存预算紧张下的内存溢出执行。REMOP将传输轮次数引入延迟成本模型,并推导出算子特定的缓冲区划分策略,在DuckDB中为阻塞嵌套循环连接、外部归并排序和外部哈希连接实例化了该方法。我们在双节点计算-内存测试平台上的评估表明,在溢出密集的微基准测试中,REMOP减少了高达97%的传输轮次和高达48%的算子运行时间,并将溢出TPC-H和TPC-DS查询的平均运行时间分别降低了22.7%和26.4%。

英文摘要

Remote and disaggregated memory tiers expand the effective memory capacity of analytical database engines, but they also reshape the cost structure of out-of-memory query processing. When an operator spills beyond local DRAM, moving pages to remote memory incurs both data-transfer time and a fixed round-trip latency per transfer. Classical operator analyses and buffer-allocation heuristics primarily target disk spilling by minimizing total I/O volume. Under remote memory, these strategies can be suboptimal because they may trigger excessive transfer rounds. We present REMOP, a remote-memory-aware operator optimization framework that uses transfer-round-aware intra-operator memory policies to improve out-of-memory execution under tight memory budgets. REMOP introduces the number of transfer rounds into the latency cost model and derives operator-specific buffer-partitioning strategies, instantiating the approach for blocked nested-loop join, external merge sort, and external hash join in DuckDB. Our evaluation on a two-node compute-memory testbed shows that REMOP reduces transfer rounds by up to 97% and operator runtime by up to 48% on spill-heavy microbenchmarks, and lowers the average runtime of spilling TPC-H and TPC-DS queries by 22.7% and 26.4% end-to-end.

2606.19537 2026-06-19 cs.MA cs.DC 交叉投稿

Mesh Inference: A Formal Model of Collective Intelligence Without a Center

网格推理:无中心集体智能的形式模型

Hongwei Xu

AI总结 提出网格推理形式模型,通过耦合自由能实现无中心多智能体协作推理,证明收敛唯一性、识别完备性和观测唯一性,并分析线性高斯情况下的延迟代价。

Comments 21 pages, 2 figures

详情
AI中文摘要

我们提出了网格推理的形式模型:一群独立智能体,每个持有私有状态,仅交换被接纳的、类型化的观测,在没有中央协调者且无智能体暴露的情况下,推导出任何一个智能体单独无法得出的结论。没有智能体共享权重、梯度或隐藏状态,且智能体可能跨越不同的团队、网络和组织。受“询问模型是能量最小化推理”这一观察的启发,我们将网格建模为每个智能体局部松弛的耦合自由能。我们证明,单一的接纳/发射策略控制三个性质。首先,对于任何对称或非对称的接纳,网格推理收敛到唯一答案,因为耦合总是M-矩阵。其次,它是识别完备的:当贡献视图是载波连通时,它精确推导出集中式最优解。第三,它是仅观测的:没有节点传输其内部状态,且机密性是识别的对偶。内容寻址谱系是唯一的全局侧信道。在线性高斯情况下,每个推导出的答案都是确定的,因此等于集中式最优解,延迟为O(diam^2),这是移除中心所付出的代价。这样的推导是无中心学习循环的一个环节,我们将其形式化为架构而非证明。我们提出的开放问题是,询问何时能改善集体而非破坏它:非线性闭包是推导出升级的答案还是自信的错误。据我们所知,这是网格推理的第一个形式模型。

英文摘要

We present a formal model of mesh inference: how a population of independent agents, each holding private state and exchanging only admitted, typed observations, derives a conclusion none of them holds alone, with no central coordinator and no agent exposed. No agent shares weights, gradients, or hidden state, and the agents may span different teams, networks, and organizations. Motivated by the observation that asking a model is energy-minimizing inference, we model the mesh as a coupled free energy that each agent relaxes locally. We show that a single admission/emission policy governs three properties. First, mesh inference converges to a unique answer for any admission, symmetric or not, because the coupling is always an M-matrix. Second, it is identification-complete: it derives the centralized optimum exactly when the contributing views are carrier-connected. Third, it is observation-only: no node transmits its internals, and confidentiality is the dual of identification. Content-addressed lineage is the only global side-channel. In the linear-Gaussian regime every derived answer is determined, hence equal to the centralized optimum, at O(diam^2) latency, the measured price of removing the center. One such derivation is one turn of a center-free learning loop, which we formalize as architecture rather than prove. The open problem we state is when asking improves the collective rather than corrupting it: whether the non-linear closure derives an upgraded answer or a confident error. To our knowledge, this is the first formal model of mesh inference.

2606.20496 2026-06-19 math.NA cs.DC cs.MS cs.NA 交叉投稿

CoarseSolvers for Exascale Solution of Poisson Problems

用于泊松问题百亿亿次求解的粗网格求解器

Thilina Ratnayaka, Paul Fischer, Luke Olson

AI总结 提出一种两层Schwarz方法替代代数多重网格(AMG)作为p-多重网格预条件子的粗网格求解器,通过结构化非嵌套粗空间实现无通信插值,在Summit/Frontier超算上验证了优于BoomerAMG的性能。

详情
AI中文摘要

我们提出一种两层Schwarz方法,作为代数多重网格(AMG)的替代方案,用于求解由不可压缩Navier-Stokes方程的谱/有限元离散产生的压力泊松方程的p-多重网格(pMG)预条件子的最后一层(粗网格)求解器。所提出的Schwarz方法包括原始pMG粗空间中的一个局部问题和一个全局粗问题。本文的主要贡献是为全局粗问题提出了一种新颖的、结构化的非嵌套粗空间。所提出的全局粗空间的结构化特性使得原始p-多重网格粗空间与全局粗问题之间的插值无需通信。通过在橡树岭领导计算设施的Summit/Frontier超算上使用高度可扩展的不可压缩Navier-Stokes求解器套件Nek5000/RS进行的一系列实验,我们展示了所提方法相比最先进的AMG求解器BoomerAMG的有效性。

英文摘要

WepresentatwolevelSchwarzmethodasanalternativetoAlgebraicMultigridmethod(AMG) used as the last level (coarse) solver of the p-multigrid pMG preconditioner for pressure Poission equation resulting from Spectral/Finite element descretization of incompressible Navier-Stokes eqaution. Proposed Schwarz method consits of a local problem in the original pMG coarse space and a global coarse problem. Main contribution of the paper is a novel, structured and a non-nested coarse space for the global coarse problem. Structured nature of the proposed global coarse space enable communication-free interpolation between the original p-multgrid coarse space and the global coarse problem. We demonstrate the effectiveness of the proposed method compared to the state of the art AMG solver BoomerAMG by a series of experiments performed using Nek5000/RS, a suite of highly scalable incompressible Navier-Stokes solvers, on Summit/Frontier supercomputers at Oak Ridge Leadership Computing Facility.

2606.20344 2026-06-19 quant-ph cs.DC cs.LG 交叉投稿

Quantum ring all-reduce: communication and privacy advantages for distributed learning

量子环全归约:分布式学习的通信与隐私优势

María Gragera Garcés, Lirandë Pira

AI总结 提出量子环全归约协议,利用预共享纠缠和超密编码将每链路在线通信量减半,并通过验证纠缠实现信息论安全的可组合ε-安全聚合,同时获得通信与隐私优势。

Comments 23 pages, 1 figure

详情
AI中文摘要

机器学习模型已扩展到前所未有的规模,使得跨分布式设备的训练成为该领域的事实标准。在这项工作中,我们探讨量子通信如何使分布式训练在通信效率和信息论隐私方面都更具优势,适用于经典和量子学习模型。环全归约是大规模分布式训练的基础通信原语。我们提出一种量子版本,通过预共享纠缠和超密编码,将每链路在线通信量减少一个可证明最优的因子二,且无需改变学习模型或梯度计算。除了带宽优势,该原语还能实现任何经典协议在信息论上不可能实现的隐私保证,通过验证纠缠以GHZ副本的2倍开销实现可组合的ε-安全聚合。我们的混合量子-经典通信架构为大规模分布式训练同时带来通信和安全优势,无论学习本身是量子还是经典。最后,我们描述了在带宽约束下服务器到客户端通信中梯度冲突检测的量子优势,该设置出现在环全归约完成后,当完整梯度广播到外部客户端不可行时。该问题的两个变体呈现出不同的分离。对于基于间隔的对齐测试(\textsc{GapIP}_{\tau}),量子优势在间隔参数上是二次的:\widetilde{O}({\tau}^{-1}\log P) 量子比特对比 \widetilde{O}(\min(\{\tau}^{-2},P)) 比特。对于针对私有参数匹配的符号一致性审计(\textsc{TieAudit}_{\epsilon}),优势表现为通信复杂度的指数级分离:\Omega(\sqrt{P}) 比特,而 O({\epsilon}^{-2}\log P) 量子比特就足够了。

英文摘要

Machine learning models have scaled to unprecedented sizes, making training across distributed devices the de facto standard in the field. In this work, we explore how quantum communications can make distributed training both more communication-efficient and information-theoretically private, for both classical and quantum learning models. Ring all-reduce is the foundational communication primitive for large-scale distributed training. We present a quantum version that reduces per-link online communication by a provably optimal factor of two using pre-shared entanglement and superdense coding, without requiring the learning model or gradient computation to change. Beyond bandwidth, the primitive enables privacy guarantees that are information-theoretically impossible for any classical protocol, achieving composable ε-secure aggregation, via verified entanglement, at a 2x overhead in GHZ copies. Our hybrid quantum-classical communication architecture yields simultaneous communication and security advantages for large scale distributed training, regardless of whether the learning itself is quantum or classical. Finally, we characterise quantum advantages in gradient conflict detection for server-to-client communication under bandwidth constraints, a setting that arises after ring all-reduce is completed, when full gradient broadcast to external clients is infeasible. Two variants of the problem admit different separations. For margin-based alignment testing (\textsc{GapIP}_τ), the quantum advantage is quadratic in the margin parameter: \widetilde{O}(τ^{-1}\log P) qubits versus \widetilde{O}(\min(\τ^{-2},P)) bits. For sign-consistency auditing against a private parameter matching (\textsc{TieAudit}_ε), the advantage represents an exponential separation in communication complexity: Ω(\sqrt{P}) bits whereas O(ε^{-2}\log P) qubits suffice.

2606.16106 2026-06-19 cs.PF cs.AR cs.DC 交叉投稿

Edge-Inference Governors Need Memory-Clock State

超越CPU-GPU频率:内存时钟和尾部效应对边缘推理延迟估计的影响

Jaehoon Kang

AI总结 通过测量NVIDIA Jetson Orin Nano,发现内存时钟是缺失的维度、聚合丢失率隐藏突发性、频率切换存在延迟,这些现象超出传统频率感知延迟模型的范围。

Comments 20 pages, 13 figures, 11 tables. Code and data: https://github.com/dankang21/jetson-latency-lab ; traces: https://doi.org/10.5281/zenodo.20745228

详情
AI中文摘要

频率感知延迟估计器通过建模CPU和GPU频率上的延迟,使得边缘ML推理的截止时间感知DVFS成为可能。我们在NVIDIA Jetson Orin Nano上进行了测量研究,展示了该建模范围之外的三种现象。(1) 内存时钟是一个缺失的维度:在现实的上限EMC范围(2133->3199 MHz)内,根据工作负载的不同,它将中位数延迟偏移了+11%到+48%,并且在最高GPU时钟下,对于合成L2驻留内核,我们观察到一个可重复的非单调情况(-9%)。在一个功率配置下分析并在另一个功率配置下部署的GPU频率估计器,因此低估了高达32%的延迟;列出四个可锁定的EMC点可以修复大多数工作负载,而参数化的1/f_emc项则不能。(2) 聚合丢失率隐藏了突发性:在固定时钟下,100k周期运行显示出刀锋边缘分布,其截止时间丢失的悬崖跨度约为1毫秒,但丢失的聚集远超出独立性——在0.1%的聚合丢失率下,下一个周期也丢失的概率高达74%(是独立基线的740倍)。高斯mu+3sigma边界超过0.1%丢失目标13倍到29倍,而样本外广义帕累托边界在所有八种配置中保持在~2倍以内。(3) 频率切换并非免费:每个域的过渡停顿低于100微秒,但新的工作点需要1/5/8毫秒(CPU/GPU/EMC)才能生效——对于每推理周期的调控器来说,这是典型推理周期的很大一部分。我们发布了完整的测量工具,并讨论了对下一代频率感知估计器和调控器的影响。

英文摘要

Frequency-aware latency estimators let deadline-aware DVFS governors schedule edge ML inference by modeling latency over CPU and GPU clocks, but they cannot observe the memory clock (EMC) -- a missing deployment state that decides whether a governor meets its deadlines and at what energy. We show this with a deployed, measured governor on a Jetson Orin NX: an EMC-blind GPU-only fit misses 25-28% of cycles at tight deadlines, whereas an EMC-aware refit holds misses to at most 1.3% under a 2% QoS miss budget by selecting a budget-feasible clock -- the energy-minimal one for periodic vision (calibrated module-rail power). The failure generalizes across three workload classes -- MobileNetV2, a ViT transformer, and Qwen2.5 LLM token decode (where saturated decode makes the aware policy lower-energy than the infeasible blind choice): a CPUxGPU estimator sends the deployed governor to an infeasible operating point, and only an EMC-aware model identifies the feasible side of the energy frontier. The effect is real and outside the CPUxGPU state abstraction: across two Orin SKUs sharing the same lockable EMC points it shifts median latency by up to ~45%, replicates on both, and survives a fused TensorRT fp16 engine. CPUxGPU models do not absorb it: per-lockable-point EMC tables are needed, a scoped inversion shows monotone assumptions can pick the wrong direction, and clustered misses make aggregate QoS rates understate deployment risk. We release the harness; this complements, not rebuts, the state of the art within its CPUxGPU scope.

2606.04101 2026-06-19 cs.DC cs.LG 版本更新

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

UltraEP:在机架级节点上以近最优负载均衡释放MoE训练与推理

Xinming Wei, Chao Jin, Tuo Dai, Yinmin Zhong, Shan Yu, Chengxu Yang, Bingyang Wu, Zili Zhang, Jing Mai, Qianchao Zhu, Zhouyang Li, Yuliang Liu, Guojie Luo

AI总结 提出UltraEP,首个基于精确负载的实时均衡器,通过协同设计规划求解与专家复制通信,在机架级节点上实现MoE训练和推理的微批次与逐层重均衡,达到94.3%的力均衡理想吞吐量。

详情
AI中文摘要

大规模专家并行(EP)正成为训练和服务前沿MoE模型的关键,但它也加剧了设备级专家负载不均衡,导致计算掉队者、令牌全对全瓶颈和激活内存峰值。现有的均衡器基于历史负载定期重新分配专家,这对于具有非平稳负载模式的生产部署变得不可靠。我们提出UltraEP,首个用于大规模EP MoE训练和在机架级节点(RSN)上服务预填充的精确负载实时均衡器。基于RSN扩展的纵向扩展连接性,UltraEP在关键路径上对每个微批次和层进行重均衡,这需要规划求解和专家复制通信的非平凡协同设计,以最小化暴露的开销。为此,UltraEP通过高效的配额驱动规划对门控后负载做出积极反应,并利用RSN原生的持久tile流和基于中继的扇出缓解来执行由此产生的不规则专家状态传输。在训练和预填充中,平均涵盖106B到671B参数的MoE模型,UltraEP实现了力均衡理想吞吐量的94.3%,相比无均衡提升了1.49倍,同时将最终跨秩不均衡从1.30-4.01降低到1.01-1.04。此外,我们在2560个GPU的生产MoE训练中验证了UltraEP的可扩展性和鲁棒性。

英文摘要

Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Leveraging the extended scale-up connectivity among dozens of GPUs within RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with an efficient quota-driven planner, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. We evaluate UltraEP in a multi-RSN deployment of up to 256 GPUs, using cutting-edge MoE models from 106B to 671B parameters. Averaged across training and serving, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over no-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04.

2606.01183 2026-06-19 cs.DC cs.DB cs.DS cs.PF 版本更新

The World's Fastest Matching Engine Algorithm

世界上最快的撮合引擎算法

Jake Yoon

AI总结 提出Priority-Indicated Node (PIN)和邻域感知树操作两种数据结构,消除订单簿中指针追逐和根到叶搜索的延迟,实现亚微秒级尾部延迟和每秒数千万条消息的处理能力。

Comments 20 pages, 5 figures, 7 tables

详情
AI中文摘要

每个电子交易所都依赖于一个订单簿,其存储层决定了撮合延迟。主流实现——通过平衡树链接的链表——在每个操作上施加两个成本:指针追逐遍历以到达插入点,以及根到叶搜索以定位目标价格水平。在微突发条件下,这些成本会产生尾部延迟峰值,在流动性最需要时降低市场质量。我们提出了两种数据结构贡献,消除了这些成本。第一种是优先级指示节点(PIN),一种优先队列,其中条目占据固定容量、连续可寻址的槽位,每个槽位携带一个指示条目全局优先级的每槽指示器。与每次操作需要O(log n)次比较的堆不同,PIN直接根据指示器解析插入位置,无需比较条目;指示器更新为O(1),与队列大小无关。第二种解决了更广泛的低效问题:平衡搜索树在每次插入和删除时都进行根到叶搜索,即使调用者已经知道键的中序邻居——例如在有序事件流、增量索引维护和电子交易中。邻域感知插入和删除利用已知的邻居引用,通过O(1)次引用写入来附加或移除节点,然后进行单路径重平衡,统一适用于红黑树、AVL树和B/B+树变体。单个CPU核心在每秒数百万条消息的微突发下,以亚微秒级尾部延迟维持每秒3200万条订单消息,比同一硬件上最好的开源撮合引擎快5-11倍。扩展到单个96核实例,该引擎在10,000个交易品种上维持每秒6.4亿条消息。

英文摘要

A single CPU core sustains 32 million order messages per second at sub-microsecond median end-to-end host-path response latency, 4.7-11 times faster than the best available open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding the entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.

2601.11646 2026-06-19 cs.DC cs.FL 版本更新

A Forward Simulation-Based Hierarchy of Linearizable Concurrent Objects

基于前向模拟的可线性化并发对象层次结构

Chao Wang, Ruijia Li, Yang Zhou, Peng Wu, Yi Lv, Jianwei Liao, Jim Woodcock, Zhiming Liu

AI总结 本文通过前向模拟关系系统研究可线性化对象,证明满足不同活性条件的可线性化对象集合形成有界半格或格,并提出了基于前向模拟的等价刻画和通用构造,用于验证可线性化性。

详情
AI中文摘要

本文系统研究了可线性化对象与前向模拟之间的联系。我们证明,满足无等待(resp.,无锁或无阻塞)的可线性化对象集合在前向模拟关系下形成有界并半格,而无活性约束的可线性化对象集合在同一关系下形成有界格。因此,前向模拟不仅是可线性化性的证明技术,还诱导了可线性化对象的代数层次结构。作为格结果的一部分,我们通过将关于顺序规范$Spec$的可线性化性检查归约为关于无等待通用构造$\mathcal{U}_{Spec}^{WF}$的前向模拟检查,提出了可线性化性的等价刻画。我们还提出了对象$\mathcal{U}_{Spec}^s$,它简化了$\mathcal{U}_{Spec}^{WF}$,更适合验证。我们证明Herlihy-Wing队列被$\mathcal{U}_{Queue}^s$模拟,其中$Queue$是队列的顺序规范。因此,我们的对象$\mathcal{U}_{Spec}^s$可用于可线性化性的验证。为了展示具体可线性化对象之间的前向模拟关系,我们证明时间戳队列模拟Herlihy-Wing队列,而Herlihy-Wing队列不能模拟时间戳队列。这三个证明均已通过Isabelle/HOL机器验证。

英文摘要

In this paper, we systematically investigate the connection between linearizable objects and forward simulation. We prove that the sets of linearizable objects satisfying wait-freedom (resp., lock-freedom or obstruction-freedom) form a bounded join-semilattice under the forward simulation relation, and that the sets of linearizable objects without liveness constraints form a bounded lattice under the same relation. Thus, forward simulation is not only a proof technique for linearizability but also induces an algebraic hierarchy of linearizable objects. As part of our lattice result, we propose an equivalent characterization of linearizability by reducing checking linearizability w.r.t. sequential specification $Spec$ into checking forward simulation w.r.t. a wait-free universal construction $\mathcal{U}_{Spec}^{WF}$. We also propose an object $\mathcal{U}_{Spec}^s$, which simplifies $\mathcal{U}_{Spec}^{WF}$ and is more suitable for verification. We prove that the Herlihy-Wing queue is simulated by $\mathcal{U}_{Queue}^s$ with $Queue$ the sequential specification of the queue. Thus, our object $\mathcal{U}_{Spec}^s$ can be used in the verification of linearizability. To demonstrate the forward simulation relation between concrete linearizable objects, we prove that the time-stamped queue simulates the Herlihy-Wing queue, while the Herlihy-Wing queue cannot simulate the time-stamped queue. All these three proofs have been machine-verified by Isabelle/HOL.

2510.01565 2026-06-19 cs.LG cs.DC 版本更新

TetriServe: Efficiently Serving Mixed DiT Workloads

TetriServe: 高效服务混合DiT工作负载

Runyu Lu, Shiqi He, Wenxuan Tan, Shenggui Li, Ruofan Wu, Jeff J. Ma, Ang Chen, Mosharaf Chowdhury

发表机构 * University of Michigan(密歇根大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Nanyang Technological University(南洋理工大学)

AI总结 针对混合分辨率与截止时间的异构DiT工作负载,提出基于步骤级序列并行的TetriServe系统,通过轮次调度与自适应并行度,在保证图像质量下将SLO达成率提升32%。

详情
AI中文摘要

扩散Transformer(DiT)模型通过迭代去噪步骤生成高质量图像,但由于其高计算成本(尤其在大分辨率下),在严格服务级别目标(SLO)下服务这些模型具有挑战性。现有服务系统使用固定程度的序列并行,这对于具有混合分辨率和截止时间的异构工作负载效率低下,导致GPU利用率低和SLO达成率低。在本文中,我们提出步骤级序列并行,根据请求的截止时间动态调整单个请求的并行度。我们提出了TetriServe,一个实现此策略的DiT服务系统,用于高效图像生成。具体来说,TetriServe引入了一种新颖的基于轮次的调度机制,通过(1)将时间离散化为固定轮次以使截止时间感知调度可处理,(2)在步骤级别自适应并行度并最小化GPU小时消耗,以及(3)联合打包请求以最小化延迟完成,从而提高SLO达成率。对最先进的DiT模型进行的广泛评估表明,与现有解决方案相比,TetriServe在不降低图像质量的情况下实现了高达32%的SLO达成率提升。

英文摘要

Diffusion Transformer (DiT) models excel at generating high-quality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at larger resolutions. Existing serving systems use fixed-degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the degree of parallelism of individual requests according to their deadlines. We present TetriServe, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment by (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimizing GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.

2507.19712 2026-06-19 cs.DC cs.AI cs.GT cs.LG cs.NI 版本更新

Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

Oranits: 基于Open RAN的智能交通系统中的任务分配与卸载——元启发式与深度强化学习方法

Ngoc Hung Nguyen, Nguyen Van Thieu, Quang-Trung Luu, Anh Tuan Nguyen, Senura Wanasekara, Nguyen Cong Luong, Fatemeh Kavehmadavani, Van-Dinh Nguyen

发表机构 * Department of Smart City, Hanyang University(翰阳大学智能城市系)

AI总结 提出Oranits系统模型,通过元启发式算法CGG-ARO和深度强化学习框架MA-DDQN优化车辆协作中的任务依赖与卸载成本,分别提升任务完成率7.7%和12.5%。

Comments 16 pages, 13 figures

Journal ref IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2026

详情
AI中文摘要

本文研究了基于开放无线接入网(Open RAN)的智能交通系统(ITS)中的任务分配与卸载问题,其中自动驾驶车辆利用移动边缘计算进行高效处理。现有研究常忽视任务之间的复杂依赖关系以及将任务卸载到边缘服务器的成本,导致决策次优。为弥补这一不足,我们引入了Oranits,一种新颖的系统模型,明确考虑了任务依赖性和卸载成本,同时通过车辆协作优化性能。为此,我们提出了一种双重优化方法。首先,我们开发了一种基于元启发式的进化计算算法,即混沌高斯全局ARO(CGG-ARO),作为单时隙优化的基线。其次,我们设计了一种增强的基于奖励的深度强化学习(DRL)框架,称为多智能体双深度Q网络(MA-DDQN),该框架集成了多智能体协调和多动作选择机制,显著减少了任务分配时间并提高了对基线方法的适应性。大量仿真表明,CGG-ARO将完成任务数量和总体收益分别提高了约7.1%和7.7%。同时,MA-DDQN在任务完成率和总体收益方面分别实现了11.0%和12.5%的更大提升。这些结果凸显了Oranits在动态ITS环境中实现更快、更自适应、更高效任务处理的有效性。

英文摘要

In this paper, we explore mission assignment and task offloading in an Open Radio Access Network (Open RAN)-based intelligent transportation system (ITS), where autonomous vehicles leverage mobile edge computing for efficient processing. Existing studies often overlook the intricate interdependencies between missions and the costs associated with offloading tasks to edge servers, leading to suboptimal decision-making. To bridge this gap, we introduce Oranits, a novel system model that explicitly accounts for mission dependencies and offloading costs while optimizing performance through vehicle cooperation. To achieve this, we propose a twofold optimization approach. First, we develop a metaheuristic-based evolutionary computing algorithm, namely the Chaotic Gaussian-based Global ARO (CGG-ARO), serving as a baseline for one-slot optimization. Second, we design an enhanced reward-based deep reinforcement learning (DRL) framework, referred to as the Multi-agent Double Deep Q-Network (MA-DDQN), that integrates both multi-agent coordination and multi-action selection mechanisms, significantly reducing mission assignment time and improving adaptability over baseline methods. Extensive simulations reveal that CGG-ARO improves the number of completed missions and overall benefit by approximately 7.1% and 7.7%, respectively. Meanwhile, MA-DDQN achieves even greater improvements of 11.0% in terms of mission completions and 12.5% in terms of the overall benefit. These results highlight the effectiveness of Oranits in enabling faster, more adaptive, and more efficient task processing in dynamic ITS environments.