arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.DC分布式计算33
2606.12343 2026-06-11 cs.DC 新提交

Fair Comparison of Scheduling Algorithms on Heterogeneous Edge Clusters: A Continuous Adaptive Benchmark

异构边缘集群上调度算法的公平比较:一种连续自适应基准测试

Zihang Wang, Boris Sedlak, Juan Luis Herrera, Schahram Dustdar

AI总结 提出一种开源基准平台,用于公平比较异构边缘集群上的连续多模式调度算法,通过统一接口、闭环工作负载驱动器和双指标SLO评分,揭示控制器排名强烈依赖配置,且原始SLO与稳态SLO分离可暴露切换成本。

详情
AI中文摘要

现代人工智能工作负载部署在边缘-云连续体的异构层级上,必须满足关于延迟、吞吐量和输出质量的多维服务等级目标(SLO)。对于每个传入任务,调度器选择目标节点和处理模式(例如,完整或降低推理精度)。我们将这类问题称为连续多模式调度(CMMS)。公平比较CMMS算法很困难,因为先前的研究通常在自己的栈中、在单一工作负载下评估每个控制器,且不报告每次决策的开销。为弥补这些差距,我们提出一个开源基准平台,具有(i)统一控制器接口,(ii)覆盖多种工作负载模式的闭环工作负载驱动器,以及(iii)双指标SLO评分,分别报告原始SLO(整体合规性)和稳态SLO(稳定运行期间的合规性)。通过运行六个控制器跨越五个集群配置和两种负载状态(424个回合),我们发现控制器排名强烈依赖于配置:在轻负载下获胜的深度强化学习控制器,在负载增加时输给基于规则的启发式算法近29个百分点,且每次决策的操作开销约为500倍。我们进一步表明,将原始SLO与稳态SLO分离可以暴露切换成本,而单一聚合分数会混淆这些成本。

英文摘要

Modern Artificial Intelligence (AI) workloads deployed across the heterogeneous tiers of an edge--cloud continuum must satisfy multi-dimensional Service Level Objectives (SLOs) over latency, throughput, and output quality. For each incoming task, the scheduler picks both a target node and a processing mode (e.g., full or reduced inference precision). We call this class of problems \emph{Continuous Multi-Mode Scheduling} (CMMS). Comparing CMMS algorithms fairly is difficult because prior studies typically evaluate each controller in its own stack, under a single workload, and without reporting per-decision overhead. To close these gaps, we present an open source benchmark platform that features (i) a unified controller interface, (ii) a closed-loop workload driver covering multiple workload patterns, and (iii) dual-metric SLO scoring that reports raw SLO (overall compliance) and steady-state SLO (compliance during stable operation) separately. Running six controllers across five cluster configurations and two load regimes (424 episodes), we find that controller rankings are strongly configuration-dependent: a deep reinforcement-learning winner under light workloads loses to a rule-based heuristic by nearly 29 percentage points once load intensifies, at roughly 500$\times$ the per-decision operational overhead. We further show that separating raw from steady-state SLOs exposes switching costs that a single aggregate score would otherwise conflate.

2606.12246 2026-06-11 cs.DC cs.IR 新提交

Efficient and Robust Online Learning to Rank in Decentralized Systems

去中心化系统中高效且鲁棒的在线学习排序

Marcel Gregoriadis, Martijn de Vos, Sayan Biswas, Anne-Marie Kermarrec, Johan Pouwelse

AI总结 提出RankGuard框架,通过用户间直接交换模型更新并利用私有点击历史防御投毒攻击,首次给出去中心化在线学习排序的收敛性证明,效率最高提升62倍。

详情
AI中文摘要

在在线学习排序(OLTR)中,排序模型直接从实时用户交互中训练,但现有系统依赖可信中央服务器来收集和处理这些交互。这使得操作者可以自由引入与用户利益冲突的偏见。去中心化学习提供了一种有吸引力的替代方案,允许用户通过直接相互交换模型更新来协作训练共享排序模型,无需任何中央权威。然而,在这种设置中,恶意节点可以发送投毒模型更新,降低诚实节点的排序质量。我们引入了RankGuard,一个去中心化OLTR框架,其中用户协作训练排序模型并直接与其他节点交换模型更新。RankGuard通过仔细评估传入模型与用户自己的私有点击历史(经位置偏差校正)来防御投毒攻击。仅当传入模型比当前本地模型更好地解释用户过去交互时,才进行聚合,这使得恶意节点极难构造出能通过此测试而又不真正帮助用户的更新。我们推导了RankGuard的理论收敛保证。据我们所知,这是去中心化OLTR算法的首次形式化收敛分析。我们使用四个标准基准和三个点击模型,针对四种投毒攻击(包括一种强大的自适应攻击)评估了RankGuard。在大多数设置中,RankGuard优于所有基线,同时效率比最接近的竞争者高出62倍。

英文摘要

In Online Learning to Rank (OLTR), ranking models are trained directly from live user interactions, but existing systems rely on a trusted central server to collect and process these interactions. This leaves operators free to introduce biases that conflict with user interests. Decentralized learning offers an attractive alternative, allowing users to collaboratively train a shared ranking model by exchanging model updates directly with one another, without any central authority. In such settings, however, malicious nodes can send poisoned model updates that degrade the ranking quality of honest nodes. We introduce RankGuard, a decentralized OLTR framework in which users collaboratively train ranking models and exchange model updates directly with other nodes. RankGuard defends against poisoning attacks by carefully evaluating incoming models against the user's own private click history, corrected for position bias. An incoming model is only aggregated if it better explains the user's past interactions than the current local model, making it fundamentally hard for malicious nodes to craft updates that pass this test without also genuinely helping the user. We derive a theoretical convergence guarantee of RankGuard. To the best of our knowledge, this is the first formal convergence analysis of a decentralized OLTR algorithm. We evaluate RankGuard against four poisoning attacks, including a powerful adaptive attack, using four standard benchmarks and three click models. RankGuard outperforms all baselines in most settings while being up to 62x more efficient than its closest competitors.

2606.12103 2026-06-11 cs.DC 新提交

The PM-EdgeMap: Towards Real-Time Process Mining on the Edge-Cloud Continuum

PM-EdgeMap:迈向边缘-云连续体上的实时过程挖掘

Hendrik Reiter, Christian Imenkamp, Olaf Landsiedel, Andrea Maldonado, Patrick Rathje, Wilhelm Hasselbring

AI总结 提出PM-EdgeMap框架,在边缘-云连续体上实现实时过程挖掘,通过边缘一致性检查算法验证可行性,提升智能工厂自主控制能力。

详情
AI中文摘要

智能工厂正在演变为网络物理系统(CPS),要求更高的自主性。这需要基于传感器数据洞察的实时决策。过程挖掘提供了一种获取此类洞察并指导行动的有价值方法。边缘计算范式通过实现传感器之间的网络通信并利用附近的计算资源来支持这一实时需求。本文研究了在边缘上执行实时过程挖掘算法的影响。在本文中,我们首先提出了一种形式化方法来描述相关数据集和计算拓扑。然后,我们通过一个涉及基于边缘的一致性检查算法的案例研究来评估边缘计算方法。结果证明了基于边缘的实时过程挖掘在增强智能工厂自主控制方面的可行性和优势。

英文摘要

Smart factories are evolving into Cyber-Physical Systems (CPS), demanding increased autonomy. This necessitates real-time decision making, facilitated by insights derived from sensor data. Process mining offers a valuable approach to gain such insights and guide actions. The edge computing paradigm supports this real-time requirement by enabling network communication between sensors and leveraging nearby computing resources. This paper investigates the implications of performing real-time process mining algorithms on the edge. Within this paper, we first propose a formalism to describe relevant datasets and the computing topology. We then evaluate the edge computing approach through a case study involving an edge-based conformance checking algorithm. The results demonstrate the feasibility and benefits of edge-based real-time process mining for enhanced autonomous control in smart factories.

2606.11974 2026-06-11 cs.DS cs.DC 新提交

Near-Optimal Distributed 2-Ruling Sets on Graphs with Low Arboricity

低树度图上的近最优分布式2-统治集

Malte Baumecker, Rustam Latypov, Yannic Maus, Jara Uitto

AI总结 针对低树度图,提出在LOCAL模型中几乎最优的随机算法,在O(log log n)轮内高概率计算2-统治集,改进指数级并匹配下界。

详情
AI中文摘要

给定图$G=(V,E)$,一个$\beta$-统治集是节点子集$S\subseteq V$,满足$S$是独立集,且每个节点$V$到$S$中某节点的距离至多为$\beta$。本文在经典\LOCAL模型中提出了几乎最优的分布式算法来寻找$2$-统治集。我们的主要贡献是一个随机算法,它在具有有界树度的任意$n$节点图上,以高概率在$O(\log \log n)$轮内计算出$2$-统治集。事实上,该算法适用于树度高达$O(\log\log n)$的图,比结合[Barenboim, Elkin, Pettie, Schneider; JACM'16]、[Ghaffari; SODA'16]和[Bisht, Kothapalli and Pemmaraju; PODC'14]所能达到的先前最优结果指数级改进,并且几乎匹配$\Omega(\log \log n / \log \log \log n)$的下界[Balliu, Brandt, Kuhn, Olivetti; FOCS'20]。统治参数$\beta=2$对于运行时间为$\log^{o(1)}n$的算法是最优的:在树度为2的图上,MIS(即$\beta = 1$)存在$\Omega(\sqrt{\log n})$轮的下界[Khoury, Schild; FOCS'25]。此外,对于更大的树度,我们获得了改进的算法。对于树度为$\alpha$的一般图,我们提出了一个随机算法,在$\widetilde{O}(\log^{5/8} \alpha +\log^{5/3} \log n)$轮内计算出$2$-统治集。对于一大类非常数树度,这比先前最优结果指数级改进。我们的技术超越了分布式计算。在低空间大规模并行计算(\mpc)模型中,我们提出了一个$O(\log \log \log n)$轮的算法,该算法以高概率在树度高达$2^{poly (\log \log n)}$的任意图上计算出$2$-统治集,比[Kothapalli, Pai, Pemmaraju; FSTTCS'20]结合[Fischer, Giliberti, Grunau; SPAA'23]的先前最优结果指数级改进。

英文摘要

Given a graph $G=(V,E)$, a $\beta$-ruling set is a subset of nodes $S\subseteq V$ that is independent, and each node in $V$ is at distance at most $\beta$ from some node in $S$. In this paper, we present almost optimal distributed algorithms for finding $2$-ruling sets in the classical \LOCAL model. Our main contribution is a randomized algorithm that w.h.p.\ computes a $2$-ruling set on any $n$-node graph with bounded arboricity in $O(\log \log n)$ rounds. In fact, the algorithm works up to arboricity $O(\log\log n)$, improves exponentially over the prior state of the art that can be achieved by combining [Barenboim, Elkin, Pettie, Schneider; JACM'16], [Ghaffari; SODA'16], and [Bisht, Kothapalli and Pemmaraju; PODC'14], and nearly matches the lower bound of $\Omega(\log \log n / \log \log \log n)$ [Balliu, Brandt, Kuhn, Olivetti; FOCS'20]. The domination parameter $\beta=2$ is optimal for algorithms with runtime $\log^{o(1)}n$: on graphs with arboricity $2$, there is a lower bound of $\Omega(\sqrt{\log n})$ rounds for MIS (i.e., $\beta = 1$) [Khoury, Schild; FOCS'25]. Additionally, we obtain improved algorithms for larger arboricity. For general graphs with arboricity $\alpha$, we present a randomized algorithm that computes a $2$-ruling set in $\widetilde{O}(\log^{5/8} \alpha +\log^{5/3} \log n)$ rounds. This improves exponentially over the state of the art for a large range of non-constant arboricity. Our techniques extend beyond distributed computing. We present an $O(\log \log \log n)$-round algorithm in the low-space Massively Parallel Computation (\mpc) model that w.h.p.\ computes a $2$-ruling set on any graph with arboricity up to $2^{poly (\log \log n)}$, improving exponentially over the state of the art from [Kothapalli, Pai, Pemmaraju; FSTTCS'20] combined with [Fischer, Giliberti, Grunau; SPAA'23].

2606.11937 2026-06-11 cs.DC cs.PF 新提交

From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX

从Fork-Join到异步任务:使用OpenMP和HPX并行化瓦片Cholesky分解

Alexander Strack, Alexander Van Craen, Dirk Pflüger

AI总结 本文通过Cholesky-Bench基准,比较了OpenMP和HPX运行时下四种瓦片Cholesky分解并行变体,发现HPX在最优瓦片大小下性能优于OpenMP 15%-30%,异步任务开销降低约3.8倍。

详情
Comments
15 pages, 8 figures, accepted paper at AMTE held in conjunction with PPAM 2026
AI中文摘要

由OpenMP推广的Fork-Join并行性仍然是共享内存并行编程的主导模型,但其隐式同步屏障会惩罚工作负载不均匀的算法。异步多任务(AMT)运行时通过将工作表示为细粒度任务的依赖图来绕过这些屏障。然而,与精心编写的fork-join基线相比,实际的性能优势很少被量化。在这项工作中,我们引入了Cholesky-Bench,并利用它重新审视了瓦片Cholesky分解(一个典型的不规则内核),比较了两种运行时(GCC和LLVM附带的OpenMP实现,以及HPX AMT运行时)中右视算法的四种并行化变体。这些变体包括经典的fork-join、暴露额外内循环并行性的折叠fork-join、同步任务以及具有显式数据依赖的异步任务。我们在双插槽128核AMD Zen 2节点上,针对多种瓦片大小和问题大小,对所有八种组合进行了基准测试。我们的结果表明,在所有变体中,HPX在最优瓦片大小下比OpenMP快15%-30%。具体来说,异步HPX任务比对应的OpenMP任务快高达26%,并且任务开销大约小3.8倍。此外,折叠fork-join变体缩小了与同步任务的大部分差距。消除冗余同步屏障带来了额外的改进,OpenMP为7%,HPX为14%。GCC与LLVM的比较进一步揭示了fork-join调度和任务创建开销中编译器特定的差异。

英文摘要

Fork-join parallelism, popularized by OpenMP, remains the dominant model for shared-memory parallel programming, but its implicit synchronization barriers can penalize algorithms with inhomogeneous workloads. Asynchronous many-task (AMT) runtimes sidestep these barriers by expressing work as a dependency graph of fine-grained tasks. Yet, the actual performance benefit over a carefully written fork-join baseline is rarely quantified. In this work, we introduce Cholesky-Bench and use it to revisit the tiled Cholesky decomposition, a canonical irregular kernel, comparing four parallelization variants of the right-looking algorithm across two runtimes: the OpenMP implementations shipped with GCC and LLVM, and the HPX AMT runtime. The variants span classical fork-join, a collapsed fork-join that exposes additional inner-loop parallelism, synchronous tasking, and asynchronous tasking with explicit data dependencies. We benchmark all eight combinations on a dual-socket 128-core AMD Zen 2 node across multiple tile sizes and problem sizes. Our results show that across all variants, HPX outperforms OpenMP at the optimal tile size by 15%-30%. Specifically, asynchronous HPX tasks are up to 26% faster than their OpenMP counterparts, and exhibit roughly 3.8x smaller task overhead. Furthermore, the collapsed fork-join variants close most of the gap to synchronous tasking. Removing redundant synchronization barriers yields an additional improvement of 7% (OpenMP) to 14% (HPX). A GCC-versus-LLVM comparison further reveals compiler-specific differences in fork-join scheduling and task-creation overheads.

2606.11867 2026-06-11 cs.DC 新提交

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

利用路由预见性实现RL后训练中微步级MoE负载均衡

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao, Tong Zhao, Xupeng Miao, Jie Jiang, Fangcheng Fu, Bin Cui

AI总结 针对MoE模型在RL后训练中微步级负载波动问题,提出ForeMoE系统,利用rollout阶段的可预见路由信息主动引导负载均衡,并采用分层规划器与传输引擎实现微步级重配置,在64 GPU上取得高达1.45倍加速。

详情
AI中文摘要

混合专家(MoE)和强化学习(RL)后训练现在主导着大语言模型(LLM)的开发,但专家负载不平衡仍然是一个关键挑战。现有的负载均衡系统针对预训练,依赖于历史步级统计。然而,这些方法在RL后训练的独特工作负载动态下失效:步级负载稳定,但微步处理的小批量大小导致严重的高频负载波动。我们引入了ForeMoE,一种用于MoE RL后训练的微步级负载均衡系统。ForeMoE不依赖历史统计,而是利用多阶段RL流水线(rollout、recompute、policy update),通过使用来自rollout阶段的可预见路由信息,主动指导剩余阶段的负载均衡。为了支持频繁的每微步重配置,ForeMoE采用分层规划器,将NP难的负载均衡问题分解为可处理的子组件,以及一个利用互补硬件路径(CPU辅助和GPU直接)进行重叠专家传输的传输引擎。在64 GPU上的评估表明,与最先进的RL后训练系统相比,ForeMoE实现了高达1.45倍的加速。

英文摘要

Mixture-of-Experts (MoE) and reinforcement learning (RL) post-training now dominate large language model (LLM) development, yet expert load imbalance remains a critical challenge. Existing load-balancing systems target pre-training by relying on historical step-level statistics. However, these methods fail under the unique workload dynamics of RL post-training: the step-level load is stable, but the tiny batch sizes processed during micro-steps cause severe, high-frequency load fluctuations. We introduce ForeMoE, a micro-step-level load balancing system for MoE RL post-training. Instead of relying on historical statistics, ForeMoE exploits the multi-stage RL pipeline (rollout, recompute, policy update) by using foreseeable routing information from the rollout stage to proactively guide load balancing in the remaining stages. To support frequent per-micro-step reconfiguration, ForeMoE employs a hierarchical planner that decomposes the NP-hard load balancing problem into tractable sub-components, alongside a transfer engine that leverages complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer. Evaluations on 64 GPUs demonstrate that ForeMoE achieves up to a 1.45$\times$ speedup over state-of-the-art RL post-training systems.

2606.11824 2026-06-11 cs.DC 新提交

Optimizing Cloud Deployment: Blending of IaaS and FaaS for Microservice Architecture

优化云部署:微服务架构中IaaS与FaaS的混合

Nikhil Kapoor, Sougata Mukherjea

AI总结 提出一种指标驱动方法,通过自动化框架分析性能指标,将微服务从纯IaaS迁移至IaaS+FaaS混合模型,以优化资源利用和可扩展性。

详情
AI中文摘要

云计算的快速发展导致了混合部署的采用,这种部署融合了基础设施即服务(IaaS)和函数即服务(FaaS)服务模型,以优化资源利用率、可扩展性和运营效率。本文提出了一项全面的研究和实际实现,采用指标驱动的方法,以两个微服务应用为案例,将微服务从传统的IaaS服务模型迁移到混合的IaaS+FaaS模型。研究开发了一个自动化框架,用于分析服务级性能指标,以识别最适合无服务器执行的微服务。我们的研究结果突出了不同云服务模型的优势和局限性,并为云原生应用的优化部署提供了一种可扩展且可复制的自动化方法。

英文摘要

The rapid evolution of cloud computing has resulted in the adoption of hybrid deployments that blend Infrastructure-as-a-Service (IaaS) and Function-as-a-Service (FaaS) service models to optimize resource utilization, scalability, and operational efficiency. This paper presents a comprehensive study and practical implementation of a metrics-driven approach for migrating microservices from a traditional IaaS service model to a hybrid IaaS + FaaS model, using two microservice applications as case studies. The research develops an automated framework to analyze service-level performance metrics to identify microservices that are best suited for serverless execution. The findings of our research highlight the benefits and limitations of different cloud service models and provide a scalable and replicable automated methodology for optimized deployment of cloud-native applications.

2606.11778 2026-06-11 cs.DC 新提交

Consensus Time in 3-Majority and 2-Choices Is Determined by the Maximum Initial Opinion Density

3-Majority 和 2-Choices 中的共识时间由最大初始意见密度决定

Niccolò D Archivio (COATI, I3S, UniCA)

AI总结 研究同步模型下完全图上3-Majority和2-Choices动态的共识时间,发现其由最大初始意见密度决定,并给出紧界。

详情
AI中文摘要

我们建立了同步模型下完全图上3-Majority和2-Choices动态收敛时间的正确参数。最近的工作[Shimizu and Shiraga, PODC'25]给出了共识轮次匹配的上下界,但仅在弱意义上:这些界对于某些初始意见配置是一致的。相比之下,我们在强意义上获得了紧界,对于每个初始配置,上下界匹配到对数因子。设$\alpha(0)$为初始意见频率向量,并记$\\|\alpha(0)\\|_\infty$为其最大条目。我们证明,3-Majority以高概率在$\Theta(\min\{\\|\alpha(0)\\|_\infty^{-1}, \sqrt{n}\})$轮内达成共识,而2-Choices以高概率在$\Theta(\\|\alpha(0)\\|_\infty^{-1})$轮内达成共识。我们的结果表明,两种动态的收敛时间不是由全局参数(如意见数量k或初始意见分布的平方$\ell_2$范数)决定,而是由“局部”参数$\\|\alpha(0)\\|_\infty$(最大初始意见密度)决定。

英文摘要

We establish the correct parameter governing the convergence time of the 3-Majority and 2-Choices dynamics on the complete graph in the synchronous model. Recent work [Shimizu and Shiraga, PODC'25] provides matching upper and lower bounds on the number of rounds to consensus, but only in a weak sense: the bounds are shown to coincide for some initial opinion configuration. In contrast, we obtain tight bounds in a strong sense, with upper and lower bounds matching up to logarithmic factors for every initial configuration. Let $\alpha$ (0) be the initial opinion-frequency vector, and denote by ___$\alpha$ (0) ___ $\infty$ its maximum entry. We show that 3-Majority reaches consensus in $\Theta$(min{___$\alpha$ (0) ___ -1 $\infty$, $\sqrt$ n}) rounds w.h.p., while 2-Choices reaches consensus in $\Theta$(___$\alpha$ (0) ___ -1 $\infty$ ) rounds w.h.p. Our results demonstrate that the convergence time of both dynamics is governed not by global parameters such as the number of opinions k or the squared ${\ell}$ 2 norm of the initial opinion distribution, but rather by the ''local'' parameter ___$\alpha$ (0) ___ $\infty$, the maximum initial opinion density.

2606.11736 2026-06-11 cs.CR cs.DC cs.ET 新提交

MHOT: Height-Optimized Authenticated Data Structure for Blockchain State Commitment

MHOT:面向区块链状态承诺的高度优化认证数据结构

Sipeng Xie, Qianhong Wu, Minghang Li, Qiyuan Gao, Bo Qin, Qin Wang

AI总结 针对Merkle Patricia Trie树高增长及Nurgle攻击问题,提出MHOT,通过区分位索引实现自适应扇出和最小高度,并引入分层证明降低证明开销,在以太坊主网负载下实现9倍写吞吐量提升和0%攻击成功率。

详情
Comments
Usenix Sec'26
AI中文摘要

状态根计算占区块链区块处理时间的78%。以太坊的规范认证数据结构,即Merkle Patricia Trie(MPT),遭受严重的树高增长问题,并容易受到\textit{Nurgle攻击}(SP'24),其中攻击者通过哈希碰撞膨胀路径深度,以可忽略的成本降低系统性能。现有防御措施通过增加节点扇出(跨度)来限制树高,但更高的扇出会指数级增加证明大小。先前的工作使用向量承诺来缓解这种权衡,但代价是需要可信设置或昂贵的验证。我们提出\textsc{Mhot},一种用于区块链状态承诺的高度最优认证数据结构,它保留了基于哈希的标准验证,无需可信设置。与MPT的固定前缀索引(将跨度和扇出指数级耦合)不同,\textsc{Mhot}通过实际区分键的区分位进行索引,实现了具有线性扇出耦合的自适应跨度和可证明的最小高度。为了防止高扇出膨胀证明,我们引入了分层证明,一种两层Merkle结构,将每节点证明开销从O(k)降低到O(log k)。在以太坊主网负载下,\textsc{Mhot}相比MPT实现了高达9倍的写吞吐量、4倍低的写放大和2倍小的证明。在Nurgle攻击下,即使攻击者消耗了整个区块的gas预算,\textsc{Mhot}仍保持0%的攻击成功率(相比之下,MPT为99.97%)。我们的结果有些令人惊讶地表明,高度最优性(而非新的密码学原语!)是可扩展且抗攻击的区块链状态承诺的关键抽象。

英文摘要

State root computation dominates (78%) blockchain block processing time. Ethereum's canonical authenticated data structure, i.e., Merkle Patricia Trie (MPT), suffers from severe tree-height growth and is vulnerable to \textit{Nurgle attacks} (SP'24), where adversaries inflate path depth via hash collisions and degrade system performance at negligible cost. Existing defenses increase node fanout (span) to bound tree height, but higher span inflates proof size exponentially. Prior work mitigates this trade-off using vector commitments, at the cost of trusted setup or expensive verification. We present \textsc{Mhot}, a height-optimal authenticated data structure for blockchain state commitment that preserves standard hash-based verification without trusted setup. Unlike MPT's fixed-prefix indexing, which couples span and fanout exponentially, \textsc{Mhot} indexes by discriminative bits that actually distinguish keys, achieving adaptive span with linear fanout coupling and provably minimal height. To prevent high fanout from inflating proofs, we introduce hierarchical proofs, a two-layer Merkle construction that reduces per-node proof overhead from O(k) to O(log k). On Ethereum mainnet workloads, \textsc{Mhot} achieves up to 9X higher write throughput, 4X lower write amplification, and 2X smaller proofs than MPT. Under Nurgle attacks, even when the adversary consumes an entire block's gas budget, \textsc{Mhot} maintains a 0% attack success rate (v.s., 99.97% for MPT). Our results, somewhat surprisingly, show that height optimality (not new crypto primitives!) is the key abstraction for scalable and attack-resilient blockchain state commitment.

2606.11690 2026-06-11 cs.DC cs.PF 新提交

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

超越按Token定价:一种考虑并发性的LLM基础设施成本估算方法

Chitral Patil

AI总结 针对现有成本计算器将GPU利用率作为固定输入导致严重误差的问题,提出一种基于测量请求率λ的并发感知成本估算方法,并开源vllm-cost-meter工具,验证了低负载下成本被低估2.5-36.3倍。

详情
Comments
26 pages, 9 figures. Code: this https URL
AI中文摘要

我们调查的每个公共LLM成本计算器都将GPU利用率视为固定输入——由用户输入、作为预设内置或默认假设为100%——从未根据运营商的实际负载进行测量。我们表明,这一假设是误差的主要来源:在相同的H100硬件上,有效成本从每百万输出token 0.21美元到15.25美元不等,在低到中等企业负载(1-10 rps)下,利用率不足导致的惩罚为2.5-24倍,在接近空闲时高达36.3倍——这由一个运营商可控变量,即提供的请求率λ驱动,该变量通过Little定律设置并发数,而没有任何开源计算器公开它。由于计算器将利用率作为用户提供的输入,任何不考虑利用率的估计都会将真实成本低估正好1/U,系统地低估了自托管成本——对于低流量工作负载,最严重地过度推销。我们提出了一种测量方法,将关系参数化为C_eff = f(H, M, Q, λ, L),在密集、超稀疏MoE和稀疏MoE模型上使用42个基准进行验证,并发布了vllm-cost-meter,这是一个开源成本计量器,可连接到实时vLLM服务器,并根据运营商自己的流量报告真实的$/M-tokens。我们进一步表明,FP8量化对我们测试的MoE架构的益处大约是密集模型的2.2-2.4倍(峰值吞吐量提升+69至+74%对比+31%;n=3,需要更广泛的验证),并且我们的数据与活跃参数计数(而非总模型大小)是饱和经济性的主要预测因子一致。为了排除单硬件混淆,我们在A100 80GB PCIe上重复了核心扫描(56次运行):负载驱动的波动重现为7.0-11.4倍,活跃参数排序在FP8下仍然成立,而密集模型FP8的优势在没有原生FP8张量核心的硅片上反转——这是一个框架已经容纳的硬件条件性注意事项。

英文摘要

Every public LLM cost calculator we surveyed treats GPU utilization as a fixed input -- entered by the user, baked in as a preset, or silently assumed at 100% -- never measured against the operator's actual load. We show that this assumption is the dominant source of error: on identical H100 hardware, effective cost spans \$0.21 to \$15.25 per million output tokens, an underutilization penalty of 2.5-24x across low-to-moderate enterprise loads (1-10 rps) and up to 36.3x near idle -- driven by one operator-controlled variable, offered request rate lambda, which sets in-flight concurrency via Little's Law and which no open-source calculator exposes. Because calculators take utilization as a user-supplied input, any utilization-naive estimate understates true cost by exactly 1/U, systematically mispricing self-hosting -- most severely over-selling it for low-traffic workloads. We propose a measurement methodology that parameterizes the relationship as C_eff = f(H, M, Q, lambda, L), validate it with 42 benchmarks across dense, ultra-sparse MoE, and sparse MoE models, and release vllm-cost-meter, an open-source cost meter that attaches to a live vLLM server and reports real \$/M-tokens against the operator's own traffic. We further show that FP8 quantization benefits the MoE architectures we tested roughly 2.2-2.4x more than the dense model (+69 to +74% vs. +31% peak throughput; n=3, broader validation needed), and our data are consistent with active parameter count, not total model size, being a primary predictor of saturation economics. To rule out single-hardware confounding we repeat the core sweep on A100 80GB PCIe (56 runs): the load-driven spread reproduces at 7.0-11.4x, the active-parameters ordering survives at FP8, and the dense-FP8 advantage inverts on silicon without native FP8 tensor cores -- a hardware-conditional caveat the framework already accommodates.

2606.11632 2026-06-11 cs.CR cs.AI cs.DC cs.MA 新提交

Sovereign Assurance Boundary: Certificate-Bound Admission for Agentic Infrastructure

主权保证边界:面向智能体基础设施的证书绑定准入机制

Jun He, Deying Yu

AI总结 针对智能体基础设施中非确定性推理系统对生产资源的高风险操作,提出主权保证边界(SAB),通过证书绑定的运行时准入层,将代理提案编译为执行合约并绑定加密证据,实现可验证、可撤销的授权控制。

详情
Comments
12 pages, 1 figure, 13 tables
AI中文摘要

智能体基础设施引入了一个关键的控制平面授权问题:非确定性推理系统可以对生产资源提出高风险变更,但现有的安全机制——如身份与访问管理(IAM)、策略引擎、共识协议和审计日志——要么强制执行静态的、上下文无关的权限,要么仅在执行后记录操作。本文介绍了主权保证边界(SAB),一种用于自主执行权限的证书绑定运行时准入层。SAB在保证气闸处拦截代理提案,将其编译为类型化执行合约$C$,并将这些合约绑定到加密证据摘要$H(E)$和策略版本。然后,合约通过后果感知的认证路径进行路由。成功准入后,系统发出一个严格限定于特定执行身份、撤销周期和有效时间窗口的签名主权保证证书($\Omega$)。最后,主权执行代理验证$\Omega$,并在调用基础设施API之前执行新鲜的执行前撤销和漂移检查。我们详细描述了气闸-代理架构,形式化了其准入和撤销不变量,并报告了在Go原型上对2500次准入尝试评估的初步可行性测量。最终,这种代理强制模型防止了自主推理直接改变状态,将委托的执行权限转化为一个可加密验证、证据绑定、可撤销且可重放的运行时工件。

英文摘要

Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stakes mutations to production resources, yet existing security mechanisms -- such as identity and access management (IAM), policy engines, consensus protocols, and audit logs -- either enforce static, context-unaware permissions or merely record actions post-execution. This paper introduces the Sovereign Assurance Boundary (SAB), a certificate-bound runtime admission layer for autonomous execution authority. SAB intercepts agent proposals at an assurance airlock, compiles them into typed execution contracts $C$, and binds these contracts to cryptographic evidence digests $H(E)$ and policy versions. The contracts are then routed through consequence-aware certification paths. Upon successful admission, the system emits a signed Sovereign Assurance Certificate ($\Omega$) that is strictly scoped to a specific execution identity, revocation epoch, and validity window. Finally, a sovereign execution broker verifies $\Omega$ and performs fresh pre-execution revocation and drift checks before invoking infrastructure APIs. We detail the airlock-broker architecture, formalize its admission and revocation invariants, and report preliminary feasibility measurements from a Go prototype evaluated over 2,500 admission attempts. Ultimately, this broker-enforced model prevents autonomous reasoning from directly mutating state, transforming delegated execution authority into a cryptographically verifiable, evidence-bound, revocable, and replayable runtime artifact.

2606.11579 2026-06-11 quant-ph cs.DC physics.atm-clus physics.atom-ph physics.chem-ph 新提交

Tensor-Network-Based Distributed Quantum Dynamics on Independent Quantum Computers

基于张量网络的独立量子计算机分布式量子动力学

Anurag Dwivedi, Melissa C. Revelle, Daniel S. Lobser, Brian K. McFarland, Edward C. Tortorici, Christopher G. Yale, Susan M. Clark, Philip Richerme, Srinivasan S. Iyengar

AI总结 提出基于张量网络的分布式量子计算方法,将多维时间演化算子分解为独立低维传播,在异构量子-经典架构上异步执行,并在离子阱量子计算机上实验验证,计算质子化水团簇振动光谱精度达4 cm⁻¹。

详情
AI中文摘要

我们提出了一种基于张量网络的方法,用于连续变量表示中化学波包动力学的分布式量子计算模拟。核心思想是:多维时间演化算子的张量网络表示自然诱导出一个提升的希尔伯特空间,其中动力学分解为一组独立的低维传播。这种变换将纠缠的量子演化转化为一组并行的计算任务,可以在异构量子与经典计算架构上异步执行。由此产生的形式体系建立了张量网络分解、均匀受控量子电路和异步分布式量子计算之间的直接联系。该方法旨在实现混合量子/经典实现,适用于通用异构量子硬件系统。由张量网络分解产生的异步分布式量子过程的实验实现是在桑迪亚国家实验室的离子阱量子计算机上进行的,其中电路使用原生部分纠缠$XX(\ heta)$门进行编译,与传统的完全纠缠分解相比,预期的两量子比特门保真度降低了30%以上。我们通过量子计算一个小型质子化水团簇的振动光谱来演示该方法,该团簇显示出关键的量子核行为。此类水团簇系统已被发现对实验作用光谱学和理论具有挑战性,而在这里,我们首次提供了与相应经典结果一致(误差在4 cm⁻¹以内)的振动光谱结果,从而展示了量子计算实现光谱精度的潜力。

英文摘要

We present an approach based on tensor networks for distributed quantum computing simulation of chemical wavepacket dynamics in a continuous variable representation. The central idea is that the tensor-network representation of the multidimensional time-evolution operator naturally induces an elevated Hilbert space where the dynamics decomposes into a set of independent lower-dimensional propagations. This transformation converts an entangled quantum evolution into a set of parallel computational tasks that can be executed asynchronously across heterogeneous quantum and classical computing architectures. The resulting formalism establishes a direct connection between tensor-network decompositions, uniformly controlled quantum circuits, and asynchronous distributed quantum computing. The approach is developed with a goal towards hybrid quantum/classical implementation, and is appropriate for a general heterogeneous mixture of quantum hardware systems. The experimental realization of the asynchronously distributed quantum processes that arise from the tensor-network decomposition are carried out on the Sandia National Laboratories' trapped-ion quantum computer, where the circuits are compiled using native partial-entangling $XX(\theta)$ gates, reducing the expected two-qubit gate infidelity by more than 30\% relative to conventional fully entangling decompositions. We demonstrate the methodology by quantum computing the vibrational spectra of a small protonated water cluster that shows critical quantum nuclear behavior. Such water cluster systems have been found to be challenging for experimental action spectroscopy and for theory, and here, for the first time, we provide results for vibrational spectroscopy that are in agreement with the respective classical results to within 4cm$^{-1}$, thus allowing for the potential for spectroscopic accuracy from quantum computations.

2606.11390 2026-06-11 cs.CV cs.DC cs.GR cs.LG 新提交

A Scalable PyTorch Abstraction for Multi-GPU Gaussian Splatting

一种可扩展的多GPU高斯泼溅PyTorch抽象

Matthew Cong, Francis Williams, Jonathan Swartz, Mark Harris, Sanja Fidler, Ken Museth

发表机构 * NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出一种多GPU高斯泼溅方法,通过CUDA统一内存和NVLink在算子级别分布参数,实现大规模场景重建,支持超过10亿高斯泼溅。

详情
Comments
14 pages, 6 tables, 2 figures, and 1 listing. Includes supplementary material
AI中文摘要

高斯泼溅方法在真实世界的神经重建中越来越受欢迎。然而,由于计算和内存限制,它们在规模和分辨率上常常受限。我们提出了一种多GPU高斯泼溅方法,将重建扩展到更高的分辨率和更大的场景,同时抽象掉了通常与模型分布相关的代码复杂性。为实现这一目标,我们提出一个PyTorch后端,通过CUDA统一内存和NVLink在GPU之间分布高斯参数和泼溅算子。由于分布发生在算子级别,模型代码不需要显式的跨设备通信。更广泛地说,该后端将多个GPU暴露为一个聚合的PyTorch设备,并支持其他PyTorch算子。我们展示了包含超过10亿个高斯泼溅的城市规模重建,具有街道级细节,数量是当前最先进方法的25倍以上。

英文摘要

Gaussian splatting methods have become increasingly popular for neural reconstruction of the real world. However, they are often limited in scale and resolution due to compute and memory constraints. We present a multi-GPU Gaussian splatting approach that scales reconstruction to higher resolutions and larger scenes while abstracting away the code complexity typically associated with distributing a model. To accomplish this, we propose a PyTorch backend that distributes the Gaussian parameters and splatting operators across GPUs via CUDA unified memory and NVLink. Because distribution occurs at the operator level, the model code requires no explicit cross-device communication. More broadly, the backend exposes multiple GPUs as an aggregate PyTorch device and supports other PyTorch operators. We demonstrate city-scale reconstructions with street-level detail consisting of over 1 billion Gaussian splats, more than 25 times as many as the current state of the art.

2606.11357 2026-06-11 cs.DC cs.AI cs.AR cs.PF 新提交

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse:用于AMD NPU上高效量化LLM推理的融合混合精度内核库

Wesley Pang, Gregory Hyegang Jun, Feiyang Liu, Deming Chen

AI总结 针对边缘NPU上量化LLM部署困难,提出TileFuse库,通过融合解包、反量化与GEMM/GEMV内核,并设计交错预分块布局与数据流,在XDNA2上实现AWQ格式原生支持,性能提升最高281%,能耗降低64.6%。

详情
Comments
13 pages excluding reference, 11 figures
AI中文摘要

随着设备端LLM推理需求的增长,边缘SoC越来越多地集成NPU,以在严格的功耗和热预算下提高性能和能效。然而,当前客户端NPU上的实际LLM部署仍然困难:广泛使用的量化格式(如AWQ)无法干净地映射到许多现有NPU软件栈上,这些软件栈通常是专有的,并且暴露有限底层控制。在这项工作中,我们提出了\textit{TileFuse},一个面向AMD XDNA2 NPU的近底层混合精度内核库,针对量化LLM推理中的Transformer线性层。TileFuse将实用的低位格式(如AWQ风格的W4A16和W8A16)直接引入XDNA2,而不是迫使模型围绕NPU特定的量化方案重新调整。TileFuse协同设计了权重布局、元数据放置、混合精度微内核和阵列级数据流。具体来说,它将解包、反量化以及GEMM/GEMV执行融合到单个内核流中,引入了一种支持高达32K GEMM维度的交错预分块布局,并重新设计了GEMV数据流以利用完整的4x8 AIE阵列。在内核级评估中,与全精度基线相比,TileFuse在GEMM上性能提升高达121.6%,在GEMV上提升281%,同时在GEMM上相比强iGPU基线实现了超过2倍的性能和能效提升。在Ryzen AI笔记本电脑上的端到端LLM实验中,TileFuse实现了高达2.0倍的预填充延迟降低,能耗降低超过64.6%。这些结果共同表明,XDNA2是AWQ风格边缘LLM推理的实用目标,并且对现成量化的原生NPU支持可以使NPU在实际客户端部署中更加可用。

英文摘要

With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textit{TileFuse}, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

2606.11356 2026-06-11 physics.ao-ph cs.DC cs.SE physics.comp-ph 新提交

An Ocean Model Ported by a Large Language Model: Experience and Lessons from FESOM2 (Fortran to C to C++/Kokkos)

大型语言模型移植海洋模型:FESOM2(Fortran到C再到C++/Kokkos)的经验与教训

Nikolay V. Koldunov, Suvarchal K. Cheedela, Sergey Danilov, Dmitry Sidorenko, Sebastian Beyer, Thomas Jung

AI总结 本文展示利用LLM将FESOM2海洋模型从Fortran移植到C再到C++/Kokkos,通过两阶段翻译、严格字面转换和逐级验证,在数周内保持物理准确性并实现GPU加速。

详情
AI中文摘要

大型语言模型(LLM)能够翻译和修改源代码,并且已被证明可以对不同复杂度的代码进行此类操作。然而,它们是否能够将完整的、生产级的地球物理模型移植到另一种语言而不降低其物理保真度,尚未得到证实。我们证明,LLM辅助的代码翻译可以在将完整的生产级海洋模型迁移到现代性能可移植形式的同时,保持其物理特性。我们报告了在领域专家指导下,使用代理式LLM编码助手将FESOM2非结构化网格海洋-海冰模型(约74000行核心Fortran代码)首先移植到C,然后移植到C++/Kokkos以实现跨CPU和GPU的性能可移植性的经验。我们描述了被证明必要的实践、哪些有效、哪些无效,以及我们遇到的失败模式。三个实践最为重要:分两阶段翻译,将重现数值计算(Fortran到干净的C参考实现)与引入并行性(C到Kokkos)分开;要求严格字面翻译,不允许助手“改进”源代码;以及根据适合的验收标准对每个阶段进行验证。C移植版本在五年长期模拟统计水平上重现了原始Fortran结果。Kokkos版本在CPU上与C参考实现逐位一致,在GPU上多年运行统计上接近。在涡旋丰富网格上,高达740万个表面顶点,单个A100 GPU节点比CPU节点快1.6-3.7倍,达到生产集成所需的每天1-2模拟年。结果不仅仅是一个GPU移植:通过遵循清晰的验证程序,LLM在数周内将完整的Fortran海洋模型迁移到另一种语言并移植到加速器上,同时保持了其物理特性。

英文摘要

Large language models (LLMs) can translate and modify source code, and have been shown to do so for codes of different complexity. Whether they can port a complete, production geophysical model to a different language without degrading its physics has not been established. We demonstrate that LLM-assisted code translation can preserve the physics of a complete production ocean model while moving it into a modern performance-portable form. We report our experience using an agentic LLM coding assistant, directed by domain experts, to port the FESOM2 unstructured mesh ocean--sea-ice model (about 74000 lines of core Fortran) first to C and then to C++/Kokkos for performance portability across CPUs and GPUs. We describe the practices that proved necessary, what worked and what did not, and the failure modes that we encountered. Three practices mattered most: translating in two stages that separate reproducing the numerics (Fortran to a clean C reference) from introducing parallelism (C to Kokkos); requiring a strictly literal translation in which the assistant was not permitted to ``improve'' the source; and validating each stage against an acceptance criterion suited to it. The C port reproduces the original Fortran at the level of long-term simulation statistics over five years. The Kokkos port is bit-for-bit identical to the C reference on CPU and statistically close on GPU over multi-year runs. On eddy-rich meshes up to 7.4 million surface vertices a single A100 GPU node runs 1.6--3.7 times faster than a CPU node, reaching the 1-2 simulated-years-per-day required for production integrations. The result is more than a single GPU port: by following a clear validation procedure, an LLM moved a full Fortran ocean model into another language and onto accelerators while preserving its physics in a matter of weeks.

2606.10982 2026-06-11 cs.DC 版本更新

FairWave: A Fairness-Aware Asynchronous DAG-BFT Consensus

FairWave: 一种公平感知的异步DAG-BFT共识

Syariful Mujaddiq

AI总结 提出FairWave协议,通过双通道设计分离锚点选择与奖励分配,解决异步BFT与PoS结合时的公平性三难问题,实现低基尼系数和抗富者愈富。

详情
Comments
20 pages, 36 figures, preprint version
AI中文摘要

将异步拜占庭容错(BFT)共识与权益证明(PoS)结合会产生一个三难问题:女巫攻击抵抗、奖励分配公平性和对抗持久性富豪统治之间的权衡。现有的DAG-BFT方法(Narwhal+Tusk、Bullshark和Mysticeti)优先考虑活性而非基于权益的选择的公平性影响,导致持续的纵向不平等。本文提出一种双通道DAG BFT协议,将锚点选择与奖励分配分离。选择通道与权益呈超线性关系,确保对于所有分裂因子K>1,女巫增益<1。奖励通道呈次线性关系,使用平方根权益归一化来缓解富者愈富效应。最终确定的DAG结构提供确定性的正常运行时间和延迟因子,使诚实验证者无需任何外部预言机即可就操作质量达成一致。为避免选择结果与选择权重之间的循环依赖,信誉以滞后形式使用:第e个时期的活跃值等于前一时期的最终值。我们推导出两个通道的闭式约束,并通过九个实证分析(约550,000轮蒙特卡洛模拟)与八个基线进行验证。FairWave实现了0.149的基尼系数(而Pure-PoS为0.488),在50,000个时期中HHI从0.039单调降至0.021,最优对手女巫分裂K*=1,在±25%输入扰动下成功率变异系数为5.2%。安全性(一致性和有效性)是2f+1强支持提交规则的形式化结果,在f<n/3时无条件成立;经验差异是单调连续的活性退化曲线,在b=0.20时提交率为99.6%,在理论边界b=1/3时降至71.1%,没有视图变更驱动的领导者BFT所特有的不连续悬崖特征。

英文摘要

Combining asynchronous Byzantine Fault Tolerant (BFT) consensus with Proof-of-Stake (PoS) creates a trilemma between Sybil resistance, reward distribution fairness, and protection against persistent plutocracy. Existing DAG-BFT approaches (Narwhal+Tusk, Bullshark, and Mysticeti) prioritize liveness over the fairness implications of stake-based selection, resulting in persistent longitudinal centralization. FairWave is a dual-channel DAG BFT protocol that separates anchor selection from reward distribution. The selection channel is super-linear in stake, guaranteeing Sybil gain < 1 for all split factors K > 1. The reward channel is sub-linear, using square-root stake normalization to mitigate rich-get-richer dynamics. The finalized DAG structure provides deterministic uptime and latency factors, allowing honest validators to agree on operational quality without any external oracle. To avoid circular dependency between selection outcomes and selection weights, reputation is used in a lagged form: the active value at epoch e equals the prior epoch's final value. We derive closed-form constraints for both channels and validate them through nine empirical analyses (approximately 550,000 Monte Carlo rounds) against eight baselines. FairWave achieves a Gini coefficient of 0.149 (vs. Pure-PoS's 0.488), a monotone HHI reduction from 0.039 to 0.021 over 50,000 epochs, an optimal-adversary Sybil split of K* = 1, and a success-rate coefficient of variation of 5.2% under +/-25% input perturbation. Safety (agreement and validity) is a formal consequence of the 2f+1 strong-support commit rule, holding unconditionally for f < n/3; the empirical differential is the monotone-continuous liveness-degradation curve, which decreases from 99.6% commit rate at b=0.20 to 71.1% at the theoretical bound b=1/3 without the discontinuous cliff characteristic of view-change-driven leader-BFT.

2606.04145 2026-06-11 cs.LG cs.AI cs.DC 版本更新

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

EvalStop:利用世界反馈检测和纠正多租户RLHF平台中的奖励过度优化

Guilin Zhang, Chuanyi Sun, Shahryar Sarkani, John M. Fossaceca

AI总结 提出EvalStop调度原语,通过检测评估分数连续下降来终止作业、释放GPU并保留最佳检查点,以纠正奖励过度优化,在RLHF负载上实现高精度检测并提升JCT。

详情
AI中文摘要

云LLM微调平台越来越多地服务于RLHF工作负载,其中学习到的奖励模型作为人类质量的代理被优化。正如Gao等人(2023)所示,在持续优化压力下,该代理与世界反馈(下游评估指标)发生偏离,这种现象称为奖励过度优化。现有的平台调度器忽略这种偏离:非预见性调度器优化JCT而不考虑任何质量信号,SLAQ式质量感知调度器使用训练损失(一个单调下降的较弱代理,可通过黑客攻击降低),而经典的每作业早停需要人工监控且不释放共享GPU。我们提出EvalStop,一个可组合的调度原语,它在连续k次评估分数下降时终止作业,释放GPU,保留最佳检查点,并委托给任何基础调度器。我们将调度器级别的早停视为检测问题,并在一个离散事件模拟器中评估它,该模拟器的RLHF工作负载混合了奖励黑客攻击和结构健康运行,真实标签对调度器隐藏。在RLHF密集型负载(80% RLHF,64 GPU)上,EvalStop实现了精确率98%、召回率99%、假阳性率1.5%,同时相比SRTF-Est将JCT提高了9%,将浪费的计算减少了22%(p<0.05)。简单的固定进度和损失平台竞争对手要么在健康RLHF上产生65%的假阳性率,要么错过超过一半的真实黑客攻击案例。增益在所有测试的基础调度器上均成立(JCT提升9-25%),且检测质量在评估噪声(噪声标准差≤0.05时精确率至少91%)和黑客攻击基础率(黑客攻击比例20-80%时精确率至少89%)下保持稳定。

英文摘要

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

2606.03077 2026-06-11 cs.LG cs.AI cs.DC 版本更新

Libra: Efficient Resource Management for Agentic RL Post-Training

Libra:面向智能体强化学习后训练的高效资源管理

Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu

AI总结 针对智能体强化学习中长尾、非平稳工作负载带来的资源管理挑战,提出Libra系统,通过周期性全局资源规划器和因果驱动多级反馈队列调度器,实现GPU分配优化和请求调度,最高提升3倍吞吐量和2.5倍收敛速度。

详情
Comments
19 pages, 12 figures
AI中文摘要

强化学习(RL)已成为大型语言模型(LLM)的标准后训练范式,从偏好对齐扩展到复杂推理和多轮智能体行为。在智能体RL中,rollout阶段生成轨迹并调用工具,产生长尾和非平稳的工作负载,挑战了传统的资源管理假设。出现了三个基本挑战。首先,由于长尾分布,一小部分轨迹主导了rollout完成时间。其次,rollout和训练在计算模式、内存需求和对序列长度的敏感性上表现出强烈的不对称性。第三,随着RL策略的演变,轨迹长度分布随时间漂移,使得任何静态资源分配逐渐变得次优。我们提出Libra,引入了两个核心机制。第一个是周期性全局资源规划器,它联合优化rollout和训练集群间的GPU分配。它利用弹性混合池实现阶段间轻量级、非阻塞的工作节点重新分配。第二个是因果驱动的多级反馈队列(C-MLFQ)调度器,它基于从工具返回结果导出的因果信号(而非依赖脆弱的长度的预测)将请求路由到异构的rollout桶。在48个A800 GPU上的评估表明,与基线相比,Libra实现了高达3.0倍的吞吐量提升和高达2.5倍的奖励收敛加速。

英文摘要

Reinforcement learning (RL) has emerged as a standard post-training paradigm for shaping large language models (LLMs) into capable agents. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that expose two fundamental challenges in resource management. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training are subject to cross-stage imbalance, as they exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Compounding this asymmetry, the sequence length distribution drifts continuously as the policy evolves, rendering any static resource split progressively suboptimal. We present Libra, a resource management system to address both challenges via two core mechanisms. The first is a global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0x higher throughput and converges up to 2.5x faster in reward compared to the baselines.

2606.01183 2026-06-11 cs.DC cs.DB cs.DS cs.PF 版本更新

The World's Fastest Matching Engine Algorithm

世界上最快的撮合引擎算法

Jake Yoon

AI总结 提出Priority-Indicated Node (PIN)和邻域感知树操作两种数据结构,消除订单簿中指针追逐和根到叶搜索的延迟,实现亚微秒级尾部延迟和每秒数千万条消息的处理能力。

详情
Comments
20 pages, 5 figures, 7 tables
AI中文摘要

每个电子交易所都依赖于一个订单簿,其存储层决定了撮合延迟。主流实现——通过平衡树链接的链表——在每个操作上施加两个成本:指针追逐遍历以到达插入点,以及根到叶搜索以定位目标价格水平。在微突发条件下,这些成本会产生尾部延迟峰值,在流动性最需要时降低市场质量。我们提出了两种数据结构贡献,消除了这些成本。第一种是优先级指示节点(PIN),一种优先队列,其中条目占据固定容量、连续可寻址的槽位,每个槽位携带一个指示条目全局优先级的每槽指示器。与每次操作需要O(log n)次比较的堆不同,PIN直接根据指示器解析插入位置,无需比较条目;指示器更新为O(1),与队列大小无关。第二种解决了更广泛的低效问题:平衡搜索树在每次插入和删除时都进行根到叶搜索,即使调用者已经知道键的中序邻居——例如在有序事件流、增量索引维护和电子交易中。邻域感知插入和删除利用已知的邻居引用,通过O(1)次引用写入来附加或移除节点,然后进行单路径重平衡,统一适用于红黑树、AVL树和B/B+树变体。单个CPU核心在每秒数百万条消息的微突发下,以亚微秒级尾部延迟维持每秒3200万条订单消息,比同一硬件上最好的开源撮合引擎快5-11倍。扩展到单个96核实例,该引擎在10,000个交易品种上维持每秒6.4亿条消息。

英文摘要

A single CPU core sustains 32 million order messages per second at sub-microsecond median end-to-end host-path response latency, 4.7-11 times faster than the best available open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding the entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.

2605.26418 2026-06-11 cs.LG cs.AI cs.DC 版本更新

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

深度强化学习何时超越校准基线?自适应资源控制的基准研究

Guilin Zhang, Chuanyi Sun, Kai Zhao, Shahryar Sarkani, John Fossaceca

AI总结 通过RLScale-Bench基准测试,发现校准的基于规则的自动缩放器在所有工作负载上成本均低于六种主流深度强化学习算法,并揭示了算法选择、基线校准和评估协议的关键瓶颈。

详情
AI中文摘要

一个适当校准的基于规则的自动缩放器可以在我们测试的每个工作负载上,在成本方面击败六种主流深度强化学习(DRL)算法——那么,如果存在的话,DRL究竟何时能真正发挥作用?我们在RLScale-Bench中研究这个问题,这是一个用于自适应资源控制的DRL可重复基准和评估协议,其中代理在成本和服务级别约束下将计算资源分配给动态工作负载。我们在匹配的架构、训练预算和奖励函数下,评估PPO、DQN、A2C、SAC、TD3和DDPG,与校准的基于规则基线在六个工作负载模式和五个种子(240次运行)上进行对比,在Kubernetes水平Pod自动缩放上实例化基准,并探测分布偏移泛化。三个发现挑战了常见假设:(i)校准控制器在所有六个工作负载上实现了最低成本,尽管在突发和闪流流量上落后于最佳RL代理;(ii)由于动作空间不匹配,离散动作算法在约束违反方面比连续动作算法好一到两个数量级;(iii)没有单一算法在所有工作负载上占主导地位,排名变化高达四个位置。基于RL的资源控制的瓶颈不是算法选择,而是基线校准、奖励工程和现实的评估协议。

英文摘要

A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.

2509.20241 2026-06-11 cs.LG cs.DC 版本更新

Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling

AI推断的能耗:效率路径与测试时计算

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, Juan M. Lavista Ferres

AI总结 本文提出基于令牌吞吐量的底层方法,估算大规模大语言模型的每查询能耗,揭示测试时扩展场景下的能耗变化及效率提升潜力。

详情
Comments
A preprint version with DOI is available at Zenodo: this https URL
AI中文摘要

随着AI推断扩展到数十亿查询和新兴推理及代理工作流增加令牌需求,可靠估计每查询能耗对容量规划、排放核算和效率优先级至关重要。许多公开估计不一致且高估能耗,因为它们从有限基准外推且未能反映大规模下的效率提升。本文引入基于令牌吞吐量的底层方法,估算大规模LLM系统的每查询能耗。在H100节点下运行的模型,根据现实工作负载和GPU利用率及PUE约束,估算前沿规模模型(>2000亿参数)的每查询能耗中位数为0.34瓦(IQR: 0.18-0.67)。这些结果与生产规模配置测量一致,表明非生产估计可能高估能耗4-20倍。扩展到测试时扩展场景,每个典型查询的令牌数增加15倍,中位数能耗升至4.32瓦,表明在该范围内聚焦效率将带来最大的集群节能。我们量化了在模型、服务平台和硬件层面的可实现效率提升,发现单个模型的每查询能耗中位数减少1.5-3.5倍,而综合改进可能带来8-20倍的减少。为说明系统级影响,我们估算一个处理十亿查询的部署的基线日能耗为0.8 GWh/天。如果10%为长查询,需求可能增长到1.8 GWh/天。通过针对性的效率干预,它降至0.9 GWh/天,与该规模的网络搜索能耗相当。这呼应了数据中心历史上通过效率提升控制能耗增长的历史。

英文摘要

As AI inference scales to billions of queries, estimates of per-query energy use are increasingly important for capacity planning, efficiency interventions, and policy. Yet many public estimates assume non-production settings, leading to systematic overestimation. We introduce a bottom-up framework estimating inference energy from token throughput, node power, and overhead under large-scale deployment assumptions. For frontier-scale models (>200B parameters) on H100 nodes, we estimate a median energy of 0.31 Wh/query (IQR 0.16-0.60), indicating widely cited estimates are overstated by 4-20x. In test-time scaling scenarios 15x longer than typical queries, the median energy rises 13x to 3.91 Wh (IQR 2.15-7.05). Across models, serving systems, and hardware, we estimate 8-20x line-of-sight energy reductions. At datacenter scale, serving 1 billion queries/day requires 0.7 GWh; if 10% are long queries, demand rises to 1.7 GWh/day. With efficiency interventions, it falls to 0.8 GWh/day, mitigating the energy impact of test-time scaling.

2605.06057 2026-06-11 cs.DC cs.MS 版本更新

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

FalconGEMM:通过低复杂度矩阵乘法超越硬件极限

Honglin Zhu, Jiaping Cao, Jiang Shao, Siyuan Feng, Qian Qiu, Peng Chen, Xu Zhang, Yixian Zhou, Man Lung Yiu, Guang Ji, Minwen Deng, Jintao Meng, Wenxi Zhu

AI总结 FalconGEMM通过自动化部署优化低复杂度矩阵乘法算法,实现DL性能提升,在GPU和CPU上均超越传统GEMM库和AlphaTensor等竞品。

详情
AI中文摘要

峰值突破矩阵乘法是一种提升深度学习性能的有前途技术,特别是在大语言模型训练和推理中。我们提出了FalconGEMM,一个跨平台框架,自动化部署、优化和选择低复杂度矩阵乘法算法(LCMAs)以适应多样化的硬件。三个关键创新包括:(1)部署模块通过代码生成实现跨各种硬件和输入配置的可移植执行;(2)执行模块具有分组并行优化,最大化芯片内数据重用,利用并行资源并减少带宽开销;(3)决策模块具备轻量级分析性能模型,根据矩阵形状和硬件配置选择最优策略。在多种数据类型下,对LLM工作负载在GPU(H20,A100)和CPU(ARM,x86)架构上进行了广泛评估。FalconGEMM成功实现了峰值突破性能,在GEMM库(如cuBLAS、CUTLASS、Intel MKL等)上提升了7.59%-17.85%,在LCMA竞争对手如AlphaTensor上提升了12.41%-55.61%。我们的框架使LCMAs的理论承诺在现代异构硬件的生产部署中成为现实。

英文摘要

Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.

2605.05727 2026-06-11 cs.DC 版本更新

LLM-Enhanced Deep Reinforcement Learning for Task Offloading in Collaborative Edge Computing

LLM增强的深度强化学习用于协作边缘计算中的任务卸载

Hao Guo, Kaixiang Xu, Ziwu Ge, Lei Yang

AI总结 提出LeDRL框架,结合轻量级LLM与自注意力增强DRL,通过结构化提示和语义反馈实现实时任务卸载,在成功率、收敛速度和实时性上优于基线方法。

详情
AI中文摘要

协作边缘计算利用不同位置的边缘节点执行任务,需要动态的任务卸载决策以保持低延迟和高可靠性,尤其是在不可预测的节点故障下。尽管深度强化学习(DRL)和大语言模型(LLMs)在任务卸载方面显示出潜力,但DRL通常面临样本效率低和局部最优的问题,而LLMs由于推理开销和输出不确定性难以直接使用。为解决这些限制,我们提出了\textbf{LeDRL},一种混合决策框架,将\emph{轻量级LLM}与自注意力增强的DRL相结合,用于实时任务卸载。LeDRL构建结构化的、上下文感知的提示,捕获节点状态、任务语义和链路动态,以导出高层策略先验。这些先验由基于自注意力的对齐模块选择性处理,用于上下文感知的策略优化。一个反思性评估器进一步从过去轨迹中提炼语义反馈,以优化后续提示并提供一致的指导。大量实验表明,LeDRL在不同网络规模的任务成功率、收敛速度和实时响应性方面优于代表性基线,成功率提升超过17%。此外,我们使用原型系统\textit{CoEdgeSys}在基于Jetson的边缘设备上部署LeDRL,展示了其在资源约束下的鲁棒性和可行性。我们的代码可在以下网址获取:this https URL。

英文摘要

Collaborative edge computing uses edge nodes in different locations to execute tasks, necessitating dynamic task offloading decisions to maintain low latency and high reliability, especially under unpredictable node failures. Although deep reinforcement learning (DRL) and large language models (LLMs) have shown promise for task offloading, DRL often suffers from poor sample efficiency and local optima, while LLMs are difficult to use directly due to inference overhead and output uncertainty. To address these limitations, we propose \textbf{LeDRL}, a hybrid decision framework that couples a \emph{lightweight LLM} with self-attention-enhanced DRL for real-time task offloading. LeDRL constructs structured, context-aware prompts capturing node status, task semantics, and link dynamics to derive high-level strategy priors. These are selectively processed by a self-attention-based alignment module for context-aware policy optimization. A reflective evaluator further distills semantic feedback from past trajectories to refine subsequent prompts and provide consistent guidance. Extensive experiments show that LeDRL outperforms representative baselines in task success rate, convergence speed, and real-time responsiveness across diverse network scales, achieving over 17\% improvement in success rate. Furthermore, we deploy LeDRL on Jetson-based edge devices using our prototype system \textit{CoEdgeSys}, demonstrating its robustness and feasibility under resource constraints. Our code is available at: this https URL.

2604.25018 2026-06-11 cs.ET cs.AI cs.DC cs.NI 版本更新

Internet of Everything in the 6G Era: Paradigms, Enablers, Potentials and Future Directions

6G时代的万物互联:范式、使能技术、潜力与未来方向

Driss Choukri, Essaid Sabir, Elmahdi Driouch, Abdelkrim Haqiq

AI总结 本文综述了万物互联(IoE)的概念、核心组件、架构基础、使能技术及研究挑战,并探讨了面向6G智能IoE系统的开放研究方向,重点关注可扩展性、安全、隐私和能效。

详情
Comments
48 pages, 15 figures, 6 tables, 272 references
AI中文摘要

万物互联(IoE)代表了物联网(IoT)的演进,通过将人、数据、流程和事物集成到一个统一的智能生态系统中。IoE旨在增强多个应用领域的自动化、决策和服务效率,例如智慧城市、医疗保健、工业和下一代无线网络。本文提供了IoE概念、其核心组件、架构基础、使能技术和主要研究挑战的结构化概述。最后,讨论了面向6G使能的智能IoE系统的开放研究方向,重点关注可扩展性、安全性、隐私和能效。

英文摘要

The Internet of Everything (IoE) represents an evolution of the Internet of Things (IoT) by integrating people, data, processes, and things into a unified intelligent ecosystem. IoE aims to enhance automation, decision-making, and service efficiency across multiple application domains such as smart cities, healthcare, industry, and next-generation wireless networks. This paper provides a structured overview of the IoE concept, its core components, architectural foundations, enabling technologies, and major research challenges. Finally, open research directions toward 6G-enabled intelligent IoE systems are discussed, with emphasis on scalability, security, privacy, and energy efficiency.

2510.18058 2026-06-11 cs.NI cs.DC 版本更新

A New Broadcast Model for Several Network Topologies

一种适用于多种网络拓扑的新型广播模型

Hongbo Lu, Junsung Hwang, Bernard Tenreiro, Nabila Jaman Tripti, Darren Hamilton, Yuefan Deng

AI总结 提出基于平衡饱和的广播(BBS)算法,通过树形流水线优化大规模消息广播的通信效率,在Mesh、Butterfly、Dragonfly和Fat-Tree拓扑上均优于现有算法。

详情
Comments
30 pages, 7 figures
AI中文摘要

我们引入了基于平衡饱和的广播(BBS),这是一类通用的基于树的流水线广播算法,旨在优化不同网络拓扑下的通信效率,特别关注大消息尺寸。通过解决广播中的两个基本理论挑战——生成树构建和通信任务调度,BBS提供了一个统一且灵活的框架,能够在各种网络约束下有效运行。该算法在最大化聚合吞吐量的同时,处理拓扑约束、同步开销、带宽限制和竞争。在标准假设(包括全双工和单端口通信)下,使用SimGrid在Mesh、Butterfly、Dragonfly和Fat-Tree拓扑上评估了多种算法。结果表明,BBS在多种拓扑和消息尺寸下始终优于通用和拓扑感知的广播算法,成为大规模系统中稳健且高性能的解决方案。

英文摘要

We introduce Broadcast by Balanced Saturation (BBS), a general class of tree-based pipelined broadcast algorithms that optimizes communication efficiency across diverse network topologies, with a particular emphasis on large message sizes. By addressing spanning tree construction and communication task scheduling, two fundamental theoretical challenges in broadcasting, BBS offers a unified and flexible framework that operates effectively under varied network constraints. The algorithm maximizes aggregated throughput while simultaneously addressing topology constraints, synchronization overhead, bandwidth limitations and contention. Using SimGrid under standard assumptions, including full-duplex and one-port communication, various algorithms were evaluated on Mesh, Butterfly, Dragonfly, and Fat-Tree topologies. Results demonstrate that BBS consistently outperforms both general-purpose and topology-aware broadcast algorithms across a wide range of topologies and message sizes, establishing it as a robust and high-performance solution for large-scale systems.

2603.09738 2026-06-11 cs.OS cs.DC 版本更新

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

确保多速率任务链调度中的数据新鲜度

José Luis Conradi Hoffmann, Antônio Augusto Fröhlich

AI总结 针对安全关键系统中数据新鲜度与确定性执行之间的权衡,提出基于任务的数据新鲜度约束调度框架,通过分解数据依赖图并设计偏移搜索算法同步多速率任务链,在不引入LET额外延迟的前提下保证端到端数据新鲜度。

详情
AI中文摘要

在安全关键自主系统中,数据新鲜度是一个基本的设计挑战。虽然逻辑执行时间(LET)范式保证了组合确定性,但通常以注入延迟为代价,可能降低高频控制回路上的数据年龄。此外,异构、多速率的任务依赖通常通过过采样来低效地保证。本文提出了一种扩展了数据新鲜度约束的基于任务的调度框架。与传统模型不同,调度决策由数据的生命周期驱动。我们引入了一种形式化方法,通过从执行器向后追踪最严格的数据新鲜度约束,将数据依赖图分解为主路径。基于这种分解,我们提出了一种偏移搜索算法,用于同步多速率、多依赖的任务链。该方法在不引入LET缓冲的人工延迟的情况下强制实现端到端数据新鲜度,这是数据新鲜度与执行确定性之间的权衡。我们正式证明,这种基于偏移的对齐在保证数据新鲜度的同时,保留了全局EDF的100%可调度性能力。

英文摘要

In safety-critical autonomous systems, data freshness presents a fundamental design challenge. While the Logical Execution Time (LET) paradigm ensures compositional determinism, it often does so at the cost of injected latency, possibly degrading the age of data on high-frequency control loops. Furthermore, heterogeneous, multi-rate, task dependencies is typically guaranteed inefficiently through oversampling. This paper proposes a Task-based scheduling framework extended with data freshness constraints. Unlike traditional models, scheduling decisions are driven by the lifespan of data. We introduce a formal methodology to decompose Data Dependency Graphs into dominant paths by tracing the strictest data freshness constraints backward from the actuators. Based on this decomposition, we propose an offset search algorithm that synchronizes multi-rate, multi-dependencies, task chains. This approach enforces end-to-end data freshness without the artificial latency of LET buffering, a trade-off between data freshness and execution determinism. We formally prove that this offset-based alignment preserves the 100% schedulability capacity of Global EDF while addressing data freshness guarantees.

2603.09555 2026-06-11 cs.LG cs.AI cs.DC cs.PF 版本更新

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

编译器优先的状态空间对偶性与可移植的 $O(1)$ 自回归缓存推理

Cosmo Santoni, Anmol Thapar

AI总结 提出一种基于编译器优先的状态空间对偶性(SSD)结构的推理方法,通过标准JAX原语实现无自定义内核的单源推理路径,在TPU和GPU上达到高硬件利用率,且缓存解码速度比全前缀重计算快27-36倍。

详情
Comments
21 pages, 6 figures. Code available at: this https URL
AI中文摘要

高吞吐量的Mamba-2推理通常依赖于融合的CUDA和Triton内核,这限制了在不同加速器后端之间的可移植性。我们证明状态空间对偶性(SSD)递归具有编译器友好的结构:对角逐头动态、固定大小分块、以einsum为主的计算以及静态控制流。在标准JAX原语中表达这种结构,可以得到一个无需自定义内核的单源推理路径、一个注册的JAX PyTree缓存以及一个编译后的设备上自回归循环。在单个Google Cloud TPU v6e上,batch-1预填充达到约140 TFLOPS,即15%的模型FLOP利用率(MFU),这是该场景下的屋顶线上限;缓存解码达到高达64%的硬件带宽利用率(HBU)。在4096个token的上下文中,对于五个Mamba-2检查点(参数从130M到2.7B),缓存解码比全前缀重计算快27-36倍。相同的源代码在未修改的情况下可在NVIDIA L40S上运行,其中缓存解码在所有模型规模下均保持序列长度无关。WikiText-103验证困惑度与Triton参考实现mamba_ssm v2.2.2相差在±0.0005以内,隐藏状态在float32舍入容差内一致。代码可在以下网址获取:https://this URL。

英文摘要

High-throughput Mamba-2 inference is usually tied to fused CUDA and Triton kernels, limiting portability across accelerator backends. We show that the state space duality (SSD) recurrence has a compiler-friendly structure: diagonal per-head dynamics, fixed-size chunking, einsum-dominated compute, and static control flow. Expressing this structure in standard JAX primitives gives a single-source inference path with no custom kernels, a registered JAX PyTree cache, and a compiled on-device autoregressive loop. On a single Google Cloud TPU v6e, batch-1 prefill reaches approximately 140 TFLOPS, or 15% model FLOP utilisation (MFU), the roofline ceiling for this regime, and cached decode reaches up to 64% hardware bandwidth utilisation (HBU). At a 4096-token context, cached decode is 27x--36x faster than full-prefix recomputation across five Mamba-2 checkpoints from 130M to 2.7B parameters. The same source runs unmodified on NVIDIA L40S, where cached decode remains sequence-length independent across all model scales. WikiText-103 validation perplexity matches the Triton reference mamba_ssm v2.2.2 within +/-0.0005 points, and hidden states agree to float32 rounding tolerance. Code is available at this https URL.

2602.02340 2026-06-11 cs.DC 版本更新

LCLs Beyond Bounded Degrees

超越有界度的LCL

Gustav Schmid

AI总结 研究无界度树上的局部可检查标记(LCL)问题,引入局部有限标记(LFL)概念,证明多项式间隙在LFL中恢复存在。

详情
AI中文摘要

对局部可检查标记(LCL)的研究已经得出了在有界度树上可能出现的分布式时间复杂度的极其精确的特征描述。这种复杂度景观的一个核心特征是间隙结果的存在,它排除了大范围的中间复杂度。虽然最初希望这些间隙可能扩展到更一般的图类,但事实证明并非如此。在这项工作中,我们研究了一个不同的方向:我们仍然停留在树类中,但允许任意大的度。我们关注多项式区间,即形式为 $\Theta(n^{1/k})$($k \in \mathbb{N}$)的复杂度,并证明多项式间隙结果是否在无界度设置中持续存在,关键取决于LCL如何推广到有界度之外。已经存在一个复杂的构造表明,多项式间隙在无界度树上的LCL中也消失了。我们并没有止步于这个负面结果,而是给出了一组更简单的问题,这些问题的存在已经否定了任何多项式间隙的存在。从这个更简洁的构造中获得的洞见是,为了存在间隙结果,我们不能允许问题定义区分无限多的局部情况。受此启发,我们引入了局部有限标记(LFL),它形式化了每个节点必须属于有限多个局部情况之一的直觉。我们的主要结果表明,这种限制足以恢复多项式间隙:对于无界度树上的任何LFL $\Pi$,$\Pi$ 的确定性LOCAL复杂度要么是 - 对于某个整数 $k \geq 1$ 为 $\Theta(n^{1/k})$,要么是 - $O(\log n)$。此外,哪种情况适用以及相应的 $k$ 值可以仅从 $\Pi$ 的描述中确定。

英文摘要

The study of Locally Checkable Labelings (LCLs) has led to a remarkably precise characterization of the distributed time complexities that can occur on bounded-degree trees. A central feature of this complexity landscape is the existence of gap results, which rule out large ranges of intermediate complexities. While it was initially hoped that these gaps might extend to more general graph classes, this has turned out not to be the case. In this work, we investigate a different direction: we remain in the class of trees, but allow arbitrarily large degrees. We focus on the polynomial regime, i.e. complexities of the form $\Theta(n^{1/k})$ for $k \in \mathbb{N}$, and show that whether polynomial gap results persist in the unbounded-degree setting crucially depends on how LCLs are generalized beyond bounded degrees. There already exists a complex construction that shows that the polynomial gaps also vanish for LCLs on unbounded-degree trees. Rather than stopping at this negative result, we give a much simpler set of problems that already contradicts the existence of any polynomial gaps. The insight obtained from this cleaner construction is that for gap results to exist, we cannot allow problem definitions to distinguish infinitely many local cases. Inspired by this, we introduce Locally Finite Labelings (LFLs), which formalize the intuition that every node must fall into one of finitely many local cases. Our main result shows that this restriction is sufficient to restore the polynomial gaps: for any LFL $\Pi$ on trees with unbounded degrees, the deterministic LOCAL complexity of $\Pi$ is either - $\Theta(n^{1/k})$ for some integer $k \geq 1$, or - $O(\log n)$. Moreover, which case applies, and the corresponding value of $k$, can be determined solely from the description of $\Pi$.

2601.21824 2026-06-11 cs.LG cs.DC

DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
英文摘要

Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.

2512.22219 2026-06-11 cs.DC cs.LG cs.PL 版本更新

MPK: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

MPK:一种用于将张量程序转化为巨型内核的编译器和运行时系统

Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao Jia

AI总结 提出MPK,首个自动将多GPU模型推理转化为单个高性能巨型内核的编译器和运行时系统,通过SM级图表示实现跨算子软件流水线和细粒度计算通信重叠,显著降低推理延迟。

详情
Comments
14 pages
AI中文摘要

我们介绍了Mirage Persistent Kernel (MPK),这是首个自动将多GPU模型推理转化为单个高性能巨型内核的编译器和运行时系统。MPK引入了一种SM级图表示,该表示在单个流式多处理器(SM)的粒度上捕获数据依赖关系,从而实现跨算子软件流水线、计算与通信的细粒度重叠,以及在传统每算子内核执行模型下不可行的其他优化。MPK编译器将张量程序降级为优化的SM级任务图,并为每个任务生成快速的CUDA实现,而MPK内核内并行运行时则通过跨SM的分散调度在单个持久巨型内核内执行这些任务。这些组件共同提供了端到端的内核融合,且开发工作量极小,同时保留了现有编程模型的灵活性。我们的评估表明,MPK显著优于现有的每算子内核LLM服务系统,实现了高达1.7倍的端到端推理延迟降低,并将LLM推理性能推近底层硬件的极限。MPK在此https URL公开可用。

英文摘要

We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance mega-kernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, \rev{fine-grained overlap of computation and communication, and other optimizations that are infeasible under the conventional kernel-per-operator execution model}. The MPK compiler lowers tensor programs into optimized SM-level task graphs and generates fast CUDA implementations for each task, while the MPK in-kernel parallel runtime executes these tasks within a single persistent mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems, achieving up to 1.7$\times$ lower end-to-end inference latency and pushing LLM inference performance close to the limits of the underlying hardware. MPK is publicly available at this https URL.