arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.PF性能10
2606.12154 2026-06-11 cs.PF 新提交

The Brain That Goes Quiet: Serving a Large Model's Knowledge at 131 Tokens per Second on an 8 GB Laptop by Removing the Large Model from the Runtime Path

静默的大脑:通过从运行时路径中移除大模型,在8 GB笔记本电脑上以每秒131个令牌的速度提供大模型知识

Myeong Jun Jo

AI总结 本文提出一种离线知识存储方法,将大模型(35B MoE)用于构建结构化知识库,运行时仅用轻量路由器和1B小模型,在8GB笔记本上将端到端响应时间从4.4秒降至0.5秒,吞吐量提升至131 tokens/s。

详情
Comments
17 pages, 5 figures
AI中文摘要

在之前的工作中,我展示了35B类混合专家模型可以在具有8 GB GPU内存的消费级笔记本电脑上加载和执行。该结果解决了一个放置问题,并立即暴露了另一个问题:即使正确放置,大模型每次查询仍需要大约四秒才能回答,因为它在每次查询时仍被调用。本文记录了当我停止调用它时发生的情况。在离线阶段,大模型读取源文档并将经过验证的答案条目写入结构化知识存储;在运行时,只有轻量级路由器、确定性渲染器和1B类模型处于活动状态。在同一台8 GB笔记本电脑上,端到端响应时间从约4,465毫秒降至518毫秒,有效端到端吞吐量从15.7 tokens/s升至131 tokens/s,小模型的流式解码速率保持在226-237 tokens/s,首令牌时间为29-62毫秒。瓶颈是结构性的:三种不同的大模型(Qwen、Gemma和GLM类)都显示出相同的多秒运行时成本,并且所有三个模型都在离线状态下生成了可用的知识存储。在由17个真实文档构建的563条条目的存储上,关键词路由的top-1准确率降至1.5%,而基于BM25的路由达到92.8%(top-3为99.4%),置信门通过升级12.3%的查询将有效top-1提升至98.0%。小模型在携带相同内容的不同信封格式上的精确匹配保真度从9/9到0/9不等。一个16案例的验证门阻止了所有十个损坏条目,同时接纳了所有六个支持的条目。

英文摘要

In earlier work I showed that a 35B-class Mixture-of-Experts model can be loaded and executed on a consumer laptop with 8 GB of GPU memory. That result solved a placement problem and immediately exposed a different one: even correctly placed, the large model needed roughly four seconds to answer, because it was still being invoked at every query. This paper documents what happened when I stopped invoking it. During an offline phase, the large model reads source documents and writes verified answer entries into a structured knowledge store; at runtime, only a lightweight router, a deterministic renderer, and a 1B-class model are active. On the same 8 GB laptop, end-to-end response time fell from approximately 4,465 ms to 518 ms, effective end-to-end throughput rose from 15.7 to 131 tokens per second, and the small model's streaming decode rate held at 226-237 tokens per second with a time-to-first-token of 29-62 ms. The bottleneck is structural: three different large models (Qwen, Gemma, and GLM class) all showed the same multi-second runtime cost, and all three produced usable knowledge stores offline. On a 563-entry store built from seventeen real documents, keyword routing collapsed to 1.5% top-1 accuracy while BM25-based routing reached 92.8% (99.4% top-3), and a confidence gate raised effective top-1 to 98.0% by escalating 12.3% of queries. Exact-match fidelity of the small model ranged from 9/9 to 0/9 across envelope formats carrying identical content. A 16-case verification gate blocked all ten corrupted entries while admitting all six supported ones.

2606.11937 2026-06-11 cs.DC cs.PF 新提交

From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX

从Fork-Join到异步任务:使用OpenMP和HPX并行化瓦片Cholesky分解

Alexander Strack, Alexander Van Craen, Dirk Pflüger

AI总结 本文通过Cholesky-Bench基准,比较了OpenMP和HPX运行时下四种瓦片Cholesky分解并行变体,发现HPX在最优瓦片大小下性能优于OpenMP 15%-30%,异步任务开销降低约3.8倍。

详情
Comments
15 pages, 8 figures, accepted paper at AMTE held in conjunction with PPAM 2026
AI中文摘要

由OpenMP推广的Fork-Join并行性仍然是共享内存并行编程的主导模型,但其隐式同步屏障会惩罚工作负载不均匀的算法。异步多任务(AMT)运行时通过将工作表示为细粒度任务的依赖图来绕过这些屏障。然而,与精心编写的fork-join基线相比,实际的性能优势很少被量化。在这项工作中,我们引入了Cholesky-Bench,并利用它重新审视了瓦片Cholesky分解(一个典型的不规则内核),比较了两种运行时(GCC和LLVM附带的OpenMP实现,以及HPX AMT运行时)中右视算法的四种并行化变体。这些变体包括经典的fork-join、暴露额外内循环并行性的折叠fork-join、同步任务以及具有显式数据依赖的异步任务。我们在双插槽128核AMD Zen 2节点上,针对多种瓦片大小和问题大小,对所有八种组合进行了基准测试。我们的结果表明,在所有变体中,HPX在最优瓦片大小下比OpenMP快15%-30%。具体来说,异步HPX任务比对应的OpenMP任务快高达26%,并且任务开销大约小3.8倍。此外,折叠fork-join变体缩小了与同步任务的大部分差距。消除冗余同步屏障带来了额外的改进,OpenMP为7%,HPX为14%。GCC与LLVM的比较进一步揭示了fork-join调度和任务创建开销中编译器特定的差异。

英文摘要

Fork-join parallelism, popularized by OpenMP, remains the dominant model for shared-memory parallel programming, but its implicit synchronization barriers can penalize algorithms with inhomogeneous workloads. Asynchronous many-task (AMT) runtimes sidestep these barriers by expressing work as a dependency graph of fine-grained tasks. Yet, the actual performance benefit over a carefully written fork-join baseline is rarely quantified. In this work, we introduce Cholesky-Bench and use it to revisit the tiled Cholesky decomposition, a canonical irregular kernel, comparing four parallelization variants of the right-looking algorithm across two runtimes: the OpenMP implementations shipped with GCC and LLVM, and the HPX AMT runtime. The variants span classical fork-join, a collapsed fork-join that exposes additional inner-loop parallelism, synchronous tasking, and asynchronous tasking with explicit data dependencies. We benchmark all eight combinations on a dual-socket 128-core AMD Zen 2 node across multiple tile sizes and problem sizes. Our results show that across all variants, HPX outperforms OpenMP at the optimal tile size by 15%-30%. Specifically, asynchronous HPX tasks are up to 26% faster than their OpenMP counterparts, and exhibit roughly 3.8x smaller task overhead. Furthermore, the collapsed fork-join variants close most of the gap to synchronous tasking. Removing redundant synchronization barriers yields an additional improvement of 7% (OpenMP) to 14% (HPX). A GCC-versus-LLVM comparison further reveals compiler-specific differences in fork-join scheduling and task-creation overheads.

2606.11690 2026-06-11 cs.DC cs.PF 新提交

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

超越按Token定价:一种考虑并发性的LLM基础设施成本估算方法

Chitral Patil

AI总结 针对现有成本计算器将GPU利用率作为固定输入导致严重误差的问题,提出一种基于测量请求率λ的并发感知成本估算方法,并开源vllm-cost-meter工具,验证了低负载下成本被低估2.5-36.3倍。

详情
Comments
26 pages, 9 figures. Code: this https URL
AI中文摘要

我们调查的每个公共LLM成本计算器都将GPU利用率视为固定输入——由用户输入、作为预设内置或默认假设为100%——从未根据运营商的实际负载进行测量。我们表明,这一假设是误差的主要来源:在相同的H100硬件上,有效成本从每百万输出token 0.21美元到15.25美元不等,在低到中等企业负载(1-10 rps)下,利用率不足导致的惩罚为2.5-24倍,在接近空闲时高达36.3倍——这由一个运营商可控变量,即提供的请求率λ驱动,该变量通过Little定律设置并发数,而没有任何开源计算器公开它。由于计算器将利用率作为用户提供的输入,任何不考虑利用率的估计都会将真实成本低估正好1/U,系统地低估了自托管成本——对于低流量工作负载,最严重地过度推销。我们提出了一种测量方法,将关系参数化为C_eff = f(H, M, Q, λ, L),在密集、超稀疏MoE和稀疏MoE模型上使用42个基准进行验证,并发布了vllm-cost-meter,这是一个开源成本计量器,可连接到实时vLLM服务器,并根据运营商自己的流量报告真实的$/M-tokens。我们进一步表明,FP8量化对我们测试的MoE架构的益处大约是密集模型的2.2-2.4倍(峰值吞吐量提升+69至+74%对比+31%;n=3,需要更广泛的验证),并且我们的数据与活跃参数计数(而非总模型大小)是饱和经济性的主要预测因子一致。为了排除单硬件混淆,我们在A100 80GB PCIe上重复了核心扫描(56次运行):负载驱动的波动重现为7.0-11.4倍,活跃参数排序在FP8下仍然成立,而密集模型FP8的优势在没有原生FP8张量核心的硅片上反转——这是一个框架已经容纳的硬件条件性注意事项。

英文摘要

Every public LLM cost calculator we surveyed treats GPU utilization as a fixed input -- entered by the user, baked in as a preset, or silently assumed at 100% -- never measured against the operator's actual load. We show that this assumption is the dominant source of error: on identical H100 hardware, effective cost spans \$0.21 to \$15.25 per million output tokens, an underutilization penalty of 2.5-24x across low-to-moderate enterprise loads (1-10 rps) and up to 36.3x near idle -- driven by one operator-controlled variable, offered request rate lambda, which sets in-flight concurrency via Little's Law and which no open-source calculator exposes. Because calculators take utilization as a user-supplied input, any utilization-naive estimate understates true cost by exactly 1/U, systematically mispricing self-hosting -- most severely over-selling it for low-traffic workloads. We propose a measurement methodology that parameterizes the relationship as C_eff = f(H, M, Q, lambda, L), validate it with 42 benchmarks across dense, ultra-sparse MoE, and sparse MoE models, and release vllm-cost-meter, an open-source cost meter that attaches to a live vLLM server and reports real \$/M-tokens against the operator's own traffic. We further show that FP8 quantization benefits the MoE architectures we tested roughly 2.2-2.4x more than the dense model (+69 to +74% vs. +31% peak throughput; n=3, broader validation needed), and our data are consistent with active parameter count, not total model size, being a primary predictor of saturation economics. To rule out single-hardware confounding we repeat the core sweep on A100 80GB PCIe (56 runs): the load-driven spread reproduces at 7.0-11.4x, the active-parameters ordering survives at FP8, and the dense-FP8 advantage inverts on silicon without native FP8 tensor cores -- a hardware-conditional caveat the framework already accommodates.

2606.11529 2026-06-11 cs.GR cs.CV cs.PF 新提交

XPR: An Extensible Cross-Platform Point-Based Differentiable Renderer

XPR:一个可扩展的跨平台基于点的可微分渲染器

Steve Rhyner, Sankeerth Durvasula, Aleksandr Kovalev, Hansel Jia, Adrian Zhao, Mrutunjayya Mrutunjayya, Nilesh Ahuja, Selvakumar Panneer, Christina Giannoula, Nandita Vijaykumar

AI总结 提出XPR框架,通过高级编程接口和模块化渲染管线,支持用少量代码实现3DGS等新方法,并利用XLA编译器跨平台运行。

详情
AI中文摘要

基于点的可微分渲染支撑着现代3D重建、新视角合成和基于学习的图形管线,但开发新的渲染方法通常需要大量的底层实现、硬件特定的内核以及手动编写的反向传播。这限制了快速原型设计、可重复性、探索和部署,尤其是在不同的硬件平台上。本文提出了XPR,一个可扩展的跨平台基于点的可微分渲染框架。XPR引入了一个高级编程接口,将方法特定的逻辑与共享的渲染管线分离,允许用户用几行代码实现新方法。其管线将渲染分解为模块化的、静态形状的并行操作,这些操作可以通过跨平台编译器降级到GPU、TPU、CPU和其他ML加速器。我们展示了3DGS、3DGUT和LinPrim的实现,仅需几百行Python代码,每个都可以通过XLA编译器编译到一系列硬件平台。这些结果表明,XPR为新兴的基于点的可微分渲染系统实现了快速实验和可移植执行。

英文摘要

Point-based differentiable rendering underpins modern 3D reconstruction, novel-view synthesis, and learning-based graphics pipelines, but developing new rendering methods often requires extensive low-level implementation, hardware-specific kernels, and manually written backward passes. This limits rapid prototyping, reproducibility, exploration, and deployment, especially across diverse hardware platforms. This paper presents XPR, an extensible cross-platform framework for point-based differentiable rendering. XPR introduces a high-level programming interface that separates method-specific logic from the shared rendering pipeline, allowing users to implement new methods in a few lines of code. Its pipeline decomposes rendering into modular, statically shaped parallel operations that can be lowered by a cross-platform compiler to GPUs, TPUs, CPUs, and other ML accelerators. We demonstrate implementations of 3DGS, 3DGUT, and LinPrim, with only a few 100s lines of Python code, each of which can be compiled to a range of hardware platforms with the XLA compiler. These results show that XPR enables fast experimentation and portable execution for emerging point-based differentiable rendering systems.

2606.11357 2026-06-11 cs.DC cs.AI cs.AR cs.PF 新提交

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse:用于AMD NPU上高效量化LLM推理的融合混合精度内核库

Wesley Pang, Gregory Hyegang Jun, Feiyang Liu, Deming Chen

AI总结 针对边缘NPU上量化LLM部署困难,提出TileFuse库,通过融合解包、反量化与GEMM/GEMV内核,并设计交错预分块布局与数据流,在XDNA2上实现AWQ格式原生支持,性能提升最高281%,能耗降低64.6%。

详情
Comments
13 pages excluding reference, 11 figures
AI中文摘要

随着设备端LLM推理需求的增长,边缘SoC越来越多地集成NPU,以在严格的功耗和热预算下提高性能和能效。然而,当前客户端NPU上的实际LLM部署仍然困难:广泛使用的量化格式(如AWQ)无法干净地映射到许多现有NPU软件栈上,这些软件栈通常是专有的,并且暴露有限底层控制。在这项工作中,我们提出了\textit{TileFuse},一个面向AMD XDNA2 NPU的近底层混合精度内核库,针对量化LLM推理中的Transformer线性层。TileFuse将实用的低位格式(如AWQ风格的W4A16和W8A16)直接引入XDNA2,而不是迫使模型围绕NPU特定的量化方案重新调整。TileFuse协同设计了权重布局、元数据放置、混合精度微内核和阵列级数据流。具体来说,它将解包、反量化以及GEMM/GEMV执行融合到单个内核流中,引入了一种支持高达32K GEMM维度的交错预分块布局,并重新设计了GEMV数据流以利用完整的4x8 AIE阵列。在内核级评估中,与全精度基线相比,TileFuse在GEMM上性能提升高达121.6%,在GEMV上提升281%,同时在GEMM上相比强iGPU基线实现了超过2倍的性能和能效提升。在Ryzen AI笔记本电脑上的端到端LLM实验中,TileFuse实现了高达2.0倍的预填充延迟降低,能耗降低超过64.6%。这些结果共同表明,XDNA2是AWQ风格边缘LLM推理的实用目标,并且对现成量化的原生NPU支持可以使NPU在实际客户端部署中更加可用。

英文摘要

With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textit{TileFuse}, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

2606.11257 2026-06-11 cs.CL cs.LG cs.PF 新提交

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

移动NPU上的能效型设备端RAG:Snapdragon X Elite系统设计与基准测试

Zhiyuan Cheng, Longying Lai

发表机构 * Qualcomm(高通) Snapdragon X Elite(骁龙X Elite) Dell XPS 13 laptop(戴尔XPS 13笔记本电脑) Qualcomm Hexagon NPU(高通Hexagon NPU) Adreno X1-85

AI总结 本文首次在Snapdragon X Elite的Hexagon NPU上实现端到端RAG流水线,通过对比CPU和GPU,NPU在嵌入吞吐量、系统能耗和查询延迟上分别提升9.1倍、降低12.3倍和4.0倍,且答案质量相当。

详情
Comments
9 pages, 2 figures, 6 tables
AI中文摘要

检索增强生成(RAG)流水线计算密集,结合了嵌入、检索、重排序和大语言模型(LLM)生成。完全在设备端运行有利于隐私、延迟和离线使用,但CPU推理的能耗成本是一个主要障碍。我们提出了据我们所知第一个在Snapdragon X Elite的Qualcomm Hexagon NPU上运行所有神经阶段(嵌入、重排序和LLM生成)的端到端RAG流水线。在Dell XPS 13笔记本电脑上进行性能分析,我们比较了NPU加速的RAG与CPU和OpenCL/Adreno GPU基线在索引和查询工作负载上的表现。在索引方面,NPU实现了9.1倍的嵌入吞吐量提升和12.3倍的系统能耗降低。在120查询的Wikipedia段落基准测试中,与CPU基线相比,NPU实现了18.1倍的LLM预填充加速、4.0倍的端到端查询延迟降低和4.0倍的系统能耗降低;集成GPU上的相同工作负载比CPU慢1.7倍,且能耗比NPU高6.5倍。GPT-4.1 LLM作为评判者的评估发现,NPU的答案质量与CPU和GPU相当,在评估者噪声范围内(1-10分制下平均9.32 vs. 8.95 vs. 9.03),86.7%的查询在所有三个后端上得分相同。因此,在Snapdragon X Elite / Hexagon类笔记本电脑SoC上,NPU实现了实用、能效高的设备端RAG,且无质量退化——这是一条通往绿色边缘智能的可持续路径,我们预计随着软件栈的成熟,该方法将推广到类似的移动NPU(Apple Neural Engine、Intel NPU、MediaTek APU)。

英文摘要

Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

2606.01183 2026-06-11 cs.DC cs.DB cs.DS cs.PF 版本更新

The World's Fastest Matching Engine Algorithm

世界上最快的撮合引擎算法

Jake Yoon

AI总结 提出Priority-Indicated Node (PIN)和邻域感知树操作两种数据结构,消除订单簿中指针追逐和根到叶搜索的延迟,实现亚微秒级尾部延迟和每秒数千万条消息的处理能力。

详情
Comments
20 pages, 5 figures, 7 tables
AI中文摘要

每个电子交易所都依赖于一个订单簿,其存储层决定了撮合延迟。主流实现——通过平衡树链接的链表——在每个操作上施加两个成本:指针追逐遍历以到达插入点,以及根到叶搜索以定位目标价格水平。在微突发条件下,这些成本会产生尾部延迟峰值,在流动性最需要时降低市场质量。我们提出了两种数据结构贡献,消除了这些成本。第一种是优先级指示节点(PIN),一种优先队列,其中条目占据固定容量、连续可寻址的槽位,每个槽位携带一个指示条目全局优先级的每槽指示器。与每次操作需要O(log n)次比较的堆不同,PIN直接根据指示器解析插入位置,无需比较条目;指示器更新为O(1),与队列大小无关。第二种解决了更广泛的低效问题:平衡搜索树在每次插入和删除时都进行根到叶搜索,即使调用者已经知道键的中序邻居——例如在有序事件流、增量索引维护和电子交易中。邻域感知插入和删除利用已知的邻居引用,通过O(1)次引用写入来附加或移除节点,然后进行单路径重平衡,统一适用于红黑树、AVL树和B/B+树变体。单个CPU核心在每秒数百万条消息的微突发下,以亚微秒级尾部延迟维持每秒3200万条订单消息,比同一硬件上最好的开源撮合引擎快5-11倍。扩展到单个96核实例,该引擎在10,000个交易品种上维持每秒6.4亿条消息。

英文摘要

A single CPU core sustains 32 million order messages per second at sub-microsecond median end-to-end host-path response latency, 4.7-11 times faster than the best available open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding the entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.

2603.09555 2026-06-11 cs.LG cs.AI cs.DC cs.PF 版本更新

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

编译器优先的状态空间对偶性与可移植的 $O(1)$ 自回归缓存推理

Cosmo Santoni, Anmol Thapar

AI总结 提出一种基于编译器优先的状态空间对偶性(SSD)结构的推理方法,通过标准JAX原语实现无自定义内核的单源推理路径,在TPU和GPU上达到高硬件利用率,且缓存解码速度比全前缀重计算快27-36倍。

详情
Comments
21 pages, 6 figures. Code available at: this https URL
AI中文摘要

高吞吐量的Mamba-2推理通常依赖于融合的CUDA和Triton内核,这限制了在不同加速器后端之间的可移植性。我们证明状态空间对偶性(SSD)递归具有编译器友好的结构:对角逐头动态、固定大小分块、以einsum为主的计算以及静态控制流。在标准JAX原语中表达这种结构,可以得到一个无需自定义内核的单源推理路径、一个注册的JAX PyTree缓存以及一个编译后的设备上自回归循环。在单个Google Cloud TPU v6e上,batch-1预填充达到约140 TFLOPS,即15%的模型FLOP利用率(MFU),这是该场景下的屋顶线上限;缓存解码达到高达64%的硬件带宽利用率(HBU)。在4096个token的上下文中,对于五个Mamba-2检查点(参数从130M到2.7B),缓存解码比全前缀重计算快27-36倍。相同的源代码在未修改的情况下可在NVIDIA L40S上运行,其中缓存解码在所有模型规模下均保持序列长度无关。WikiText-103验证困惑度与Triton参考实现mamba_ssm v2.2.2相差在±0.0005以内,隐藏状态在float32舍入容差内一致。代码可在以下网址获取:https://this URL。

英文摘要

High-throughput Mamba-2 inference is usually tied to fused CUDA and Triton kernels, limiting portability across accelerator backends. We show that the state space duality (SSD) recurrence has a compiler-friendly structure: diagonal per-head dynamics, fixed-size chunking, einsum-dominated compute, and static control flow. Expressing this structure in standard JAX primitives gives a single-source inference path with no custom kernels, a registered JAX PyTree cache, and a compiled on-device autoregressive loop. On a single Google Cloud TPU v6e, batch-1 prefill reaches approximately 140 TFLOPS, or 15% model FLOP utilisation (MFU), the roofline ceiling for this regime, and cached decode reaches up to 64% hardware bandwidth utilisation (HBU). At a 4096-token context, cached decode is 27x--36x faster than full-prefix recomputation across five Mamba-2 checkpoints from 130M to 2.7B parameters. The same source runs unmodified on NVIDIA L40S, where cached decode remains sequence-length independent across all model scales. WikiText-103 validation perplexity matches the Triton reference mamba_ssm v2.2.2 within +/-0.0005 points, and hidden states agree to float32 rounding tolerance. Code is available at this https URL.

2506.21960 2026-06-11 cs.PF 版本更新

Redundant Array Computation Elimination

冗余数组计算消除

Zixuan Wang, Liang Yuan, Xianmeng Jiang, Kun Li, Junmin Xiao, Yunquan Zhang

AI总结 提出RACE技术,通过两级哈希方案识别数组引用间的数据重用和表达式间的计算冗余,实现层次化冗余检测,并支持表达式重关联以增加冗余机会。

详情
AI中文摘要

冗余消除是一个关键的优化方向,而循环嵌套是现代编译器中的主要优化目标。以往关于循环嵌套中数组计算冗余消除的工作要么针对特定的计算模式,要么无法识别具有复杂结构的冗余。本文提出了RACE(冗余数组计算消除),一种基于哈希的技术,利用新颖的两级方案来识别数组引用之间的数据重用和表达式之间的计算冗余,实现了超越特定模式方法的层次化冗余检测。它遍历循环嵌套中的表达式树,在线性时间内层次化地检测冗余,并生成高效的代码,其中包含存储冗余计算结果的优化辅助数组。此外,RACE支持通过各种激进策略进行表达式重关联,以提高冗余机会。实验结果表明了RACE的有效性。

英文摘要

Redundancy elimination is a key optimization direction, and loop nests are the main optimization target in modern compilers. Previous work on redundancy elimination of array computations in loop nests either targets specific computation patterns or fails to recognize redundancies with complex structures. This paper proposes RACE (Redundant Array Computation Elimination), a hash-based technique that utilizes a novel two-level scheme to identify the data reuse between array references and the computation redundancies between expressions, enabling hierarchical redundancy detection beyond pattern-specific methods. It traverses the expression trees in loop nests to detect redundancies hierarchically in linear time and generates efficient code with optimized auxiliary arrays that store redundant computation results. Furthermore, RACE supports the expression reassociation with various aggressive strategies to improve the redundancy opportunities. Experimental results demonstrate the effectiveness of RACE.

2505.17623 2026-06-11 cs.CR cs.AI cs.ET cs.LG cs.PF 版本更新

\texttt{Range-Arithmetic}: Verifiable Deep Learning Inference on an Untrusted Party

Range-Arithmetic: 在不可信方上进行可验证的深度学习推理

Ali Rahimi, Babak H. Khalaj, Mohammad Ali Maddah-Ali

AI总结 提出Range-Arithmetic框架,通过将非算术运算转化为可验证的算术步骤,实现高效的深度神经网络推理验证,降低了计算和通信开销。

详情
AI中文摘要

可验证计算(VC)在去中心化机器学习系统中日益重要,由于区块链的限制,深度神经网络(DNN)推理等资源密集型任务被外包给外部参与者。这产生了在不重新执行的情况下验证外包计算正确性的需求。我们提出了\texttt{Range-Arithmetic},一个新颖的框架,用于高效且可验证的DNN推理,它将非算术运算(如定点矩阵乘法后的舍入和ReLU)转化为可通过求和检查协议和串联范围证明验证的算术步骤。我们的方法避免了布尔编码、高次多项式和大查找表的复杂性,同时保持与基于有限域的证明系统的兼容性。实验结果表明,我们的方法不仅匹配现有方法的性能,还降低了验证结果的计算成本、执行DNN推理的不可信方所需的计算工作量以及双方之间的通信开销。

英文摘要

Verifiable computing (VC) has gained prominence in decentralized machine learning systems, where resource-intensive tasks like deep neural network (DNN) inference are offloaded to external participants due to blockchain limitations. This creates a need to verify the correctness of outsourced computations without re-execution. We propose \texttt{Range-Arithmetic}, a novel framework for efficient and verifiable DNN inference that transforms non-arithmetic operations, such as rounding after fixed-point matrix multiplication and ReLU, into arithmetic steps verifiable using sum-check protocols and concatenated range proofs. Our approach avoids the complexity of Boolean encoding, high-degree polynomials, and large lookup tables while remaining compatible with finite-field-based proof systems. Experimental results show that our method not only matches the performance of existing approaches, but also reduces the computational cost of verifying the results, the computational effort required from the untrusted party performing the DNN inference, and the communication overhead between the two sides.