arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12259 2026-06-11 cs.CR cs.AR 新提交

Partitioned Tags, Shared Data: Reconciling Strict Cache Isolation with Write-Shared Coherence

分区标签，共享数据：严格缓存隔离与写共享一致性的调和

Kartik Ramkrishnan, Stephen McCamant, Antonia Zhai, Pen Chung Yew

AI总结提出SCP方法，通过仅分区标签、共享数据池并调整大小避免容量驱逐，结合时序混淆和写泄漏阈值，在严格隔离下实现写共享一致性，有效防御Prime+Probe和Flush+Reload攻击。

详情

AI中文摘要

缓存分区是针对基于驱逐的缓存侧信道攻击最强大的结构性防御之一，然而一个存在十年的设计问题阻碍了其在安全共享操作系统环境中的广泛部署。该问题是写共享一致性在严格分区下会崩溃。我们提出SCP（安全且一致的分区），它通过仅分区标签、共享单个数据池，并调整数据池大小以避免容量驱动的跨分区驱逐，从而将严格的驱逐隔离与写共享一致性结合起来。时序混淆将保护扩展到分区间的查找路径。通过将写操作在泄漏阈值超过后路由到LLC，减轻了共享可写行上的基于一致性的泄漏，这使得攻击者的写探测延迟与受害者活动无关。使用gem5实现，SCP缓解了Prime+Probe和Flush+Reload攻击，这些是更复杂缓存攻击的基础。我们还展示了一个共享可写行攻击被缓解。所有这些攻击的结果都不优于随机猜测。SCP的硬件成本是LLC SRAM适度增加2.8%。在我们评估的SPEC CPU2017基准测试中，性能在IPC上与DAWG相差在0.3%以内。共享密集型微基准测试展示了基于系统指定泄漏阈值的可调安全-性能权衡。

英文摘要

Cache partitioning is among the strongest structural defenses against eviction-based cache side channels, yet a decade-old design issue has blocked its widespread deployment in secure shared-OS settings. The issue is that write-shared coherence collapses under strict partitioning. We present SCP (Secure and Coherent Partitioning), which combines strict eviction isolation with write-shared coherence by partitioning only the tags, sharing a single data pool, and sizing the data pool so capacity-driven cross-partition eviction cannot occur. Timing obfuscation extends protections to the inter-partition lookup path. Coherence-based leakage on shared-writeable lines is mitigated by routing those writes through to the LLC once a leakage threshold is crossed, which makes attacker write probe latency independent of victim activity. Using gem5 for implementation, SCP mitigates Prime+Probe and Flush+Reload, which are the basis for more sophisticated cache attacks. We also demonstrate that a shared-writeable-line attack is mitigated. All these attacks yield results no better than random guessing. SCP's hardware cost is a modest +2.8% LLC SRAM. Performance matches DAWG within 0.3% IPC on the SPEC CPU2017 benchmarks that we evaluated. Sharing-intensive microbenchmarks demonstrate a tunable security-performance tradeoff based on a system-specified leakage threshold.

URL PDF HTML ☆

赞 0 踩 0

2606.12235 2026-06-11 cs.AR 新提交

BenDi: An Energy-Efficient Quasi-Stochastic Systolic Architecture for Edge Bioelectronics

BenDi: 一种用于边缘生物电子学的节能准随机脉动架构

Bochen Ye, Yihan Pan, Shady Agwa, Themis Prodromakis

AI总结提出BenDi架构，通过低电压、准随机乘法、脉动数据流和硬件感知量化，在边缘设备上高效运行CNN，相比二进制权重固定脉动架构，面积效率提升3.35倍，能效提升5倍，精度损失仅1%-3.3%。

详情

Comments: Accepted for presentation as a short paper at International Conference on Application-specific Systems, Architectures and Processors (ASAP 2026)

AI中文摘要

对生物医学信号（如心电图）的连续长期监测和诊断有助于减轻对公共健康日益增长的威胁。人工智能模型（如卷积神经网络）能够对相关疾病进行准确监测和分类；然而，它们需要的计算资源超出了传统AI硬件通常所能提供的，尤其是在资源受限的边缘环境中。在这项工作中，我们提出了BenDi，一种用于边缘生物电子系统的节能准随机脉动架构。BenDi利用从电路到软件量化的多个层次的能量和功率优化，包括低供电电压、用于准随机乘法的Ben-t-Pyramid数据格式、DiP脉动数据流以及硬件感知量化，以在有限的硬件预算下在边缘设备上高精度地处理CNN。使用商业22nm技术的硬件实现结果表明，在0.5V电压和100MHz频率下，BenDi架构相比最先进的基于二进制的权重固定脉动架构，面积缩小了3.35倍，能效提高了5倍。对于生物电子边缘系统，BenDi在能效和面积效率上分别比同类架构提高了一个数量级。这种显著的改进是以在MIT-BIH和Apnea-ECG基准测试上分别损失1%至3.3%的精度为代价的，与使用32位浮点格式的传统计算相比。

英文摘要

Continuous long-term monitoring and diagnosis of biomedical signals, such as electrocardiograms (ECGs), can help mitigate an increasing threat to public health. Artificial Intelligence (AI) models, such as Convolutional Neural Networks (CNNs), provide accurate monitoring and classification for relevant diseases; however, they require more computational resources than conventional AI hardware can typically afford, especially for a resource-constrained environment on the edge. In this work, we present BenDi, an energy-efficient quasi-stochastic systolic architecture for bioelectronic systems on the edge. BenDi leverages multiple levels of energy and power optimization, ranging from circuits to software quantization, including low supply voltage, the \underline{Ben}t-Pyramid data format for quasi-stochastic multiplication, the \underline{Di}P systolic dataflow, and hardware-aware quantization, to handle CNNs with high accuracy on the edge within limited hardware budgets. The hardware implementation results, using a commercial 22nm technology, show that BenDi architecture, at 0.5 Voltage and 100 MHz, offers 3.35x smaller area and 5x higher energy efficiency, compared to state-of-the-art binary-based weight-stationary systolic architectures. Regarding Bioelectronic edge systems, BenDi achieves an order-of-magnitude improvement in energy efficiency and another order-of-magnitude improvement in area efficiency, compared to its counterparts. This significant improvement comes at the cost of 1\% to 3.3\% accuracy loss on the MIT-BIH and Apnea-ECG benchmarks, respectively, compared with conventional computing using the 32-bit floating-point format.

URL PDF HTML ☆

赞 0 踩 0

2606.11718 2026-06-11 cs.AR 新提交

Making Locality-aware GEMM Compatible with Page-Granularity Placement on Chiplet GPUs

使局部感知的GEMM与Chiplet GPU上的页粒度放置兼容

Euijun Chung, Jae Hyung Ju, Hyesoon Kim

AI总结针对多芯片GPU非均匀内存系统中局部感知数据放置与固定页粒度交错不兼容的问题，提出Chiplet-Contiguous Layout，通过全局内存布局存储芯片局部连续数据，无需修改操作系统或硬件即可兼容页粒度放置，显著降低远程HBM流量。

详情

AI中文摘要

多芯片GPU扩展了计算吞吐量和高带宽内存（HBM）容量，但其非均匀内存系统使得芯片与其数据之间的局部性对GPU性能和能效至关重要。局部感知调度和数据放置确定了哪些数据应驻留在每个芯片附近。然而，在通用矩阵乘法（GEMM）中，局部感知的数据放置通常与固定的页粒度数据交错不兼容，因为跨芯片映射数据的最佳粒度在不同工作负载间差异很大。我们提出了Chiplet-Contiguous Layout，一种全局内存布局，将芯片局部数据连续存储。Chiplet-Contiguous Layout使得局部感知放置与页粒度放置兼容，适用于各种大型语言模型（LLM）GEMM形状，无需修改操作系统或硬件。在来自Qwen 3 30B和Llama 3.1 70B的代表性LLM推理和训练GEMM上，Chiplet-Contiguous Layout相比4KB交错平均减少了Qwen上24.7倍和Llama上19.2倍的远程HBM流量，相比粗粒度局部感知放置减少了4.1倍和2.1倍。

英文摘要

Multi-chiplet GPUs scale compute throughput and high-bandwidth memory (HBM) capacity, but their non-uniform memory system makes locality between chiplets and their data critical to the GPU's performance and energy efficiency. Locality-aware scheduling and data placement identify which data should reside near each chiplet. However, in general matrix multiplication (GEMM), locality-aware data placement often becomes incompatible with a fixed page-granularity data interleaving, since the optimal granularity for mapping data across chiplets varies widely across workloads. We propose Chiplet-Contiguous Layout, a global memory layout that stores chiplet-local data contiguously. Chiplet-Contiguous Layout enables locality-aware placement compatible with page-granularity placement across diverse large language model (LLM) GEMM shapes, without changes to the operating system or hardware. On representative LLM inference and training GEMMs from Qwen 3 30B and Llama 3.1 70B, Chiplet-Contiguous Layout on average reduces remote HBM traffic by 24.7x on Qwen and 19.2x on Llama over 4KB interleaving, and by 4.1x and 2.1x over coarse locality-aware placement.

URL PDF HTML ☆

赞 0 踩 0

2606.11716 2026-06-11 cs.AR 新提交

A Fast Locality Simulator for GEMM Design-Space Exploration on Multi-Chiplet GPUs

面向多芯片GPU的GEMM设计空间探索的快速局部性模拟器

Euijun Chung, Hyesoon Kim

AI总结提出一种快速、功能级、瓦片级局部性模拟器，用于评估多芯片GPU上GEMM的内核选择对片间流量影响，发现远程流量变化可达90倍，并揭示2D块交织CTA遍历可减少远程流量达5.1倍。

详情

AI中文摘要

多芯片GPU通过硅中介层将内存分为本地和远程HBM区域，减少远程HBM流量对多芯片GPU的性能和能效至关重要。对于大语言模型中的主导算子——通用矩阵乘法（GEMM），产生的片间流量强烈依赖于内核选择，如操作数布局、CTA遍历顺序和数据放置，而最小化远程访问的最优策略并非显而易见。我们提出了一种快速、功能级、瓦片级局部性模拟器，该模拟器对CTA调度、每芯片L2缓存以及本地/远程HBM访问进行建模，以评估全尺寸LLM GEMM配置。在代表性的LLM GEMM上，模拟器显示，对于相同的GEMM维度，远程流量在设计空间内变化高达90倍。此外，使用该模拟器作为反馈，一个智能体AI发现，在轮询放置下，2D块交织CTA遍历相比最佳1D遍历可将远程流量减少高达5.1倍，从而将CTA遍历顺序确定为片间流量的一个一阶、依赖GEMM的设计旋钮。

英文摘要

Multi-chiplet GPUs split memory into local and remote HBM regions across a silicon interposer, and reducing the remote HBM traffic is crucial for the performance and energy efficiency of multi-chiplet GPUs. For general matrix multiplication (GEMM), the dominant operator in large language models (LLMs), the resulting inter-chiplet traffic depends strongly on kernel choices such as operand layout, CTA traversal order, and data placement, and the optimal strategy to minimize remote accesses is nontrivial. We present a fast, functional, tile-level locality simulator that models CTA scheduling, per-chiplet L2 caches, and local/remote HBM accesses to evaluate a full-size LLM GEMM configuration. Across representative LLM GEMMs, the simulator shows that remote traffic varies by up to 90x across the design space for the same GEMM dimensions. Moreover, using the simulator as feedback, an agentic AI discovers that a 2D block-swizzle CTA traversal reduces remote traffic over the best 1D traversal by up to 5.1x under round-robin placement, identifying CTA traversal order as a first-order, GEMM-dependent design knob for inter-chiplet traffic.

URL PDF HTML ☆

赞 0 踩 0

2606.11357 2026-06-11 cs.DC cs.AI cs.AR cs.PF 新提交

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse：用于AMD NPU上高效量化LLM推理的融合混合精度内核库

Wesley Pang, Gregory Hyegang Jun, Feiyang Liu, Deming Chen

AI总结针对边缘NPU上量化LLM部署困难，提出TileFuse库，通过融合解包、反量化与GEMM/GEMV内核，并设计交错预分块布局与数据流，在XDNA2上实现AWQ格式原生支持，性能提升最高281%，能耗降低64.6%。

详情

Comments: 13 pages excluding reference, 11 figures

AI中文摘要

随着设备端LLM推理需求的增长，边缘SoC越来越多地集成NPU，以在严格的功耗和热预算下提高性能和能效。然而，当前客户端NPU上的实际LLM部署仍然困难：广泛使用的量化格式（如AWQ）无法干净地映射到许多现有NPU软件栈上，这些软件栈通常是专有的，并且暴露有限底层控制。在这项工作中，我们提出了\textit{TileFuse}，一个面向AMD XDNA2 NPU的近底层混合精度内核库，针对量化LLM推理中的Transformer线性层。TileFuse将实用的低位格式（如AWQ风格的W4A16和W8A16）直接引入XDNA2，而不是迫使模型围绕NPU特定的量化方案重新调整。TileFuse协同设计了权重布局、元数据放置、混合精度微内核和阵列级数据流。具体来说，它将解包、反量化以及GEMM/GEMV执行融合到单个内核流中，引入了一种支持高达32K GEMM维度的交错预分块布局，并重新设计了GEMV数据流以利用完整的4x8 AIE阵列。在内核级评估中，与全精度基线相比，TileFuse在GEMM上性能提升高达121.6%，在GEMV上提升281%，同时在GEMM上相比强iGPU基线实现了超过2倍的性能和能效提升。在Ryzen AI笔记本电脑上的端到端LLM实验中，TileFuse实现了高达2.0倍的预填充延迟降低，能耗降低超过64.6%。这些结果共同表明，XDNA2是AWQ风格边缘LLM推理的实用目标，并且对现成量化的原生NPU支持可以使NPU在实际客户端部署中更加可用。

英文摘要

With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textit{TileFuse}, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.11247 2026-06-11 cs.LG cs.AI cs.AR 新提交

Physics-informed generative AI for semiconductor manufacturing: Enforcing hard physical constraints in generative models by construction

物理信息驱动的生成式AI在半导体制造中的应用：通过构造强制生成模型中的硬物理约束

Yaser Mike Banad, Sarah Sharif

AI总结针对半导体制造中生成模型必须满足硬物理约束的问题，本文提出通过构造集成物理信息（如物理信息扩散、PDE约束变分模型等）来强制约束，而非事后过滤，并给出四种集成模式和未来研究方向。

详情

AI中文摘要

生成模型越来越多地被用于为物理系统提出设计、数据和控制动作，然而许多此类系统受硬物理约束而非感知合理性支配。半导体制造提供了一个严苛的测试案例：生成的掩模、布局、合成缺陷数据和工艺配方必须遵守光刻、传输、反应和器件物理约束，因为物理无效的样本不仅质量低劣，而且无法使用。本文认为，半导体制造揭示了一个更广泛的计算科学挑战，即用于受约束物理领域的生成式AI必须通过构造实现物理信息驱动，而非仅通过事后过滤来纠正。我们调查了新兴的架构工具包，包括物理信息扩散、PDE约束变分模型、神经算子先验和守恒律尊重生成网络，并展示了它如何与可微分光刻、TCAD、工艺仿真和自主实验相联系。我们识别了生成模型与基于物理的模拟器之间的四种集成模式，并提出了一个以物理保真度基准、可微分模拟器基础设施以及面向物理设计和制造的多模态基础模型为中心的研究议程。核心主张是分析性的而非修辞性的：在物理有效性是成功的关键标准的情况下，通过构造强制约束的架构应被期望优于事后过滤的架构，而晶圆厂正是这种区别最鲜明的环境。

英文摘要

Generative models are increasingly used to propose designs, data, and control actions for physical systems, yet many such systems are governed by hard physical constraints rather than by perceptual plausibility. Semiconductor manufacturing provides a demanding test case: generated masks, layouts, synthetic defect data, and process recipes must obey lithography, transport, reaction, and device-physics constraints, because physically invalid samples are not merely low quality but unusable. This Perspective argues that semiconductor manufacturing exposes a broader computational-science challenge, namely that generative AI for constrained physical domains must be physics-informed by construction, not corrected only through post-hoc filtering. We survey the emerging architectural toolkit, including physics-informed diffusion, PDE-constrained variational models, neural-operator priors, and conservation-law-respecting generative networks, and show how it connects to differentiable lithography, TCAD, process simulation, and autonomous experimentation. We identify four integration patterns between generative models and physics-based simulators, and we propose a research agenda centered on physics-fidelity benchmarks, differentiable simulator infrastructure, and multimodal foundation models for physical design and manufacturing. The central claim is analytical rather than rhetorical: where physical validity is the binding criterion of success, architectures that enforce it by construction should be expected to outperform those that filter for it after the fact, and the fab is the setting where this distinction is sharpest.

URL PDF HTML ☆

赞 0 踩 0

2606.11244 2026-06-11 cs.AR cs.AI 新提交

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

SPEAR: 一种后量化误差自适应恢复系统，实现高效低比特LLM服务

Hongyuan Liu, Yawei Li, Zhiqiang Que, Qinli Yang, Junming Shao, Guosheng Hu

AI总结针对低比特量化导致LLM质量下降的问题，提出SPEAR系统，通过输入感知的门控误差补偿器（EC）选择性修正高误差层，结合自适应内核融合调度和SLO感知调度器，在<1%内存开销下恢复W4与FP16之间56-75%的困惑度差距。

详情

AI中文摘要

高效的大语言模型（LLM）服务日益受到部署成本的制约。量化是降低服务成本的关键技术，但即使是最先进的4比特量化器，其与FP16之间仍存在显著的质量差距，尤其是在低比特服务最有利的小型模型中。我们发现这一差距的根本原因：量化误差高度依赖于输入，且在不同token之间差异显著，而现有的后量化补偿方法是静态的，对所有输入应用相同的修正。结果，简单token被过度修正，而困难token则修正不足。我们提出SPEAR，一种后量化误差自适应恢复系统，用于改进低比特LLM服务。SPEAR引入了由逐token门控调制的轻量级误差补偿器（EC），并将其仅放置在通过CKA引导的熵感知诊断识别出的最误差敏感层。这将少量参数预算集中在最有效的位置。EC的高效部署带来了若干系统挑战，包括额外计算、由输入相关门控引起的张量并行同步，以及跨配置的延迟不稳定。SPEAR通过自适应内核融合调度解决了这些问题，结合了后同步集成规约内核与P2P双写，将EC后计算融合到低比特GEMM中，并采用SLO约束的EC感知调度器以实现可预测的服务性能。在具有挑战性的逐通道量化设置中，SPEAR恢复了W4与FP16之间56-75%的困惑度差距，同时增加了不到1%的模型内存开销，并保持了与广泛使用的4比特服务部署相当的延迟。

英文摘要

Efficient large language model (LLM) serving is increasingly constrained by deployment cost. Quantization is a key technique for reducing serving cost, yet even state-of-the-art 4-bit quantizers exhibit a noticeable quality gap from FP16, particularly for smaller models where low-bit serving is most beneficial. We identify a fundamental cause of this gap: quantization error is highly input-dependent and varies substantially across tokens, while existing post-quantization compensation methods are static and apply identical corrections to all inputs. As a result, easy tokens are over-corrected while hard tokens remain under-corrected. We present SPEAR, a system for post-quantization error-adaptive recovery that improves low-bit LLM serving. SPEAR introduces lightweight Error Compensators (ECs) modulated by per-token gates and places them only at the most error-sensitive layers identified through a CKA-guided entropy-aware diagnostic. This focuses a small parameter budget where it is most effective. Efficient deployment of ECs presents several systems challenges, including additional computation, tensor-parallel synchronization caused by input-dependent gating, and latency instability across configurations. SPEAR addresses these issues through adaptive kernel-fusion dispatch, combining an epilogue-integrated peer-reduction kernel with P2P dual-write to fuse the post-EC computation into low-bit GEMMs, and an SLO-constrained EC-aware scheduler for predictable serving performance. Across challenging per-channel quantization settings, SPEAR recovers 56-75% of the perplexity gap between W4 and FP16 while adding less than 1% model memory overhead and maintaining latency comparable to a widely used 4-bit serving deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.06527 2026-06-11 cs.AR cs.LG 版本更新

Characterizing the Impact of NVFP4 Quantization for Low-Power Edge AI Deployment

块大小、权重精度和缩放精度在低功耗边缘高效神经网络NVFP4推理中的消融研究

Ovishake Sen, Venkata Nithin Kamineni, Daniel Lobo, Swarup Bhunia, Rickard Ewetz, Baibhab Chatterjee

AI总结本文通过消融实验研究NVFP4 LUT推理框架，结合4位激活、两级缩放和电压缩放存储，在边缘高效模型上实现高达26.85倍能耗降低和2.21倍面积缩减。

详情

Comments: 7 Pages

AI中文摘要

节能边缘推理需要降低算术成本、内存流量和硬件开销。本文对基于NVFP4 LUT的边缘高效神经网络推理进行了消融研究。提出的NVLUT框架结合了4位NVFP4激活、两级缩放、基于LUT的尾数计算、电压缩放存储和选择性ECC保护。乘法分解为符号、指数和尾数路径，其中符号使用XOR逻辑，指数使用整数加法，尾数乘法由紧凑的LUT访问替代。NVFP4激活使用FP4数据，并带有FP8块缩放和FP32张量缩放。在六个边缘高效模型上，块大小消融表明B=16提供了实用的精度/存储权衡，对于N=4096仅需4.5078位每输入。权重精度消融表明，在相同NVFP4激活路径下，FP8和FP16权重相比FP4权重仅带来适度提升。与纯无缩放FP4相比，无重训练的NVFP4通过恢复激活动态范围大幅恢复精度，而带重训练的NVFP4在模型上达到最佳精度。硬件分析显示，NVLUT相比传统LUT在ECC加电压缩放下实现高达26.85倍能耗降低，在混合电压操作下高达22.85倍。面积分别减少高达2.21倍和1.52倍。这些结果表明，NVFP4两级缩放结合选择性可靠性保护实现了鲁棒、低能耗的边缘推理。

英文摘要

Energy-efficient neural-network inference at the edge requires reducing arithmetic cost, memory traffic, computation energy, and storage overhead while maintaining acceptable accuracy. This paper presents an ablation-focused study of NVFP4 quantization for edge-efficient neural networks, with emphasis on the relationship between activation precision, weight precision, block-size scaling, retraining, and model accuracy. NVFP4 activations are represented using 4-bit FP4 data, an FP8 block scale, and an FP32 tensor scale, enabling ultra-low precision inference while preserving activation dynamic range. A block-size ablation over six edge-efficient models shows that block size B = 16 provides a practical accuracy/storage trade-off, requiring only 4.5078 bits per input for N = 4096. A weight precision ablation further shows that FP8 and FP16 weights provide only modest gains over FP4 weights under the same NVFP4 activation path, suggesting that activation quantization and scaling dominate much of the accuracy behavior. To isolate the benefit of the NVFP4 data type, this work compares conventional unscaled FP4 activation inference and NVFP4 activation inference with and without retraining. The results show that conventional FP4 inference collapses accuracy for most compact models, while NVFP4 without retraining already recovers substantial accuracy by restoring activation dynamic range through FP8 block scaling and FP32 tensor scaling. When combined with retraining, NVFP4 achieves the best accuracy across the evaluated models, demonstrating the effectiveness of scaling-aware FP4 (NVFP4) inference. These findings provide general design guidance for hardware-software co-design of low power edge inference across a broad range of accelerator platforms, including GPUs, Tensor Cores, FPGAs, domain-specific AI accelerators, near-memory computing systems, and emerging edge-computing architectures.

URL PDF HTML ☆

赞 0 踩 0

2601.23278 2026-06-11 cs.LG cs.AR cs.CL 版本更新

FOCUS: DLLMs Know How to Tame Their Compute Bound

FOCUS: DLLMs 知道如何驯服它们的计算瓶颈

Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini

AI总结针对扩散大语言模型解码中大部分计算浪费在不可解码令牌上的问题，提出 FOCUS 推理系统，通过动态聚焦可解码令牌并驱逐不可解码令牌，提升有效批大小，实现高达 3.52 倍的吞吐量提升。

详情

Comments: ICML 2026 camera-ready version

AI中文摘要

扩散大语言模型（DLLMs）为自回归模型提供了一种引人注目的替代方案，但其部署受到高解码成本的制约。在这项工作中，我们识别出 DLLM 解码中的一个关键低效问题：虽然计算在令牌块上并行化，但每个扩散步骤中只有一小部分令牌是可解码的，导致大部分计算浪费在不可解码的令牌上。我们进一步观察到注意力导出的令牌重要性与逐令牌解码概率之间存在强相关性。基于这一洞察，我们提出了 FOCUS，一个专为 DLLMs 设计的推理系统。通过动态地将计算聚焦于可解码令牌并实时驱逐不可解码令牌，FOCUS 增加了有效批大小，缓解了计算限制并实现了可扩展的吞吐量。实验评估表明，在大批量设置下，FOCUS 相比生产级引擎 LMDeploy 实现了高达 3.52 倍的吞吐量提升，同时在多个基准测试中保持或提升了生成质量。

英文摘要

Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS, an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy in large-batch settings, while preserving or improving generation quality across multiple benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2505.06470 2026-06-11 cs.CR cs.AR

"vcd2df" -- Leveraging Data Science Insights for Hardware Security Research

Calvin Deutschbein, Jimmy Ostler, Hriday Raj

2503.19180 2026-06-11 cs.AR

"Test, Build, Deploy" -- A CI/CD Framework for Open-Source Hardware Designs

Calvin Deutschbein, Aristotle Stassinopoulos