arXivDaily arXiv每日学术速递 周一至周五更新
2606.20414 2026-06-19 cs.AR 新提交

ExSpike: A General Full-Event Neuromorphic Architecture for Exploiting Irregular Sparsity with Event Compression

ExSpike: 一种利用事件压缩开发不规则稀疏性的通用全事件神经形态架构

Yuehai Chen, Farhad Merchant

AI总结 提出ExSpike通用全事件神经形态架构,通过数据流优化实现纯事件驱动执行,并引入相邻位置事件压缩减少冗余累加,在FPGA上实现高能效SNN加速。

Comments Accepted by the 36th International Conference on Field-Programmable Logic and Applications (FPL 2026); 9 pages, 9 figures

详情
AI中文摘要

脉冲神经网络(SNN)因其稀疏的时空活动而有望实现节能计算。然而,有效将这种不规则稀疏性转化为实际的性能和能耗增益仍然具有挑战性,因为全事件计算架构尚未得到充分探索。本文提出ExSpike,一种通用的全事件神经形态架构,充分利用SNN中的不规则稀疏性。为了实现纯事件驱动执行,我们首先提出一组数据流优化,确保每个SNN层的输入保持基于脉冲,从而在整个网络中实现全事件执行。然后,我们设计了一种硬件高效的全事件架构,命名为ExSpike,它支持优化的纯事件驱动数据流以及用于脉冲驱动自注意力的额外注意力核心。为了进一步提高计算效率,我们引入了相邻位置事件压缩,以减少跨空间相邻脉冲序列的冗余累加。ExSpike在AMD Xilinx Virtex-7 FPGA上实现,并在分类和分割任务上进行了评估。实验结果表明,ExSpike在保持竞争性精度的同时,在多种SNN模型上实现了高归一化能效,最高可达479.15 GOPS、281.85 GOPS/W和0.80 GOPS/W/PE。特别是,ExSpike的PE归一化能效比最先进的基于FPGA的SNN加速器(FireFly-T)高出10倍。ExSpike的代码可在\url{this https URL}获取。

英文摘要

Spiking neural networks (SNNs) promise energy-efficient computing due to their sparse spatio-temporal activity. However, effectively translating such irregular sparsity into practical performance and energy gains remains challenging, as full-event computing architectures are still underexplored. This paper proposes ExSpike, a general full-event neuromorphic architecture that fully exploits irregular sparsity in SNNs. To realize pure event-driven execution, we first propose a set of dataflow optimizations to ensure that the inputs to each SNN layer remain spike-based, thereby enabling full-event execution throughout the network. We then design a hardware-efficient full-event architecture, named ExSpike, which supports the optimized pure event-driven dataflow and an additional Attention Core for spike-driven self-attention. To further improve computing efficiency, we introduce adjacent-position event compression to reduce redundant accumulations across spatially adjacent spike sequences. ExSpike is implemented on an AMD Xilinx Virtex-7 FPGA and evaluated on both classification and segmentation workloads. Experimental results show that ExSpike achieves high normalized energy efficiency across diverse SNN models while maintaining competitive accuracy, delivering up to 479.15 GOPS, 281.85 GOPS/W, and 0.80 GOPS/W/PE. In particular, ExSpike achieves up to 10$\times$ higher PE-normalized energy efficiency than the SOTA FPGA-based SNN accelerator (FireFly-T). The code for ExSpike is available at \url{https://github.com/xiaoyuehai/ExSpike}.

2606.19913 2026-06-19 cs.AR 新提交

Design and Evaluation of Energy-Efficient Whisper Dot-Product Kernel Offloading on a CGLA Architecture

在CGLA架构上设计并评估节能的Whisper点积内核卸载

Takuto Ando, Yu Eto, Ayumu Takeuchi, Yasuhiko Nakashima

AI总结 在CGLA架构IMAX上卸载Whisper点积内核,通过内核映射、本地内存大小调整和突发调度优化,在Whisper tiny上实现比Jetson AGX Orin低2.35倍、比RTX 4090低10.48倍的功耗延迟积(PDP),为低功耗本地语音识别提供可编程架构方案。

Comments This paper is accepted at Concurrency and Computation: Practice and Experience (Wiley)

详情
AI中文摘要

在本文中,我们在IMAX(一种可编程的粗粒度线性阵列(CGLA)架构)上实现并评估了Whisper点积内核卸载。在ARM Cortex-A72上的性能分析显示,点积操作占FP16执行时间的90.6%和Q8_0执行时间的87.1%。为了解决这一内核瓶颈,我们结合了内核映射、本地内存大小调整和突发调度。该实现使用了内联FP16到FP32转换、64位数据路径上的2路SIMD FMA、列式多线程以及混合执行,其中对齐的向量段在IMAX上运行,剩余段在主机CPU上并发执行。我们通过FPGA原型和28nm ASIC投影(840MHz)评估了该设计。对于Whisper tiny,32KB本地内存和突发长度16共同最小化PDP和EDP。在基于TDP的跨平台比较中,投影的IMAX在Whisper tiny Q8_0上的PDP为11.58J,比Jetson AGX Orin(27.16J)低2.35倍,比RTX 4090(121.38J)低10.48倍。相同的设计扩展到Whisper base和Whisper small,但PDP差距缩小,因为32KB本地内存覆盖率从tiny的93.8%下降到base和small的约66.5%。这些结果表明,IMAX是一种在tiny模型范围内实现低PDP本地ASR的可编程架构。

英文摘要

In this paper, we implement and evaluate Whisper dot-product kernel offloading on IMAX, a programmable Coarse-Grained Linear Arrays (CGLAs) architecture. Whisper-tiny.en profiling on an ARM Cortex-A72 shows that dot-product operations account for 90.6% of FP16 execution time and 87.1% of Q8_0 execution time. To address this kernel bottleneck, we combine kernel mapping, local-memory sizing, and burst scheduling. The implementation uses inline FP16-to-FP32 conversion, 2-way SIMD FMA on a 64-bit datapath, column-wise multithreading, and mixed execution in which aligned vector segments run on IMAX and residual segments run concurrently on the host CPU. We evaluate the design with an FPGA prototype and a 28nm ASIC projection at 840MHz. For Whisper-tiny.en, 32KB local memory and burst length 16 jointly minimize PDP and EDP. Under a TDP-based cross-platform comparison, the projected IMAX records a PDP of 11.58J for Whisper-tiny.en Q8_0, 2.35x lower than Jetson AGX Orin (27.16J) and 10.48x lower than RTX 4090 (121.38J). The same design extends to Whisper-base.en and Whisper-small.en, where the PDP gap narrows as 32KB local-memory coverage drops from 93.8% for tiny to about 66.5% for base and small. These results position IMAX as a programmable architecture for lower-PDP local ASR in the tiny-model regime.

2606.19533 2026-06-19 cs.AR cs.AI 新提交

A Tool for the Synthesis of Adaptive Probabilistic Processors Based on the Ising Model

基于伊辛模型的自适应概率处理器合成工具

Jonathan Juracy Carneiro da Silva, Leonardo R. Gobatto, Jose Rodrigo Azambuja

AI总结 提出一种自动合成与仿真概率架构的工具,通过将组合优化问题映射到伊辛模型,自适应选择更新算法,改善收敛行为并支持硬件实现。

Comments ACM/IEEE/SBC/SBMICRO Symposium on Integrated Circuits and Systems Design 2026

详情
AI中文摘要

本文提出一种用于合成和仿真概率架构的工具,通过将组合优化问题映射到伊辛模型来求解。该方法根据问题特征(如规模和拓扑)自动构建伊辛哈密顿量并确定概率元件(p-bits)的数量。此外,该工具引入了一种自适应策略,用于在吉布斯采样、模拟退火(SA)、模拟量子退火(SQA)和基于簇的方法中选择最合适的更新算法。使用基准问题的实验结果表明,与固定方法相比,该方法具有更好的收敛行为和灵活性。所提出的框架能够系统评估概率计算策略,并支持基于MTJ和p-bits的未来硬件实现的开发。

英文摘要

This work presents a tool for the synthesis and simulation of probabilistic architectures for solving combinatorial optimization problems by mapping them to the Ising model. The proposed approach automatically constructs the Ising Hamiltonian and determines the number of probabilistic elements (p-bits) based on problem characteristics such as size and topology. Furthermore, the tool introduces an adaptive strategy for selecting the most suitable update algorithm among Gibbs Sampling, Simulated Annealing (SA), Simulated Quantum Annealing (SQA), and cluster-based methods. Experimental results using benchmark problems demonstrate improved convergence behavior and flexibility compared to fixed approaches. The proposed framework enables systematic evaluation of probabilistic computing strategies and supports the development of future hardware implementations based on MTJs and p-bits.

2606.19526 2026-06-19 cs.AR 新提交

SPINE: A Fault Injection Profiler for Quantized Neural Networks under Accumulated Faults

SPINE: 面向累积故障下量化神经网络的故障注入分析器

Nathan Guimarães, Ian Kersz, Leonardo R. Gobatto, Fabio Benevenuti, Michael G. Jordan, Antonio Carlos S. Beck, Fernanda L. Kastensmidt, Jose Rodrigo Azambuja

AI总结 提出GDB驱动的分析框架SPINE,通过向边缘CPU目标二进制注入累积权重位翻转,生成逐层故障特征,无需重训练或修改代码,指导选择性加固策略。

Comments ACM/IEEE/SBC/SBMICRO Symposium on Integrated Circuits and Systems Design 2026

详情
AI中文摘要

在边缘部署深度神经网络需要在严格的成本和功耗约束下实现高效推理。量化神经网络通过用低精度整数替换浮点参数来满足这些需求,但其权重在推理过程中仍持续暴露于辐射引起的位翻转。故障注入可用于模拟这些环境,但现有研究未能表征在现实内存布局下累积翻转如何转化为错误预测。本文提出一个GDB驱动的分析框架,直接将累积权重位翻转注入边缘CPU的目标二进制,生成逐层故障特征,无需模型重训练或代码修改。在多种拓扑、量化方案和内存布局上的评估结果表明,应如何应用选择性加固策略来有效保护神经网络。

英文摘要

Deploying deep neural networks at the edge demands efficient inference under strict cost and power constraints. Quantized neural networks address these demands by replacing floating-point parameters with low-precision integers, yet their weights remain continuously exposed to radiation-induced bit-flips during inference. Fault Injection can be used to simulate those environments, but existing studies fail to characterize how accumulated upsets translate into mispredictions under realistic memory layouts. This paper presents a GDB-driven profiling framework that injects cumulative weight bit-flips directly onto the target binary of edge CPUs, generating per-layer fault profiles without requiring model retraining or code modification. Evaluated across multiple topologies, quantization efforts, and memory layouts, the results indicate how selective hardening strategies should be applied to effectively protect neural networks.

2606.17128 2026-06-19 cs.AR 新提交

Shift-Left High-Level Synthesis Verification via Knowledge-Augmented LLM Agent

通过知识增强的LLM智能体实现左移高层次综合验证

Zhihan Xiao, Hongbing Lang, Zhe Zhao, Luke Ztz Hu, Songping Mai

AI总结 提出一种知识增强的智能体驱动左移验证框架,通过双层级一致性检查、符号执行和HLS验证知识图谱,在综合前自动验证C与HLS-C的功能一致性,覆盖率达98.26%。

详情
AI中文摘要

高层次综合(HLS)通过将C/C++程序转换为硬件实现,实现了快速硬件开发。在HLS设计流程中,黄金C规范与面向HLS的C实现之间的功能一致性验证是一项关键但劳动密集型的任务。尽管大型语言模型(LLMs)最近在自动化测试平台生成方面显示出潜力,但其随机性常常导致覆盖率不足、验证环境不一致以及等价性检查结果不可靠。为了解决这些限制,我们提出了一种知识增强的、智能体驱动的左移验证框架,用于在综合前自动检查黄金C与HLS-C之间的功能一致性。该框架引入了一种双层级一致性检查机制,该机制共同强制配对测试平台之间的静态结构对齐和动态行为等价性,同时集成符号执行和覆盖率驱动的细化以提高验证完整性。此外,我们构建了一个异构的HLS验证知识图谱,为测试平台生成提供拓扑感知推理先验,并设计了一个自主验证智能体来协调跨异构工具链的迭代细化和故障诊断。在107个HLS基准对上的实验结果表明,所提出的框架实现了98.26%的平均覆盖率和95.33%的动态一致性,优于代表性的基于AST、检索增强和迭代智能体的基线。此 https URL

英文摘要

High-Level Synthesis (HLS) relies on transforming original C specifications into synthesizable HLS-oriented C (HLS-C) implementations. Functional consistency verification between original C specifications and HLS-C implementations is a critical yet labor-intensive task in HLS design flows. While Large Language Models (LLMs) have recently shown promise in automated testbench generation, their stochastic nature often leads to insufficient coverage, inconsistent verification environments, and unreliable equivalence checking results. To address these limitations, we propose a knowledge-augmented, agent-driven shift-left verification framework for automated functional consistency checking between original C and HLS-C implementations before synthesis. The framework introduces a Dual-Tier Consistency Checking mechanism that jointly enforces static structural alignment and dynamic behavioral equivalence between paired testbenches, while integrating symbolic execution and coverage-driven refinement to improve verification completeness. Furthermore, we construct a heterogeneous HLS Verification Knowledge Graph to provide topology-aware reasoning priors for testbench generation, and design an autonomous verification agent to orchestrate iterative refinement and failure diagnosis across heterogeneous toolchains. Experimental results on 107 HLS benchmark pairs demonstrate that the proposed framework achieves 0.9826 average coverage and 0.9533 dynamic consistency, outperforming representative AST-based, retrieval-augmented, and iterative agent-based baselines. https://github.com/cz-5f/HLS-LeVeri.git

2606.19964 2026-06-19 cs.LG cs.AR 交叉投稿

Low-Energy Reduced RISC-V Instruction Subset Processor for Tsetlin Machine Inference at the Edge

用于边缘Tsetlin Machine推理的低能耗精简RISC-V指令子集处理器

Chanda Gupta, Sanidhya Bhatia, Shaurya Priyadarshi, Himani Panwar, Rishad Shafik, Sudip Roy

AI总结 针对Tsetlin Machine推理,提出一种领域专用RISC-V微处理器架构,通过指令精简和数据路径简化,在保持可编程性的同时实现高达98%的执行时间减少和29.7倍能耗降低。

Comments 6 pages, 6 Figures, Accepted in IEEE ISVLSI Conference 2026

详情
AI中文摘要

Tsetlin Machine (TM) 是一种基于逻辑的机器学习方法,依赖于简单的位运算和有限状态自动机,使其适用于边缘AI部署。最近的工作集中在基于Tsetlin Machine (TM) 的协处理器和加速器设计上。尽管这些设计实现了高性能,但它们通常依赖于紧密耦合的接口、微码风格的编程和外部主机处理器,限制了灵活性和编程简易性。在这项工作中,我们提出了一种面向TM推理的领域专用RISC-V微处理器架构和设计流程。利用RISC-V的模块化结构,我们设计了一个精简指令子集处理器,在保持可编程性的同时,针对TM工作负载提高了性能并降低了能耗。采用指令分析来指导指令精简,随后针对TM推理进行数据路径和控制路径的简化。在多个数据集上评估了基线RV32IM核心和所提出的精简核心,并与二值神经网络 (BNN) 进行比较,BNN由于在推理过程中依赖位运算而被用作硬件高效基线。结果表明,TM实现了相当或更高的准确率(例如,在CIFAR-2上高达88.18%,而BNN为60.0%),同时在多个数据集上执行时间减少了高达98%。此外,所提出的设计实现了平均29.7倍的能耗降低,证明了其在可编程且高效的边缘AI系统中的有效性。

英文摘要

Tsetlin Machine (TM) is a logic-based machine learning approach that relies on simple bitwise operations and finite-state automata, which makes it attractive for edge AI deployments. Recent work has focused on co-processor and accelerator designs based on Tsetlin Machines (TMs). Although these designs achieve high performance, they typically depend on tightly coupled interfaces, microcode-style programming, and external host processors, limiting flexibility and ease of programming. In this work, we present a domain-specific RISC-V microprocessor architecture and design flow tailored for TM inference. Leveraging the modular structure of RISC-V, we design a reduced instruction subset processor that retains programmability while targeting improved performance and lower energy consumption for TM workloads. Instruction profiling is employed to guide instruction reduction, followed by datapath and control path simplifications tailored to TM inference. Both the baseline RV32IM core and the proposed reduced core are evaluated across multiple datasets and compared with Binarized Neural Networks (BNNs), which serve as a hardware-efficient baseline due to their reliance on bitwise operations during inference. Results show that TM achieves comparable or higher accuracy (e.g., up to 88.18% on CIFAR-2 compared to 60.0% for BNN) while reducing execution time by up to 98% across multiple datasets. Furthermore, the proposed design achieves an average $29.7\times$ reduction in energy consumption, demonstrating its effectiveness for programmable and efficient edge AI systems.

2606.16106 2026-06-19 cs.PF cs.AR cs.DC 交叉投稿

Edge-Inference Governors Need Memory-Clock State

超越CPU-GPU频率:内存时钟和尾部效应对边缘推理延迟估计的影响

Jaehoon Kang

AI总结 通过测量NVIDIA Jetson Orin Nano,发现内存时钟是缺失的维度、聚合丢失率隐藏突发性、频率切换存在延迟,这些现象超出传统频率感知延迟模型的范围。

Comments 20 pages, 13 figures, 11 tables. Code and data: https://github.com/dankang21/jetson-latency-lab ; traces: https://doi.org/10.5281/zenodo.20745228

详情
AI中文摘要

频率感知延迟估计器通过建模CPU和GPU频率上的延迟,使得边缘ML推理的截止时间感知DVFS成为可能。我们在NVIDIA Jetson Orin Nano上进行了测量研究,展示了该建模范围之外的三种现象。(1) 内存时钟是一个缺失的维度:在现实的上限EMC范围(2133->3199 MHz)内,根据工作负载的不同,它将中位数延迟偏移了+11%到+48%,并且在最高GPU时钟下,对于合成L2驻留内核,我们观察到一个可重复的非单调情况(-9%)。在一个功率配置下分析并在另一个功率配置下部署的GPU频率估计器,因此低估了高达32%的延迟;列出四个可锁定的EMC点可以修复大多数工作负载,而参数化的1/f_emc项则不能。(2) 聚合丢失率隐藏了突发性:在固定时钟下,100k周期运行显示出刀锋边缘分布,其截止时间丢失的悬崖跨度约为1毫秒,但丢失的聚集远超出独立性——在0.1%的聚合丢失率下,下一个周期也丢失的概率高达74%(是独立基线的740倍)。高斯mu+3sigma边界超过0.1%丢失目标13倍到29倍,而样本外广义帕累托边界在所有八种配置中保持在~2倍以内。(3) 频率切换并非免费:每个域的过渡停顿低于100微秒,但新的工作点需要1/5/8毫秒(CPU/GPU/EMC)才能生效——对于每推理周期的调控器来说,这是典型推理周期的很大一部分。我们发布了完整的测量工具,并讨论了对下一代频率感知估计器和调控器的影响。

英文摘要

Frequency-aware latency estimators let deadline-aware DVFS governors schedule edge ML inference by modeling latency over CPU and GPU clocks, but they cannot observe the memory clock (EMC) -- a missing deployment state that decides whether a governor meets its deadlines and at what energy. We show this with a deployed, measured governor on a Jetson Orin NX: an EMC-blind GPU-only fit misses 25-28% of cycles at tight deadlines, whereas an EMC-aware refit holds misses to at most 1.3% under a 2% QoS miss budget by selecting a budget-feasible clock -- the energy-minimal one for periodic vision (calibrated module-rail power). The failure generalizes across three workload classes -- MobileNetV2, a ViT transformer, and Qwen2.5 LLM token decode (where saturated decode makes the aware policy lower-energy than the infeasible blind choice): a CPUxGPU estimator sends the deployed governor to an infeasible operating point, and only an EMC-aware model identifies the feasible side of the energy frontier. The effect is real and outside the CPUxGPU state abstraction: across two Orin SKUs sharing the same lockable EMC points it shifts median latency by up to ~45%, replicates on both, and survives a fused TensorRT fp16 engine. CPUxGPU models do not absorb it: per-lockable-point EMC tables are needed, a scoped inversion shows monotone assumptions can pick the wrong direction, and clustered misses make aggregate QoS rates understate deployment risk. We release the harness; this complements, not rebuts, the state of the art within its CPUxGPU scope.

2606.05017 2026-06-19 cs.AR cs.MS 版本更新

GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF256 with a Lucas-Exact Integer Identity

GoldenFloat: 从GF4到GF256的基于Phi的静态拆分浮点系列及其Lucas精确整数恒等式

Dmitrii Vasilev

AI总结 提出一种由单一闭式规则生成的静态拆分浮点系列GoldenFloat,并给出多宽度RTL生成器、Lucas精确累加器路径和FPGA编解码器三个具体实现。

Comments 20 pages, single-file LaTeX, ASCII source. v2: peer-anchor updates. Adds Sarnoff P3109 (arXiv:2606.04028), AMD MXFP4 silicon (arXiv:2605.09825), NVIDIA GB10 NVFP4 measurement, companion catalog (arXiv:2606.09686), MixFP4 (arXiv:2605.31035). FL-002 expanded: (c1) GF256 bias, (c2) count drift, (g) static-split vs micro-mixing. TTSKY26a regeneration timeline added. No mathematical claims revised

详情
AI中文摘要

我们提出一种面向硬件的GoldenFloat(GF)描述,这是一个由单一闭式规则生成的静态拆分浮点系列,以及三个具体成果:(i)一个开放的多宽度RTL生成器,覆盖GF4-GF256,并带有针对正确舍入参考的连续积分差分扫描;(ii)一个整数支持的Lucas精确累加器路径,在n=1,...,256时以500位精度验证;(iii)一个GF16 FPGA编解码器,在Artix-7(Xilinx XC7A35T)上以323 MHz通过35/35测试台。对于每个总宽度N>=4,指数宽度e=round((N-1)/phi^2),其中小数部分f=N-1-e,phi=(1+sqrt(5))/2。该规则复现了九种格式(9/9)的已实现指数宽度,并一致扩展到GF128、GF512、GF1024。该规则与posit、takum、OCP-MX以及IEEE P3109多宽度浮点草案并列。我们不对其中任何一种提出每级精度或优越性声明。广度/工具链一致性框架被记录为一个开放猜想,并带有预注册的证伪路径。证伪分类账(FL-002)记录了开放问题及解决它们的实验。报告了日期为2026-05-31的RTL正确性勘误;制造的TTSKY26b芯片带有缺陷的乘法器组合,修正后的生成器是再生基线。

英文摘要

We present a hardware-oriented description of GoldenFloat (GF), a static-split floating-point family generated by a single closed rule, and three concrete artefacts: (i) an open multi-width RTL generator covering GF4-GF256 with a continuous-integration differential sweep against a correctly-rounded reference; (ii) an integer-backed Lucas-exact accumulator path verified at 500-digit precision for n = 1, ..., 256; and (iii) a GF16 FPGA codec passing a 35-of-35 testbench at 323 MHz on Artix-7 (Xilinx XC7A35T). A format-conformance oracle (Corona) ships in the same repository and is used as the blackbox check in our continuous-integration audit. The rule and its scope. For each total width N >= 4, the exponent width is e = round((N-1)/phi^2) with fraction f = N-1-e and phi = (1+sqrt(5))/2. The rule reproduces the realised exponent widths of nine formats GF4, GF8, GF12, GF16, GF20, GF24, GF32, GF64, GF256 (9/9) and extends consistently to GF128, GF512, GF1024. The rule is positioned alongside posit (2022 Posit Standard), takum (Hunhold 2024, 2025), OCP-MX (Rouhani et al. 2023), and the IEEE P3109 multi-width float draft, all of which are width-spanning families under a parameterised rule. We make no per-rung accuracy or superiority claim against any of them. What is open. The breadth/toolchain-coherence framing is recorded as an open conjecture with a pre-registered falsification path: a matched-substrate FPGA experiment and a matched-budget software ablation. A falsification ledger (FL-002) records the open questions and the experiments that would settle them. An RTL-correctness erratum dated 2026-05-31 is reported in Section 5.5; the fabricated TTSKY26b dies carry the defective multiplier portfolio, and the corrected generator is the regeneration baseline.

2305.04122 2026-06-19 cs.AR 版本更新

Performance Analysis of Digital Processing-in-Memory through a Case Study on Convolutional-Neural-Network Acceleration

基于卷积神经网络加速案例的数字内存内处理性能分析

Orian Leitersdorf, Ronny Ronen, Shahar Kvatinsky

AI总结 本文通过理论分析和与GPU的定量对比,系统评估数字PIM架构在CNN加速中的性能,揭示其局限性并指导未来应用加速。

Comments Revised and expanded version with additional evaluation, CNN training results, and broader architectural analysis

详情
AI中文摘要

内存内处理(PIM)架构正在发展,通过利用相同的物理器件实现存储和逻辑功能来最小化数据移动。模拟PIM利用交叉阵列进行高效的近似矩阵-向量乘法,而数字PIM架构则支持大规模并行按位运算以处理更通用的工作负载。最近的工作将数字PIM扩展到卷积神经网络(CNN)的全精度加速,但与GPU的全面比较在文献中仍然缺失,这可能揭示数字PIM的局限性。本文旨在通过更新的定量比较,对CNN加速进行彻底检查来填补这一空白。我们的方法首先对各种PIM架构进行理论研究,揭示其性能特征和约束。随后,通过一系列从内存受限的向量算术到CNN加速的基准测试,我们提供了对数字PIM性能的见解,这可能指导未来应用的加速。

英文摘要

Processing-in-Memory (PIM) architectures are evolving to minimize data movement by leveraging the same physical devices for both memory and logic functionalities. While analog PIM harnesses crossbar arrays for efficient approximate matrix-vector multiplication, digital PIM architectures facilitate massively-parallel bitwise operations for more general workloads. Recent works have extended digital PIM towards the full-precision acceleration of convolutional neural networks (CNNs), yet a comprehensive comparison with GPUs remains a gap in the literature that may illuminate the limitations of digital PIM. This paper aims to fill this void by conducting a thorough examination of CNN acceleration through an updated quantitative comparison with GPUs. Our approach begins with a theoretical investigation into various PIM architectures, shedding light on their performance characteristics and constraints. Subsequently, through a series of benchmarks spanning memory-bound vectored arithmetic to CNN acceleration, we provide insights into digital PIM performance that may guide the acceleration of applications in the future.