arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2606.20195 2026-06-19 cs.PF cs.NA math.NA 新提交

Randomized Sketching is Robust to Low-Precision Rounding on GPUs

随机草图对GPU低精度舍入具有鲁棒性

Aryaman Jeendgar, Clément Flint, Hartwig Anzt

AI总结研究随机草图在GPU低精度下的性能与精度，提出SparseStack改进CountSketch，发现FP16舍入方式对嵌入质量影响小，分布比量化更关键。

Comments 14 pages, 3 figures

详情

AI中文摘要

随机草图是随机数值线性代数中的核心原语。在现代硬件架构上，特别是在GPU上，稀疏草图的性能受限于内存流量和原子累加，而非浮点吞吐量。这使得草图成为混合精度的自然目标，前提是低精度累加不会降低嵌入质量。我们研究了稀疏子空间嵌入的混合精度GPU实现，重点关注Higgins等人提出的GPU CountSketch内核的SparseStack泛化。SparseStack在相干输入上相对于CountSketch提高了嵌入质量，但其每列额外的非零元素增加了原子更新争用并降低了吞吐量。因此，我们实现了使用确定性舍入到最近、精确随机舍入和抖动舍入的FP16 SparseStack变体，并将它们与FP32 SparseStack、CountSketch、混合精度CountSketch和FlashSketch进行比较。我们的主要实证发现是，在测试的范围内，SparseStack嵌入质量对FP16舍入规则不敏感。确定性、随机和抖动舍入的FP16 SparseStack在不相干、相干和对抗性测试问题上产生几乎相同的子空间失真和草图求解最小二乘精度。主导精度因素是草图分布而非量化规则：SparseStack变体在相干输入上显著改善失真，而所有方法在不相干输入上表现相似。由于确定性舍入的开销最低，它在FP16 SparseStack变体中提供了最佳的性能-精度权衡。

英文摘要

Randomized sketching is a core primitive in randomized numerical linear algebra. On modern hardware architectures, in particular on GPUs, the performance of sparse sketches is limited by memory traffic and atomic accumulation rather than floating-point throughput. This makes sketching a natural target for mixed precision, provided that low-precision accumulation does not degrade the embedding quality. We study mixed-precision GPU implementations of sparse oblivious subspace embeddings, focusing on a SparseStack generalization of the GPU CountSketch kernel of Higgins et al. SparseStack improves embedding quality relative to CountSketch on coherent inputs, but its additional nonzeros per column increase atomic-update contention and reduce throughput. We therefore implement FP16 SparseStack variants using deterministic round-to-nearest, exact stochastic rounding, and dithered rounding, and compare them with FP32 SparseStack, CountSketch, mixed-precision CountSketch, and FlashSketch. Our main empirical finding is that, for the tested regimes, SparseStack embedding quality is insensitive to the FP16 rounding rule. Deterministic, stochastic, and dithered rounding FP16 SparseStack produce nearly identical subspace distortion and sketch-and-solve least-squares accuracy across incoherent, coherent, and adversarial test problems. The dominant accuracy factor is the sketch distribution rather than the quantization rule: SparseStack variants substantially improve distortion on coherent inputs, while all methods behave similarly on incoherent inputs. Since deterministic rounding has the lowest overhead, it provides the best performance--accuracy tradeoff among the FP16 SparseStack variants.

URL PDF HTML ☆

赞 0 踩 0

2606.16106 2026-06-19 cs.PF cs.AR cs.DC 新提交

Edge-Inference Governors Need Memory-Clock State

超越CPU-GPU频率：内存时钟和尾部效应对边缘推理延迟估计的影响

Jaehoon Kang

AI总结通过测量NVIDIA Jetson Orin Nano，发现内存时钟是缺失的维度、聚合丢失率隐藏突发性、频率切换存在延迟，这些现象超出传统频率感知延迟模型的范围。

Comments 20 pages, 13 figures, 11 tables. Code and data: https://github.com/dankang21/jetson-latency-lab ; traces: https://doi.org/10.5281/zenodo.20745228

详情

AI中文摘要

频率感知延迟估计器通过建模CPU和GPU频率上的延迟，使得边缘ML推理的截止时间感知DVFS成为可能。我们在NVIDIA Jetson Orin Nano上进行了测量研究，展示了该建模范围之外的三种现象。(1) 内存时钟是一个缺失的维度：在现实的上限EMC范围（2133->3199 MHz）内，根据工作负载的不同，它将中位数延迟偏移了+11%到+48%，并且在最高GPU时钟下，对于合成L2驻留内核，我们观察到一个可重复的非单调情况（-9%）。在一个功率配置下分析并在另一个功率配置下部署的GPU频率估计器，因此低估了高达32%的延迟；列出四个可锁定的EMC点可以修复大多数工作负载，而参数化的1/f_emc项则不能。(2) 聚合丢失率隐藏了突发性：在固定时钟下，100k周期运行显示出刀锋边缘分布，其截止时间丢失的悬崖跨度约为1毫秒，但丢失的聚集远超出独立性——在0.1%的聚合丢失率下，下一个周期也丢失的概率高达74%（是独立基线的740倍）。高斯mu+3sigma边界超过0.1%丢失目标13倍到29倍，而样本外广义帕累托边界在所有八种配置中保持在~2倍以内。(3) 频率切换并非免费：每个域的过渡停顿低于100微秒，但新的工作点需要1/5/8毫秒（CPU/GPU/EMC）才能生效——对于每推理周期的调控器来说，这是典型推理周期的很大一部分。我们发布了完整的测量工具，并讨论了对下一代频率感知估计器和调控器的影响。

英文摘要

Frequency-aware latency estimators let deadline-aware DVFS governors schedule edge ML inference by modeling latency over CPU and GPU clocks, but they cannot observe the memory clock (EMC) -- a missing deployment state that decides whether a governor meets its deadlines and at what energy. We show this with a deployed, measured governor on a Jetson Orin NX: an EMC-blind GPU-only fit misses 25-28% of cycles at tight deadlines, whereas an EMC-aware refit holds misses to at most 1.3% under a 2% QoS miss budget by selecting a budget-feasible clock -- the energy-minimal one for periodic vision (calibrated module-rail power). The failure generalizes across three workload classes -- MobileNetV2, a ViT transformer, and Qwen2.5 LLM token decode (where saturated decode makes the aware policy lower-energy than the infeasible blind choice): a CPUxGPU estimator sends the deployed governor to an infeasible operating point, and only an EMC-aware model identifies the feasible side of the energy frontier. The effect is real and outside the CPUxGPU state abstraction: across two Orin SKUs sharing the same lockable EMC points it shifts median latency by up to ~45%, replicates on both, and survives a fused TensorRT fp16 engine. CPUxGPU models do not absorb it: per-lockable-point EMC tables are needed, a scoped inversion shows monotone assumptions can pick the wrong direction, and clustered misses make aggregate QoS rates understate deployment risk. We release the harness; this complements, not rebuts, the state of the art within its CPUxGPU scope.

URL PDF HTML ☆

赞 0 踩 0

2606.20474 2026-06-19 cs.LG cs.AI cs.PF 交叉投稿

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant: 面向上下文密集型智能体的4位KV缓存

Inesh Chakrabarti, David Limpus, Aditi Ghai Rana, Bowen Bao, Spandan Tiwari, Thiago Crepaldi, Ashish Sirasao

发表机构 * Advanced Micro Devices（超威半导体）； University of California, Los Angeles（加州大学洛杉矶分校）； Purdue University（普渡大学）

AI总结针对上下文密集型智能体场景，提出UltraQuant方法，通过4位KV缓存压缩、旋转量化和代码本量化，结合AMD GPU优化，在长上下文多轮任务中延迟降低3.47倍，吞吐量提升1.63倍。

Comments 11 pages, 9 figures

详情

AI中文摘要

上下文密集型智能体给键值（KV）缓存带来了异常压力：长前缀在多个短轮次中重复使用，而并发性决定了服务系统能否保持GPU利用率。我们针对此场景研究4位KV缓存压缩，采用TurboQuant风格的旋转和代码本量化作为质量锚点，vLLM FP8 KV缓存作为部署锚点。我们报告三项贡献。首先，我们将4位KV缓存框架用于多轮智能体工作负载，其中任务质量、缓存驻留和服务吞吐量必须联合衡量。其次，我们描述了使4位路径鲁棒所需的实际设计选择，包括非对称K/V处理、Walsh-Hadamard旋转、QJL移除和块尺度变体。第三，我们展示了AMD GPU上的服务优化，包括优化的解码注意力内核和UltraQuant，一种使用FP8查询、FP4 KV张量、UE8M0组尺度和CDNA4上原生缩放MFMA支持的FP4近似路径。在长上下文、多轮智能体工作负载上，UltraQuant在缓存压力大的后期轮次中将P50首令牌延迟降低了3.47倍（所有轮次平均2.3倍），并将输出吞吐量比FP8 KV基线提高了1.63倍。

英文摘要

Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.01183 2026-06-19 cs.DC cs.DB cs.DS cs.PF 版本更新

The World's Fastest Matching Engine Algorithm

世界上最快的撮合引擎算法

Jake Yoon

AI总结提出Priority-Indicated Node (PIN)和邻域感知树操作两种数据结构，消除订单簿中指针追逐和根到叶搜索的延迟，实现亚微秒级尾部延迟和每秒数千万条消息的处理能力。

Comments 20 pages, 5 figures, 7 tables

详情

AI中文摘要

每个电子交易所都依赖于一个订单簿，其存储层决定了撮合延迟。主流实现——通过平衡树链接的链表——在每个操作上施加两个成本：指针追逐遍历以到达插入点，以及根到叶搜索以定位目标价格水平。在微突发条件下，这些成本会产生尾部延迟峰值，在流动性最需要时降低市场质量。我们提出了两种数据结构贡献，消除了这些成本。第一种是优先级指示节点（PIN），一种优先队列，其中条目占据固定容量、连续可寻址的槽位，每个槽位携带一个指示条目全局优先级的每槽指示器。与每次操作需要O(log n)次比较的堆不同，PIN直接根据指示器解析插入位置，无需比较条目；指示器更新为O(1)，与队列大小无关。第二种解决了更广泛的低效问题：平衡搜索树在每次插入和删除时都进行根到叶搜索，即使调用者已经知道键的中序邻居——例如在有序事件流、增量索引维护和电子交易中。邻域感知插入和删除利用已知的邻居引用，通过O(1)次引用写入来附加或移除节点，然后进行单路径重平衡，统一适用于红黑树、AVL树和B/B+树变体。单个CPU核心在每秒数百万条消息的微突发下，以亚微秒级尾部延迟维持每秒3200万条订单消息，比同一硬件上最好的开源撮合引擎快5-11倍。扩展到单个96核实例，该引擎在10,000个交易品种上维持每秒6.4亿条消息。

英文摘要

A single CPU core sustains 32 million order messages per second at sub-microsecond median end-to-end host-path response latency, 4.7-11 times faster than the best available open-source matching engines on identical hardware. Scaled out, a single 96-core commodity server (~$1,630/month) sustains ~640 million messages per second across 10,000 symbols, over 20 times the provisioned capacity of the U.S. consolidated quote feed. We reach these numbers by attacking the storage layer that sets matching latency. The dominant order-book implementation, linked lists chained through a balanced tree, imposes two costs on every operation: pointer-chased traversal to the insertion point, and root-to-leaf search to locate the target price level. Under micro-bursts these costs produce tail-latency spikes that degrade market quality precisely when liquidity is most needed. We present two data-structure contributions that eliminate them. The first is the Priority-Indicated Node (PIN), a priority queue in which entries occupy fixed-capacity, contiguously addressable slots, with indicators encoding the entry's global priority status. Unlike heaps, which require O(log n) comparisons per operation, the PIN resolves insertion position directly from the indicators without comparing entries; indicator updates are O(1), independent of queue size. A depth-aware capacity model sizes each PIN so hot entries fit within L1 residency. The second targets a broader inefficiency: balanced search trees search from root to leaf on every insertion and deletion, even when the caller already knows the key's in-order neighbors, which in electronic trading are available at zero cost. Neighbor-aware insertion and deletion use known neighbor references to attach or remove a node with O(1) reference writes, followed by single-path rebalancing, across red-black, AVL, and B+-tree variants.

URL PDF HTML ☆

赞 0 踩 0