arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2605.15617 2026-05-18 cs.DC cs.AI

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

几块GPU,大量规模:PrismLLM实现忠实的LLM训练仿真

Shaoke Xi, ChonLam Lao, Boyi Jia, Jiaqi Gao, Zhipeng Zhang, Jiamin Cao, Brian Sutioso, Erci Xu, Minlan Yu, Kui Ren, Yong Li, Zhengping Qian, Ennan Zhai, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团) Harvard University(哈佛大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 PrismLLM通过切片方法构建高保真执行图,使工程师能用少量GPU模拟大规模训练行为,准确复现性能和内存表现,节省集群访问成本。

Comments 13 pages body, 21 pages total

详情
AI中文摘要

当前大型语言模型(LLM)训练依赖数千块GPU的集群,尽管规模大能加速模型发展,但开发、调试和性能调优框架变得复杂且昂贵。工程师需频繁访问生产集群以复现行为或评估优化,但大部分GPU已用于生产任务。PrismLLM通过切片方法构建高保真执行图,分离大规模执行与访问大集群的需求,使工程师能用少量GPU运行并观察感兴趣的一组rank。PrismLLM通过混合仿真,部分rank执行原始程序,其余rank作为虚拟参与者回放。实验显示PrismLLM在大规模LLM训练任务中准确复现性能和内存行为,迭代时间平均误差仅0.58%,峰值GPU内存使用误差低于0.01%。PrismLLM可模拟最多8192块GPU的集群,仅需原部署物理GPU的不到1%。

英文摘要

Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and costly. This is because engineers often need to reproduce production behaviors to diagnose failures or evaluate optimizations, thereby demanding frequent and even exclusive access to production-scale clusters -- which becomes increasingly hard given that the majority of GPUs are already committed to production workloads. Simulation relies on complex performance models that are difficult to maintain, and downscaled experiments often fail to capture scale-dependent behaviors. We present PrismLLM to decouple large-scale execution from the need to access large clusters, enabling engineers to run and observe ranks of interest under faithful large-scale behavior using only a few GPUs. PrismLLM constructs a high-fidelity execution graph via a slicing-based approach that captures computation, communication, and dependencies of the target scale. Then, PrismLLM performs hybrid emulation where selected ranks execute the original program while the remaining ranks are replayed as virtual participants. Experiments on large-scale LLM training workloads show that PrismLLM accurately reproduces performance and memory behavior, achieving only 0.58\% average error in iteration time and less than 0.01\% error in peak GPU memory usage. PrismLLM can emulate clusters of up to 8192 GPUs using fewer than 1\% of the physical GPUs required by the original deployment.

2605.15579 2026-05-18 eess.IV cs.CV

TVRN: Invertible Neural Networks for Compression-Aware Temporal Video Rescaling

TVRN:用于压缩感知的可逆神经网络时间视频重采样

Xinmin Feng, Li Li, Dong Liu, Feng Wu

发表机构 * MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition(教育部脑科学与智能感知认知重点实验室) University of Science and Technology of China(中国科学技术大学) Information Science and Technology Institution(信息科学与技术研究所) MCC Lab(MCC实验室)

AI总结 本文提出TVRN框架,通过可逆架构和学习到的排名策略,解决压缩感知下的时间视频重采样问题,提升重建质量。

Comments Accepted by IEEE Transactions on Image Processing

详情
AI中文摘要

为适应多样显示和带宽约束,高帧率视频需先时间下采样到低帧率(LFR)再上采样,需联合优化以实现有效帧率重采样。然而现有方法通常通过训练目标连接两个操作,未充分利用其互为逆过程的性质,可能导致高频信息丢失。此外,它们忽略了有损编码器对LFR视频的影响,限制了实际应用。本文提出一种端到端的压缩感知帧率重采样框架TVRN。为正则化帧率下采样过程中丢失的高频信息,TVRN采用结合多输入多输出时间小波变换的可逆架构,并加入高频重建模块。为通过非可微的有损编码器实现端到端训练,设计了一个近似其梯度的替代网络。最后,为提高不同压缩级别下的鲁棒性,通过学习到的排名策略扩展TVRN为非对称架构。大量实验表明,TVRN在工业视频压缩设置下优于现有方法。源代码可在https://github.com/fengxinmin/TVRN_public公开获取。

英文摘要

To fit diverse display and bandwidth constraints, high-frame-rate videos are temporally downscaled to low-frame-rate (LFR) and later upscaled, requiring joint optimization for effective frame-rate rescaling. However, existing methods typically link the two operations via training objectives, without fully exploiting their reciprocal nature, which may cause high-frequency information loss. Moreover, they overlook the impact of lossy codecs on LFR videos, limiting real-world applicability. In this work, we propose an end-to-end framework for compression-aware frame-rate rescaling, named TVRN. To regularize high-frequency information lost during frame-rate downscaling, TVRN adopts an invertible architecture that combines a Multi-Input Multi-Output Temporal Wavelet Transform with a high-frequency reconstruction module. To enable end-to-end training through non-differentiable lossy codecs, we design a surrogate network that approximates their gradients. Finally, to improve robustness under various compression levels, we extend TVRN to an asymmetric architecture by incorporating compression-aware features learned via a learning-to-rank strategy. Extensive experiments show that TVRN outperforms existing methods in reconstruction quality under industrial video compression settings. Source code is publicly available at https://github.com/fengxinmin/TVRN_public.

2605.15571 2026-05-18 stat.ML cs.LG

MaxSketch: Robust Distinct Counting in Streams via Random Projections

MaxSketch:通过随机投影在数据流中实现鲁棒的唯一计数

Nikos Tsikouras, Constantine Caramanis, Christos Tzamos

发表机构 * National and Kapodistrian University of Athens(希腊国家与卡波迪斯蒂亚纳大学) Archimedes, Athena Research Center(阿提卡研究中心) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出MaxSketch,利用随机高斯投影在高维噪声数据流中实现鲁棒的唯一计数,证明在几何结构下可将内存需求降低至~O(log n / ε²)。

详情
AI中文摘要

估计数据流中不同元素的数量在重复元素相同的情况下已知。然而在现代设置中,观测是高维且噪声的,相同对象的重复实例仅近似相似——例如不同个体的图像在像素层面可能有显著差异。经典草图如HyperLogLog依赖一致的哈希值来处理相同元素,在这种情况下会失效。最近在一般度量空间中关于鲁棒唯一计数的研究实现了~Θ(√n)的内存需求,这是最坏情况下的最优。本文证明在学习表示中常见的几何结构下,可以实现显著改进的内存保证。我们介绍了MaxSketch,一种由随机高斯投影构建的简单max线性草图,并证明其能够估计潜在对象的数量。具体而言,我们证明在这一假设下,m = ~O(log n / ε²)的随机投影(因此~O(log n / ε²)的内存)足以在(1+ε)因子内恢复真实的唯一计数。在图像流上的实验证实MaxSketch能够准确估计唯一计数,并在训练范围外泛化。我们的结果将经典流算法与现代表示学习连接起来,展示了几何结构如何从根本上减少唯一计数的复杂性。

英文摘要

Estimating the number of distinct elements in a data stream is well understood when repeated elements are identical. In modern settings, however, observations are high-dimensional and noisy, so repeated instances of the same object are only approximately similar -- for example, different images of the same individual may vary significantly at the pixel level. Classical sketches such as HyperLogLog rely on consistent hash values for identical elements and break down in this regime. Recent work on robust distinct counting in general metric spaces achieves $\widetildeΘ(\sqrt{n})$ memory, which is tight in the worst case. We show that substantially improved memory guarantees are possible under geometric structure common in learned representations. We introduce MaxSketch, a simple max-linear sketch built from random Gaussian projections, and prove that it succeeds in estimating the number of distinct latent objects. Concretely, we show that under this assumption $m = \widetilde{O} (\log n / \varepsilon^2)$ random projections (and hence $\widetilde{O} (\log n/\varepsilon^2)$ memory) suffice to recover the true distinct count within a $(1+\varepsilon)$ factor. Experiments on image streams confirm that MaxSketch accurately estimates distinct counts and generalizes beyond the training regime. Our results bridge classical streaming algorithms and modern representation learning, showing how geometric structure can fundamentally reduce the complexity of distinct counting.

2605.15569 2026-05-18 cs.CR cs.AI cs.SE

Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis

通过代理程序分析检测多语言微服务中的特权提升

Penghui Li, Hong Yau Chong, Yinzhi Cao, Junfeng Yang

发表机构 * Columbia University(哥伦比亚大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出Neo框架,结合LLM和经典程序分析,解决微服务中特权提升检测的复杂性问题,发现24个零日漏洞,精度和召回率均优于现有方法。

Comments In Proceedings of the 47th IEEE Symposium on Security and Privacy (S&P)

详情
AI中文摘要

微服务因可扩展性和容错性被广泛采用,但其架构引入了特权和权限控制的复杂性,导致特权提升风险。本文提出Neo框架,结合大语言模型和经典程序分析,通过动态生成分析计划、适应代码搜索策略和验证语义,实现跨服务和语言的可扩展代码探索。在25个开源微服务应用上评估,Neo发现24个零日漏洞,精度81.0%、召回率85.0%。相比现有方法,Neo在检测准确性和可扩展性上均有显著提升,并展示了其在其他应用领域和漏洞类型上的可扩展性,发现18个额外零日漏洞。

英文摘要

Microservices are widely adopted in modern cloud systems due to their scalability and fault tolerance. However, microservice architectures introduce significant complexity in privilege and permission control, creating risks of privilege escalation where attackers can gain unauthorized access to resources or operations. Detecting such vulnerabilities is challenging due to complex cross-service interactions, polyglot codebases, and diverse privileged operations and permission checks. We present Neo, an agentic program analysis framework that combines large language models (LLMs) with classic program analysis to address these challenges. Neo leverages an LLM-based agent that dynamically generates analysis plans, adapts code search strategies, and validates semantics. We develop code search primitives that enable Neo to perform scalable and flexible code exploration across services and languages. We evaluated Neo on 25 open-source microservice applications spanning 7 programming languages and 6.2 million lines of code. Neo uncovered 24 zero-day privilege escalation vulnerabilities and achieved 81.0% precision and 85.0% recall on a ground-truth dataset. Compared to existing program analysis and agentic solutions, Neo demonstrated significant improvements in both detection accuracy and scalability. We further showcased Neo's extensibility by applying it to other application domains and vulnerability types, uncovering 18 additional zero-day vulnerabilities.

2605.15558 2026-05-18 eess.IV cs.CV

Text-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction

Text-RSIR: 一种基于文本的高效遥感图像传输与重建框架

Hao Yang, Xianping Ma, Peifeng Ma, Man-On Pun

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) Geosciences and Engineering, Southwest Jiaotong University(西南交通大学地球科学与工程学院) Institute of Space and Earth Information Science and the Department of Geography and Resource Management, The Chinese University of Hong Kong(香港中文大学空间与地球信息科学研究所及地理与资源管理系)

AI总结 本文提出一种基于文本的遥感图像传输系统,通过低分辨率图像与紧凑文本描述替代高分辨率数据,提升传输效率。引入文本条件图像恢复模型,实现细粒度细节恢复与语义一致性保持。

Comments 15 pages, 8 figures, submitted to ISPRS JPRS

详情
AI中文摘要

高分辨率遥感影像对环境监测、城市制图和土地覆盖分析至关重要,但其传输常受带宽限制和高通信成本阻碍。传统流程传输全分辨率像素数据导致冗余和低效。本文提出一种文本引导的遥感图像传输系统,用低分辨率图像配以紧凑文本描述替代完整高分辨率数据。机载文本生成器产生空间和语义摘要,将传输数据量减少至原大小的约2%。地面重建中引入文本条件图像恢复模型,利用跨模态学习恢复细粒度空间细节并保持语义一致性。实验结果表明,在Alsat-2B、UC Merced Land Use和Aerial Image数据集上,所提框架的重建PSNR分别为16.36 dB、26.87 dB和27.41 dB,实现了高效且信息保留的遥感图像传输。实现将公开发布于GitHub。

英文摘要

High-resolution remote sensing imagery is critical for environmental monitoring, urban mapping, and land cover analysis, but its transmission is often hindered by limited bandwidth and high communication costs. Conventional pipelines transmit full-resolution pixel data, resulting in redundant and inefficient delivery. This paper proposes a text-guided remote sensing image transmission system that replaces complete high-resolution data with low-resolution images accompanied by compact textual descriptions. An onboard text generator produces spatial and semantic summaries, reducing the transmitted data volume to approximately 2\% of the original size. For ground-based reconstruction, a text-conditioned image restoration model is introduced, which leverages cross-modal learning to recover fine spatial details and maintain semantic coherence. Experimental results on the Alsat-2B, UC Merced Land Use, and Aerial Image datasets demonstrate that the proposed framework achieves reconstruction PSNRs of 16.36 dB, 26.87 dB, and 27.41 dB, respectively, enabling efficient and information-preserving image transfer for remote sensing applications. The implementation will be made publicly available at \href{https://github.com/haoyangofficial/textrssr}{GitHub}.

2605.15543 2026-05-18 cs.GT cs.AI

Domain-Independent Game Abstraction using Word Embedding Techniques

基于词嵌入技术的领域无关游戏抽象

Juho Kim, Tuomas Sandholm

发表机构 * CMU Strategic Machine, Inc.(CMU战略机器公司) Strategy Robot, Inc.(策略机器人公司) Optimized Markets, Inc.(优化市场公司)

AI总结 本文提出一种基于自然语言处理的词嵌入技术进行游戏抽象的方法,通过将动作视为词,利用词向量表示和聚类实现领域无关的游戏抽象,实验表明该方法有效但不如专用算法。

详情
AI中文摘要

许多现实中的游戏规模庞大,需要通过游戏抽象来减小规模。尽管过去二十年游戏抽象有显著进展,但多数工作局限于特定领域(如扑克),难以推广到其他领域。本文提出一种领域无关的游戏抽象方法,利用自然语言处理中的词嵌入技术,将动作视为词,通过训练词向量表示并聚类实现游戏抽象。实验结果表明,该方法有效,但不如针对特定游戏优化的算法性能优异。

英文摘要

Many games of interest in the real world are often intractably large, thereby necessitating the use of game abstraction to shrink them in size, typically by many magnitudes. Over the last two decades, there have been significant advances in game abstraction; however, the domain-specific nature (usually poker) of much of the prior work prevents those techniques from being easily generalized to other settings without extensively analyzing the game at hand. In this paper, we propose a domain-independent approach to game abstraction, which applies word embedding techniques from the field of natural language processing. Treating each action as a word and gameplay data as a corpus, word vectors can be trained to represent each action as a real-valued vector, which can then be clustered to facilitate game abstraction. We also explore the use of foundational embedding models and show that action embeddings obtained this way can capture a surprising amount of information about the underlying game. Experimental results demonstrate that our proposed game abstraction technique is effective, although it does not outperform specialized algorithms tailored to specific games.

2605.15507 2026-05-18 cs.IT cs.AI cs.LG math.IT

PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources

PrismQuant: 为高斯混合源优化的率失真向量量化

Bumsu Park, Chanho Park, Youngmok Park, Namyoon Lee

发表机构 * Department of Electrical Engineering(电气工程系)

AI总结 针对高斯混合源,PrismQuant通过组件标签传输和组件匹配KLT实现率失真优化,结合EM驱动学习和熵约束量化,有效逼近理论边界并优于传统模型。

详情
AI中文摘要

对于均方误差下的高斯源,传统变换编码在率失真(RD)最优:KLT对角化协方差,反向水填充分配比特,随后标量量化闭环。然而多模态源中,单一协方差无法捕捉异质局部几何,RD函数失去闭合形式。本文通过高斯混合源重新审视该问题,构建其RD理论。核心发现混合结构仅引入组件标签成本。在活跃混合组件条件下,每个分支为高斯;挑战在于异质分支间的比特分配。证明 genie-aided 条件RD函数由单一全局反向水填充水平支配。基于此,提出PrismQuant,无损传输组件标签并使用组件匹配KLT编码残差,随后标量量化,实现H(C)/n bits per source dimension的反向率,渐近间隙消失。进一步开发基于EM驱动高斯混合学习、组件自适应KLT和熵约束标量量化(ECSQ)的实用实现。合成高斯混合实验显示PrismQuant接近理论RD界限,现实世界信道状态信息(CSI)数据实验显示其性能优于传统模型,模型规模小一个数量级。

英文摘要

For a Gaussian source under mean-squared error (MSE), classical transform coding is rate--distortion (RD) optimal: the Karhunen--Loeve transform (KLT) diagonalizes the covariance, reverse waterfilling allocates the bits, and scalar quantization closes the loop. This elegant story breaks down for multimodal sources, where no single covariance can capture heterogeneous local geometries, and the RD function loses its closed form. We revisit this problem through Gaussian-mixture sources and develop a constructive RD theory for them. Our key finding is that the mixture structure incurs only a component label cost. Conditioned on the active mixture component, each branch is Gaussian; the challenge is allocating bits across heterogeneous branches. We prove that the genie-aided conditional RD function is governed by a single global reverse-waterfilling level shared across all components and eigenmodes. Building on this result, we introduce PrismQuant, which transmits the component label losslessly and encodes the residual using the component-matched KLT, followed by scalar quantization, achieving a rate of H(C)/n bits per source dimension of the converse, with a vanishing asymptotic gap. We further develop a practical implementation based on EM-driven Gaussian-mixture learning, component-adaptive KLTs, and entropy-constrained scalar quantization (ECSQ). Experiments on synthetic Gaussian mixtures show that PrismQuant closely approaches the theoretical RD bound, while experiments on real-world channel-state-information (CSI) data demonstrate competitive or superior performance compared with transformer-based learned codecs at more than one order of magnitude smaller model size.

2605.15460 2026-05-18 cs.IR cs.AI

Differentially Private Motif-Preserving Multi-modal Hashing

差分隐私的动机保持多模态哈希

Zehua Cheng, Wei Dai, Jiahao Sun

发表机构 * Department of Computer Science\ of Oxford Oxford United Kingdom Department of Computer Science\ of Oxford

AI总结 本文提出DMP-MH框架,通过去噪后蒸馏方法在保证隐私的前提下保留多模态数据的结构特征,实验表明其在保持隐私的同时提升了检索性能。

Comments 9 Pages

详情
AI中文摘要

跨模态哈希通过将图像和文本编码为紧凑的二进制码实现高效检索。现有方法依赖于用户交互导出的语义相似性图进行监督,但这些图编码了敏感行为模式,易受链接重建攻击。现有隐私保护方法在图结构数据上失效:差分隐私SGD通过独立处理样本破坏关系动机,而图合成方法在无标度网络中面临无界局部敏感性,中心节点的单边修改会通过O(N)改变三角形计数,需要昂贵的噪声注入。我们称此现象为Hubness Explosion。本文提出DMP-MH,一种Sanitize-then-Distill框架,将隐私与表征学习解耦。我们的方法首先通过确定性裁剪节点度数来限制敏感性,独立于数据集规模上限三角动机的L2敏感性。然后通过在(ε,δ)-边差分隐私下生成去噪合成图。最后,双流哈希网络通过整体结构损失蒸馏此拓扑,强制跨模态对齐。在MIRFlickr-25K和NUS-WIDE数据集上严格归纳协议下评估,DMP-MH在保持隐私的同时,检索性能比私有基线高出11.4 mAP点,非隐私性能保留率达92.5%。

英文摘要

Cross-modal hashing enables efficient retrieval by encoding images and text into compact binary codes. State-of-the-art methods rely on semantic similarity graphs derived from user interactions for supervision, yet these graphs encode sensitive behavioral patterns vulnerable to link reconstruction attacks. Existing privacy-preserving approaches fail on graph-structured data: Differentially Private SGD destroys relational motifs by treating samples independently, while graph synthesis methods suffer from unbounded local sensitivity in scale-free networks, hub nodes cause single-edge modifications to alter triangle counts by $\mathcal{O}(N)$, necessitating prohibitive noise injection. We term this phenomenon Hubness Explosion. We propose DMP-MH, a Sanitize-then-Distill framework that decouples privacy from representation learning. Our approach first bounds sensitivity by deterministically clipping node degrees, capping the $L_2$-sensitivity of triangle motifs independently of dataset size. A sanitized synthetic graph is then generated via Noisy Mirror Descent under $(ε,δ)$-Edge Differential Privacy. Finally, dual-stream hashing networks distill this topology using a holistic structural loss that enforces cross-modal alignment. Evaluated on MIRFlickr-25K and NUS-WIDE under a strict inductive protocol, DMP-MH outperforms private baselines by up to 11.4 mAP points while retaining up to 92.5% of non-private performance.

2605.15456 2026-05-18 eess.IV cs.CV math.OC

DIPA: Distilled Preconditioned Algorithms for Solving Imaging Inverse Problems

DIPA:用于解决成像反问题的蒸馏预条件算法

Romario Gualdrón-Hurtado, Roman Jacome, Leon Suarez, Henry Arguello

发表机构 * Department of Computer Science, Universidad Industrial de Santander(圣安德烈斯工业大学计算机科学系) Department of Electrical Engineering, Universidad Industrial de Santander(圣安德烈斯工业大学电气工程系)

AI总结 本文提出DIPA算法,通过教师指导蒸馏改进重建质量,结合线性与非线性预条件运算符,验证了其在磁共振成像、压缩感知和超分辨率成像中的有效性。

Comments 17 pages, 8 figures, 8 tables

详情
AI中文摘要

解决成像反问题通常需要设计合适的先验模型,但数据保真项的最小化因物理约束导致的病态传感矩阵而面临挑战。为此,经典优化理论采用预条件技术通过改变算法梯度步长以加速收敛和提升数值稳定性。本文将预条件概念扩展至提升重建质量,并引入DIPA:蒸馏预条件算法,其中预条件运算符(PO)通过教师指导的蒸馏标准进行优化。教师与学生在重建过程中使用的传感运算符不同:教师使用模拟的更良态且信息更丰富的传感矩阵,而学生使用物理可行的传感矩阵。设计不同的蒸馏损失函数以将教师算法的不同特性转移到预条件学生中。PO可以是线性的(L-DIPA),允许可解释性,或非线性的(N-DIPA),由神经网络参数化,提供更好的可扩展性。在多种成像模态中验证了所提PO设计的有效性,包括磁共振成像、压缩感知和超分辨率成像。

英文摘要

Solving imaging inverse problems has usually been addressed by designing proper prior models of the underlying signal. However, minimizing the data fidelity term poses significant challenges due to the ill-conditioned sensing matrix caused by physical constraints in the acquisition system. Thus, preconditioning techniques have been adopted in classical optimization theory to address ill-conditioned data-fidelity minimization by transforming the algorithm gradient step to achieve faster convergence and better numerical stability. We extend the preconditioning concept beyond convergence acceleration and use it to improve reconstruction quality. We introduce DIPA: Distilled Preconditioned Algorithms, where a preconditioning operator (PO) is optimized using teacher-guided distillation criteria. Unlike standard model-compression KD, the teacher and student differ by the sensing operators available during reconstruction: the teacher uses a simulated, better-conditioned, and more informative sensing matrix, whereas the student uses the physically feasible sensing matrix. We design different distillation loss functions to transfer different properties of the teacher algorithm to the preconditioned student. The PO can be linear (L-DIPA), allowing interpretability, or non-linear (N-DIPA), parametrized by a neural network, offering better scalability. We validate the proposed PO design across several imaging modalities, including magnetic resonance imaging, compressed sensing, and super-resolution imaging.

2605.15425 2026-05-18 cs.SE cs.AI

Runtime-Structured Task Decomposition for Agentic Coding Systems

运行时结构化任务分解用于代理编码系统

Shubhi Asthana, Bing Zhang, Chad DeLuca, Hima Patel, Ruchi Mahindru

发表机构 * IBM Research(IBM研究院)

AI总结 本文提出运行时结构化任务分解方法,通过可执行控制逻辑管理任务分解与执行流程,降低重试成本,提升代理编码系统的效率和可靠性。

Comments Paper presented at ACM Conference on AI and Agentic Systems 2026 at the Agentic Software Engineering workshop

详情
AI中文摘要

代理编码系统越来越多地使用大型语言模型(LLMs)进行软件工程任务,如调试、根本原因分析和代码审查。然而,许多现有系统在单个提示中编码任务逻辑、执行流程和输出生成,这种设计导致行为脆弱、调试困难和高重试成本,因为失败往往需要重新运行整个工作流。我们提出运行时结构化任务分解,一种架构方法,通过可执行控制逻辑管理任务分解和执行流程,而不是仅依赖提示结构。LLMs仅用于专注判断任务,输出在下游执行前会根据预定义的模式进行验证。我们在两个软件工程工作负载上评估了这种方法,使用三种配置:单体执行、静态分解(固定子任务和无运行时分支)和运行时结构化分解。每种配置在10次运行中进行评估。我们的结果表明,分解本身并不一定减少重试成本。在Kubernetes根本原因分析工作负载中,静态分解基线的重试成本为1,632±145个标记,而单体基线为904±17个标记,因为失败迫使重新运行下游子任务。在多文件调试工作负载中,类似模式出现,静态基线消耗933个标记,而单体系统为703个标记。运行时结构化方法仅重新运行失败的子任务,将重试成本降低到436±132个标记(根本原因分析)和460个标记(调试)。总体而言,该方法比单体系统减少了51.7%的重试成本,比静态分解基线减少了73.2%的重试成本,提高了代理编码系统的效率、调试能力和操作可靠性。

英文摘要

Agentic coding systems increasingly use large language models (LLMs) for software engineering tasks such as debugging, root cause analysis, and code review. However, many existing systems encode task logic, execution flow, and output generation inside monolithic prompts. This design creates brittle behavior, limited debuggability, and high retry costs because failures often require rerunning the full workflow. We present runtime-structured task decomposition, an architectural approach in which task partitioning and execution flow are managed through executable control logic rather than prompt structure alone. LLMs are used only for focused judgment tasks, and outputs are validated against predefined schemas before downstream execution. We evaluate this approach on two software engineering workloads using three configurations: monolithic execution, static decomposition with fixed subtasks and no runtime branching, and runtime-structured decomposition. Each configuration was evaluated across 10 runs. Our results show that decomposition alone does not necessarily reduce retry cost. In the Kubernetes root cause analysis workload, the static decomposition baseline produced a retry cost of 1,632 +/- 145 tokens versus 904 +/- 17 tokens for the monolithic baseline because failures forced reruns of downstream subtasks. A similar pattern appeared in the multi-file debugging workload, where the static baseline consumed 933 tokens compared to 703 tokens for the monolithic system. The runtime-structured approach reran only failed subtasks, reducing retry costs to 436 +/- 132 tokens for root cause analysis and 460 tokens for debugging. Overall, the approach achieved up to 51.7% lower retry cost than monolithic systems and 73.2% lower retry cost than static decomposition baselines, improving efficiency, debuggability, and operational reliability in agentic coding systems.

2605.15412 2026-05-18 cs.CE cs.AI cs.CL

From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

从反馈循环到政策更新:基于强化微调的LLM驱动的alpha因子发现

Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Zixuan Xie, Chiming Duan, Minghua He, Philip S. Yu, Ying Li

发表机构 * Peking University(北京大学) Alibaba Group(阿里巴巴集团) Nanjing University(南京大学) University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 本文提出QuantEvolver框架,通过强化微调将可执行量化评估转化为策略更新,提升LLM在alpha因子发现中的表现,生成高质量且互补的因子池。

详情
AI中文摘要

现代量化交易日益依赖系统模型从大规模金融数据中提取预测信号,其中alpha因子发现是将市场观察转化为可交易信号的核心。最近基于LLM的方法在自动化因子生成方面表现出色,但大多数仍依赖提示级生成-评估-反馈循环进行迭代优化。随着循环变长,反复追加的历史候选和反馈会导致上下文爆炸,增加推理成本,稀释有用信息,并引入反馈漂移。此外,这些方法通常依赖非常大的LLM,其稳定的生成偏好可能导致结构相似的表达、冗余候选和搜索停滞。为了解决这些限制,我们提出QuantEvolver,一种基于强化微调的自进化alpha因子发现框架。与在提示中积累反馈不同,QuantEvolver将可执行量化评估转化为策略更新,使Miner LLM通过参数学习内化历史优化经验。具体而言,QuantEvolver构建高质量种子因子,构建多样化的种子-时间窗训练任务,生成可执行的Factor DSL表达式,通过Regime Backtest进行评估,并通过多样性-互补性奖励优化Miner LLM。在训练过程中,高质量因子持续积累在Mined Factor Database中,最终成为发现的因子库。在三个现实市场基准上的广泛实验表明,QuantEvolver的有效性,其在每个任务的主要评估指标上均优于现有基于LLM的alpha因子发现基线,产生更高质量和更互补的因子池。

英文摘要

Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation--evaluation--feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textsc{QuantEvolver}, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textsc{QuantEvolver} converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textsc{QuantEvolver} constructs high-quality seed factors, builds diverse seed--time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textsc{QuantEvolver}, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.

2605.15411 2026-05-18 stat.ML cs.LG math.OC

Harnessing Unimodality in Semiparametric Contextual Pricing via Oracle Price Map Learning

通过Oracle价格图学习利用单峰性在半参数上下文定价中

Yingying Fan, Yuxuan Han, Jinchi Lv, Xiaocong Xu, Zhengyuan Zhou

发表机构 * Data Sciences and Operations Department, University of Southern California(南加州大学数据科学与运营系) Stern School of Business, New York University(纽约大学斯特恩商学院)

AI总结 本文研究了半参数标量指数估值模型中的上下文动态定价,通过Oracle价格图学习方法,利用β-Hölder光滑性和收益几何条件,提出了一种模块化粗到细策略,实现非参数Oracle图学习的最优 regret 界。

详情
AI中文摘要

我们研究了半参数标量指数估值模型中的上下文动态定价,其中潜在价值为 $v_t=μ_\ast(\mathsf c_t)+ξ_t$,其中未知效用图 $μ_\ast$ 和未知加性噪声分布。关键决策对象是通过标量指数 $u=μ_\ast(\mathsf c)$ 和噪声尾部诱导的一维Oracle价格图 $u\mapsto p^\ast(u)$。在 $β$-Hölder光滑性($β\geq 2$)和收益几何条件(提供唯一、稳定的内部最大化器)下,该Oracle图本身为 $(β-1)$-光滑。我们通过 $\mathsf{ORBIT}$,一种模块化粗到细策略,利用标量试点指数作为输入,在每个活跃区间内局部化基准价格,并通过多臂凸优化学习Oracle图的局部多项式近似。对于基线线性效用模型 $μ_\ast(\mathsf c)=\mathsf c^\topθ_\ast$,自适应椭圆探索方案在不假设上下文分布的情况下构建所需的标量试点在线。所得到的策略达到 regret $\widetilde{O}\big(T^{\frac{2β-1}{4β-3}}+\sqrt{dT}\big)$。对于固定 $d$,我们建立了在时间范围依赖上的匹配下界,揭示了非参数Oracle图学习项的最小最大尖锐性。相同的标量试点接口还扩展到稀疏高维线性效用和非参数Hölder效用。

英文摘要

We study contextual dynamic pricing in a semiparametric scalar-index valuation model where the latent value is $v_t=μ_\ast(\mathsf c_t)+ξ_t$, with an unknown utility map $μ_\ast$ and an unknown additive noise distribution. The key decision object is the one-dimensional oracle price map $u\mapsto p^\ast(u)$ induced by the scalar index $u=μ_\ast(\mathsf c)$ and the noise tail. Under the $β$-Hölder smoothness of the tail function for $β\geq 2$ and a revenue-geometry condition that gives a unique, stable, interior maximizer, this oracle map is itself $(β-1)$-smooth. We exploit such structure through $\mathsf{ORBIT}$, a modular coarse-to-fine policy that takes a scalar pilot index as input, localizes a benchmark price in each active bin, and learns a local polynomial approximation of the oracle map inside a trust region via bandit convex optimization. For the baseline linear utility model $μ_\ast(\mathsf c)=\mathsf c^\topθ_\ast$, an adaptive elliptical exploration scheme constructs the required scalar pilot online without distributional assumptions on the contexts. The resulting policy achieves regret $\widetilde{O}\big(T^{\frac{2β-1}{4β-3}}+\sqrt{dT}\big)$. For fixed $d$, we establish a matching lower bound in the horizon dependence, unveiling that the nonparametric oracle-map learning term is minimax sharp. The same scalar-pilot interface also yields extensions to sparse high-dimensional linear utility and nonparametric Hölder utility.

2605.15410 2026-05-18 quant-ph cs.AI cs.LG

Diagonal Adaptive Non-local Observables on Quantum Neural Networks

量子神经网络上的对角自适应非局部可观测量

Huan-Hsin Tseng, Yan Li, Hsin-Yi Lin, Samuel Yen-Chi Chen

发表机构 * AI \& ML Department Brookhaven National Laboratory Upton NY, USA Department of Electrical Engineering The Pennsylvania State University University Park, PA, USA

AI总结 本文提出了一种对角自适应非局部可观测量,通过仅考虑对角可观测量与量子电路的组合,降低了参数数量和经典优化成本,同时保持了全非局部可观测量的能力。

Comments Accepted at ICCCN2026

详情
AI中文摘要

自适应非局部可观测量(ANOs)已显示,使量子可观测量动态化可以显著扩大变分量子算法的功能空间,部分将硬件需求从电路合成转移到测量设计。然而,这种优势伴随着参数数量的大幅增加以及经典优化成本的上升。我们提出了一种特殊的ANo形式,通过仅考虑对角可观测量与量子电路的组合,显著降低了这一负担。数学上,这相当于全ANo在大参数空间中的完整形式,因为对角矩阵是ANo空间的规范代表,模幺正相似性。因此,对角ANo保持了全ANo的能力,同时将k-局部可观测量的复杂度从O(4^k)降低到O(2^k),并降低了相应的测量侧经典计算成本。从这个意义上说,对角ANo保留了全ANo的许多优势,同时涵盖了传统VQCs作为特殊情况。

英文摘要

Adaptive Non-local Observables (ANOs) have shown that making quantum observables dynamic can substantially enlarge the function space of Variational Quantum Algorithms, partly shifting hardware demands from circuit synthesis to measurement design. However, this advantage is accompanied by a steep increase in the number of parameters, as well as the classical optimization cost for varying general Hermitian observables. We propose a special form of ANO that significantly reduces this burden by considering only diagonal observables paired with quantum circuits. Mathematically, this is equivalent to the full ANO of a large parameter space since diagonal matrices are canonical representatives of the ANO space modulo unitary similarity. As a result, Diagonal ANO retains the same capability of full ANO while reducing $k$-local observable complexity from $O(4^k)$ to $O(2^k)$ and lowering the corresponding measurement-side classical computation. In this sense, diagonal ANO preserves much of the benefit of full ANO while encompassing conventional VQCs as a special case.

2605.15398 2026-05-18 cs.GR cs.CV

3DEditSafe: Defending 3D Editing Pipelines from Unsafe Generation

3DEditSafe: 防御3D编辑流程中的不安全生成

Nicole Meng, Zheyuan Liu, Meng Jiang, Yingjie Lao

发表机构 * Tufts University(塔夫茨大学) University of Notre Dame(诺特大学)

AI总结 本文提出3DEditSafe框架,通过安全正则化约束不安全语义传播,减少3D编辑中的不安全内容生成,揭示安全与质量的权衡。

详情
AI中文摘要

近期3D生成编辑的进步,特别是基于3D高斯点散布(3DGS)的流程,实现了从文本提示中高保真的多视角一致场景操控。然而,我们发现这些流程在处理不安全提示时会产生传播和优化的不安全编辑。本文研究了3D编辑流程中的不安全生成,证明这种行为可能导致最终3D表示中一致但不适宜工作(NSFW)的内容。为解决此问题,我们提出了3DEditSafe,一个安全正则化的3D编辑框架,通过生成阶段的安全指导和渲染视图的3D安全正则化、安全语义投影、残差抑制和掩码感知保留,引导优化远离不安全的编辑方向。我们在EditSplat场景上使用对象兼容的不安全提示基准评估了我们的方法,并证明2D安全指导单独不足以防止不安全的3D编辑。3DEditSafe减少了不安全语义对齐和视图级攻击成功率,同时揭示了安全与质量之间的权衡,更强的不安全抑制可能引入伪影或降低不安全提示的保真度。到目前为止,这项工作是首次尝试研究并防御文本驱动的3D编辑流程中的不安全生成,强调了需要直接在优化的3D表示上操作的安全机制。

英文摘要

Recent advances in 3D generative editing, particularly pipelines based on 3D Gaussian Splatting (3DGS), have achieved high-fidelity, multi-view-consistent scene manipulation from text prompts. However, we find that these pipelines also introduce new safety risks when unsafe prompts produce edits that are propagated and optimized across views. In this work, we study unsafe generation in 3D editing pipelines and show that such behavior can lead to coherent, undesirable Not-Safe-For-Work (NSFW) content in the final 3D representation. To address this, we propose 3DEditSafe, a safety-regularized 3D editing framework that constrains unsafe semantic propagation during optimization. 3DEditSafe combines generation-stage safety guidance with rendered-view 3D safety regularization, safe semantic projection, residue suppression, and mask-aware preservation to steer optimization away from unsafe editing directions. We evaluate our approach on EditSplat scenes using an object-compatible unsafe prompt benchmark and show that 2D safety guidance alone is not consistently sufficient to prevent unsafe 3D edits. 3DEditSafe reduces unsafe semantic alignment and view-level attack success rates, while revealing a safety-quality tradeoff in which stronger unsafe suppression can introduce artifacts or reduce unsafe-prompt fidelity. To our knowledge, this work is the first attempt to study and defend against unsafe generation in text-driven 3D editing pipelines, highlighting the need for safety mechanisms that operate directly on optimized 3D representations.

2605.15392 2026-05-18 physics.optics cs.CV

Frequency-domain Event-based Imaging for Selective Surveillance

频域事件成像用于选择性监控

Megan Birch, James Rick, Adrish Kar, Jason Zutty, Joseph L. Greene

发表机构 * Georgia Tech Research Institute(佐治亚理工学院研究 institute)

AI总结 本文提出FRIES框架,通过频域分析事件数据,用于识别机械振动和旋转物体,结合RTS可视化技术,在室内和户外实验中验证了其在动态背景下的有效性。

Comments 14 pages, 11 figures

详情
AI中文摘要

事件相机(EBCs)因其微秒级像素级辐射变化报告和高动态范围,成为监控中的有吸引力的传感模式。然而,其异步、稀疏输出需要在事件空间中识别目标的算法。本文引入了频率率信息事件空间(FRIES),一种神经形态处理框架,用于检测事件中的周期性,如旋转器旋转和机械振动,以区分和监控人造物体。FRIES首先应用时间门来抑制背景和噪声,然后将事件聚合为像素级活动(如密度)图,并将像素聚类为感兴趣区域(ROIs)。对每个ROI应用局部频谱分析,以提取主导频率,用于区分结构化物体特征与无结构背景和噪声。被区分的目标通过共振时间表面(RTS)可视化,这是一种频率选择性方法,通过事件与其提取频率的相位相干性加权,奖励同步内容并抑制异步杂波。我们在受控室内实验中演示了FRIES和RTS,以恢复机械切碎机和无人机旋转器的旋转频率,对抗移动背景。我们进一步在户外数据上测试这些方法,以检测悬停无人机,对抗现实的树线。这些初步结果确立了频域事件处理作为神经形态管道中选择性监控的有前景的前端,以及利用高时间分辨率实现频谱区分的互补监控模式。

英文摘要

Event-based cameras (EBCs) are an attractive sensing modality for surveillance due to their reporting of pixel-level radiance changes with microsecond resolution and high dynamic range, enabling motion extraction while suppressing background. Their asynchronous, sparse output, however, necessitate algorithms that identify targets in event-space without processing full frames. We introduce Frequency Rate Information for Event Space (FRIES), a neuromorphic processing framework that detects periodicity in events, such as rotor rotation and mechanical vibrations, to discriminate and monitor man-made objects. FRIES first applies a time gate to suppress background and noise, then aggregates events into a pixel-wise activity (e.g., density) map and clusters pixels into regions-of-interest (ROIs). A localized spectral analysis is applied to each ROI to extract dominant frequencies used to distinguish structured object signatures from unstructured background and noise. Discriminated targets are visualized using a Resonant Time Surface (RTS), a frequency-selective method that weights events by their phase coherence with the extracted frequencies, rewarding in-sync content and suppressing out-of-sync clutter. We demonstrate FRIES and RTS in a controlled indoor experiment to recover the rotational frequency of a mechanical chopper and drone rotors against a moving background. We further test these methods on an outdoor data to detect a hovering drone against a realistic treeline. These preliminary results establish frequency-domain event processing as a promising front-end for selective surveillance in neuromorphic pipelines and a complementary surveillance modality, leveraging the high temporal resolution to enable spectral discrimination.

2605.15370 2026-05-18 quant-ph cs.LG

Quantum Feature Pyramid Gating for Seismic Image Segmentation

量子特征金字塔门控用于地震图像分割

Taha Gharaibeh, Jyotsna Sharma

发表机构 * Louisiana State University Baton Rouge, Louisiana, USA(路易斯安那州立大学巴吞鲁日分校,路易斯安那州,美国)

AI总结 本文提出量子特征门控方法,通过参数化量子电路在编码器-解码器中实现特征融合,提升地震图像分割精度,验证量子特征融合在密集预测中的有效性。

详情
AI中文摘要

准确识别盐丘对于地震解释至关重要,因为盐结构会扭曲波传播,复杂化速度建模,遮蔽储层几何形状,并增加勘探和钻井决策的不确定性。尽管混合量子-经典模型在小规模图像分类任务中表现出色,但其在密集像素级地球物理预测中的价值尚未得到充分验证。本文介绍了一种混合分割架构,即量子特征门控,该架构在编码器-解码器管道中的特征融合点嵌入了参数化量子电路(PQC)。一个4量子位、2层的PQC通过数据重新上传计算每个特征金字塔网络合并点的学得凸组合。全局平均池化层将编码器特征映射到固定4维量子输入,将72参数量子预算与主干大小和图像分辨率解耦。该方法在2018年TGS盐识别挑战赛上使用4000张101x101分辨率的地震图像进行评估,涵盖两种集成拓扑、八种电路变体和六个参数从8M到118M的编码器,在五折交叉验证下进行测试。在控制的EfficientNetV2-L消融实验中,256x256分辨率下,将三个量子FPN门控替换为逐元素加法,同时保持编码器、损失计划、分割和阈值搜索固定,使平均IoU从0.9389降至0.8404,差距达9.85个百分点。将相同电路作为自定义U-Net中的跳连注意模块,使IoU比SolidUNet基线提高0.88点,表明PQC的贡献取决于其门控的位置和内容。这些结果提供了受控证据,证明量子特征融合可以提升密集地震分割。

英文摘要

Accurate salt-body delineation is essential for seismic interpretation because salt structures distort wave propagation, complicate velocity-model building, obscure reservoir geometry, and increase uncertainty in exploration and drilling decisions. Although hybrid quantum-classical models have shown competitive performance on small-scale image-classification tasks, their value for dense, pixel-level geophysical prediction remains largely untested. This work introduces quantum feature gating, a hybrid segmentation architecture that embeds a parameterized quantum circuit (PQC) at feature-fusion points within an encoder-decoder pipeline. A 4-qubit, 2-layer PQC with data re-uploading computes a learned convex combination of lateral and top-down features at each Feature Pyramid Network merge point. A global-average-pooling layer maps encoder features to a fixed 4-dimensional quantum input, decoupling the 72-parameter quantum budget from backbone size and image resolution. The method is evaluated on the 2018 TGS Salt Identification Challenge using 4,000 seismic images at 101 x 101 resolution, across two integration topologies, eight circuit variants, and six encoders with 8M to 118M parameters under five-fold cross-validation. In a controlled EfficientNetV2-L ablation at 256 x 256 resolution, replacing the three Quantum FPN Gates with element-wise addition while holding the encoder, loss schedule, splits, and threshold search fixed reduces mean IoU from 0.9389 to 0.8404, a 9.85 percentage-point gap. Inserting the same circuit as skip-connection attention in a custom U-Net improves IoU by 0.88 points over the SolidUNet baseline, showing that the PQC contribution depends on where and what it gates. These results provide controlled evidence that quantum feature fusion can improve dense seismic segmentation.

2605.15350 2026-05-18 math.OC cs.LG

Stochastic Compositional Optimization via Hybrid Momentum Frank--Wolfe

通过混合动量Frank-Wolfe实现随机组合优化

El Mahdi Chayti

发表机构 * Machine Learning and Optimization Laboratory (MLO)(机器学习与优化实验室)

AI总结 本文提出混合动量随机Frank-Wolfe算法,无需假设外层函数F的光滑性,结合动量Jacobian追踪器与泰勒修正函数追踪器,实现非凸目标函数的O(K^{-1/4})收敛率。

详情
AI中文摘要

随机组合优化旨在最小化形式为min_{x∈X}F(f(x),x)的目标函数,其中f仅可通过噪声随机查询获取。现有方法假设外层函数F连续可导,排除了如鲁棒最大损失、条件风险价值和范数正则化等重要应用。本文提出混合动量随机Frank-Wolfe算法,通过结合动量基Jacobian追踪器与泰勒修正函数追踪器,将完整的随机线性化而非单个梯度输入广义线性最小化oracle。对于非凸目标函数和L_F-Lipschitz外层函数,算法在广义Frank-Wolfe间隙中达到O(K^{-1/4})收敛率,匹配投影自由单样本随机方法在期望光滑性下的最优复杂度。分析扩展到具有有界r-阶矩的重尾噪声Oracle(r∈(1,2]),并恢复Vladarean等人(2023)在噪声消失时的确定性速率。

英文摘要

Stochastic compositional optimization minimizes objectives of the form $\min_{\bm{x} \in \mathcal{X}} F(\bm{f}(\bm{x}), \bm{x})$, where $\bm{f}$ is accessible only through noisy stochastic queries. Existing methods for this problem assume that the outer function $F$ is continuously differentiable, which excludes many practically important applications such as robust max-of-losses, Conditional Value-at-Risk, and norm regularizers. We propose the Hybrid Momentum Stochastic Frank--Wolfe algorithm, which drops the smoothness assumption on $F$. By combining a momentum-based Jacobian tracker with a Taylor-corrected function tracker, the algorithm feeds an entire stochastic linearization -- rather than a single gradient -- into a generalized linear minimization oracle. We establish an $\mathcal{O}(K^{-1/4})$ convergence rate in the generalized Frank--Wolfe gap for non-convex objectives with $L_F$-Lipschitz outer functions, matching the optimal complexity for projection-free single-sample stochastic methods under expected smoothness. The analysis extends to heavy-tailed noise oracles with bounded $r$-th moments for $r \in (1, 2]$ and recovers the deterministic rates of Vladarean et al (2023) as the noise vanishes.

2605.15320 2026-05-18 cs.GR cs.CV cs.LG

FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

FFAvatar: 少样本、前馈和可泛化的头像重建

Thuan Hoang Nguyen, Jiahao Luo, Yinyu Nie, Hao Li, Gordon Guocheng Qian, Jian Wang

发表机构 * Snap Inc. University of California, Santa Cruz(加州大学圣克鲁兹分校) MBZUAI

AI总结 FFAvatar通过多视图查询-Former融合多源图像信息,实现高保真3D高斯头像重建,支持实时部署与高质量动画。

Comments Project Page: https://ffavatar.github.io

详情
AI中文摘要

FFAvatar通过多视图查询-Former融合多源图像信息,实现高保真3D高斯头像重建,支持实时部署与高质量动画。

英文摘要

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.

2605.15307 2026-05-18 cs.GR cs.CV cs.MM cs.SD

Sound Sparks Motion: Audio and Text Tuning for Video Editing

声音激发动作:用于视频编辑的音频和文本微调

AmirHossein Naghi Razlighi, Aryan Mikaeili, Ali Mahdavi-Amiri, Daniel Cohen-Or, Yiorgos Chrysanthou

发表机构 * University of Cyprus(塞浦路斯大学) Simon Fraser University(西蒙弗雷泽大学) Tel Aviv University(特拉维夫大学) CYENS Center Of Excellence(CYENS卓越中心)

AI总结 本文提出Sound Sparks Motion框架,通过测试时调整音频视觉生成模型的多模态条件信号,实现视频动作编辑,无需训练,通过音频潜在和文本条件残差扰动促进动作修改,同时利用视觉语言模型反馈提升编辑效果。

Comments Project Page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion

详情
AI中文摘要

以动作为中心的视频编辑仍然对大生成视频模型来说具有挑战性,这些模型通常对外观变化反应良好,但难以在现有片段中生成特定的局部动作或状态转换。我们介绍了Sound Sparks Motion,一种无需训练的框架,通过在测试时调整音频视觉视频生成模型的内部多模态条件信号,实现动作编辑。与修改模型权重不同,我们的方法仅调整两个轻量级变量:从源视频导出的音频潜在和文本条件的残差扰动。我们发现这种组合可以鼓励动作编辑,这些动作在仅通过提示控制时,底层模型往往难以实现。由于没有直接方法评估文本和动作之间的时间对齐,我们利用视觉语言模型提供反馈,指示生成视频中是否出现了预期的动作。这种简单的监督产生了一个有效的语义目标用于动作编辑,而正则化和感知-时间约束有助于保持内容和视觉质量。除了单视频调整外,我们还表明学习到的潜在控制可以跨视频转移,表明它们捕捉了可重用的动作编辑方向,而不是过拟合到单个示例。我们的结果强调了多模态条件调整,特别是通过音频路径,作为动作感知视频编辑的有前途的方向,并表明测试时调整可以作为轻量级的探测机制,帮助揭示模型多模态条件中嵌入的动作控制。代码和数据可通过我们的项目页面获取:https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/

英文摘要

Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time. Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control. Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision-language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example. Our results highlight multimodal conditioning tuning, particularly through the audio pathway, as a promising direction for motion-aware video editing, and suggest that test-time tuning can serve as a lightweight probing mechanism that helps reveal latent motion controls embedded in the model's multimodal conditioning. Code and data are available via our project page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/

2605.15299 2026-05-18 cs.IR cs.AI

Fortress: A Case Study in Stabilizing Search Recommendations via Temporal Data Augmentation and Feature Pruning

Fortress:通过时间数据增强和特征剪枝稳定化搜索推荐

Milind Pandurang Jagre, Jia Huang, Dayvid V. R. Oliveira, Zhinan Cheng, Babak Seyed Aghazadeh, Puja Das, Chris Alvino, Jinda Han, Kailash Thiyagarajan

发表机构 * Apple(苹果公司)

AI总结 Fortress通过时间数据增强和特征剪枝稳定化搜索推荐模型,提升预测稳定性和准确性,验证了在大规模应用市场中效果显著。

详情
AI中文摘要

Fortress通过时间数据增强和特征剪枝稳定化搜索推荐模型,提升预测稳定性和准确性,验证了在大规模应用市场中效果显著。

英文摘要

In search and recommendation systems, predictive models often suffer from temporal instability when certain input features introduce volatility in output scores. This instability can degrade model reliability and user experience especially in multi-stage systems where consistent predictions are critical for downstream decision making. We introduce Fortress, a general framework for enhancing model stability and accuracy by identifying and pruning features that contribute to inconsistent prediction scores over time. Fortress leverages historical snapshots temporally partitioned datasets capturing score fluctuations for the same entity across periods and follows a four-step process: (1) collect historical snapshots, (2) identify samples with unstable predictions, (3) isolate and remove instability-inducing features, and (4) retrain models using only stable features. While semantic features from LLMs and BERT-based models improve generalization, they often lack full query or entity coverage. Engagement-based features offer strong predictive power but tend to introduce temporal instability. Fortress mitigates this trade-off by suppressing the volatility of engagement signals while retaining their predictive value leading to more stable and accurate models. We validate Fortress on a query-to-app relevance model in a large-scale app marketplace. Offline experiments demonstrate notable improvements in prediction stability (measured by Coefficient of Variation) and classification performance (measured by PR-AUC).

2605.15281 2026-05-18 cs.CR cs.AI

Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance

Vinil Pasupuleti, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi

发表机构 * International Business Machines (IBM)(国际商业机器公司(IBM)) Salesforce Inc(Salesforce公司)

AI总结 本文提出了一种基于人工智能的自主测试框架,用于实现自然语言驱动的网页执行与集成安全验证。该框架通过导航可靠性、上下文感知选择器生成、后生成验证、智能等待注入和失败学习等五项策略,有效解决了传统网页测试套件易失效的问题。实验表明,该方法显著提升了脚本生成成功率,减少了导航失败和时间相关竞争条件,并大幅降低了测试创建时间;同时,它还能通过自然语言描述攻击场景,自动转换为安全检测探针,有效发现多种安全漏洞,为自然语言驱动的安全测试提供了新颖的解决方案。

Comments 6 pages, 4 figures, 5 tables, IEEE conference format

详情
英文摘要

Modern web test suites rot. A UI refactor breaks locators, a timing change causes race conditions, and within weeks developers abandon the suite entirely. This paper presents an AI-driven autonomous testing framework that addresses these failure modes through five integrated strategies - navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning - implemented over a containerised worker architecture that decouples orchestration from long-running browser execution. Evaluated across four production applications and 176 scenarios, the framework improves script generation success from 55% to 93%, achieves an 8x reduction in navigation failures, eliminates 80% of timing-related race conditions, and reduces test creation time by 75% compared to manual Selenium authoring. The framework extends naturally to security validation: testers describe attack scenarios in plain English - "try accessing another user's invoice" - which the agent converts to OWASP Top 10-aligned browser probes, detecting 85% of authentication bypass vulnerabilities and 95% of input validation flaws with false positive rates below 12%. Natural-language-driven security testing of this kind represents, to our knowledge, a novel contribution to the field.

2605.15249 2026-05-18 cs.CR cs.LG

Enabling Adversarial Robustness in AI Models through Kubeflow MLOps

Stavros Bouras, Ioannis Korontanis, Antonios Makris, Konstantinos Tserpes

发表机构 * School of Electrical and Computer Engineering, National Technical University of Athens, Greece(电气与计算机工程学院,国家技术大学雅典,希腊) Department of Informatics and Telematics, Harokopio University of Athens, Greece(信息与电信学院,哈罗基奥大学雅典,希腊)

AI总结 本文研究了如何在Kubernetes环境中提升AI模型的对抗鲁棒性。作者提出了一种基于Kubeflow MLOps的架构,能够在推理阶段自动检测对抗攻击并触发防御机制,从而保障模型的准确性和可靠性。实验表明,该方法能有效增强模型对对抗攻击的抵御能力,显著恢复因攻击导致的性能下降。

Comments Accepted at the 1st Workshop on Secure and Intelligent Data Spaces (SIDS 2026), co-located with the 27th IEEE International Conference on Mobile Data Management (MDM 2026)

详情
英文摘要

AI models are increasingly deployed in cloud-native environments to support scalable and automated services. However, while platforms such as Kubernetes provide strong infrastructure orchestration, security mechanisms specifically designed to protect deployed AI models remain limited. This paper presents security measures for AI models deployed in Kubernetes clusters. The proposed architecture integrates Kubeflow-based MLOps to automatically detect adversarial attacks during the inference phase and trigger defense mechanisms that preserve the model's accuracy and reliability. Specifically, a Fast Gradient Sign Method (FGSM) attack is applied at inference time, and a Projected Gradient Descent (PGD)-based adversarial training defense is automatically deployed when a degradation in accuracy is detected. The experimental results indicate that the deployed defense robustifies the model, significantly recovering accuracy relative to the degradation caused by the attack.

2605.15241 2026-05-18 eess.IV cs.CV cs.LG

From Full and Partial Intraoral Scans to Crown Proposal: A Classification-Guided Restoration Assistance Pipeline

Rabin Kunwar, Dikshya Parajuli, Rujal Acharya, Romik Gosai, Prince Panta, Kundan Siwakoti, Shuvangi Adhikari, Saugat Kafley, Louis Digiorgio, Amit Regmi, Akio Tanaka, Masahiko Inada, Yuriko Komagamine, Kennta Kashiwazaki, Manabu Kanazawa

发表机构 * Accelerated Komputing Pvt. Ltd.(加速计算私人有限公司) University of Pittsburgh(匹兹堡大学) Institute of Science Tokyo(东京科学研究所) Emium Co. Ltd.(Emium公司) GodelBlock Inc.(GodelBlock公司) Carnegie Mellon University(卡内基梅隆大学)

AI总结 该研究提出了一种端到端的牙冠提案生成流程,旨在从全牙弓或部分牙弓的口腔扫描数据中生成个性化的牙冠初始方案,以辅助临床医生进行后续调整。方法结合了分类引导的分割策略和基于上下文的检索与拟合技术,有效解决了部分扫描数据分割精度低和生成牙冠细节丢失的问题。实验表明,该方法在多个评估指标上表现优异,具备较高的分割精度和实际应用价值。

详情
英文摘要

Single-unit crown restoration is among the most common procedures in clinical dentistry, with CAD/CAM workflows now designing crowns directly from intraoral scans. Partial scans are often preferred over full-arch scans for single-unit cases due to fewer stitching errors, yet most segmentation networks trained on full arches fail on partial scans, while end-to-end generative crown methods often produce over-smoothed surfaces that lose occlusal detail. We propose an end-to-end pipeline that takes a raw intraoral scan and target FDI tooth number as input and outputs an initial, patient-specific crown proposal for clinician refinement. The pipeline has three phases: (I) data preparation and pose standardization; (II) segmentation routed by scan type; and (III) crown proposal generation via context-aware retrieval and Blender-based fitting. We address partial-scan segmentation through a classify-then-align strategy: a DGCNN classifier categorizes the scan into one of five anatomical types, then coarse-to-fine RANSAC+ICP registration standardizes the jaw coordinate frame, followed by graph-cut optimization to refine tooth-gingival boundaries. Trained on 1,958 partial scans, the pipeline achieves macro-average DSC 0.9249, Recall 0.8919, and Precision 0.9615 across 17 semantic classes; a fine-tuned full-arch model reaches DSC 0.9347. The prepared tooth and its mesial and distal neighbors achieve DSC 0.9468-0.9569 with sub-millimeter Centroid Errors (0.2666-0.2774 mm). These centroids anchor a retrieval module using DGCNN embeddings and cosine similarity over neighboring and opposing teeth, followed by spline-guided alignment and Blender Python API refinement. The pipeline produces a preliminary crown shell in 2.5-3.5 minutes, offering a practical alternative to end-to-end generative approaches.

2605.15240 2026-05-18 stat.ML cs.LG

On Kernel Eigen-alignments of KRR: Reconstruction and Generalization

Yang Liu, Ernest Fokoue, Richard Lange, Daniel Krutz

发表机构 * Galisano College of Coumputing and Information Science(加利萨诺计算与信息科学学院) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 本文研究了核矩阵与学习目标之间的特征对齐在实现鲁棒泛化中的关键作用,建立了核方法泛化性能与矩阵特征向量和特征值估计之间的直接联系。通过分析核矩阵扰动对预测结果的影响,作者推导出基于特征值和特征向量估计稳定性的泛化误差上界,并指出在高秩核条件下,重建误差对泛化能力的预测作用有限。研究从特征值估计的角度提出了新的泛化界,表明强泛化能力需要增强特征向量对齐、增大特征值幅度或增大相邻特征值之间的间隔。

详情
英文摘要

This paper investigates the critical role of eigenalignments between the kernel matrix and learning targets in achieving robust generalization in learning problems. We establish a direct connection between generalization performance in kernel methods and the estimation of eigenvectors and eigenvalues of matrices, offering a more intuitive understanding compared to prior work with minimal assumptions. We also show that, since the prediction task in KRR is essentially the weighted sum of eigenvectors/singular vectors, by analyzing how much error can be caused by perturbations to the kernel matrix, we can then derive a bound on this generalization error using the estimation stability of matrix eigenvalues and eigenvectors. Compared with previous work, our analysis concentrates on finite-sample settings and on the generalization error arising from having a suboptimal finite training set. Our findings reveal that in kernel methods, as long as the kernel is of high rank, the near-zero reconstruction error can be trivially obtained, implying that the reconstruction error will have limited predictive power for generalization. Finally, we establish a generalization bound from an eigenvalues/eigenvectors estimation perspective, showing that strong generalization requires increasing eigenvector alignment, eigenvalue magnitude, or gaps between consecutive eigenvalues.

2605.15238 2026-05-18 cs.SE cs.AI cs.PL

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

Alexander Du, Jianjun Ou, Danyang Zhuo, Matthew Lentz

发表机构 * Duke University(杜克大学)

AI总结 本文提出了一种名为Hydra的系统,用于在代码生成过程中高效地恢复静态错误。Hydra通过异步检查和检查点回滚机制,避免了传统方法中高昂的延迟和令牌消耗,能够在生成过程中及时检测并修复错误,而无需重新生成已正确部分的代码。实验表明,Hydra在C/C++代码生成任务中,相比事后修复方法,显著降低了延迟和令牌使用量。

详情
英文摘要

Large language models are increasingly used for code generation, but many generated programs fail to compile, a prerequisite for further correctness checks such as unit tests. Existing solutions for repairing static errors are costly in both latency and token consumption. Post-hoc repair delays error detection until generation completes and commonly regenerates large regions of previously valid code. Constrained semantic decoding checks after each token, incurring per-token overhead while limiting repair to the current token even when the root cause lies earlier. We present Hydra, a system for efficient recovery from static errors during code generation. Hydra allows checking to proceed asynchronously with generation, avoiding checker overhead when the generated code is semantically correct. In addition, it provides checkpoint-and-rollback support for targeted repair, avoiding regeneration and rechecking of valid prefixes. We retrofit the Clang C/C++ compiler to support Hydra with modest modifications. Paired with a token-efficient repair strategy, Hydra reduces latency by up to 71% and token consumption by up to 70% relative to post-hoc repair on C/C++ code generation tasks that encounter static errors.

2605.15237 2026-05-18 cs.AR cs.AI

A3D: Agentic AI flow for autonomous Accelerator Design

Abinand Nallathambi, Christopher Knight, Shantanu Ganguly, Wilfried Haensch, Anand Raghunathan

发表机构 * Purdue University(普渡大学) Argonne National Laboratory(阿贡国家实验室) University of Chicago(芝加哥大学)

AI总结 A3D 是一种基于智能体的 AI 流程,旨在实现从端到端的硬件加速器自动化设计。该方法通过自主分析工作负载、识别性能瓶颈、重构代码以适配高阶综合工具,并生成微架构,显著降低了加速器设计的复杂性和人工干预需求。A3D 还能够自动探索速度与面积的权衡空间,生成多样化的加速器设计方案,为复杂科学应用提供了高效且自动化的加速器设计解决方案。

详情
英文摘要

Accelerating applications through the design of hardware accelerators can significantly enhance system performance and energy efficiency. Despite advances, such as high-level synthesis (HLS), designing accelerators for complex applications still remains highly labor-intensive, demanding considerable expertise in understanding workloads to be accelerated, hardware design, micro-architecture, and EDA tool usage, posing challenges for application domain experts. Therefore, most accelerator solutions are limited to applications with a regular predictable dataflow. Advances in AI have enabled agents that perform autonomous planning, reasoning, execution and reflection, leading to unprecedented potential for automation through agentic AI. We present A3D, an Agentic AI flow for end-to-end Automation of hardware Accelerator Design. A3D automates workload analysis, performance bottleneck identification, code refactoring for HLS compatibility and micro-architecture generation. A3D also generates diverse accelerator designs by automatically exploring the speed-area tradeoff space. Recent efforts have explored the use of AI for specific tasks such as design space exploration in HLS, leaving several tasks to still be performed manually. A3D addresses the challenges in applying modern LLMs to accelerator design by judiciously partitioning tasks among specialist agents, orchestrating process loops with specialist and verifier agents, utilizing pre-existing and custom tools, and employing agentic RAG for codebase and proprietary EDA tool documentation exploration. Our implementation of A3D, using commercial components like Claude Sonnet 4.5 and the Catapult HLS tool, demonstrates its effectiveness by generating accelerator designs with no human intervention from complex scientific applications like LAMMPS (molecular dynamics simulation) and QMCPACK (quantum chemistry).

2605.15226 2026-05-18 cs.AR cs.AI cs.SE

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Qingyun Zou, Feng Yu, Hongshi Tan, Bingsheng He, WengFai Wong

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文探讨了用于软件工程的智能体AI系统是否适用于实际的硬件工程任务,并引入了Phoenix-bench基准测试集,该基准集包含511个经过验证的Verilator实例,支持对硬件设计流程、错误修复和验证等任务的全面评估。研究发现,硬件工程与软件工程在错误传播机制和修复方式上存在显著差异,且定位精度和反馈机制对智能体性能影响显著,为未来智能体在硬件工程中的应用提供了重要参考。

详情
英文摘要

We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one round of testbench-log feedback). Three findings emerge. (i)~Software and hardware are fundamentally different engineering tasks: the same agent loses 37\% to 58\% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain. (ii)~Failures concentrate on design control-flow / finite state machine (FSM) bugs, verification testbench bugs, and hard cases that demand cross-hierarchy signal-flow tracking and coordinated multi-file edits. (iii)~Localization granularity matters far more than localization itself: a perfect file-level oracle yields only $+1.4$\% because the agent then breaks files that did not need editing, while a single round of test case feedback lifts resolved rate by $42$\% to $45$\% because the test case tells \emph{where} the bug is and \emph{what} the fix has to look like.

2605.15223 2026-05-18 cs.AR cs.AI

GenAI-Driven Approach to RISC-V Supply Chain Exploration

Nenad Petrovic, Andre Schamschurko, Yingjie Xu, Alois Knoll

发表机构 * Chair of Robotics, Artificial Intelligence and Real-Time Systems(机器人、人工智能与实时系统教授会) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出了一种基于大语言模型(LLM)的流程,用于分析 RISC-V 供应链,结合视觉语言模型(VLM)和模型驱动工程(MDE),实现了对异构、非结构化供应链数据的多模态数据驱动分析。该方法通过 LLM 理解文本信息,VLM 提取图表、表格等视觉文档中的信息,构建供应链知识图谱,并利用 MDE 技术进行依赖关系验证、瓶颈检测和风险评估,从而支持对供应链韧性的探索性与系统性分析。实验表明,该方法在 RISC-V 生态系统中有效提升了供应链透明度和决策支持能力。

详情
英文摘要

This paper presents an LLM-empowered workflow for RISC-V supply chain analysis, integrating Vision-Language Models (VLMs) and Model-Driven Engineering (MDE) to enable comprehensive, multimodal data-driven insights. The proposed approach addresses the challenges of heterogeneous and unstructured supply chain data by leveraging LLMs for textual understanding and VLMs for extracting information from visual artifacts such as diagrams, tables, and scanned documents. These models collaboratively identify key entities and relationships, which are then organized into a knowledge graph representing supply chain components and their interdependencies. For analytical reasoning, the workflow incorporates MDE techniques and constraint-based modeling to enable formal validation of dependencies, detection of bottlenecks, and assessment of risks. The synergy between LLM- and VLM-based semantic understanding and MDE-based formal analysis supports both exploratory and systematic evaluation of supply chain resilience. A human-in-the-loop mechanism further enables interactive querying and expert validation. The approach is evaluated in RISC-V ecosystem scenarios, demonstrating its effectiveness in generating actionable insights, enhancing transparency, and supporting decision-making in complex semiconductor supply chains.

2605.15222 2026-05-18 cs.SE cs.CL cs.PL

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Huihao Jing, Wenbin Hu, Haochen Shi, Hanyu Yang, Sirui Zhang, Shaojin Chen, Haoran Li, Yangqiu Song

发表机构 * HKUST(香港科技大学) NYU(纽约大学) SWUPL(西南大学)

AI总结 PerfCodeBench 是一个用于评估大语言模型在系统级高性能代码优化能力的可执行基准。该基准聚焦于需要硬件感知优化和性能瓶颈处理的实际系统任务,每个任务均包含正确性检查、基线实现和参考优化方案,从而同时评估代码的正确性与运行效率。实验表明,当前主流大语言模型在生成高效代码方面与专家实现仍存在显著差距,尤其在并行计算和GPU操作任务中表现较弱,突显了性能导向评估的重要性。

详情
英文摘要

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and expert-optimized implementations. The gap is especially large on tasks involving parallelism and GPU operations. Current models also show weaknesses in cross-language robustness and in consistently reaching expert-level efficiency. These results suggest that performance-aware evaluation are still needed. LLMs should move beyond generating merely correct code toward producing efficient systems software. We submit the benchmark data, evaluation infrastructure, and complete logs of all LLMs-generated code at https://anonymous.4open.science/r/perfcodebench-7CDE.

2605.15221 2026-05-18 cs.SE cs.AI cs.CL

Effective Harness Engineering for Algorithm Discovery with Coding Agents

Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

发表机构 * Gemini Algorithmicsuperintelligence(算法智能)

AI总结 本文研究了在算法发现任务中,如何设计有效的执行框架(harness)以提升基于大语言模型和进化搜索的自动算法生成效果。通过分析算法生成数量与深度、评估漏洞处理以及并行执行安全等问题,提出了改进的Vesper框架,并在圆填充问题上验证了其有效性。实验表明,在固定计算预算下,生成更少但更深入的算法能取得更优结果,同时更强大的模型更容易产生评估漏洞,凸显了漏洞检测的重要性。

详情
英文摘要

AlphaEvolve and FunSearch have demonstrated the potential of combining large language models (LLMs) with evolutionary search for automated algorithm discovery. However, discovery success is shaped not only by model capability but also significantly by the design of the execution infrastructure, i.e., the harness. This paper investigates effective harness design through three questions: under a fixed token budget, is it better to produce many algorithms with brief thought or fewer algorithms with deeper thought? How should the harness handle evaluation hacks, where generated programs exploit the scoring function? And how can agents that require full filesystem access execute safely in parallel? Using Vesper, an algorithm discovery framework that incorporates harness improvements addressing these questions, we evaluate on Circle Packing under the same token budget. Interestingly, generating fewer algorithms while thinking more deeply about each one achieved higher scores. That is, scaling the quality of each individual is more budget-efficient than scaling the number of evolutionary generations. Surprisingly, more capable models produced evaluation hacks at higher rates, making hack detection increasingly necessary as models scale.