arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

代码大模型 / AI 编程

代码生成、软件工程智能体、程序修复、测试生成和开发者工具。

2026-06-19 至 2026-06-19 收录 7 信号源:cs.SE, cs.CL, cs.AI, cs.LG, cs.PL

1. 代码生成 4 篇

2604.11556 2026-06-19 cs.SE cs.AI 版本更新 90%

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

FM-Agent: 通过基于LLM的Hoare风格推理将形式化方法扩展到大型系统

Haoran Ding, Zhaoguo Wang, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University(并行与分布式系统研究所,上海交通大学)

专题命中 代码生成 :LLM自动生成函数规范实现形式化推理

AI总结 提出FM-Agent框架,利用LLM自动生成函数级规范,实现大型系统的组合式推理,在143k行代码的系统中2天内发现522个新bug。

详情
AI中文摘要

LLM辅助的软件开发已日益普遍,并能生成如编译器这样的大型系统。增强生成代码的正确性变得至关重要。然而,由于代码复杂性,大型系统的自动推理仍然具有挑战性。Hoare逻辑提供了一种将大型系统分解为较小组件并分别推理(即组合式推理)的方法。然而,现有工作仍难以扩展,因为Hoare逻辑要求为每个函数编写形式化规范,给人类带来沉重负担。当代码由LLM生成时,问题更加严重,因为开发人员缺乏对每个函数预期行为的深入理解。本文提出FM-Agent,这是第一个实现大型系统自动化组合式推理的框架。利用LLM,FM-Agent引入了一种自顶向下的范式来自动生成函数级规范。具体来说,FM-Agent从调用者期望函数如何行为中推导出函数的规范,因此即使实现有缺陷,生成的规范也能反映开发者的意图。开发者的意图通常用自然语言表达,而现有的验证器只支持公式。因此,FM-Agent推广了Hoare风格推理,以针对自然语言规范推理函数。最后,为了确认错误存在并解释错误原因,FM-Agent自动生成测试用例以触发潜在错误。在我们的评估中,FM-Agent在2天内成功推理了大型系统,每个系统最多有143k行代码。这些系统已经由开发者测试过,但FM-Agent仍然发现了522个新错误。这些错误可能导致严重后果,包括系统崩溃和错误的执行结果。

英文摘要

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

2602.00510 2026-06-19 cs.AI cs.LG cs.SE 版本更新 85%

PCBSchemaGen: Reward-Guided LLM Code Synthesis for Printed Circuit Boards (PCB) Schematic Design with Structured Verification

PCBSchemaGen: 奖励引导的LLM代码合成用于印刷电路板(PCB)原理图设计及结构化验证

Huanghaohe Zou, Peng Han, Emad Nazerian, Mafu Zhang, Zhicheng Guo, Alex Q. Huang

发表机构 * Semiconductor Power Electronics Center (SPEC)(半导体功率电子中心) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Arizona State University(亚利桑那州立大学)

专题命中 代码生成 :LLM生成PCB原理图代码合成

AI总结 提出PCBSchemaGen框架,通过结构化验证器引导冻结的LLM生成可修复的PCB原理图,在无单元测试的领域实现高准确率。

详情
AI中文摘要

大多数LLM代码合成基准依赖于单元测试作为奖励预言,但PCB原理图设计没有这样的测试:正确性由真实IC封装和引脚级分配的结构化物理约束定义,每个任务的金标准参考不可用,且SPICE仿真无法验证原理图级正确性。我们提出PCBSchemaGen,一个无需训练的推理时框架,将冻结的LLM转变为可验证、可修复的PCB原理图生成器。该框架从IC数据手册中提取领域模式以约束LLM解码,将其与一个具有引脚级错误定位的确定性5层连续奖励验证器配对,并通过汤普森采样臂获取赌博机优化候选方案。我们在两个PCB基准上评估,涵盖22个统一电路领域的227个真实IC任务,包括一个从公开原理图导出的套件,作为完全保留的泛化测试(验证器、知识图谱库和提示在评估前冻结)。在我们的框架下,一个开放权重的31B模型(Gemma-4-31B)平均通过PCBBench任务的81.3%,且同一框架在两个基准间迁移时无需更改验证器代码;而基于相同Gemma-4-31B骨干网络的Circuitron式推理时提示基线在困难的系统级设计上崩溃。这表明在确定性结构验证器下的推理时优化是在没有单元测试预言的领域中实现无参考LLM代码合成的一般方法。我们的基准和确定性验证器在此https URL公开可用。

英文摘要

Most LLM code-synthesis benchmarks rely on unit tests as the reward oracle, but PCB schematic design has none: correctness is defined by structured physical constraints over real IC packages and pin-level assignments, per-task golden references are unavailable, and SPICE simulation does not validate schematic-level correctness. We introduce PCBSchemaGen, a training-free inference-time framework that turns a frozen LLM into a verifiable, repairable PCB schematic generator. The framework induces a domain schema from IC datasheets to ground LLM decoding, pairs it with a deterministic 5-layer continuous-reward verifier with pin-level error localization, and refines candidates through a Thompson Sampling arm-acquiring bandit. We evaluate on 2 PCB benchmarks covering 227 real-IC tasks across 22 unified circuit domains, including a public-schematic-derived suite that serves as a fully held-out generalization test (verifier, KG library, and prompts frozen before any evaluation). Under our framework, an open-weight 31B model (Gemma-4-31B) passes 81.3% of PCBBench tasks on average, and the same framework transfers across both benchmarks with zero verifier code changes; a Circuitron-style inference-time prompting baseline on the same Gemma-4-31B backbone collapses on hard system-level designs. This suggests inference-time refinement under a deterministic structural verifier is a general recipe for reference-free LLM code synthesis in domains without unit-test oracles. Our benchmarks and deterministic verifier are publicly available at https://github.com/HZou9/PCBSchemaGen_v2.

2606.01338 2026-06-19 cs.CL 版本更新 80%

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试:消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta, Ambika Baniya Bhandari

发表机构 * Department of Computer Science, University of the Cumberlands(大学的计算机科学系) Department of Computer Science, DePaul University(德保罗大学计算机科学系) Youngstown State University(亚当斯州立大学)

专题命中 代码生成 :评估本地LLM在生物制药制造中的NL2SQL性能。

AI总结 本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能,发现代码调优的通用模型优于领域特定模型,但当前性能仍需人工监督。

详情
AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范(GMP)和欧盟AI法案等监管框架下运营,这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型(LLM)提供了一种保护隐私的替代方案,但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM(Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B)在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI,使用一个包含约63,000条记录的合成Microsoft SQL Server数据库,涵盖批次、制造执行系统(MES)和在线清洗(CIP)模块。模型在60个领域特定的自然语言问题上进行了基准测试,使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL,而Meditron 7B由于上下文窗口限制和SQL生成能力差,几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性,而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明,代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的,但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

2606.05017 2026-06-19 cs.AR cs.MS 版本更新 60%

GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF256 with a Lucas-Exact Integer Identity

GoldenFloat: 从GF4到GF256的基于Phi的静态拆分浮点系列及其Lucas精确整数恒等式

Dmitrii Vasilev

专题命中 代码生成 :提出GoldenFloat浮点系列RTL生成器。

AI总结 提出一种由单一闭式规则生成的静态拆分浮点系列GoldenFloat,并给出多宽度RTL生成器、Lucas精确累加器路径和FPGA编解码器三个具体实现。

Comments 20 pages, single-file LaTeX, ASCII source. v2: peer-anchor updates. Adds Sarnoff P3109 (arXiv:2606.04028), AMD MXFP4 silicon (arXiv:2605.09825), NVIDIA GB10 NVFP4 measurement, companion catalog (arXiv:2606.09686), MixFP4 (arXiv:2605.31035). FL-002 expanded: (c1) GF256 bias, (c2) count drift, (g) static-split vs micro-mixing. TTSKY26a regeneration timeline added. No mathematical claims revised

详情
AI中文摘要

我们提出一种面向硬件的GoldenFloat(GF)描述,这是一个由单一闭式规则生成的静态拆分浮点系列,以及三个具体成果:(i)一个开放的多宽度RTL生成器,覆盖GF4-GF256,并带有针对正确舍入参考的连续积分差分扫描;(ii)一个整数支持的Lucas精确累加器路径,在n=1,...,256时以500位精度验证;(iii)一个GF16 FPGA编解码器,在Artix-7(Xilinx XC7A35T)上以323 MHz通过35/35测试台。对于每个总宽度N>=4,指数宽度e=round((N-1)/phi^2),其中小数部分f=N-1-e,phi=(1+sqrt(5))/2。该规则复现了九种格式(9/9)的已实现指数宽度,并一致扩展到GF128、GF512、GF1024。该规则与posit、takum、OCP-MX以及IEEE P3109多宽度浮点草案并列。我们不对其中任何一种提出每级精度或优越性声明。广度/工具链一致性框架被记录为一个开放猜想,并带有预注册的证伪路径。证伪分类账(FL-002)记录了开放问题及解决它们的实验。报告了日期为2026-05-31的RTL正确性勘误;制造的TTSKY26b芯片带有缺陷的乘法器组合,修正后的生成器是再生基线。

英文摘要

We present a hardware-oriented description of GoldenFloat (GF), a static-split floating-point family generated by a single closed rule, and three concrete artefacts: (i) an open multi-width RTL generator covering GF4-GF256 with a continuous-integration differential sweep against a correctly-rounded reference; (ii) an integer-backed Lucas-exact accumulator path verified at 500-digit precision for n = 1, ..., 256; and (iii) a GF16 FPGA codec passing a 35-of-35 testbench at 323 MHz on Artix-7 (Xilinx XC7A35T). A format-conformance oracle (Corona) ships in the same repository and is used as the blackbox check in our continuous-integration audit. The rule and its scope. For each total width N >= 4, the exponent width is e = round((N-1)/phi^2) with fraction f = N-1-e and phi = (1+sqrt(5))/2. The rule reproduces the realised exponent widths of nine formats GF4, GF8, GF12, GF16, GF20, GF24, GF32, GF64, GF256 (9/9) and extends consistently to GF128, GF512, GF1024. The rule is positioned alongside posit (2022 Posit Standard), takum (Hunhold 2024, 2025), OCP-MX (Rouhani et al. 2023), and the IEEE P3109 multi-width float draft, all of which are width-spanning families under a parameterised rule. We make no per-rung accuracy or superiority claim against any of them. What is open. The breadth/toolchain-coherence framing is recorded as an open conjecture with a pre-registered falsification path: a matched-substrate FPGA experiment and a matched-budget software ablation. A falsification ledger (FL-002) records the open questions and the experiments that would settle them. An RTL-correctness erratum dated 2026-05-31 is reported in Section 5.5; the fabricated TTSKY26b dies carry the defective multiplier portfolio, and the corrected generator is the regeneration baseline.

2. 代码评测 1 篇

2511.18288 2026-06-19 cs.SE 版本更新 90%

Can Large Language Models Reason About Complex Execution Paths? An Empirical Study on Python

大型语言模型能否推理复杂执行路径?基于Python的实证研究

Wenhan Wang, Kaibo Liu, Zeyu Sun, An Ran Chen, Ge Li, Gang Huang, Lei Ma

专题命中 代码评测 :实证研究LLM在Python执行路径推理中的能力。

AI总结 本文实证研究大型语言模型在Python执行路径推理中的可行性,构建测试用例生成和缺陷分类任务,发现LLM能提升路径覆盖率,但强推理模型不一定优于弱模型。

Comments Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM)

详情
AI中文摘要

执行路径推理是理解程序语义的关键步骤,对于生成覆盖特定分支/路径的测试用例或检测由某些路径触发的缺陷(无需实际执行程序)至关重要。传统上,执行路径推理可通过符号执行技术实现,但现有的基于SMT的符号执行方法在处理复杂数据结构及外部API调用时面临困难。在具有高度灵活语法的语言(如Python)中,这一挑战更为突出,导致缺乏广泛采用的执行路径推理工具。因此,基于AI的方法进行执行路径推理成为一个有前景的方向。本文研究了采用大型语言模型(LLMs)进行Python执行路径推理的可行性,而传统的基于路径的符号执行工具在此环境中不可用。我们对两类路径推理任务进行了实证研究:用于测试用例生成的生成任务和用于缺陷检测的分类任务。我们从竞赛级程序和真实世界仓库中构建了新的评估流水线和基准。结果表明,最先进的LLMs能够正确推理执行路径,并提高真实世界软件的测试覆盖率,尽管推理能力更强的模型并不总是优于较弱的模型。这些发现凸显了利用LLMs作为路径感知代码推理的补充启发式方法的潜力,特别是在缺乏成熟符号执行工具的程序语言中。我们已在以下网址发布了基准和评估脚本:此 https URL。

英文摘要

Execution path reasoning is a key step towards program semantics understanding. It is crucial for generating test cases that cover certain branches/paths, or detecting bugs that are triggered by some paths without actually executing the program. Traditionally, execution path reasoning can be achieved by symbolic execution techniques, but existing SMT-based symbolic execution approaches struggle with complex data structures and external API calls. This challenge is even more pronounced in languages with highly flexible syntax, such as Python, resulting in a lack of widely adopted tools for reasoning on execution paths. Therefore, reasoning execution paths with AI-based approaches become a promising direction. In this paper, we investigate the feasibility of adopting large language models (LLMs) for execution path reasoning on Python, where traditional path-based symbolic execution tools are unavailable. We conduct an empirical study on two types of path reasoning tasks: generation tasks for test case generation and classification tasks for bug detection. We build new evaluation pipelines and benchmarks from both competition-level programs and real-world repositories. Our results show that state-of-the-art LLMs can perform correct reasoning on execution paths and improve test coverage on real-world software, though models with stronger reasoning abilities do not always outperform weaker ones. These findings highlight the potential of utilizing LLMs as a complementary heuristic for path-aware code reasoning, especially in program languages lacking mature symbolic execution tools. We have released our benchmark and evaluation scripts at https://github.com/jacobwwh/llm-path-study.

3. 软件智能体 1 篇

2512.00560 2026-06-19 cs.SE 版本更新 80%

SAGE: Semantic-Aware Gray-Box Game Regression Testing with Large Language Models

SAGE: 基于语义的灰盒游戏回归测试与大型语言模型

Jinyu Cai, Jialong Li, Nianyu Li, Zhenyu Mao, Mingyue Zhang, Kenji Tei

专题命中 软件智能体 :利用LLM引导强化学习自动生成游戏测试套件。

AI总结 提出SAGE框架,利用LLM引导强化学习自动生成测试套件,通过语义多目标优化精简测试,并基于更新日志语义分析优先排序,在Overcooked Plus和Minecraft中实现高效回归测试。

Comments This paper has been accepted by Automated Software Engineering journal

详情
AI中文摘要

现代实时服务游戏的快速迭代周期使得回归测试对于维持质量和稳定性不可或缺。然而,现有的回归测试方法面临关键限制,特别是在无法完全访问源代码的常见灰盒设置中:它们严重依赖手动构建测试用例,难以维护因冗余而日益庞大的测试套件,并且缺乏有效的机制来优先排序相关测试。这些挑战导致测试成本过高、自动化程度有限以及缺陷检测不足。为了解决这些问题,我们提出了SAGE,一个面向灰盒游戏环境的语义感知回归测试框架。SAGE系统地解决了测试生成、维护和选择的核心挑战。它采用LLM引导的强化学习进行高效、目标导向的探索,以自动生成多样化的基础测试套件。随后,它应用基于语义的多目标优化,通过平衡成本、覆盖率和稀有性,将该套件精炼为紧凑、高价值的子集。最后,它利用基于LLM的更新日志语义分析,优先排序与版本变更最相关的测试用例,从而实现跨迭代的高效适应。我们在两个代表性环境Overcooked Plus和Minecraft上评估了SAGE,并与自动化基线和人工记录的测试用例进行了比较。在所有环境中,SAGE以显著更低的执行成本实现了更优的缺陷检测,并展现出对版本更新的强大适应性。

英文摘要

The rapid iteration cycles of modern live-service games make regression testing indispensable for maintaining quality and stability. However, existing regression testing approaches face critical limitations, especially in common gray-box settings where full source code access is unavailable: they heavily rely on manual effort for test case construction, struggle to maintain growing suites plagued by redundancy, and lack efficient mechanisms for prioritizing relevant tests. These challenges result in excessive testing costs, limited automation, and insufficient bug detection. To address these issues, we propose SAGE, a semanticaware regression testing framework for gray-box game environments. SAGE systematically addresses the core challenges of test generation, maintenance, and selection. It employs LLM-guided reinforcement learning for efficient, goal-oriented exploration to automatically generate a diverse foundational test suite. Subsequently, it applies a semantic-based multi-objective optimization to refine this suite into a compact, high-value subset by balancing cost, coverage, and rarity. Finally, it leverages LLM-based semantic analysis of update logs to prioritize test cases most relevant to version changes, enabling efficient adaptation across iterations. We evaluate SAGE on two representative environments, Overcooked Plus and Minecraft, comparing against both automated baselines and human-recorded test cases. Across all environments, SAGE achieves superior bug detection with significantly lower execution cost, while demonstrating strong adaptability to version updates.

4. 程序修复 1 篇

2601.22978 2026-06-19 cs.CR cs.PL 版本更新 60%

Triosecuris: Formally Verified Protection Against Speculative Control-Flow Hijacking

Triosecuris:针对推测控制流劫持的形式化验证防御

Jonathan Baumann, Yonghyun Kim, Yan Farba, Catalin Hritcu, Julay Leatherman-Brooks

专题命中 程序修复 :形式化验证防御推测控制流劫持

AI总结 提出Triosecuris,结合CET风格硬件辅助控制流完整性与编译器插入的推测加载硬化,通过形式化证明实现相对安全性,确保任意程序在推测执行下不泄露比源程序无推测时更多的信息。

Comments To appear at CSF'26; extended version with appendices. W.r.t. first revision: extended with concrete protection against Spectre RSB and renamed to Triosecuris

Journal ref 39th IEEE Computer Security Foundations Symposium (CSF) (2026) 544-559

详情
AI中文摘要

本文介绍了Triosecuris,一种针对Spectre BTB、RSB和PHT的形式化验证防御,它结合了CET风格的硬件辅助控制流完整性与编译器插入的推测加载硬化(SLH)。Triosecuris基于一个新颖的观察:在CET风格保护存在的情况下,我们可以精确检测间接调用的BTB误推测和返回的RSB误推测,并设置SLH误推测标志。我们在Rocq中将Triosecuris形式化为一种变换,并提供了机器检查的证明,表明它实现了相对安全性:任何经过变换的程序在推测执行下泄露的信息不超过源程序在无推测执行下泄露的信息。这一强安全保证适用于任意程序,即使那些不遵循密码学常数时间编程规范的程序。

英文摘要

This paper introduces Triosecuris, a formally verified defense against Spectre BTB, RSB, and PHT that combines CET-style hardware-assisted control-flow integrity with compiler-inserted speculative load hardening (SLH). Triosecuris is based on the novel observation that in the presence of CET-style protection, we can precisely detect BTB misspeculation for indirect calls and RSB misspeculation for returns and set the SLH misspeculation flag. We formalize Triosecuris as a transformation in Rocq and provide a machine-checked proof that it achieves relative security: any transformed program running with speculation leaks no more than what the source program leaks without speculation. This strong security guarantee applies to arbitrary programs, even those not following the cryptographic constant-time programming discipline.