代码大模型 / AI 编程 - arXivDaily 专题

2604.11556 2026-06-19 cs.SE cs.AI 版本更新 90%

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

FM-Agent: 通过基于LLM的Hoare风格推理将形式化方法扩展到大型系统

Haoran Ding, Zhaoguo Wang, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University（并行与分布式系统研究所，上海交通大学）

专题命中代码生成：LLM自动生成函数规范实现形式化推理

AI总结提出FM-Agent框架，利用LLM自动生成函数级规范，实现大型系统的组合式推理，在143k行代码的系统中2天内发现522个新bug。

详情

AI中文摘要

LLM辅助的软件开发已日益普遍，并能生成如编译器这样的大型系统。增强生成代码的正确性变得至关重要。然而，由于代码复杂性，大型系统的自动推理仍然具有挑战性。Hoare逻辑提供了一种将大型系统分解为较小组件并分别推理（即组合式推理）的方法。然而，现有工作仍难以扩展，因为Hoare逻辑要求为每个函数编写形式化规范，给人类带来沉重负担。当代码由LLM生成时，问题更加严重，因为开发人员缺乏对每个函数预期行为的深入理解。本文提出FM-Agent，这是第一个实现大型系统自动化组合式推理的框架。利用LLM，FM-Agent引入了一种自顶向下的范式来自动生成函数级规范。具体来说，FM-Agent从调用者期望函数如何行为中推导出函数的规范，因此即使实现有缺陷，生成的规范也能反映开发者的意图。开发者的意图通常用自然语言表达，而现有的验证器只支持公式。因此，FM-Agent推广了Hoare风格推理，以针对自然语言规范推理函数。最后，为了确认错误存在并解释错误原因，FM-Agent自动生成测试用例以触发潜在错误。在我们的评估中，FM-Agent在2天内成功推理了大型系统，每个系统最多有143k行代码。这些系统已经由开发者测试过，但FM-Agent仍然发现了522个新错误。这些错误可能导致严重后果，包括系统崩溃和错误的执行结果。

英文摘要

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

URL PDF HTML ☆

赞 0 踩 0

2602.00510 2026-06-19 cs.AI cs.LG cs.SE 版本更新 85%

PCBSchemaGen: Reward-Guided LLM Code Synthesis for Printed Circuit Boards (PCB) Schematic Design with Structured Verification

PCBSchemaGen: 奖励引导的LLM代码合成用于印刷电路板(PCB)原理图设计及结构化验证

Huanghaohe Zou, Peng Han, Emad Nazerian, Mafu Zhang, Zhicheng Guo, Alex Q. Huang

发表机构 * Semiconductor Power Electronics Center (SPEC)（半导体功率电子中心）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Arizona State University（亚利桑那州立大学）

专题命中代码生成：LLM生成PCB原理图代码合成

AI总结提出PCBSchemaGen框架，通过结构化验证器引导冻结的LLM生成可修复的PCB原理图，在无单元测试的领域实现高准确率。

详情

AI中文摘要

大多数LLM代码合成基准依赖于单元测试作为奖励预言，但PCB原理图设计没有这样的测试：正确性由真实IC封装和引脚级分配的结构化物理约束定义，每个任务的金标准参考不可用，且SPICE仿真无法验证原理图级正确性。我们提出PCBSchemaGen，一个无需训练的推理时框架，将冻结的LLM转变为可验证、可修复的PCB原理图生成器。该框架从IC数据手册中提取领域模式以约束LLM解码，将其与一个具有引脚级错误定位的确定性5层连续奖励验证器配对，并通过汤普森采样臂获取赌博机优化候选方案。我们在两个PCB基准上评估，涵盖22个统一电路领域的227个真实IC任务，包括一个从公开原理图导出的套件，作为完全保留的泛化测试（验证器、知识图谱库和提示在评估前冻结）。在我们的框架下，一个开放权重的31B模型（Gemma-4-31B）平均通过PCBBench任务的81.3%，且同一框架在两个基准间迁移时无需更改验证器代码；而基于相同Gemma-4-31B骨干网络的Circuitron式推理时提示基线在困难的系统级设计上崩溃。这表明在确定性结构验证器下的推理时优化是在没有单元测试预言的领域中实现无参考LLM代码合成的一般方法。我们的基准和确定性验证器在此https URL公开可用。

英文摘要

Most LLM code-synthesis benchmarks rely on unit tests as the reward oracle, but PCB schematic design has none: correctness is defined by structured physical constraints over real IC packages and pin-level assignments, per-task golden references are unavailable, and SPICE simulation does not validate schematic-level correctness. We introduce PCBSchemaGen, a training-free inference-time framework that turns a frozen LLM into a verifiable, repairable PCB schematic generator. The framework induces a domain schema from IC datasheets to ground LLM decoding, pairs it with a deterministic 5-layer continuous-reward verifier with pin-level error localization, and refines candidates through a Thompson Sampling arm-acquiring bandit. We evaluate on 2 PCB benchmarks covering 227 real-IC tasks across 22 unified circuit domains, including a public-schematic-derived suite that serves as a fully held-out generalization test (verifier, KG library, and prompts frozen before any evaluation). Under our framework, an open-weight 31B model (Gemma-4-31B) passes 81.3% of PCBBench tasks on average, and the same framework transfers across both benchmarks with zero verifier code changes; a Circuitron-style inference-time prompting baseline on the same Gemma-4-31B backbone collapses on hard system-level designs. This suggests inference-time refinement under a deterministic structural verifier is a general recipe for reference-free LLM code synthesis in domains without unit-test oracles. Our benchmarks and deterministic verifier are publicly available at https://github.com/HZou9/PCBSchemaGen_v2.

URL PDF HTML ☆

赞 0 踩 0

2606.01338 2026-06-19 cs.CL 版本更新 80%

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试：消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta, Ambika Baniya Bhandari

发表机构 * Department of Computer Science, University of the Cumberlands（大学的计算机科学系）； Department of Computer Science, DePaul University（德保罗大学计算机科学系）； Youngstown State University（亚当斯州立大学）

专题命中代码生成：评估本地LLM在生物制药制造中的NL2SQL性能。

AI总结本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能，发现代码调优的通用模型优于领域特定模型，但当前性能仍需人工监督。

详情

AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范（GMP）和欧盟AI法案等监管框架下运营，这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型（LLM）提供了一种保护隐私的替代方案，但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM（Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B）在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI，使用一个包含约63,000条记录的合成Microsoft SQL Server数据库，涵盖批次、制造执行系统（MES）和在线清洗（CIP）模块。模型在60个领域特定的自然语言问题上进行了基准测试，使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL，而Meditron 7B由于上下文窗口限制和SQL生成能力差，几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性，而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明，代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的，但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

URL PDF HTML ☆

赞 0 踩 0

2606.05017 2026-06-19 cs.AR cs.MS 版本更新 60%

GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF256 with a Lucas-Exact Integer Identity

GoldenFloat: 从GF4到GF256的基于Phi的静态拆分浮点系列及其Lucas精确整数恒等式

Dmitrii Vasilev

专题命中代码生成：提出GoldenFloat浮点系列RTL生成器。

AI总结提出一种由单一闭式规则生成的静态拆分浮点系列GoldenFloat，并给出多宽度RTL生成器、Lucas精确累加器路径和FPGA编解码器三个具体实现。

Comments 20 pages, single-file LaTeX, ASCII source. v2: peer-anchor updates. Adds Sarnoff P3109 (arXiv:2606.04028), AMD MXFP4 silicon (arXiv:2605.09825), NVIDIA GB10 NVFP4 measurement, companion catalog (arXiv:2606.09686), MixFP4 (arXiv:2605.31035). FL-002 expanded: (c1) GF256 bias, (c2) count drift, (g) static-split vs micro-mixing. TTSKY26a regeneration timeline added. No mathematical claims revised

详情

AI中文摘要

我们提出一种面向硬件的GoldenFloat（GF）描述，这是一个由单一闭式规则生成的静态拆分浮点系列，以及三个具体成果：（i）一个开放的多宽度RTL生成器，覆盖GF4-GF256，并带有针对正确舍入参考的连续积分差分扫描；（ii）一个整数支持的Lucas精确累加器路径，在n=1,...,256时以500位精度验证；（iii）一个GF16 FPGA编解码器，在Artix-7（Xilinx XC7A35T）上以323 MHz通过35/35测试台。对于每个总宽度N>=4，指数宽度e=round((N-1)/phi^2)，其中小数部分f=N-1-e，phi=(1+sqrt(5))/2。该规则复现了九种格式（9/9）的已实现指数宽度，并一致扩展到GF128、GF512、GF1024。该规则与posit、takum、OCP-MX以及IEEE P3109多宽度浮点草案并列。我们不对其中任何一种提出每级精度或优越性声明。广度/工具链一致性框架被记录为一个开放猜想，并带有预注册的证伪路径。证伪分类账（FL-002）记录了开放问题及解决它们的实验。报告了日期为2026-05-31的RTL正确性勘误；制造的TTSKY26b芯片带有缺陷的乘法器组合，修正后的生成器是再生基线。

英文摘要

We present a hardware-oriented description of GoldenFloat (GF), a static-split floating-point family generated by a single closed rule, and three concrete artefacts: (i) an open multi-width RTL generator covering GF4-GF256 with a continuous-integration differential sweep against a correctly-rounded reference; (ii) an integer-backed Lucas-exact accumulator path verified at 500-digit precision for n = 1, ..., 256; and (iii) a GF16 FPGA codec passing a 35-of-35 testbench at 323 MHz on Artix-7 (Xilinx XC7A35T). A format-conformance oracle (Corona) ships in the same repository and is used as the blackbox check in our continuous-integration audit. The rule and its scope. For each total width N >= 4, the exponent width is e = round((N-1)/phi^2) with fraction f = N-1-e and phi = (1+sqrt(5))/2. The rule reproduces the realised exponent widths of nine formats GF4, GF8, GF12, GF16, GF20, GF24, GF32, GF64, GF256 (9/9) and extends consistently to GF128, GF512, GF1024. The rule is positioned alongside posit (2022 Posit Standard), takum (Hunhold 2024, 2025), OCP-MX (Rouhani et al. 2023), and the IEEE P3109 multi-width float draft, all of which are width-spanning families under a parameterised rule. We make no per-rung accuracy or superiority claim against any of them. What is open. The breadth/toolchain-coherence framing is recorded as an open conjecture with a pre-registered falsification path: a matched-substrate FPGA experiment and a matched-budget software ablation. A falsification ledger (FL-002) records the open questions and the experiments that would settle them. An RTL-correctness erratum dated 2026-05-31 is reported in Section 5.5; the fabricated TTSKY26b dies carry the defective multiplier portfolio, and the corrected generator is the regeneration baseline.

URL PDF HTML ☆

赞 0 踩 0