代码大模型 / AI 编程 - arXivDaily 专题

2606.19725 2026-06-19 cs.SE cs.AI cs.MA 新提交 90%

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

面向OpenSIL固件中大语言模型生成的单元测试的库感知双打与迭代修复

Ma Toan Bach, Yuchi Zheng, Haingo Razafindranto, Tanvir Alam, Aric Leather, Ranveer Sandhu, Jitesh Arora

发表机构 * School of Software Design and Data Science（软件设计与数据科学学院）； Seneca Polytechnic（森纳学院）； Advanced Micro Devices Canada（加拿大先进微器件公司）

专题命中测试生成：LLM引导的多智能体自动化单元测试生成与修复。

AI总结针对OpenSIL固件单元测试因构建约束易失败的问题，提出LLM引导的多智能体自动化测试生成与迭代修复流程，在76个函数中73个生成可编译测试，行覆盖率达98.8%。

Comments 20 pages, 10 figures

详情

AI中文摘要

验证底层C固件中的变更成本高昂，因为单元测试（UT）在严格的构建约束下非常脆弱，缺失的头文件、未解析的符号和依赖不匹配经常阻止编译和链接。本研究为AMD维护的开源硅初始化库（openSIL）固件代码库引入了一种自动化的UT编写工作流程，通过大语言模型（LLM）引导的多智能体管道减少手动工作。该工作流程结合了测试框架的自动生成、库感知的桩、模拟和伪造的创建或重用，以及由构建日志和行覆盖率反馈驱动的迭代编译-分派修复循环。我们使用编译成功率、修复迭代次数、分派成功率和行覆盖率评估该方法，并以时间、成本和令牌使用量作为次要指标。在76个被测函数中，该工作流程为73个函数生成了可编译的UT。在没有行覆盖率指导或检索增强的配置下，平均行覆盖率达到73.9%。在两种配置下评估的48个函数子集中，仅使用行覆盖率指导时平均行覆盖率达到98.8%，与向量数据库检索结合时达到94.7%。结果表明，自动生成和修复管道可以显著提高受限固件环境中UT创建的效率和覆盖率，同时减少手动调试工作量。

英文摘要

Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

URL PDF HTML ☆

赞 0 踩 0