QBugLM: An Agentic Benchmarking Framework for LLM-based Quantum Software Debugging
QBugLM:基于LLM的量子软件调试的智能基准测试框架
An B. B. Pham, Hoa T. Nguyen, Muhammad Usman
AI总结 提出QBugLM多智能体框架,自动化量子软件调试流程,通过案例研究评估LLM调试能力,发现迭代反馈显著提升修复成功率。
详情
- Comments
- This paper was accepted at IEEE QSW 2026
量子软件缺陷通常产生静默的错误输出而非显式错误,这使得它们难以用传统技术检测和修复。尽管大型语言模型(LLM)在经典软件工程任务中表现出色,但其调试量子代码的能力仍未被充分探索。为填补这一空白,我们提出QBugLM,一个多智能体框架,自动化量子软件调试流程,从基于分类学的缺陷注入到基于LLM的检测和修复,最终到基于模拟的验证,适用于框架无关的OpenQASM 3.0程序。我们进一步使用QBugLM进行全面的案例研究,评估两个LLM(Claude 4.6 Sonnet和Qwen3 Coder Next)在不同提示策略、缺陷类别和量子程序上的表现。结果表明,迭代反馈至关重要,单次重试将Pass@1从低于25%提升至超过80%。此外,在固定资源约束下,对于具备推理能力的模型,更简单的结构化提示甚至优于思维链和ReAct。我们的工作迈出了基准测试LLM调试量子程序能力的第一步,并为未来自动化量子软件修复提供了实用见解。
Quantum software bugs often yield silent, incorrect outputs rather than explicit errors, making them particularly difficult to detect and repair with conventional techniques. Although large language models (LLMs) have shown strong performance on classical software engineering tasks, their ability to debug quantum code remains largely unexplored. To bridge this gap, we propose QBugLM, a multi-agent framework that automates the quantum software debugging pipeline, from taxonomy-driven bug injection to LLM-based detection and repair, and finally to simulation-based validation, for framework-agnostic OpenQASM 3.0 programs. We further conduct a comprehensive case study using QBugLM to benchmark two LLMs, Claude 4.6 Sonnet and Qwen3 Coder Next, across different prompting strategies, bug categories, and quantum programs. Our results show that iterative feedback is critical, as a single retry raises Pass@1 from below 25% to above 80%. Moreover, simpler structured prompting can even outperform Chain-of-Thought and ReAct for reasoning-capable models under fixed-resource constraints. Our work takes initial steps toward benchmarking LLM capabilities for debugging quantum programs and offers practical insights to support future efforts in automated quantum software repair.