VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation
VeriScale:对抗性测试套件缩放用于可验证代码生成
Yifan Bai, Xiaoyang Liu, Zihao Mou, Guihong Wang, Jian Yu, Shuhan Xie, Yantao Li, Yangyu Zhang, Jingwei Liang, Tao Luo
AI总结 本文提出VeriScale框架,通过对抗性实现扩展和缩减测试套件,提升代码生成的可验证性,实验表明VerinaPlus显著暴露了模型弱点,而VerinaLite在低成本下保持判别能力。
详情
随着大型语言模型(LLMs)在软件工程中的广泛应用,构建高质量基准对于评估生成代码的功能正确性和形式可验证性至关重要。然而,现有基准受限于正负测试用例的数量和质量,导致模型在生成规范和实现方面的能力被高估。为此,我们提出VeriScale,一种由对抗性实现驱动的新框架,分为两个阶段:测试套件扩展以构建多样且具有挑战性的测试用例,以及测试套件缩减以将其压缩为紧凑且判别性的套件。虽然VeriScale具有通用性,但我们将其应用于Verina,构建VerinaPlus和VerinaLite。实验表明,VerinaPlus在SpecGen和CodeGen任务上显著暴露了模型弱点,而VerinaLite在低成本下保持了判别能力。增强的基准和源代码在https://github.com/XiaoyangLiu-sjtu/VeriScale上公开可用。
As large language models (LLMs) are increasingly deployed for software engineering, constructing high-quality benchmarks is crucial for evaluating not just the functional correctness, but also the formal verifiability of generated code. However, existing benchmarks are limited by the quantity and quality of positive and negative test cases, leading to an overestimation of model capabilities in generating specifications and implementations. To address this, we propose VeriScale, a novel framework driven by the adversarial implementations. It consists of two stages: test-suite expansion to construct diverse and challenging test cases, and test-suite reduction to distill them into compact yet discriminative suites. While VeriScale is general, we instantiate it on Verina to construct VerinaPlus, which expands the original test suites by over 83$\times$, and VerinaLite, a lightweight 14$\times$ variant. Our experiments across eight state-of-the-art LLMs demonstrate that VerinaPlus exposes substantial model weaknesses hidden by the original benchmark, evidenced by sharp score drops on both SpecGen and CodeGen tasks, whereas VerinaLite maintains this discriminative power at a fraction of the evaluation cost. The enhanced benchmarks and source code are publicly available at https://github.com/XiaoyangLiu-sjtu/VeriScale.