arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

代码大模型 / AI 编程

代码生成、软件工程智能体、程序修复、测试生成和开发者工具。

今日/当前日期收录 36 信号源:cs.SE, cs.CL, cs.AI, cs.LG, cs.PL

1. 代码评测 11 篇

2606.20517 2026-06-19 cs.AI cs.PL 新提交 95%

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Multi-LCB: 将 LiveCodeBench 扩展到多种编程语言

Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, Dmitrii Babaev

发表机构 * GigaCode Yandex School of Data Analysis, Applied AI Institute(Yandex数据分析学院,应用人工智能研究所)

专题命中 代码评测 :提出跨语言代码生成基准Multi-LCB,评估LLM代码能力

AI总结 提出 Multi-LCB 基准,将 LiveCodeBench 的 Python 任务扩展到 12 种编程语言,评估 LLM 跨语言代码生成能力,发现 Python 过拟合和语言特定污染等问题。

Comments ICLR 2026

详情
AI中文摘要

LiveCodeBench (LCB) 最近已成为评估大型语言模型 (LLM) 在代码生成任务上的广泛采用的基准。通过策划竞争性编程问题、不断向集合中添加新问题并根据发布日期进行过滤,LCB 提供了污染感知的评估,并提供了编码能力的整体视图。然而,LCB 仍然局限于 Python,留下了 LLM 是否能够泛化到现实软件工程所需的各种编程语言的问题。我们引入了 Multi-LCB,这是一个跨十二种编程语言(包括 Python)评估 LLM 的基准。Multi-LCB 将 LCB 数据集中的 Python 任务转换为其他语言中的等效任务,同时保留 LCB 的污染控制和评估协议。由于它与原始 LCB 格式完全兼容,Multi-LCB 将自动跟踪未来的 LCB 更新,从而能够系统地评估跨语言代码生成能力,并要求模型在 Python 之外保持良好的性能。我们在 Multi-LCB 上评估了 24 个 LLM 的指令和推理能力,发现了 Python 过拟合、语言特定污染以及多语言性能显著差异的证据。我们的结果将 Multi-LCB 确立为多编程语言代码评估的严格新基准,直接解决了 LCB 的主要局限性,并揭示了当前 LLM 能力的关键差距。

英文摘要

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

2606.19830 2026-06-19 cs.SE cs.CL 新提交 90%

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

JAMER:专业游戏引擎上的项目级代码框架数据集与基准测试

Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang, Yifei Huang, Yu Dai, Kaipeng Zhang

发表机构 * Nankai University(南开大学) Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室)

专题命中 代码评测 :项目级游戏代码框架数据集和基准,评估代码生成模型。

AI总结 提出首个基于专业游戏引擎的项目级代码框架数据集JamSet和基准JamBench,通过设计确定性验证流程,从24万仓库中筛选出8133个已验证项目,评估9个前沿模型发现项目规模增大时能力急剧下降。

详情
AI中文摘要

当前AI驱动的游戏开发在资产生成、游戏设计和基于Web的游戏编码方面取得了实质性进展,但由于缺乏大规模数据集和确定性评估方法,专业游戏引擎上的项目级代码工程仍然很大程度上未被探索。我们提出了JamSet和JamBench,这是首个基于专业游戏引擎的项目级游戏代码框架数据集和基准。我们的关键洞察是,Game Jam竞赛(开发者在严格时间限制下构建完整游戏的社区活动)产生了数千个适合此目的的开源项目。基于Godot引擎的文本格式和无头执行模式,我们设计了一个从文件完整性到运行时行为收集的确定性验证流程,从超过24万个仓库中提炼出8133个已验证项目。其中,300个手动验证的项目构成JamBench;其余构成JamSet。JamBench定义了主题驱动的生成和代码补全任务,通过结合编译通过率、结构完整性得分(SCS)和行为对齐得分(BAS)的流水线进行评估。对9个前沿模型的评估揭示了随着项目规模增加的能力悬崖,运行时通过率从小型项目的80.4%下降到大型项目的5.7%(Task2a)。代码代理提高了编译率,但在运行时行为质量上没有带来提升,表明瓶颈在于架构设计而非语法正确性。实验验证了JamSet作为有效训练数据。所有数据和代码均已公开。

英文摘要

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

2606.20502 2026-06-19 cs.CR cs.AI cs.SE 新提交 85%

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

无理解的校准:诊断微调大语言模型在系统软件漏洞检测中的局限性

Arastoo Zibaeirad, Marco Vieira

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

专题命中 代码评测 :评估LLM在系统软件漏洞检测中的能力

AI总结 提出CWE-Trace框架,通过834个Linux内核样本和两个诊断指标(DFI和HDD)评估LLM漏洞检测能力,发现数据污染无实质帮助,微调仅改变输出阈值而非决策策略,模型缺乏真正的安全推理能力。

详情
AI中文摘要

大语言模型在漏洞基准测试中得分高,但究竟是真正推理安全还是仅对污染数据进行模式匹配,这一问题仍未解决。我们提出CWE-Trace,一个基于834个手动整理的Linux内核样本(涵盖74个CWE)构建的LLM漏洞检测框架。该框架强制执行严格的时间分割(2025年前的历史集/截止后的无泄漏集),保留上下文感知的易受攻击-修补对,并引入两个诊断指标:方向性失败指数(DFI)和层次距离与方向(HDD)。我们评估了8个原始LLM和15个LoRA微调变体,涵盖非目标检测、目标检测和CWE分类。分析得出两个关键结果。首先,数据污染未提供可衡量的优势。函数级分析显示,84%的名义污染样本不携带可用的记忆信号:易受攻击的函数缺失或跨数据集交叉映射,约31%的污染样本存在CWE误分类。其次,骨干方向性先验主导微调。模型表现出稳定、系统性的失败模式(DFI范围从-85.5到+94.8个百分点),这些模式从历史数据持续到截止后数据,且难以纠正。微调改变了输出阈值,但未改变决策策略。这是无理解的校准:输出分布适应训练数据,而底层安全推理仍然缺失。在二元检测中最弱的骨干(DeepSeek-R1)在粗粒度CWE分类中提升最大,表明检测和理解是解耦的能力。最佳检测得分仅达到52.1%(比随机高2.1个百分点);精确CWE排名Top-1准确率仍低于1.3%,证实当前LLM无论采用何种微调策略,都缺乏对系统软件的可靠安全推理能力。

英文摘要

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

2606.19613 2026-06-19 cs.SE cs.AI 新提交 85%

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

StaminaBench: 对编码智能体进行100轮交互的压力测试

Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

发表机构 * AWS Agentic AI(AWS 代理人工智能)

专题命中 代码评测 :提出StaminaBench压力测试编码智能体耐力。

AI总结 提出StaminaBench基准,通过100轮连续变更请求测试编码智能体的耐力,发现所有模型在5-6轮内失败,而测试反馈和重试机制可将通过轮数提升12倍。

详情
AI中文摘要

我们引入了StaminaBench,一个衡量编码智能体耐力的基准:它们在失败前能处理多少连续交互轮次(变更请求)。与流行的任务解决率指标不同,这符合实际编码风格,其中会话运行数十或数百轮。在StaminaBench中,智能体实现一个REST API服务器,并在可调数量的程序生成的后续变更请求(实验中为100个)上进行修改,导致代码库最多达6000行。测试完全以编程方式生成,无需LLM参与,确保可重复性和可靠性;变更序列来自硬编码或LLM驱动的采样器,两者都受限于结构化动作空间以确保变更有效。智能体和服务器在隔离环境中运行,并通过HTTP与基准通信,使测试完全黑盒且与语言无关。我们评估了六个智能体框架与七个开源LLM在20个场景(每个100轮)上的表现,发现:(1)所有测试模型在5-6轮内失败,确认了无彻底测试的编码风格会产生错误;(2)将测试反馈传递给智能体并允许重试,可将通过轮数提升最多12倍;(3)良好的框架是强性能所必需的:更强的模型在其最佳和最差框架之间表现出高达6倍的差距,而较弱的模型在任何框架下都失败。我们发布了基准和生成的任务,以促进对多轮编码智能体行为的进一步研究。基准代码和数据:此 http URL。

英文摘要

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交 85%

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式:移动代理是否需要手机屏幕?

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

发表机构 * Mila – Québec AI Institute(魁北克人工智能研究所) Concordia University(康科迪亚大学) University of Toronto(多伦多大学) McMaster University(麦马斯特大学)

专题命中 代码评测 :评估编码代理在移动平台上的表现。

AI总结 本文挑战移动代理的GUI主导范式,提出CLI应同等重要,通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线,并引入CLI-Advantage任务套件展示其优势。

详情
AI中文摘要

近期移动代理的进展主要由GUI范式主导,其中代理感知UI信息并发出屏幕交互。然而,移动平台也提供了命令行接口(CLI),可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上,使用四种模型API评估了三个编码代理(Claude Code、Terminus-2、mini-swe-agent),未进行任何移动特定后训练,并与三个可复现的GUI基线(GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B)进行比较。Claude Code(Opus 4.7)达到71.8%和51.9%,优于所有可复现的GUI基线(AndroidWorld上69.3/68.1/57.8%;MobileWorld上43.2/26.3/13.3%),而其他CLI配置也保持竞争力。为确立该范式的上限,我们提供了oracle CLI解决方案,在AndroidWorld上达到88.8%(103/116个任务可CLI解决),在MobileWorld上达到86.3%(101/117个任务可CLI解决),表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图,我们引入了\ extbf{CLI-Advantage任务套件},包含五个类别的45个模板:批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线,且每个任务步骤显著更少(10.7步 vs. 18.6步)。为支持未来移动CLI代理的研究,我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

2606.06747 2026-06-19 cs.SE 新提交 85%

Tensor Algebraic Property Skeletons: Amplifying Property-Based Testing for AI Compilers

张量代数性质骨架:增强AI编译器的基于性质的测试

Yuxin Qiu, Ben Limpanukorn, Seongmin Lee, Jiyuan Wang, Qian Zhang, Miryung Kim

专题命中 代码评测 :LLM生成性质测试,检测AI编译器语义漂移

AI总结 提出Propilot框架,利用LLM将张量代数知识表示为可复用的性质骨架,自动生成可执行的基于性质的测试,以检测AI编译器中的语义漂移。

Comments v2 adds citations and fixes some typos

详情
AI中文摘要

深度学习编译器(如TVM和ONNX-MLIR)将张量计算图降级为目标后端的优化可执行文件。测试这些AI编译器在模糊测试中生成良好格式的输入方面取得了实质性进展;然而,仅靠这种生成无法捕捉到图变换和优化应保持的代数不变量的语义漂移。尽管张量代数已被研究数十年,但它尚未转化为深度学习编译器的可执行基于性质的测试,因为这样做需要联合构建算子、输入和测试预言。核心挑战不再是生成用于模糊测试深度学习编译器的良好格式输入,而是基于张量代数用这些输入和预言引导可执行的基于性质的测试。我们在Propilot中实现了这一愿景,这是一个基于GPT 5.5的LLM驱动的智能体基于性质测试框架。首先,Propilot将张量代数知识表示为可复用的性质骨架,每个骨架都包含算子约束、形状和值规则以及预言模板。其次,给定目标编译器,Propilot通过生成配对的张量计算图、具体的张量输入和预期的语义关系作为预言,将这些骨架实例化为可执行的基于性质的测试。接下来,为防止生成的测试退化为无效或无信息的基于性质的测试,Propilot在执行前验证每个基于性质的测试候选的适用性和安全性。验证反馈、执行结果和覆盖率信号指导后续生成。我们在TVM上使用212个算子和20个性质骨架评估Propilot,生成了4,579个基于性质的测试。与直接的基于LLM的基于性质的测试生成相比,Propilot通过显式的性质骨架将冗余减少了49%,并消除了无效测试。这种有效性转化为发现语义错误和数值差异。

英文摘要

Deep learning (DL) compilers such as TVM and ONNX-MLIR lower tensor computation graphs into optimized executables for target backends. Testing these compilers has made substantial progress in generating well-formed inputs in the context of fuzzing. However, such generation alone does not catch semantic drifts from algebraic invariants that graph transformations and optimizations are expected to preserve. While tensor algebra has been studied for decades, it has not been transformed into executable property-based tests (PBTs) for DL compilers because doing so requires the time-consuming and error-prone task of jointly constructing operators, tensors, and oracles. The central challenge is no longer generating well-formed inputs for fuzzing DL compilers, but bootstrapping executable PBTs with such inputs and correct oracles based on tensor algebra. We realize this vision in Propilot, an LLM-driven agentic property-based testing framework for DL compilers. First, Propilot represents tensor algebra knowledge as reusable property skeletons, each coupled with operator constraints and oracle templates. Second, given a target compiler, Propilot instantiates these skeletons into executable PBTs by generating paired tensor computation graphs, tensor inputs, and expected semantic relations as oracles. Third, to prevent generated tests from degenerating into invalid or uninformative PBTs, Propilot validates each PBT candidate before execution for applicability and safety. Validation feedback, execution results, and coverage signals guide subsequent generation. We evaluate Propilot on TVM with 212 operators and 20 property skeletons, generating 4,579 PBTs. Compared with direct LLM-based PBT generation, Propilot reduces redundancy by 49% and eliminates invalid tests through explicit property skeletons. This effectiveness translates into finding semantic errors and numerical discrepancies.

2606.20436 2026-06-19 cs.CR cs.AI 新提交 80%

Multi-View Decompilation for LLM-Based Malware Classification

基于LLM的恶意软件分类的多视角反编译

Bercan Turkmen, Vyas Raina

发表机构 * Independent Researcher(独立研究员) SPARK

专题命中 代码评测 :使用LLM对反编译代码进行恶意软件分类

AI总结 提出多反编译器视角提升LLM恶意软件分类性能,通过Ghidra和RetDec的互补伪C代码提高召回率和F1分数。

详情
AI中文摘要

恶意软件分析师通常在源代码不可用时,通过反编译的伪C代码检查编译后的二进制文件。最近的研究表明,大型语言模型(LLMs)可以通过将反编译代码分类为良性或恶意来辅助这一过程,但现有的流程通常依赖于单一的反编译器视角。我们认为这一假设是脆弱的:反编译器是有损的启发式工具,不同的反编译器可能暴露同一二进制文件的不同特征。我们整理了一个包含良性工具和恶意程序的基准测试,涵盖一系列威胁行为。每个样本都使用Ghidra和RetDec进行编译和反编译,生成匹配的伪C视图。在来自主要模型系列的一系列LLMs中,我们发现提供两种反编译器视图可以提高恶意类别的F1分数,主要是通过提高恶意样本的召回率。一致性分析进一步表明,Ghidra和RetDec会犯部分不同的错误,支持反编译器输出提供互补证据的观点。我们的结果表明,多反编译器提示是一种简单、无需训练的方法,可以在实际环境中改进基于LLM的恶意软件分类。

英文摘要

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.

2606.20146 2026-06-19 cs.AI 新提交 80%

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

BIM-Edit:基于IFC的建筑信息模型的大语言模型基准测试

Bharathi Kannan Nithyanantham, Clemens Kujat, Tobias Sesterhenn, Stefan Telgmann, Jörn Plönnigs, Stefan Lüdtke, Christian Bartelt

发表机构 * University of Rostock(罗斯托克大学) Clausthal University of Technology(克劳斯塔尔工业大学)

专题命中 代码评测 :评估LLM在建筑信息模型编辑上的基准。

AI总结 提出BIM-Edit基准,评估大语言模型在IFC格式建筑信息模型上的自然语言编辑能力,涵盖324个任务,最佳模型平均得分仅49.5%,揭示当前能力与工程需求间的差距。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被应用于计算机辅助设计(CAD),以从文本指令生成设计工件。在工程实践中,这需要的不仅仅是创建新的几何体,模型还必须理解现有场景,正确编辑它们,并保留语义和关系。然而,许多CAD基准侧重于创建新模型而非编辑现有模型,并且主要评估几何正确性。我们引入了BIM-Edit,这是一个用于评估LLMs在行业基础类(IFC)格式表示的建筑信息模型(BIM)上进行自然语言编辑的基准。BIM提供了一个具有挑战性的测试平台,因为建筑模型将几何体与语义和关系结构编码在一起。BIM-Edit包含324个编辑任务,涵盖11个真实建筑模型和36个合成场景。任务使用三种指令类别——直接、空间和拓扑——表达,涵盖显式编辑和场景接地编辑。我们沿三个维度评估输出:几何准确性、语义有效性和拓扑一致性。在评估的LLMs中,表现最佳的模型在三个指标上的平均得分仅为49.5%,且没有模型完全解决超过3.4%的任务。这些结果表明当前LLM能力与结构化工程设计工作流的要求之间存在巨大差距。

英文摘要

Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.

2606.20128 2026-06-19 cs.SE cs.DC cs.LG 新提交 80%

The Correctness Illusion in LLM-Generated GPU Kernels

LLM生成的GPU内核中的正确性错觉

Dipankar Sarkar

发表机构 * Arizona State University, USA(亚利桑那州立大学)

专题命中 代码评测 :评估LLM生成GPU内核的正确性。

AI总结 通过高精度CPU参考和操作模式感知的模糊测试,发现现有基准测试中基于固定形状的allclose检查无法检测LLM风格的转录错误,提出一种新协议并验证其有效性。

Comments 10 pages, 2 figures, LNCS format. Companion papers to follow on arXiv next week; IDs will be added in a v2 replace

详情
AI中文摘要

针对LLM生成的GPU内核的基准测试(KernelBench、TritonBench、GEAK)通过固定形状、小样本的allclose风格检查来评分正确性。不同基准测试的输入数量不同。每个内核的形状、数据类型和容差是固定的。我们凭经验测试了该oracle。我们构建了一个包含24个Triton和CPU替代内核(15个正确对照和9个带有记录转录错误的LLM风格错误变体)的受控语料库,并在操作模式感知的种子模糊测试下,使用高精度(fp64)CPU参考和每个(操作,数据类型)的绝对容差重新评估。种子oracle标记了9个错误内核中的9个,并通过了15个正确对照中的15个,对照的精度成本为零。我们将语料库扩展到26个操作(添加一个flash-attention对),并在五类GPU(RTX 3060、A10、L40S、A100 SXM4、H100 NVL)上重新运行相同的协议。所有五个GPU的判定结果相同:10个错觉中的10个被捕获,16个对照中的16个干净。语料库结果涉及LLM风格的转录错误,这些错误被单形状allclose oracle认证为正确,而不涉及任何特定部署的LLM的错误率。每个标记的失败都从存储的种子逐字节重放。

英文摘要

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

2606.19710 2026-06-19 cs.CL cs.AI 新提交 80%

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

FineREX: 面向人口走私知识图谱的微调NER-RE

Elijah Feldman, Dipak Meher, Carlotta Domeniconi

发表机构 * Thomas Jefferson High School for Science and Technology(托马斯·杰斐逊科技高中)

专题命中 代码评测 :微调LLM用于知识图谱构建中的NER和RE。

AI总结 提出FineREX,一个基于微调LLM的流水线,用于从法律文档中提取实体和关系构建知识图谱,在F1分数上分别提升15.50%和31.46%,并减少50%处理时间。

Comments Code available at https://github.com/ElijahFeldman7/FineREX

详情
AI中文摘要

法庭记录包含关于人口走私网络的有价值证据,但这些信息通常埋藏在非结构化的、充满术语的法律文件中。虽然大型语言模型(LLM)可以通过自动信息提取支持知识图谱构建,但现有方法依赖通用模型,未针对该领域所需的实体和关系定义进行定制。我们提出FineREX,一个精简的知识图谱构建流水线,基于微调的LLM进行命名实体识别和关系提取(NER-RE)。使用包含512个文本块的手动标注数据集,FineREX在实体和关系F1分数上分别比更大的通用基线模型绝对提高了15.50%和31.46%。这些提升转化为更高质量的知识图谱,将法律噪声减少近一半,并将长文档上的节点重复率从17.78%降至11.17%。通过消除文档重写和冗余提取阶段,FineREX还将端到端处理时间减少了50.0%。我们的结果表明,领域特定的微调可以显著优于更大的通用模型,同时提高非法网络分析知识图谱构建的质量和效率。

英文摘要

Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of $512$ text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.

2606.20134 2026-06-19 cs.LO cs.PL 新提交 70%

An MSO Framework for Weak-Memory Verification and Robustness

弱内存验证与鲁棒性的MSO框架

Giovanna Kobus Conrado, Andreas Pavlogiannis

专题命中 代码评测 :弱内存验证与鲁棒性的MSO框架。

AI总结 本文研究单子二阶逻辑作为弱内存元理论,证明顺序一致性执行有界树宽而TSO无界,展示多种模型可MSO公理化,并引入读自鲁棒性概念,实现统一验证算法。

Comments Accepted at CONCUR 2026

详情
AI中文摘要

内存模型是并发程序执行的形式化规范,解释了编译器和架构优化引入的弱行为。其数量和复杂性的增加促使人们通过在适当的元理论中公理化模型来统一验证整个模型类别。本文正式研究单子二阶逻辑(MSO)作为弱内存的元理论,通过证明各种流行弱内存模型的树宽和MSO可表达性结果,使得我们能够统一处理多个验证问题。总结如下:首先,我们证明顺序一致性($\mathsf{SC}$)下的执行具有有界树宽,而总存储顺序($\mathsf{TSO}$)下的执行则无界。其次,我们证明包括Release/Acquire和完整RC20在内的广泛模型是MSO可公理化的,而其他模型如Strong Release/Acquire和$\mathsf{TSO}$则不可,除非正交向量问题(在SETH下需要二次时间)可以在线性时间内解决。最后,我们引入读自鲁棒性概念,作为对近期粗粒度鲁棒性准则工作的扩展。我们证明树宽界限(上界和下界)对任何MSO可公理化模型$\mathsf{MM}$具有深远的算法意义:存在一个算法,对于每个程序$\mathsf{P}$,要么验证$\mathsf{P}$在$\mathsf{MM}$下的正确性,要么报告$\mathsf{P}$对$\mathsf{MM}$不是读自鲁棒的。总体而言,我们的结果为弱内存验证和鲁棒性建立了一个丰富且多功能的理论框架。

英文摘要

Memory models are formal specifications of concurrent-program executions, accounting for weak behaviors introduced by compiler and architectural optimizations. The increase of their number and complexity has spawned efforts for uniform verification across whole classes of models, by axiomatizing the models in an adequate metatheory that admits a uniform treatment. In this work, we formally study Monadic Second-Order logic (MSO) as a metatheory for weak memory, by proving results on the treewidth and MSO-expressibility of various popular weak-memory models, as this combination allows us to uniformly tackle several verification problems. In summary, our results are as follows. First, we prove that executions under Sequential Consistency ($\mathsf{SC}$) have bounded treewidth, while already those under Total Store Order ($\mathsf{TSO}$) do not. Second, we prove that a broad range of models, including Release/Acquire and the full RC20, are MSO-axiomatizable, while others, such as Strong Release/Acquire and $\mathsf{TSO}$, are not, unless the Orthogonal Vectors problem $\unicode{x2013}$ which requires quadratic time under SETH $\unicode{x2013}$ can be solved in linear time. Finally, we introduce the notion of reads-from robustness, as an extension to recent work on coarse robustness criteria. We show that our treewidth bounds (both upper and lower) have far-reaching algorithmic implications for any of our MSO-axiomatizable models $\mathsf{MM}$: there is an algorithm that, for every program $\mathsf{P}$, either verifies $\mathsf{P}$ under $\mathsf{MM}$ or reports that $\mathsf{P}$ is not reads-from robust against $\mathsf{MM}$. Overall, our results establish a rich and versatile theoretical framework for weak-memory verification and robustness.

2. 软件智能体 6 篇

2606.20512 2026-06-19 cs.SE cs.LG 新提交 90%

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

代码代理的仓库指导的探测与精炼调优

Asa Shepard, Jeannie Albrecht

发表机构 * Williams College(威廉姆斯学院)

专题命中 软件智能体 :提出编码代理仓库指导的探测与精炼调优

AI总结 提出探测与精炼调优方法,通过合成bug修复探测迭代诊断和修补仓库指导文件,在SWE-bench Verified上以Qwen3.5-35B-A3B模型达到33.0%解决率,优于静态知识库的28.3%和无指导基线的25.5%。

详情
AI中文摘要

基于LLM的代码代理需要关于仓库的更高级操作知识(哪些文件包含哪些子系统、如何运行测试套件、哪些工作流历史上导致错误修复),这些知识并不存在于代码本身。工程师通常维护\texttt{ this http URL }文件来提供这些上下文作为代码代理的指令,但它们是否有帮助存在争议:最近的研究对LLM生成的指导是否改善或损害代理性能存在分歧。在本文中,我们展示了指导的产生方式才是决定性变量,并引入了\emph{探测与精炼调优}:一种通过合成bug修复探测来迭代诊断和修补仓库指导文件的过程,使用单次LLM调用,在调优期间没有代理循环或工具使用。在SWE-bench Verified上,使用Qwen3.5-35B-A3B进行200步的四个独立试验中,探测与精炼实现了33.0%的平均解决率,而用于初始化的静态知识库为28.3%,无指导基线为25.5%(两个探测与精炼对比的p < 0.001)。改进来自覆盖率而非精确度:精炼后的指导为14.5个百分点(pp)更多的实例生成了可评估的补丁,而每个补丁的精确度在统计上保持不变(约59%,p = 0.119),表明改进的指导帮助代理到达正确的文件,而不是提高它们所做更改的质量。此外,一个步骤预算实验表明,指导让代理能够更有效地利用更大的步骤预算,而一个跨模型实验(使用NVIDIA-Nemotron-3-Nano-30B-A3B)发现,当模型无法生成足够诊断性的输出时,调优循环会退化,尽管即使在这种情况下每个补丁的精确度仍然保持不变。

英文摘要

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

2606.20243 2026-06-19 cs.SE cs.MA 新提交 90%

Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs

Phoenix: 通过多智能体LLM实现安全的GitHub问题解决

Kipngeno Koech, Muhammad Adam, Baimam Boukar Jean Jacques, Joao Barros

专题命中 软件智能体 :多智能体LLM系统解决GitHub问题

AI总结 提出多智能体LLM系统Phoenix,通过六个专业智能体和七层安全控制,在SWE-bench Lite子集上达到75%的解决率,并在真实问题中保持100%正确性。

详情
AI中文摘要

我们提出Phoenix,一个多智能体LLM系统,能够从分类到拉取请求创建解决GitHub问题,结合了七层安全控制与基线感知测试评估策略。Phoenix将工作分解给六个专业智能体:规划器、复现器、编码器、测试器、故障分析器和拉取请求(PR)智能体,所有智能体由基于标签的GitHub webhook状态机协调。在打开拉取请求之前,每次更改都会与基线测试运行进行对比。在SWE-bench Lite的24个实例子集上,在生产webhook路径上运行,Phoenix oracle解决了75%的实例,且成功运行中没有出现通过到通过的回归;这个精心挑选的子集不能直接与完整分割排行榜结果比较,我们讨论了比较的局限性。在14个仓库的42个真实问题上的补充试点实现了100%的正确性保持(CP;硬级别平均122秒)。人工检查显示,大约一半的拉取请求是定位良好的修复。另一半将代码放置在错误路径上,这是规划器定位的局限性,我们正在通过检索来解决。我们还报告了部署失败模式(WAF过滤、令牌过期、权限边界、不稳定的CI),这些模式促使了每种安全机制的引入。

英文摘要

We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24-instance slice of SWE-bench Lite. run on the production webhook path, Phoenix oracle-resolves 75% of instances with no pass-to-pass regressions on successful runs; this curated slice is not directly comparable to full-split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well-targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.

2606.19380 2026-06-19 cs.SE cs.LG 新提交 90%

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

AgentArmor:编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

发表机构 * Anthropic Fellows Program(Anthropic Fellow 项目) Constellation

专题命中 软件智能体 :研究编码代理的失败模式并提出缓解框架。

AI总结 提出AgentArmor框架,通过系统提示增强、命令分类器、三振政策等机制,缓解编码代理因规范不足、能力错误和工具错误导致的失败,显著提升安全性。

详情
AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中,我们研究这些失败模式源于三种不同的机制:规范不足,即默认模型行为不安全;能力错误,即安全动作可用但模型因偏见或能力限制而未遵循;以及代理工具错误,即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制,每个评估都受实际部署失败的启发,总计20个编码环境和59个合成转录模板。基于此评估,我们提出AgentArmor,一种代理工具修改,以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具,我们证明AgentArmor在统计显著数量的样本上更安全。因此,我们为当前编码代理提出具体缓解措施,并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

2606.14066 2026-06-19 cs.SE 新提交 90%

FastContext: Training Efficient Repository Explorer for Coding Agents

FastContext: 为编码智能体训练高效的仓库探索器

Shaoqiu Zhang, Maoquan Wang, Yuling Shi, Yuhang Wang, Xiaodong Gu, Yongqiang Yao, Tori Gong, Sheng Chen, Rao Fu, Anisha Agarwal, Spandan Grag, Gabriel Ryan, Colin Merkel, Yufan Huang, Shengyu Fu

专题命中 软件智能体 :编码智能体仓库探索器

AI总结 提出专用探索子智能体FastContext,通过并行工具调用和专注上下文生成,分离仓库探索与问题解决,在SWE-bench等任务上提升修复率达5.5%,降低编码智能体token消耗达60%。

Comments 34 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)编码智能体在软件工程任务上取得了强劲成果,但仓库探索仍是主要瓶颈:定位相关代码消耗大量token预算,并用不相关的片段污染智能体的上下文。在大多数智能体中,同一个模型既探索仓库又解决问题,将探索性读取和搜索留在求解器的历史记录中。我们提出FastContext,一个专用的探索子智能体,将仓库探索与求解分离。按需调用时,FastContext发出并行工具调用,并返回简洁的文件路径和行范围作为聚焦上下文。FastContext由专门的探索模型驱动,参数规模从4B到30B。我们从强参考模型轨迹中引导这些模型,并使用任务导向的奖励进行细化,以实现广泛的首次搜索、多轮证据收集和精确的引用生成。在SWE-bench Multilingual、SWE-bench Pro和SWE-QA上,将FastContext集成到Mini-SWE-Agent中,端到端修复率提升高达5.5%,同时编码智能体token消耗降低高达60%,且开销极小。这些结果表明,仓库探索可以与求解分离,并由专门模型有效处理。代码和数据:此 https URL

英文摘要

Large Language Model (LLM) coding agents have achieved strong results on software engineering tasks, yet repository exploration remains a major bottleneck: locating relevant code consumes substantial token budget and pollutes the agent's context with irrelevant snippets. In most agents, the same model explores the repository and solves the task, leaving exploratory reads and searches in the solver's history. We present FastContext, a dedicated exploration subagent that separates repository exploration from solving. Invoked on demand, FastContext issues parallel tool calls and returns concise file paths and line ranges as focused context. FastContext is powered by specialized exploration models spanning 4B--30B parameters. We bootstrap them from strong reference-model trajectories and refine them with task-grounded rewards for broad first-turn search, multi-turn evidence gathering, and precise citation generation. Across SWE-bench Multilingual, SWE-bench Pro, and SWE-QA, integrating FastContext into Mini-SWE-Agent improves end-to-end resolution rates up to 5.5% while reducing coding-agent token consumption up to 60%, with marginal overhead. These results show that repository exploration can be separated from solving and handled effectively by specialized models. Code and data: https://github.com/microsoft/fastcontext

2606.19616 2026-06-19 cs.SE cs.AI cs.MA 新提交 80%

Before the Pull Request: Mining Multi-Agent Coordination

在拉取请求之前:挖掘多智能体协调

Dipankar Sarkar

发表机构 * Arizona State University(亚利桑那州立大学)

专题命中 软件智能体 :提出grite协调基板,减少多编码智能体冲突。

AI总结 针对自主编码智能体在拉取请求中协调不足的问题,提出基于git的协调基板grite,通过事件日志减少重复和冲突工作,提升吞吐量,并自动恢复多种故障模式。

Comments 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: https://github.com/neul-labs/grite

详情
AI中文摘要

自主编码智能体现在可以开启数百万个拉取请求,然而大规模研究发现,它们的拉取请求虽然生成更快,但被接受的频率却更低——这是一个拉取请求级别的遥测无法解释的协调与信任差距。我们认为缺失的信号存在于拉取请求之前,即并发智能体如何声明、划分和碰撞共享工作。我们通过grite(我们的开源协调基板)来研究这一过程,它不需要中央服务器,并将其记录存储在git本身内部,因此其仅追加的、签名的事件日志直接捕获了协调过程。我们证明:(i) 这种共享基板以有限的开销减少了重复和冲突工作——仅重复队友任务的工作份额从78%降至0%,而有效吞吐量增加了三倍以上;(ii) 每个智能体的日志副本收敛到相同状态,没有写入被静默丢弃,而基于文件的跟踪器会丢失并发写入;(iii) 该日志是一个可挖掘的工件,从中可以自动恢复具体的故障模式——冲突编辑、锁饥饿、冗余发现、竞态关闭——并带有来源信息,其中一些在拉取请求历史中是不可见的。我们发布了数据集、测试平台和挖掘工具包。

英文摘要

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

2606.20487 2026-06-19 cs.CL 新提交 70%

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

超越全局重规划:跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Shanghai Innovation Institute(上海创新研究院) Southeast University(东南大学) Tsinghua University(清华大学)

专题命中 软件智能体 :涉及API-CLI-GUI执行和失败恢复

AI总结 提出分层重规划框架H-RePlan,通过统一API-CLI-GUI执行和跨层失败抽象,区分设备本地策略恢复与全局重规划,在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情
AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备,要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配,但恢复仍然粗粒度:当执行失败时,它们通常重试相同策略、重新分配子任务或修改全局计划,而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan},一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略,并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力,我们引入\textbf{HeraBench},一个故障注入基准,它在Linux和Android设备上构建跨设备工作流,并注入策略级和设备级故障。实验表明,H-RePlan显著优于单策略和粗粒度多设备基线,实现了更高的完成率、指令遵循率和完美通过率,同时降低了可靠端到端成功所需的令牌成本。这些结果表明,范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

3. 代码生成 10 篇

2606.20158 2026-06-19 cs.SE 新提交 90%

N-Version Programming with Coding Agents

使用编码代理的N版本编程

Javier Ron, Benoit Baudry, Martin Monperrus

专题命中 代码生成 :使用编码代理生成实现,评估多样性对故障模式的影响。

AI总结 本文在当代AI编码代理背景下重新审视N版本编程,通过Knight-Leveson实验评估代理系统、模型和实现语言的多样性对故障模式的影响,发现常见模式故障,但多数投票三版本单元显著降低故障数,证明该策略的工程实用性。

详情
AI中文摘要

本文在当代AI编码代理背景下重新审视N版本编程这一经典概念。通过重访开创性的Knight-Leveson实验,我们研究了代理系统、模型和实现语言之间的多样性是否会产生多样化的故障模式。使用Knight-Leveson的发射拦截器程序规范,我们在共享的预言机和100万个随机测试输入的测试集上评估了48个代理生成的实现。结果显示,与Knight-Leveson的发现一致,存在大量的共模故障。进一步分析表明,许多这些同时发生的故障可以追溯到规范中特别困难或模糊的地方。我们还证明了编码代理的多样性带来了实际效益:在多数投票的三版本单元中,平均故障数从单版本的387.44下降到三版本的130.99,并且有11,844个N版本单元表现出零观测故障。我们的原始结果是迄今为止最强的证据,表明使用编码代理的N版本编程是一种有用的工程策略。

英文摘要

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

2606.19988 2026-06-19 cs.SE 新提交 90%

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

基于大语言模型的仓库级Solidity代码生成:从提示到微调

Shi Chen, Rongcun Wang, Yuan Tian, Xiaoyuan Xie, Wei Song, Rubing Huang

专题命中 代码生成 :评估LLM在Solidity代码生成中的表现

AI总结 提出SolidityBench基准和SolidityScore指标,评估多种LLM方法在仓库级Solidity代码生成中的表现,发现监督微调最有效。

Comments 33 pages

详情
AI中文摘要

大语言模型(LLMs)在通用代码生成方面表现出强大的能力,但其在专业软件领域的有效性仍未得到充分探索。Solidity智能合约代表了一个高风险领域,生成的代码必须满足严格的语言级、安全性和软件工程约束。现有的基准和指标对于仓库级Solidity生成仍然不足,其中模型必须从自然语言需求中合成完整的合约。为了解决这一差距,我们引入了SolidityBench,一个包含5,470个仓库级Solidity智能合约及其自然语言描述的基准。我们还提出了SolidityScore,一种基于Solidity的语义度量,强调领域关键结构,如安全修饰符、合约声明和Solidity特定关键词。使用该基准,我们评估了代表性的代码LLM,包括Qwen2.5-Coder、DeepSeek-Coder和CodeLlama,涵盖零样本提示、思维链推理、上下文学习、检索增强生成和监督微调。结果表明,通用模型在仓库级Solidity生成中表现出系统性的结构缺陷。在非参数方法中,检索增强生成表现最佳,而上下文学习在超过两个示例后因上下文饱和而性能下降。监督微调通过将Solidity特定约束内化到模型参数中实现了最大的改进。总体而言,我们的研究为仓库级Solidity代码生成提供了全面的基准,并表明高质量领域数据结合监督微调是提高LLM生成智能合约可靠性的最有效策略。

英文摘要

Large Language Models (LLMs) have shown strong capabilities in general-purpose code generation, but their effectiveness in specialized software domains remains underexplored. Solidity smart contracts represent a high-stakes domain where generated code must satisfy strict language-level, security, and software-engineering constraints. Existing benchmarks and metrics remain insufficient for repository-level Solidity generation, where models must synthesize complete contracts from natural language requirements. To address this gap, we introduce SolidityBench, a benchmark of 5,470 repository-level Solidity smart contracts paired with natural language descriptions. We also propose SolidityScore, a Solidity-aware semantic metric that emphasizes domain-critical constructs such as security modifiers, contract declarations, and Solidity-specific keywords. Using this benchmark, we evaluate representative code LLMs, including Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama, across zero-shot prompting, Chain-of-Thought reasoning, in-context learning, retrieval-augmented generation, and supervised fine-tuning. The results show that general-purpose models exhibit systematic structural deficiencies in repository-level Solidity generation. Among non-parametric methods, retrieval-augmented generation performs best, while in-context learning degrades beyond two examples due to context saturation. Supervised fine-tuning achieves the largest improvement by internalizing Solidity-specific constraints into model parameters. Overall, our study provides a comprehensive benchmark for repository-level Solidity code generation and shows that high-quality domain data combined with supervised fine-tuning is the most effective strategy for improving the reliability of LLM-generated smart contracts.

2606.19387 2026-06-19 cs.SE cs.AI 新提交 90%

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成:基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Fudan University(复旦大学) USA(美国)

专题命中 代码生成 :利用LLM生成RTL硬件代码,结合形式化方法。

AI总结 提出结合LLM创造力与形式化方法可解释性的硬件生成框架,通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

详情
AI中文摘要

大型语言模型(LLM)在软件开发中取得了显著成功。然而,它们容易产生幻觉,即可能引入微妙的语义和逻辑错误。由于芯片设计和制造的高风险,硬件工程师仍不愿依赖LLM进行寄存器传输级(RTL)生成。本文提出一种硬件生成框架,结合了LLM的创造力和广泛知识与形式化方法的可解释性和数学严谨性。具体而言,我们设计了一组覆盖各种设计决策和硬件特征的变换规则。通过迭代应用这些规则,LLM代理可以将设计规范转换为正确性有保证的RTL程序。实验结果证明了该框架的有效性和效率。

英文摘要

Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register-transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework.

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 新提交 90%

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化?

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research(英伟达研究院)

专题命中 代码生成 :分析LLM在RTL编码中的失败与泛化

AI总结 提出基于问题可解性的错误分类法,揭示LLM在RTL编码中受限于预训练知识,对齐技术仅教会编译,而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情
AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型(LLM)的关键瓶颈。为了研究这一点,我们引入了一种新的错误分类法,该分类法基于问题可解性,受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限,前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义,暴露出对测试时计算扩展免疫的持续知识差距。此外,我们揭示了一个显著的表面收敛差距:优化容易消除语法错误,但同时加剧了更深层次的功能失败。我们的发现表明,对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误,但寄存器传输级(RTL)编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究,而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

2606.20373 2026-06-19 cs.SE cs.AI 新提交 85%

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass:基于证据的LLM智能体用于编译器性能调优

Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang

发表机构 * Shaanxi Normal University(陕西师范大学) Northwest University(西北大学) University of Leeds(利兹大学)

专题命中 代码生成 :LLM生成编译选项优化代码性能

AI总结 提出AutoPass多智能体框架,通过查询编译器内部状态和中间表示,利用运行时反馈迭代优化编译选项,无需训练即可提升性能,在x86-64和ARM64上分别实现1.043倍和1.117倍加速。

详情
AI中文摘要

大型语言模型(LLM)在代码编译任务中展现出潜力,但由于复杂的微架构效应和噪声运行时测量,将其应用于运行时性能调优较为困难。我们提出AutoPass,一个用于编译器性能调优的多智能体框架,它利用编译器和运行时证据来指导LLM生成的优化决策。与先前的自动调优方案将编译器视为黑盒不同,AutoPass向LLM开放编译器,使其能够查询编译器内部的优化状态并分析中间表示以编排编译器选项。搜索过程利用测量的运行时反馈迭代地优化配置,以诊断性能回退并指导延迟改进的编辑。AutoPass在仅推理、无需训练的环境下运行,无需离线训练或任务特定的微调,因此可轻松应用于新的基准测试和平台。我们在LLVM编译器上实现AutoPass,并在服务器级x86-64和嵌入式ARM64系统上进行评估。AutoPass优于专家调优的启发式方法和经典自动调优方法,在x86-64和ARM64上相对于LLVM -O3分别实现了1.043倍和1.117倍的几何平均加速。

英文摘要

Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler-internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency-improving edits. AutoPass operates in an inference-only, training-free setting and requires no offline training or task-specific fine-tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server-grade x86-64 and embedded ARM64 systems. AutoPass outperforms expert-tuned heuristics and classical autotuning methods, achieving geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64, respectively.

2606.19814 2026-06-19 cs.SE 新提交 85%

CoRaCommit: A VS Code Extension for Commit Message Generation with Exemplar Retrieval

CoRaCommit: 一种基于范例检索的提交消息生成的 VS Code 扩展

Chaoran Cai, Bo Xiong, Chong Wang, Lulu He, Peng Liang

专题命中 代码生成 :VS Code扩展,利用检索范例生成提交消息。

AI总结 提出 CoRaCommit VS Code 扩展,通过检索相似提交范例作为提示上下文、并行调用多个大语言模型生成候选消息并基于用户反馈动态推荐,在 ApacheCM 数据集上优于现有扩展。

Comments 17 pages, 6 images, 3 tables, Manuscript submitted to a Journal (2026)

详情
AI中文摘要

提交消息是描述代码变更意图的关键文本制品,在版本控制、代码审查和历史追踪中扮演重要角色。然而,实践中提交消息主要由人工编写,耗时且常导致质量不一致和表达不统一。现有的用于提交消息生成的 VS Code 扩展通常直接基于代码差异调用大语言模型,而不利用相似提交范例作为参考,且很少支持用户反馈驱动的大语言模型推荐。为解决这些局限,本文提出 CoRaCommit,一种 VS Code 扩展,通过检索相似提交范例作为提示上下文、并行调用多个大语言模型进行候选提交消息比较,并基于用户反馈动态推荐大语言模型,从而增强提交消息生成。在 ApacheCM 数据集的 945 个提交上的实验结果表明,CoRaCommit 在 BLEU、CIDEr、METEOR 和 ROUGE-L 指标上优于现有 VS Code 扩展,证明了检索增强上下文对提交消息生成的有效性。

英文摘要

Commit messages are essential textual artifacts that describe the intent behind code changes, and play a critical role in version control, code review, and historical tracking. However, in practice, commit messages are primarily authored manually, which is time-consuming and often results in inconsistent quality and non-uniform expression. Existing VS Code extensions for commit message generation typically directly invoke large language models based on the code diff, without leveraging similar commit exemplars as references, and rarely support user feedback-driven LLM recommendation. To address these limitations, this paper presents CoRaCommit, a VS Code extension that enhances commit message generation by retrieving similar commit exemplars as prompt context, invoking multiple LLMs in parallel for candidate commit message comparison, and dynamically recommending LLMs based on user feedback. Experimental results on 945 commits from the ApacheCM dataset show that CoRaCommit outperforms existing VS Code extensions across BLEU, CIDEr, METEOR, and ROUGE-L metrics, demonstrating the effectiveness of retrieval-augmented context for commit message generation.

2606.11537 2026-06-19 cs.AI cs.CE 新提交 85%

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck(因斯布鲁克大学) University of British Columbia(不列颠哥伦比亚大学) Toronto Metropolitan University(多伦多都会大学)

专题命中 代码生成 :系统生成可执行Python程序解决表格问答

AI总结 提出MoCA-Agent,通过声明级验证和代码生成解决金融表格问答中的数值推理错误,在十个基准上取得强性能。

详情
AI中文摘要

金融和表格问答不仅需要流畅的推理:答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent},一种声明市场代码智能体,它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明,要求专业交易智能体买入或卖出这些声明,将其订单清算为置信度加权的接受/拒绝决策,并从市场支持的证据中合成可执行的Python程序。然后,一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误,最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上,\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能,包括在 FinQA 上达到 78.3%,在 FinanceMath 上达到 76.0%,在 MultiHiertt 上达到 71.2%,在 ESGenius 上达到 86.9%,以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明,在原子声明级别聚合证据,而不是整个答案,提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取:this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

2606.20173 2026-06-19 cs.SE 新提交 80%

Qiskit Code Migration with LLMs

使用大语言模型进行Qiskit代码迁移

Jose Manuel Suarez, Luis Mariano Bibbo, Joaquin Bogado, Alenandro Fernandez

专题命中 代码生成 :LLM+RAG自动迁移Qiskit代码。

AI总结 针对量子软件开发套件版本演进导致的代码维护问题,提出结合大语言模型与检索增强生成(RAG)的混合方法,利用自动生成的迁移场景分类体系引导模型,实现Qiskit代码跨版本自动迁移,有效减少幻觉并提升迁移建议质量。

详情
AI中文摘要

量子开发套件(QDK)的快速演进引入了一种特定形式的技术债务,损害了代码可维护性并阻碍了软件复用。在量子软件工程(QSE)这一专业领域,高质量训练数据的稀缺和新兴框架的高波动性加剧了这一挑战,常导致通用大语言模型(LLM)产生不可靠或幻觉结果。本文提出一种将LLM与检索增强生成(RAG)相结合的混合方法,用于自动化Qiskit代码的跨版本迁移。所提方法通过利用自动生成的迁移场景分类体系作为结构化、版本特定的知识源来指导模型,从而提升迁移建议的精度和可靠性。该方法通过一个自动化、可扩展的工作流实现,评估了不同检索方案(无约束和限制性)下的LLM(Google Gemini Flash-2.5和OpenAI Gpt-oss-20b)。结果表明,基于分类体系的RAG架构,特别是在限制性方案下,显著减少了幻觉并提高了描述质量,其中Google Gemini Flash-2.5在检测复杂重构场景方面表现出更优性能。这些发现证实了这种以数据为中心的方法在促进技术独立性、提供缓解API过时问题的鲁棒智能助手方面的潜力,从而确保量子算法在快速变化的生态系统中的长期可用性,并降低量子软件工程(QSE)的学习曲线。

英文摘要

The rapid evolution of Quantum Development Kits (QDKs) introduces a specific form of technical debt that compromises code maintainability and hinders software reuse. In the specialized domain of Quantum Software Engineering (QSE), this challenge is intensified by the scarcity of high-quality training data and the high volatility of emerging frameworks, which often lead general-purpose Large Language Models (LLMs) to produce unreliable or hallucinated results. This paper proposes a hybrid approach integrating LLMs with Retrieval-Augmented Generation (RAG) to automate the migration of Qiskit code across versions. The proposed methodology enhances the precision and reliability of migration suggestions by leveraging an automatically generated taxonomy of migration scenarios as the structured, version-specific knowledge source to guide the models. The approach is implemented through an automated, extensible workflow evaluating LLMs (Google Gemini Flash-2.5 and OpenAI Gpt-oss-20b) under different retrieval schemes (unconstrained and restrictive). Results demonstrate that the taxonomy-based RAG architecture, particularly under the restrictive scheme, significantly reduces hallucinations and improves descriptive quality, with Google Gemini Flash-2.5 showing superior performance in detecting complex refactoring scenarios. These findings confirm the potential of this data-centric methodology to foster technological independence and provide robust, intelligent assistants that mitigate API obsolescence, ensuring the long-term availability of quantum algorithms within a rapidly shifting ecosystem and flattening the learning curve within Quantum Software Engineering (QSE).

2606.19474 2026-06-19 cs.CR cs.AI cs.SE 新提交 80%

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

LLM辅助后量子密码开发中的安全编码漂移:一种游戏化修复方案

R. D. N. Shakya, C. P. Wijesiriwardana, S. M. Vidanagamachchi, Nalin A. G. Arachchilage

发表机构 * University of Moratuwa(摩图瓦大学) University of Ruhuna(鲁胡纳大学) RMIT University(皇家墨尔本理工大学)

专题命中 代码生成 :研究LLM辅助后量子密码开发中的安全编码漂移。

AI总结 提出LLM辅助PQC开发中的安全编码漂移模型,通过游戏化框架将LLM转变为主动安全协作者,以缓解长期依赖LLM导致的安全退化。

Comments Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track

详情
AI中文摘要

向后量子密码学(PQC)的过渡引入了相当大的实现复杂性,要求严格遵守恒定时间执行、侧信道抵抗和精确参数化。同时,大型语言模型(LLM)已深度嵌入软件开发工作流程,包括密码工程。虽然LLM提高了生产力,但证据表明它们经常生成不安全或次优的代码,特别是在安全关键领域。本文引入了PQC中的安全编码漂移,这是一种新颖的社会技术漏洞模型,捕捉了由于持续依赖LLM生成的代码而导致的安全编码实践逐渐退化。与先前关注静态漏洞的工作不同,我们将安全风险概念化为一种源于人机交互的纵向行为现象。为了缓解这一问题,我们提出了一种游戏化的、LLM增强的安全编码框架,将对抗性评估、行为反馈和安全评分嵌入开发工作流程。我们的方法将LLM从被动助手重新定义为主动安全协作者,为AI中介环境中的更安全PQC实现做出贡献。

英文摘要

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

2606.19644 2026-06-19 cs.SE 新提交 75%

Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development

提示质量与拉取请求结果:基于阶段的LLM辅助开发实证研究

Richard Sserunjogi, Daniel Ogenrwot, John Businge

专题命中 代码生成 :研究提示质量对LLM辅助代码生成和PR结果的影响。

AI总结 通过分析265个开发者与ChatGPT的交互,研究提示结构(上下文、具体性、验证)对LLM辅助开发中代码生成、采纳和集成深度的影响,发现不同维度在不同阶段有不同作用。

Comments 48 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)驱动的工具(如ChatGPT)越来越多地用于协作软件工程工作流,但提示结构如何影响下游拉取请求(PR)结果尚不清楚。先前的研究主要考察对话帮助性、生产力或粗粒度的采用指标,对提示结构在协作集成行为中的作用理解不足。我们分析了来自开源拉取请求中自我承认的ChatGPT使用的265个手动验证的开发者-ChatGPT交互。基于先前关于开发者面向工件和提示工程的研究,我们使用三个维度操作化提示结构:上下文、具体性和验证。我们首先评估LLM辅助注释是否能可靠地再现人类对提示结构的判断,发现在不同维度和工作流上下文中存在显著差异。具体性与人类判断的一致性最稳定;上下文被LLM系统性地低估;验证仍然难以一致评估,这促使采用人类-LLM混合注释策略。使用这个经过验证的框架,我们然后检查提示结构如何影响AI辅助PR工作流中的可操作代码生成、代码采纳和集成深度。具体性和上下文与可操作代码生成关联最强;验证成为代码采纳的主要预测因子;集成深度与上下文关联最强。总体而言,我们的发现表明,提示特征在AI辅助软件工程工作流中表现出不同的、阶段依赖的影响,通过上下文基础、任务具体性和可评估性线索影响下游采纳和集成。

英文摘要

Large language model (LLM)-powered tools such as ChatGPT are increasingly used in collaborative software engineering workflows, yet little is known about how prompt structure influences downstream pull request (PR) outcomes. Prior studies primarily examine conversational helpfulness, productivity, or coarse-grained adoption metrics, leaving the role of prompt structure in collaborative integration behavior insufficiently understood. We analyze 265 manually validated developer-ChatGPT interactions derived from self-admitted ChatGPT usage in open-source pull requests. Building on prior research on developer-facing artifacts and prompt engineering, we operationalize prompt structure using three dimensions: Context, Specificity, and Verification. We first evaluate whether LLM-assisted annotation can reliably reproduce human judgments of prompt structure, finding substantial variation across dimensions and workflow contexts. Specificity shows the most stable agreement with human judgments; Context is systematically under-scored by the LLM; and Verification remains difficult to assess consistently, motivating a hybrid human-LLM annotation strategy. Using this validated framework, we then examine how prompt structure influences actionable code generation, code adoption, and integration depth across AI-assisted PR workflows. Specificity and Context are most strongly associated with actionable code generation; Verification emerges as the primary predictor of code adoption; and integration depth is most strongly associated with Context. Overall, our findings show that prompt characteristics exert distinct, stage-dependent effects across AI-assisted software engineering workflows, influencing downstream adoption and integration through contextual grounding, task specificity, and evaluability cues.

4. 测试生成 1 篇

2606.19725 2026-06-19 cs.SE cs.AI cs.MA 新提交 90%

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

面向OpenSIL固件中大语言模型生成的单元测试的库感知双打与迭代修复

Ma Toan Bach, Yuchi Zheng, Haingo Razafindranto, Tanvir Alam, Aric Leather, Ranveer Sandhu, Jitesh Arora

发表机构 * School of Software Design and Data Science(软件设计与数据科学学院) Seneca Polytechnic(森纳学院) Advanced Micro Devices Canada(加拿大先进微器件公司)

专题命中 测试生成 :LLM引导的多智能体自动化单元测试生成与修复。

AI总结 针对OpenSIL固件单元测试因构建约束易失败的问题,提出LLM引导的多智能体自动化测试生成与迭代修复流程,在76个函数中73个生成可编译测试,行覆盖率达98.8%。

Comments 20 pages, 10 figures

详情
AI中文摘要

验证底层C固件中的变更成本高昂,因为单元测试(UT)在严格的构建约束下非常脆弱,缺失的头文件、未解析的符号和依赖不匹配经常阻止编译和链接。本研究为AMD维护的开源硅初始化库(openSIL)固件代码库引入了一种自动化的UT编写工作流程,通过大语言模型(LLM)引导的多智能体管道减少手动工作。该工作流程结合了测试框架的自动生成、库感知的桩、模拟和伪造的创建或重用,以及由构建日志和行覆盖率反馈驱动的迭代编译-分派修复循环。我们使用编译成功率、修复迭代次数、分派成功率和行覆盖率评估该方法,并以时间、成本和令牌使用量作为次要指标。在76个被测函数中,该工作流程为73个函数生成了可编译的UT。在没有行覆盖率指导或检索增强的配置下,平均行覆盖率达到73.9%。在两种配置下评估的48个函数子集中,仅使用行覆盖率指导时平均行覆盖率达到98.8%,与向量数据库检索结合时达到94.7%。结果表明,自动生成和修复管道可以显著提高受限固件环境中UT创建的效率和覆盖率,同时减少手动调试工作量。

英文摘要

Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

5. 程序修复 2 篇

2606.19149 2026-06-19 cs.CR cs.LG 新提交 85%

OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing

OpenAnt:通过代码分解、对抗性验证和动态测试实现LLM驱动的漏洞发现

Nahum Korda, Gadi Evron

专题命中 程序修复 :LLM驱动漏洞发现,属于程序修复

AI总结 提出OpenAnt系统,结合静态分析与LLM推理,通过代码分解、对抗性验证和动态测试三阶段流水线,在降低误报率的同时发现未知漏洞。

详情
AI中文摘要

在大型代码库中自动发现漏洞仍然具有挑战性:传统静态分析误报率高,而模糊测试等动态方法需要大量基础设施且通常针对狭窄的漏洞类别。大型语言模型(LLM)的最新进展使得对程序行为进行语义推理成为可能,但将LLM应用于仓库级安全分析会引入上下文管理、成本和验证方面的挑战。我们提出了OpenAnt,一个开源漏洞发现系统,它在多阶段流水线中集成了静态程序分析与基于LLM的推理。OpenAnt引入了三种关键技术。首先,代码库被分解为自包含的分析单元,并通过从外部入口点的可达性进行过滤,将分析面减少高达97%,同时保留与攻击相关的代码。其次,候选漏洞通过受限攻击者模拟进行对抗性验证,其中模型在现实攻击者能力下评估可利用性。第三,通过动态验证确认发现结果,其中自动生成利用环境,在沙箱容器中执行,并在使用后丢弃。在包括OpenSSL、WordPress和Flowise在内的广泛使用的开源项目上的评估表明,这种架构可以识别先前未知的漏洞,同时保持可管理的分析成本并大幅减少误报。我们的结果表明,结合语义推理与利用验证的闭环漏洞发现流水线,为可扩展的自动化安全分析提供了一条实用路径。OpenAnt已在Apache 2.0许可下开源,网址为https://this https URL。

英文摘要

Automated vulnerability discovery in large codebases remains challenging: traditional static analysis produces high false-positive rates, while dynamic approaches such as fuzzing require substantial infrastructure and often target narrow classes of bugs. Recent advances in large language models (LLMs) enable semantic reasoning about program behavior, but applying LLMs to repository-scale security analysis introduces challenges related to context management, cost, and verification. We present OpenAnt, an open-source vulnerability discovery system that integrates static program analysis with LLM-based reasoning in a multi-stage pipeline. OpenAnt introduces three key techniques. First, codebases are decomposed into self-contained analysis units filtered by reachability from external entry points, reducing the analysis surface by up to 97% while preserving attack-relevant code. Second, candidate vulnerabilities undergo adversarial verification through constrained attacker simulation, where the model evaluates exploitability under realistic attacker capabilities. Third, findings are validated through dynamic verification, in which exploit environments are generated automatically, executed in sandboxed containers, and discarded after use. Evaluation on widely used open-source projects including OpenSSL, WordPress, and Flowise shows that this architecture can identify previously unknown vulnerabilities while maintaining manageable analysis cost and substantially reducing false positives. Our results suggest that closed-loop vulnerability discovery pipelines, combining semantic reasoning with exploit validation, provide a practical path toward scalable automated security analysis. OpenAnt is released as open source under the Apache 2.0 license at https://github.com/knostic/OpenAnt.

2506.16136 2026-06-19 cs.SE 85%

Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing

视觉即修复:基于多模态大语言模型的视觉软件问题修复

Kai Huang, Jian Zhang, Xiaofei Xie, Chunyang Chen

专题命中 程序修复 :多模态LLM修复视觉软件问题,属于程序修复。

AI总结 本文提出GUIRepair方法,通过多模态推理解决视觉软件问题,结合图像到代码和代码到图像的组件提升故障理解和修复验证。

Journal ref 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)

详情
AI中文摘要

基于大语言模型(LLM)的自动程序修复(APR)技术在解决真实世界GitHub问题任务中表现出有前景的结果。现有APR系统主要在单模态设置(例如SWE-bench)中进行评估。然而,这些自主系统在处理多模态问题场景(例如SWE-bench M)时面临困难,因为它们在解释和利用视觉信息方面存在局限。在多模态场景中,LLM需要依赖图形用户界面(GUI)中的视觉信息来理解故障并生成修复。为了弥合这一差距,我们提出了GUIRepair,一种用于解决多模态问题场景的跨模态推理方法,通过理解和捕捉视觉信息。具体而言,GUIRepair集成了两个关键组件,Image2Code和Code2Image,以增强故障理解和修复验证。Image2Code根据问题报告提取相关的项目文档,然后应用该领域知识生成负责视觉症状的重现代码,有效地将GUI图像转换为可执行上下文以更好地理解故障。Code2Image通过重现的代码回放视觉问题场景,并捕获修复程序的GUI渲染以评估修复是否在视觉上解决了问题,为修复验证提供反馈。我们评估了GUIRepair在SWE-bench M上的表现,该方法显示出显著的有效性。当使用GPT-4o作为基础模型时,GUIRepair解决了157个实例,优于最佳开源基线26个实例。此外,当使用o4-mini作为基础模型时,GUIRepair可以实现甚至更好的结果,解决了175个实例,优于顶级商业系统22个实例。这强调了我们新视角的成功,即通过理解和捕捉视觉信息来解决多模态问题。

英文摘要

Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks. Existing APR systems are primarily evaluated in unimodal settings (e.g., SWE-bench). However, these autonomous systems struggle to resolve multimodal problem scenarios (e.g., SWE-bench M) due to limitations in interpreting and leveraging visual information. In multimodal scenarios, LLMs need to rely on visual information in the graphical user interface (GUI) to understand bugs and generate fixes. To bridge this gap, we propose GUIRepair, a cross-modal reasoning approach for resolving multimodal issue scenarios by understanding and capturing visual information. Specifically, GUIRepair integrates two key components, Image2Code and Code2Image, to enhance fault comprehension and patch validation. Image2Code extracts relevant project documents based on the issue report, then applies this domain knowledge to generate the reproduced code responsible for the visual symptoms, effectively translating GUI images into executable context for better fault comprehension. Code2Image replays the visual issue scenario using the reproduced code and captures GUI renderings of the patched program to assess whether the fix visually resolves the issue, providing feedback for patch validation. We evaluate GUIRepair on SWE-bench M, and the approach demonstrates significant effectiveness. When utilizing GPT-4o as the base model, GUIRepair solves 157 instances, outperforming the best open-source baseline by 26 instances. Furthermore, when using o4-mini as the base model, GUIRepair can achieve even better results and solve 175 instances, outperforming the top commercial system by 22 instances. This emphasizes the success of our new perspective on incorporating cross-modal reasoning by understanding and capturing visual information to resolve multimodal issues.