代码大模型 / AI 编程 - arXivDaily 专题

2606.20158 2026-06-19 cs.SE 新提交 90%

N-Version Programming with Coding Agents

使用编码代理的N版本编程

Javier Ron, Benoit Baudry, Martin Monperrus

专题命中代码生成：使用编码代理生成实现，评估多样性对故障模式的影响。

AI总结本文在当代AI编码代理背景下重新审视N版本编程，通过Knight-Leveson实验评估代理系统、模型和实现语言的多样性对故障模式的影响，发现常见模式故障，但多数投票三版本单元显著降低故障数，证明该策略的工程实用性。

详情

AI中文摘要

本文在当代AI编码代理背景下重新审视N版本编程这一经典概念。通过重访开创性的Knight-Leveson实验，我们研究了代理系统、模型和实现语言之间的多样性是否会产生多样化的故障模式。使用Knight-Leveson的发射拦截器程序规范，我们在共享的预言机和100万个随机测试输入的测试集上评估了48个代理生成的实现。结果显示，与Knight-Leveson的发现一致，存在大量的共模故障。进一步分析表明，许多这些同时发生的故障可以追溯到规范中特别困难或模糊的地方。我们还证明了编码代理的多样性带来了实际效益：在多数投票的三版本单元中，平均故障数从单版本的387.44下降到三版本的130.99，并且有11,844个N版本单元表现出零观测故障。我们的原始结果是迄今为止最强的证据，表明使用编码代理的N版本编程是一种有用的工程策略。

英文摘要

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.19988 2026-06-19 cs.SE 新提交 90%

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

基于大语言模型的仓库级Solidity代码生成：从提示到微调

Shi Chen, Rongcun Wang, Yuan Tian, Xiaoyuan Xie, Wei Song, Rubing Huang

专题命中代码生成：评估LLM在Solidity代码生成中的表现

AI总结提出SolidityBench基准和SolidityScore指标，评估多种LLM方法在仓库级Solidity代码生成中的表现，发现监督微调最有效。

Comments 33 pages

详情

AI中文摘要

大语言模型（LLMs）在通用代码生成方面表现出强大的能力，但其在专业软件领域的有效性仍未得到充分探索。Solidity智能合约代表了一个高风险领域，生成的代码必须满足严格的语言级、安全性和软件工程约束。现有的基准和指标对于仓库级Solidity生成仍然不足，其中模型必须从自然语言需求中合成完整的合约。为了解决这一差距，我们引入了SolidityBench，一个包含5,470个仓库级Solidity智能合约及其自然语言描述的基准。我们还提出了SolidityScore，一种基于Solidity的语义度量，强调领域关键结构，如安全修饰符、合约声明和Solidity特定关键词。使用该基准，我们评估了代表性的代码LLM，包括Qwen2.5-Coder、DeepSeek-Coder和CodeLlama，涵盖零样本提示、思维链推理、上下文学习、检索增强生成和监督微调。结果表明，通用模型在仓库级Solidity生成中表现出系统性的结构缺陷。在非参数方法中，检索增强生成表现最佳，而上下文学习在超过两个示例后因上下文饱和而性能下降。监督微调通过将Solidity特定约束内化到模型参数中实现了最大的改进。总体而言，我们的研究为仓库级Solidity代码生成提供了全面的基准，并表明高质量领域数据结合监督微调是提高LLM生成智能合约可靠性的最有效策略。

英文摘要

Large Language Models (LLMs) have shown strong capabilities in general-purpose code generation, but their effectiveness in specialized software domains remains underexplored. Solidity smart contracts represent a high-stakes domain where generated code must satisfy strict language-level, security, and software-engineering constraints. Existing benchmarks and metrics remain insufficient for repository-level Solidity generation, where models must synthesize complete contracts from natural language requirements. To address this gap, we introduce SolidityBench, a benchmark of 5,470 repository-level Solidity smart contracts paired with natural language descriptions. We also propose SolidityScore, a Solidity-aware semantic metric that emphasizes domain-critical constructs such as security modifiers, contract declarations, and Solidity-specific keywords. Using this benchmark, we evaluate representative code LLMs, including Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama, across zero-shot prompting, Chain-of-Thought reasoning, in-context learning, retrieval-augmented generation, and supervised fine-tuning. The results show that general-purpose models exhibit systematic structural deficiencies in repository-level Solidity generation. Among non-parametric methods, retrieval-augmented generation performs best, while in-context learning degrades beyond two examples due to context saturation. Supervised fine-tuning achieves the largest improvement by internalizing Solidity-specific constraints into model parameters. Overall, our study provides a comprehensive benchmark for repository-level Solidity code generation and shows that high-quality domain data combined with supervised fine-tuning is the most effective strategy for improving the reliability of LLM-generated smart contracts.

URL PDF HTML ☆

赞 0 踩 0

2606.19387 2026-06-19 cs.SE cs.AI 新提交 90%

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成：基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Fudan University（复旦大学）； USA（美国）

专题命中代码生成：利用LLM生成RTL硬件代码，结合形式化方法。

AI总结提出结合LLM创造力与形式化方法可解释性的硬件生成框架，通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 新提交 90%

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化？

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research（英伟达研究院）

专题命中代码生成：分析LLM在RTL编码中的失败与泛化

AI总结提出基于问题可解性的错误分类法，揭示LLM在RTL编码中受限于预训练知识，对齐技术仅教会编译，而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情

AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型（LLM）的关键瓶颈。为了研究这一点，我们引入了一种新的错误分类法，该分类法基于问题可解性，受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限，前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义，暴露出对测试时计算扩展免疫的持续知识差距。此外，我们揭示了一个显著的表面收敛差距：优化容易消除语法错误，但同时加剧了更深层次的功能失败。我们的发现表明，对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误，但寄存器传输级（RTL）编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究，而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

URL PDF HTML ☆

赞 0 踩 0

2606.20373 2026-06-19 cs.SE cs.AI 新提交 85%

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass：基于证据的LLM智能体用于编译器性能调优

Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang

发表机构 * Shaanxi Normal University（陕西师范大学）； Northwest University（西北大学）； University of Leeds（利兹大学）

专题命中代码生成：LLM生成编译选项优化代码性能

AI总结提出AutoPass多智能体框架，通过查询编译器内部状态和中间表示，利用运行时反馈迭代优化编译选项，无需训练即可提升性能，在x86-64和ARM64上分别实现1.043倍和1.117倍加速。

详情

AI中文摘要

大型语言模型（LLM）在代码编译任务中展现出潜力，但由于复杂的微架构效应和噪声运行时测量，将其应用于运行时性能调优较为困难。我们提出AutoPass，一个用于编译器性能调优的多智能体框架，它利用编译器和运行时证据来指导LLM生成的优化决策。与先前的自动调优方案将编译器视为黑盒不同，AutoPass向LLM开放编译器，使其能够查询编译器内部的优化状态并分析中间表示以编排编译器选项。搜索过程利用测量的运行时反馈迭代地优化配置，以诊断性能回退并指导延迟改进的编辑。AutoPass在仅推理、无需训练的环境下运行，无需离线训练或任务特定的微调，因此可轻松应用于新的基准测试和平台。我们在LLVM编译器上实现AutoPass，并在服务器级x86-64和嵌入式ARM64系统上进行评估。AutoPass优于专家调优的启发式方法和经典自动调优方法，在x86-64和ARM64上相对于LLVM -O3分别实现了1.043倍和1.117倍的几何平均加速。

英文摘要

Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler-internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency-improving edits. AutoPass operates in an inference-only, training-free setting and requires no offline training or task-specific fine-tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server-grade x86-64 and embedded ARM64 systems. AutoPass outperforms expert-tuned heuristics and classical autotuning methods, achieving geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.19814 2026-06-19 cs.SE 新提交 85%

CoRaCommit: A VS Code Extension for Commit Message Generation with Exemplar Retrieval

CoRaCommit: 一种基于范例检索的提交消息生成的 VS Code 扩展

Chaoran Cai, Bo Xiong, Chong Wang, Lulu He, Peng Liang

专题命中代码生成：VS Code扩展，利用检索范例生成提交消息。

AI总结提出 CoRaCommit VS Code 扩展，通过检索相似提交范例作为提示上下文、并行调用多个大语言模型生成候选消息并基于用户反馈动态推荐，在 ApacheCM 数据集上优于现有扩展。

Comments 17 pages, 6 images, 3 tables, Manuscript submitted to a Journal (2026)

详情

AI中文摘要

提交消息是描述代码变更意图的关键文本制品，在版本控制、代码审查和历史追踪中扮演重要角色。然而，实践中提交消息主要由人工编写，耗时且常导致质量不一致和表达不统一。现有的用于提交消息生成的 VS Code 扩展通常直接基于代码差异调用大语言模型，而不利用相似提交范例作为参考，且很少支持用户反馈驱动的大语言模型推荐。为解决这些局限，本文提出 CoRaCommit，一种 VS Code 扩展，通过检索相似提交范例作为提示上下文、并行调用多个大语言模型进行候选提交消息比较，并基于用户反馈动态推荐大语言模型，从而增强提交消息生成。在 ApacheCM 数据集的 945 个提交上的实验结果表明，CoRaCommit 在 BLEU、CIDEr、METEOR 和 ROUGE-L 指标上优于现有 VS Code 扩展，证明了检索增强上下文对提交消息生成的有效性。

英文摘要

Commit messages are essential textual artifacts that describe the intent behind code changes, and play a critical role in version control, code review, and historical tracking. However, in practice, commit messages are primarily authored manually, which is time-consuming and often results in inconsistent quality and non-uniform expression. Existing VS Code extensions for commit message generation typically directly invoke large language models based on the code diff, without leveraging similar commit exemplars as references, and rarely support user feedback-driven LLM recommendation. To address these limitations, this paper presents CoRaCommit, a VS Code extension that enhances commit message generation by retrieving similar commit exemplars as prompt context, invoking multiple LLMs in parallel for candidate commit message comparison, and dynamically recommending LLMs based on user feedback. Experimental results on 945 commits from the ApacheCM dataset show that CoRaCommit outperforms existing VS Code extensions across BLEU, CIDEr, METEOR, and ROUGE-L metrics, demonstrating the effectiveness of retrieval-augmented context for commit message generation.

URL PDF HTML ☆

赞 0 踩 0

2606.11537 2026-06-19 cs.AI cs.CE 新提交 85%

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck（因斯布鲁克大学）； University of British Columbia（不列颠哥伦比亚大学）； Toronto Metropolitan University（多伦多都会大学）

专题命中代码生成：系统生成可执行Python程序解决表格问答

AI总结提出MoCA-Agent，通过声明级验证和代码生成解决金融表格问答中的数值推理错误，在十个基准上取得强性能。

详情

AI中文摘要

金融和表格问答不仅需要流畅的推理：答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent}，一种声明市场代码智能体，它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明，要求专业交易智能体买入或卖出这些声明，将其订单清算为置信度加权的接受/拒绝决策，并从市场支持的证据中合成可执行的Python程序。然后，一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误，最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上，\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能，包括在 FinQA 上达到 78.3%，在 FinanceMath 上达到 76.0%，在 MultiHiertt 上达到 71.2%，在 ESGenius 上达到 86.9%，以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明，在原子声明级别聚合证据，而不是整个答案，提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取：this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

URL PDF HTML ☆

赞 0 踩 0

2606.20173 2026-06-19 cs.SE 新提交 80%

Qiskit Code Migration with LLMs

使用大语言模型进行Qiskit代码迁移

Jose Manuel Suarez, Luis Mariano Bibbo, Joaquin Bogado, Alenandro Fernandez

专题命中代码生成：LLM+RAG自动迁移Qiskit代码。

AI总结针对量子软件开发套件版本演进导致的代码维护问题，提出结合大语言模型与检索增强生成（RAG）的混合方法，利用自动生成的迁移场景分类体系引导模型，实现Qiskit代码跨版本自动迁移，有效减少幻觉并提升迁移建议质量。

详情

AI中文摘要

量子开发套件（QDK）的快速演进引入了一种特定形式的技术债务，损害了代码可维护性并阻碍了软件复用。在量子软件工程（QSE）这一专业领域，高质量训练数据的稀缺和新兴框架的高波动性加剧了这一挑战，常导致通用大语言模型（LLM）产生不可靠或幻觉结果。本文提出一种将LLM与检索增强生成（RAG）相结合的混合方法，用于自动化Qiskit代码的跨版本迁移。所提方法通过利用自动生成的迁移场景分类体系作为结构化、版本特定的知识源来指导模型，从而提升迁移建议的精度和可靠性。该方法通过一个自动化、可扩展的工作流实现，评估了不同检索方案（无约束和限制性）下的LLM（Google Gemini Flash-2.5和OpenAI Gpt-oss-20b）。结果表明，基于分类体系的RAG架构，特别是在限制性方案下，显著减少了幻觉并提高了描述质量，其中Google Gemini Flash-2.5在检测复杂重构场景方面表现出更优性能。这些发现证实了这种以数据为中心的方法在促进技术独立性、提供缓解API过时问题的鲁棒智能助手方面的潜力，从而确保量子算法在快速变化的生态系统中的长期可用性，并降低量子软件工程（QSE）的学习曲线。

英文摘要

The rapid evolution of Quantum Development Kits (QDKs) introduces a specific form of technical debt that compromises code maintainability and hinders software reuse. In the specialized domain of Quantum Software Engineering (QSE), this challenge is intensified by the scarcity of high-quality training data and the high volatility of emerging frameworks, which often lead general-purpose Large Language Models (LLMs) to produce unreliable or hallucinated results. This paper proposes a hybrid approach integrating LLMs with Retrieval-Augmented Generation (RAG) to automate the migration of Qiskit code across versions. The proposed methodology enhances the precision and reliability of migration suggestions by leveraging an automatically generated taxonomy of migration scenarios as the structured, version-specific knowledge source to guide the models. The approach is implemented through an automated, extensible workflow evaluating LLMs (Google Gemini Flash-2.5 and OpenAI Gpt-oss-20b) under different retrieval schemes (unconstrained and restrictive). Results demonstrate that the taxonomy-based RAG architecture, particularly under the restrictive scheme, significantly reduces hallucinations and improves descriptive quality, with Google Gemini Flash-2.5 showing superior performance in detecting complex refactoring scenarios. These findings confirm the potential of this data-centric methodology to foster technological independence and provide robust, intelligent assistants that mitigate API obsolescence, ensuring the long-term availability of quantum algorithms within a rapidly shifting ecosystem and flattening the learning curve within Quantum Software Engineering (QSE).

URL PDF HTML ☆

赞 0 踩 0

2606.19474 2026-06-19 cs.CR cs.AI cs.SE 新提交 80%

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

LLM辅助后量子密码开发中的安全编码漂移：一种游戏化修复方案

R. D. N. Shakya, C. P. Wijesiriwardana, S. M. Vidanagamachchi, Nalin A. G. Arachchilage

发表机构 * University of Moratuwa（摩图瓦大学）； University of Ruhuna（鲁胡纳大学）； RMIT University（皇家墨尔本理工大学）

专题命中代码生成：研究LLM辅助后量子密码开发中的安全编码漂移。

AI总结提出LLM辅助PQC开发中的安全编码漂移模型，通过游戏化框架将LLM转变为主动安全协作者，以缓解长期依赖LLM导致的安全退化。

Comments Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track

详情

AI中文摘要

向后量子密码学（PQC）的过渡引入了相当大的实现复杂性，要求严格遵守恒定时间执行、侧信道抵抗和精确参数化。同时，大型语言模型（LLM）已深度嵌入软件开发工作流程，包括密码工程。虽然LLM提高了生产力，但证据表明它们经常生成不安全或次优的代码，特别是在安全关键领域。本文引入了PQC中的安全编码漂移，这是一种新颖的社会技术漏洞模型，捕捉了由于持续依赖LLM生成的代码而导致的安全编码实践逐渐退化。与先前关注静态漏洞的工作不同，我们将安全风险概念化为一种源于人机交互的纵向行为现象。为了缓解这一问题，我们提出了一种游戏化的、LLM增强的安全编码框架，将对抗性评估、行为反馈和安全评分嵌入开发工作流程。我们的方法将LLM从被动助手重新定义为主动安全协作者，为AI中介环境中的更安全PQC实现做出贡献。

英文摘要

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

URL PDF HTML ☆

赞 0 踩 0

2606.19644 2026-06-19 cs.SE 新提交 75%

Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development

提示质量与拉取请求结果：基于阶段的LLM辅助开发实证研究

Richard Sserunjogi, Daniel Ogenrwot, John Businge

专题命中代码生成：研究提示质量对LLM辅助代码生成和PR结果的影响。

AI总结通过分析265个开发者与ChatGPT的交互，研究提示结构（上下文、具体性、验证）对LLM辅助开发中代码生成、采纳和集成深度的影响，发现不同维度在不同阶段有不同作用。

Comments 48 pages, 2 figures

详情

AI中文摘要

大型语言模型（LLM）驱动的工具（如ChatGPT）越来越多地用于协作软件工程工作流，但提示结构如何影响下游拉取请求（PR）结果尚不清楚。先前的研究主要考察对话帮助性、生产力或粗粒度的采用指标，对提示结构在协作集成行为中的作用理解不足。我们分析了来自开源拉取请求中自我承认的ChatGPT使用的265个手动验证的开发者-ChatGPT交互。基于先前关于开发者面向工件和提示工程的研究，我们使用三个维度操作化提示结构：上下文、具体性和验证。我们首先评估LLM辅助注释是否能可靠地再现人类对提示结构的判断，发现在不同维度和工作流上下文中存在显著差异。具体性与人类判断的一致性最稳定；上下文被LLM系统性地低估；验证仍然难以一致评估，这促使采用人类-LLM混合注释策略。使用这个经过验证的框架，我们然后检查提示结构如何影响AI辅助PR工作流中的可操作代码生成、代码采纳和集成深度。具体性和上下文与可操作代码生成关联最强；验证成为代码采纳的主要预测因子；集成深度与上下文关联最强。总体而言，我们的发现表明，提示特征在AI辅助软件工程工作流中表现出不同的、阶段依赖的影响，通过上下文基础、任务具体性和可评估性线索影响下游采纳和集成。

英文摘要

Large language model (LLM)-powered tools such as ChatGPT are increasingly used in collaborative software engineering workflows, yet little is known about how prompt structure influences downstream pull request (PR) outcomes. Prior studies primarily examine conversational helpfulness, productivity, or coarse-grained adoption metrics, leaving the role of prompt structure in collaborative integration behavior insufficiently understood. We analyze 265 manually validated developer-ChatGPT interactions derived from self-admitted ChatGPT usage in open-source pull requests. Building on prior research on developer-facing artifacts and prompt engineering, we operationalize prompt structure using three dimensions: Context, Specificity, and Verification. We first evaluate whether LLM-assisted annotation can reliably reproduce human judgments of prompt structure, finding substantial variation across dimensions and workflow contexts. Specificity shows the most stable agreement with human judgments; Context is systematically under-scored by the LLM; and Verification remains difficult to assess consistently, motivating a hybrid human-LLM annotation strategy. Using this validated framework, we then examine how prompt structure influences actionable code generation, code adoption, and integration depth across AI-assisted PR workflows. Specificity and Context are most strongly associated with actionable code generation; Verification emerges as the primary predictor of code adoption; and integration depth is most strongly associated with Context. Overall, our findings show that prompt characteristics exert distinct, stage-dependent effects across AI-assisted software engineering workflows, influencing downstream adoption and integration through contextual grounding, task specificity, and evaluability cues.

URL PDF HTML ☆

赞 0 踩 0

2606.20072 2026-06-19 cs.CL 新提交 70%

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University（首尔大学）

专题命中代码生成：文本到JSON数据生成

AI总结提出STAGE方法，利用电子表格作为源数据，通过LLM生成报告和JSON模式，并验证真实值，显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情

AI中文摘要

从财务文件到临床记录，传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式，是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标，然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距，我们提出了STAGE（电子表格基础的文本到JSON工件生成），一种基于源数据的数据生成管道，通过使用LLM进行可扩展合成，同时根据底层电子表格验证真实值，来构建报告和JSON模式。在STAGE-Eval（我们的基于源数据的基准测试，包含851个示例的测试集）上的评估表明，STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%，值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

URL PDF HTML ☆

赞 0 踩 0

2606.19419 2026-06-19 cs.RO cs.AI 新提交 65%

Playful Agentic Robot Learning

趣味性具身机器人学习

Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Impossible Research

专题命中代码生成：机器人编码智能体生成可执行代码策略。

AI总结提出RATs框架，让机器人通过自主探索学习可复用技能，在LIBERO-PRO和MolmoSpaces上分别提升20.6和17.0个百分点。

Comments Project page: https://playful-rats.github.io/

详情

AI中文摘要

当前的具身机器人系统可以编写可执行的代码即策略程序、观察反馈并在多次尝试中修正行为，但它们仍然主要是任务驱动的：可复用技能仅在明确指令后获得。我们研究趣味性具身机器人学习，其中具身编码代理在下游任务到来之前，将自主导向的趣味性作为持续技能学习阶段。我们引入RATs，即专为趣味性技能获取设计的机器人代理团队。在趣味性阶段，RATs提出新颖且可学习的探索性任务，规划并执行机器人代码策略，验证中间进展，诊断失败，通过密集的步骤级反馈进行重试，并将成功执行提炼到持久代码技能库中。在测试时，代理从该冻结库中重用相关技能以帮助解决新任务。在LIBERO-PRO和MolmoSpaces上的实验表明，与无趣味性和随机趣味性基线相比，趣味性学习技能在保留的下游任务上分别提升了20.6和17.0个百分点（相对于CaP-Agent0）。此外，学习到的技能可以通过简单地检索到上下文中插入到其他推理时代码即策略代理中，无需微调基础模型，即可在RoboSuite和真实世界迁移中分别提升8.9和8.8个百分点。

英文摘要

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

URL PDF HTML ☆

赞 0 踩 0