arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

代码大模型 / AI 编程

代码生成、软件工程智能体、程序修复、测试生成和开发者工具。

今日/当前日期收录 7 信号源:cs.SE, cs.CL, cs.AI, cs.LG, cs.PL

1. 代码生成 2 篇

2606.06133 2026-06-18 cs.SE cs.AI cs.LG cs.LO 版本更新 专题 90

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

TLA-Prover: 通过偏好优化低秩适配实现可验证的 TLA+ 规范合成

Eric Spencer, Arslan Bisharat, Brian Ortiz, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago(洛约拉芝加哥大学计算机科学系)

专题命中 代码生成 :TLA+形式化规范合成,偏好优化提升通过率

AI总结 提出 TLA-Prover 模型,结合监督微调和基于修复的组相对策略优化,在 TLC 模型检查器上实现 TLA+ 规范合成,Gold/Diamond 级别通过率达 30%,约为未调优基线的 3.5 倍。

Comments 12 pages, 5 tables, 3 figures. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026)

详情
AI中文摘要

TLA+ 是一种用于验证分布式系统和安全关键协议的正式规范语言。大型语言模型(LLM)生成的 TLA+ 规范常常因语义原因无法通过 TLC 模型检查器。在 25 个 LLM 中,最佳公开基线的语法解析成功率为 26.6%,语义模型检查通过率为 8.6%。我们提出了 TLA-Prover,一个 200 亿参数的 TLA+ 规范合成模型。训练结合了在已验证示例上的监督微调(SFT)和基于修复的组相对策略优化(GRPO)。在 GRPO 阶段,模型学习修复自身被拒绝的规范。我们还从相同的 SFT 检查点训练了一个直接偏好优化(DPO)变体作为消融实验。TLC 直接提供奖励信号,无需学习奖励模型。每个输出分为四个等级:青铜(解析通过)、银(无警告)、金(通过 TLC)和钻石。要达到钻石级,模型的正确性属性会被自动微小修改;TLC 必须检测到违反。如果 TLC 仍然通过,则该属性始终为真且无贡献;输出无法达到钻石级。在一个保留的 30 问题基准上,TLA-Prover 在金级和钻石级均达到 9/30(即 pass@1 = 30%)。这大约是未调优基线 8.6% 的 3.5 倍。DPO 变体在钻石级达到 20%。金级和钻石级在每个检查点都一致;这防止了平凡属性失败模式。

英文摘要

TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.

2511.00802 2026-06-18 cs.SE cs.CL cs.LG 版本更新 专题 85

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

GrowthHacker: 使用代码修改型LLM代理的自动离线策略评估优化

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard

发表机构 * Michigan Technological University, Houghton(密歇根技术大学) Birmingham City University(伯明翰城市大学) University of British Columbia, Kelowna(不列颠哥伦比亚大学, 肯洛纳)

专题命中 代码生成 :利用LLM代理自动修改代码优化离线策略评估。

AI总结 提出GrowthHacker基准,利用LLM代理自动迭代修改代码以优化离线策略评估(OPE)实现,在Open Bandit Pipeline和Scope-RL上评估多种框架,证明基于LLM的代理可作为自动增长黑客持续改进OPE系统。

Comments Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2026

详情
AI中文摘要

随着数据驱动开发的广泛采用,在线A/B测试已成为衡量新技术效果的既定方法。然而,部署在线实验需要设计、实现和部署资源,并可能对用户产生负面影响(例如,不安全或不道德的结果),同时需要数周的数据收集。为了解决这一问题,离线策略评估(OPE)或离线A/B测试这一日益增长的研究领域,使用先前收集的日志数据离线评估新技术。OPE也是强化学习中的一个基本问题,在在线测试昂贵或风险高的领域(如医疗保健、推荐系统、教育和机器人技术)中非常重要。尽管代码生成大语言模型(LLM)和代理工作流取得了进展,但关于LLM和基于LLM的代理是否以及如何自动优化OPE实现,我们知之甚少。我们提出了GrowthHacker,这是一个基准测试,用于在大规模公共数据集上评估基线LLM和基于LLM的代理。GrowthHacker自主迭代修改代码,运行OPE,并使用指标指导后续优化。我们在Open Bandit Pipeline(OBP)和Scope-RL上评估方法,并开发了一个双代理框架,该框架解决了现有框架的局限性,同时降低了复杂性。在两个库中,双代理显示出最高的可靠性(98.1%-100%成功率)和正向结果率(78%),正向结果的中位改进为4.4%;CrewAI实现了最高的平均改进(37.9%),并且是唯一没有极端值失败的框架。AutoGen和Default各达到65%的正向结果率。这些结果证明了使用基于LLM的代理作为自动“增长黑客”持续改进OPE系统的可行性,对在手动优化成本高昂的情况下扩展数据驱动决策具有重要意义。

英文摘要

With data-driven development now widely adopted, online A/B testing is an established method for measuring the effects of new technologies. However, deploying online experiments demands resources for design, implementation, and deployment, and may negatively impact users (e.g., unsafe or unethical outcomes) while requiring weeks of data collection. To address this, the growing research area of off-policy evaluation (OPE), or offline A/B testing, assesses new technologies offline using previously collected logged data. OPE is also a fundamental problem in reinforcement learning and is important where online testing is expensive or risky, such as healthcare, recommender systems, education, and robotics. Despite advances in code-generation large language models (LLMs) and agentic workflows, little is known about whether and how LLMs and LLM-based agents can automatically optimize OPE implementations. We propose GrowthHacker, a benchmark that evaluates baseline LLMs and LLM-based agents on large-scale public datasets. GrowthHacker autonomously and iteratively modifies code, runs OPE, and uses the metrics to guide subsequent optimization. We evaluate methods on Open Bandit Pipeline (OBP) and Scope-RL, and develop a two_agent framework that addresses limitations of existing frameworks while reducing complexity. Across both libraries, two_agent shows the highest reliability (98.1%-100% success rate) and positive-outcome rate (78%), with a median improvement of 4.4% among positive outcomes; CrewAI achieves the highest average improvement (37.9%) and is the only framework with zero extreme-value failures. AutoGen and Default each reach 65% positive-outcome rates. These results establish the feasibility of using LLM-based agents as automated "growth hackers" to continuously improve OPE systems, with implications for scaling data-driven decision-making where manual optimization is expensive.

2. 软件智能体 2 篇

2602.02690 2026-06-18 cs.SE 版本更新 专题 90

Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All

超越LLM截止日期:一个面向所有人的实时内核崩溃修复基准

Chenxi Huang, Alex Mathai, Feiyang Yu, Aleksandr Nogikh, Petros Maniatis, Franjo Ivančić, Eugene Wu, Kostis Kaffes, Junfeng Yang, Baishakhi Ray

专题命中 软件智能体 :LLM代理修复内核崩溃,评估框架

AI总结 提出Live-kBench和kEnv框架,用于持续评估LLM代理修复新发现的Linux内核崩溃,实验显示代理在截止日期前修复率高出25%,但仅20%的补丁与开发者修复匹配。

详情
AI中文摘要

修复由Syzkaller等内核模糊测试工具发现的系统崩溃是软件工程中一个关键但尚未充分探索的挑战。虽然近期工作引入了基于大语言模型(LLM)的代理用于Linux内核崩溃修复,但其评估基准通常是静态的,因此未能捕捉Linux内核的演化特性,并且由于LLM知识截止日期而存在潜在的数据污染问题。为解决上述问题,我们提出了(i)Live-kBench,一个用于自我演化基准的评估框架,持续抓取并评估代理在新发现的内核漏洞上的表现,以及(ii)kEnv,一个与代理无关的标准化崩溃修复环境,用于内核编译、执行和反馈。该设计将代理工作流与重量级执行解耦,使得在相同条件下跨不同代理框架进行公平且可扩展的比较成为可能。为此,我们整理了一个包含534个Linux内核漏洞的初始数据集,并实验证明存在显著性能差距,代理在LLM知识截止日期前修复的漏洞上等效补丁率高出25%。使用kEnv,我们对三个最先进的代理进行了基准测试,结果显示它们首次尝试即修复了74%的崩溃(合理补丁);然而仅约20%生成的补丁与开发者修复紧密匹配。此外,暴露崩溃修复反馈使修复率提高了29%。Live-kBench为社区提供了一个既对时间敏感又对属性敏感的自我演化基准评估基础设施,并附带一个公共仪表板以跟踪代理在Linux内核漏洞上的进展。

英文摘要

Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under identical conditions. To this end, we curate an inaugural dataset of 534 Linux kernel bugs and empirically demonstrate a significant performance gap, with agents achieving up to 25% higher equivalent patch rate on bugs fixed before the LLM knowledge cutoff. Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt (plausible patches); however only ~20% of generated patches closely match developer fixes. Additionally, exposing crash resolution feedback improves crash resolution rate by 29%. Live-kBench provides the community with an evaluation infrastructure for self-evolving benchmarks that is both time and attribute sensitive; complete with a public dashboard to track agent progress on Linux kernel bugs.

2411.19099 2026-06-18 cs.SE 版本更新 专题 85

Enhancing Software Maintenance: A Learning to Rank Approach for Co-changed Method Identification

增强软件维护:一种用于共变方法识别的学习排序方法

Yiping Jia, Safwat Hassan, Ying Zou

专题命中 软件智能体 :学习排序方法识别共变方法,辅助软件维护

AI总结 提出一种学习排序方法,结合源代码特征和变更历史,在拉取请求级别预测并排序共变方法,实验表明随机森林模型在NDCG@5上优于其他模型2.5-12.8%,并超过基线方法4.7-537.5%。

详情
AI中文摘要

随着大规模软件系统复杂性的增加,识别特定变更所需的所有必要修改变得具有挑战性。共变方法,即经常一起修改的方法,对于理解软件依赖关系至关重要。然而,现有方法通常会产生大量结果且误报率高。关注拉取请求而非单个提交,可以提供相关变更的更全面视图,捕获关键的共变关系。为了解决这些挑战,我们提出了一种学习排序方法,结合源代码特征和变更历史,在拉取请求级别预测并排序共变方法。在150个开源Java项目(总计4150万行代码和634,216个拉取请求)上的实验表明,随机森林模型在NDCG@5上优于其他模型2.5%至12.8%。它也比文件邻近性、代码克隆、FCP2Vec和StarCoder 2等基线方法高出4.7%至537.5%。在较长历史数据(90至180天)上训练的模型表现一致,而60天后准确率下降,突显了每两个月重新训练的必要性。该方法为管理共变方法提供了有效工具,使开发团队能够处理依赖关系并维护软件质量。

英文摘要

With the increasing complexity of large-scale software systems, identifying all necessary modifications for a specific change is challenging. Co-changed methods, which are methods frequently modified together, are crucial for understanding software dependencies. However, existing methods often produce large results with high false positives. Focusing on pull requests instead of individual commits provides a more comprehensive view of related changes, capturing essential co-change relationships. To address these challenges, we propose a learning-to-rank approach that combines source code features and change history to predict and rank co-changed methods at the pull-request level. Experiments on 150 open-source Java projects, totaling 41.5 million lines of code and 634,216 pull requests, show that the Random Forest model outperforms other models by 2.5 to 12.8 percent in NDCG@5. It also surpasses baselines such as file proximity, code clones, FCP2Vec, and StarCoder 2 by 4.7 to 537.5 percent. Models trained on longer historical data (90 to 180 days) perform consistently, while accuracy declines after 60 days, highlighting the need for bi-monthly retraining. This approach provides an effective tool for managing co-changed methods, enabling development teams to handle dependencies and maintain software quality.

3. 代码评测 2 篇

2602.06774 2026-06-18 cs.AI 版本更新 专题 85

Towards Understanding What State Space Models Learn About Code

理解状态空间模型在代码中学到了什么

Jiali Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * TU Darmstadt(图宾根大学) Hessian Center for Artificial Intelligence(黑森人工智能中心) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究中心ATHENE)

专题命中 代码评测 :SSM代码理解机制分析

AI总结 本文首次系统分析状态空间模型(SSM)在代码理解中的学习机制,发现SSM在预训练时比Transformer更有效捕获语法和语义结构,但微调时会遗忘某些关系,并提出SSM-Interpret框架和架构改进,将NLCodeSearch的MRR提升高达6。

详情
AI中文摘要

状态空间模型(SSM)已成为Transformer架构的高效替代方案。先前工作表明,在可比条件下训练时,SSM在代码理解任务上可以匹配或超越Transformer。然而,其内部机制仍是一个黑箱。我们首次系统分析了基于SSM的代码模型所学到的内容,并在此领域直接比较了SSM和Transformer模型。我们的分析表明,SSM在预训练期间比Transformer更有效地捕获了语法和语义结构,但在某些任务的微调过程中会遗忘某些关系。为了研究这种行为,我们引入了SSM-Interpret,一个频域框架,揭示了微调期间向短程依赖的频谱偏移。在这些发现的指导下,我们提出了架构修改,将基于SSM的代码模型在NLCodeSearch上的性能显著提升了高达+6 MRR。这表明我们的分析不仅解释了模型行为,而且直接导致了更好的设计。

英文摘要

State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

2604.00730 2026-06-18 cs.CY cs.AI cs.LG cs.SE 版本更新 专题 75

A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

基于CEFR启发的模糊C均值分类框架:自动化评估Scratch编程技能

Ricardo Hidalgo-Aragón, Jesús M. González-Barahona, Gregorio Robles

发表机构 * Universidad Rey Juan Carlos(雷昂卡洛斯大学)

专题命中 代码评测 :模糊C均值聚类评估Scratch编程技能

AI总结 提出一种基于CEFR的Scratch项目评估框架,使用模糊C均值聚类对200万+项目分级,识别B2瓶颈并引入分类确定性指标以平衡自动反馈与人工审核。

Comments Best Paper Award CSEDU 2026 -Minor change FPC fix-

详情
AI中文摘要

背景:学校、培训平台和技术公司日益需要以透明、可重复的方法大规模评估编程能力,以支持个性化学习路径。目标:本研究引入一个与欧洲共同语言参考标准(CEFR)一致的Scratch项目评估教学框架,为学生和教师提供通用能力等级,并为课程设计提供可行见解。方法:我们对通过此http URL评估的2008246个Scratch项目应用模糊C均值聚类,实施序数准则将聚类映射到CEFR等级(A1-C2),并引入增强分类指标,识别过渡学习者,实现持续进度跟踪,量化分类确定性以平衡自动反馈与教师评审。影响:该框架能够诊断系统性课程缺口——特别是“B2瓶颈”,由于逻辑同步和数据表示的认知负荷,仅13.3%的学习者处于该等级——同时提供基于确定性的触发机制以进行人工干预。

英文摘要

Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.

4. 其他AI编程 1 篇

2602.15149 2026-06-18 cs.CE cs.NA math.NA 版本更新 专题 60

SoliDualSPHysics: An extension of DualSPHysics for solid mechanics with hyperelasticity, plasticity, and fracture

SoliDualSPHysics:一种用于固体力学的DualSPHysics扩展,支持超弹性、塑性及断裂

Mohammad Naqib Rahimi, George Moutsanidis

专题命中 其他AI编程 :开源软件扩展,涉及代码但非AI编程核心

AI总结 本文提出SoliDualSPHysics,一种基于SPH的开源软件,扩展DualSPHysics以模拟超弹性、有限应变塑性及脆性断裂行为,采用总拉格朗日格式,支持动态加载下的裂纹萌生与扩展,验证了其准确性和可扩展性。

详情
AI中文摘要

我们介绍了SoliDualSPHysics,一种新颖的开源且基于GPU加速的软件,扩展DualSPHysics以实现超弹性、有限应变塑性及脆性断裂行为的数值模拟。该软件实现了总拉格朗日格式,允许直接应用外部载荷和边界条件,支持独立的固体力学模拟。脆性断裂通过相场方法与SPH耦合,允许在动态加载下实现裂纹萌生、扩展和分叉,无需额外标准或局部细化。框架还支持用户定义的数学表达式来规定时间与空间相关的量,补充了固体力学和断裂扩展,并增强了现有和未来DualSPHysics应用的灵活性。利用DualSPHysics原生的CPU/GPU并行架构,该软件在大规模模拟中实现了显著的计算加速,且通过基准数值问题和实验数据验证了其准确性、鲁棒性和良好的扩展性能。提供了全面的实现细节和用户文档,以确保可重复性和支持社区进一步开发。框架和源代码通过公共GitHub仓库免费提供。

英文摘要

We introduce SoliDualSPHysics, a novel open-source and GPU-accelerated software that extends DualSPHysics to enable the numerical simulation of hyperelastic, finite-strain plastic, and brittle fracture behavior in deformable solids within a unified smoothed particle hydrodynamics (SPH) software framework. The software implements a total Lagrangian formulation for solid mechanics that allows direct application of external loads and boundary conditions, enabling independent solid mechanics simulations. Brittle fracture is modeled through a phase-field approach coupled with SPH, allowing crack initiation, propagation, and branching under dynamic loading without explicit crack tracking, ad hoc crack-path criteria, or local refinement. The framework also supports user-defined mathematical expressions to prescribe time- and space-dependent quantities, complementing the solid and fracture extensions and enhancing flexibility across existing and future DualSPHysics applications. Leveraging DualSPHysics' native CPU/GPU parallel architecture, the software achieves substantial computational acceleration for large-scale simulations, and the implementation is verified and validated against benchmark numerical problems and experimental data, demonstrating accuracy, robustness, and favorable scaling performance. Comprehensive implementation details and user documentation are provided to ensure reproducibility and to support further development by the community. The framework and source code are freely available through a public GitHub repository.