arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

代码大模型 / AI 编程

代码生成、软件工程智能体、程序修复、测试生成和开发者工具。

今日/当前日期收录 20 信号源:cs.SE, cs.CL, cs.AI, cs.LG, cs.PL

1. 软件智能体 9 篇

2606.18733 2026-06-18 cs.SE cs.AI 新提交 专题 90

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

SWE-Future: 面向未来软件工程智能体的预测条件数据合成

Qiao Zhao, JianYing Qu, Jun Zhang, Yehua Yang, Hanwen Du, Zhongkai Sun

发表机构 * Baidu Inc(百度公司)

专题命中 软件智能体 :面向未来软件工程智能体的数据合成。

AI总结 提出SWE-Future方法,利用仓库历史证据预测未来任务类型(如功能实现、缺陷修复),并基于预测条件合成200个编码智能体任务,减少对历史PR回放的依赖,在80个仓库中达到58.1%的未来工作相关性。

详情
AI中文摘要

真实的编码智能体基准测试通常回放公开的GitHub问题和拉取请求,这使得它们容易与模型预训练、微调、合成数据生成或基准驱动的模型选择产生重叠。完全合成的任务避免了直接的历史回放,但可能偏离真实的仓库需求。我们提出了SWE-Future,一种面向未来编码任务的预测条件数据合成方法。给定时间$T_0$的预测快照,该方法仅使用$T_0$之前的仓库证据来预测未来的功能实现/增强、缺陷修复和重构任务族。我们首先回顾性地验证了这一预测步骤:在预测固定后,后续的拉取请求仅用于衡量预测的任务族是否与未来的仓库工作匹配。在一项80个仓库的研究中,预测器在主要语义匹配指标下达到了58.1%的未来工作相关性。然后,我们使用经过验证的预测族作为条件信号,从任务生成快照中跨61个仓库合成了一个包含200个任务的编码智能体数据集,而不是回放用于验证的后续拉取请求。SWE-Future表明,仓库演化预测可以指导现实的、面向未来的编码任务合成,同时减少对历史拉取请求回放的直接依赖。

英文摘要

Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time $T_0$, the method uses only pre-$T_0$ repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1\% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay.

2606.15828 2026-06-18 cs.SE 新提交 专题 90

Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents

AGENTS.md 文件中的配置异味:配置编码代理的常见错误

Helio Victor F. dos Santos, Vitor Costa, Joao Eduardo Montandon, Luciana Lourdes Silva, Marco Tulio Valente

专题命中 软件智能体 :编码代理配置文件异味分析,软件工程

AI总结 本文首次系统化编码代理配置文件(AGENTS.md/CLAUDE.md)的异味,通过灰文献综述和仓库挖掘识别出六种异味,并在100个开源仓库中验证其普遍性,其中Lint Leakage最常见(62%)。

详情
AI中文摘要

编码代理越来越多地被用于自动化软件工程任务。为了指导其行为,这些代理通常依赖配置文件(通常命名为 AGENTS.md 或 CLAUDE.md),这些文件提供关于架构、工作流、编码规范和测试实践的指令。尽管它们的重要性日益增加,但人们对影响这些文件定义和维护的常见问题知之甚少。在本文中,我们提出了首个编码代理配置文件异味目录。为了识别此类异味,我们首先进行了灰文献综述和仓库挖掘分析。结果,我们识别出六种配置异味,并提出了自动检测它们的启发式方法。为了评估所提出异味的普遍性,我们分析了100个包含 AGENTS.md 或 CLAUDE.md 文件的流行开源仓库。我们的结果表明,配置异味广泛存在。Lint Leakage 是最常见的异味,影响了62%的文件,其次是 Context Bloat(42%)和 Skill Leakage(35%)。我们进一步表明,几种异味经常同时出现,特别是 Context Bloat、Skill Leakage 和 Conflicting Instructions。

英文摘要

Coding agents are increasingly used to automate software engineering tasks. To guide their behavior, these agents commonly rely on configuration files, typically named AGENTS.‌md or CLAUDE.‌md, which provide instructions about architecture, workflows, coding conventions, and testing practices. Despite their growing importance, little is known about common problems affecting the definition and maintenance of these files. In this paper, we present the first catalog of smells for coding-agent configuration files. To identify such smells, we first conducted a grey literature review and a repository mining analysis. As a result, we identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS.‌md or a CLAUDE.‌md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.

2606.19216 2026-06-18 cs.SE cs.HC 新提交 专题 85

No Two Developers Think Alike: How Problem-Solving Styles and Experience Shape Needs in Conversational Interaction with Copilot

没有两个开发者想法相同:问题解决风格和经验如何塑造与 Copilot 对话交互中的需求

Jonan Richards, Bruno Alves de Oliveira, Iury Oliveira, Igor Wiese, Mairieli Wessel

专题命中 软件智能体 :研究开发者与Copilot的交互,属于AI编程

AI总结 通过混合方法出声思考研究,识别出5种交互模式和10种需求,并建立概念模型,揭示认知多样性如何影响开发者与GitHub Copilot的交互。

Comments Accepted at the International Conference on Software Maintenance and Evolution (ICSME), 2026

详情
AI中文摘要

基于LLM的对话式“编程助手”为开发者提供了诸多好处。然而,最近的研究表明,个体开发者对编程助手的需求存在差异,并且只有特定开发者群体才会遇到挑战。在本研究中,我们探讨了认知多样性在塑造与GitHub Copilot聊天交互中的作用。通过对27名专业开发者和学生进行混合方法的出声思考研究,我们表征了开发者交互中的5种不同的“交互模式”和10种潜在需求,形成了一个概念模型。我们描述了这些模式、需求与开发者的问题解决风格和经验概况之间的联系,展示了认知多样性如何塑造开发者的交互。我们为研究人员和从业者提供了关于如何设计、研究和运用编程助手以更好地满足多样化开发者需求的见解和建议。

英文摘要

Conversational LLM-based ``programming assistants'' provide a range of benefits to developers. However, recent studies demonstrate the variety in individual developers' needs regarding programming assistants, and challenges encountered by only specific groups of developers. In this study, we explore the role of cognitive diversity in shaping interactions with GitHub Copilot chat. Through a mixed-methods think aloud study with 27 professional developers and students, we characterize 5 distinct ``interaction modes'' and 10 underlying needs in developers' interactions, forming a conceptual model. We characterize links between these modes, needs, and developers' problem-solving styles and experience profiles, showing how cognitive diversity may shape developers' interactions. We provide insights and recommendations for researchers and practitioners on how to design, research, and employ programming assistants to better account for diverse developer needs.

2606.19167 2026-06-18 cs.SE 新提交 专题 85

Teaching Software Engineering with LLM and MCP Integration: From Classroom to Industry Practice

用LLM和MCP集成教学软件工程:从课堂到工业实践

Kehui Chen, Jacky Keung, Weining Li, Xiangbing Shao, Yishu Li, Xiaoxue Ma

专题命中 软件智能体 :将LLM和MCP集成到软件工程教学,提升编程和工具使用能力

AI总结 本研究将LLM和MCP集成到软件工程协作教学模式中,通过嵌入驱动工具到教学、代码辅助和工程模拟,弥合传统教学与工业流程的差距,提升学生编程、问题解决和智能工具使用能力。

Comments Aceept by International Symposium on Educational Technology (ISET) 2026

详情
AI中文摘要

大型语言模型(LLM)和模型上下文协议(MCP)在工业软件工程中的快速集成,迫切要求更新软件工程教育以跟上新兴技术和不断变化的行业需求。本研究探讨了一种创新方法,将LLM和MCP集成到软件工程教育的协作教学模式中,旨在构建一个与实际工程实践紧密相连的实用学习框架。通过将LLM和MCP驱动的工具嵌入日常教学、代码辅助和工程模拟中,该模型有效弥合了传统教学与工业工作流程之间的差距。这种集成增强了学生的编程能力、实际问题解决能力以及使用智能工程工具的熟练度。此外,通过与行业实习的合作,学生可以在真实环境中应用这些技术,进一步加强学术准备与专业实践之间的联系。总体而言,本研究为人工智能时代软件工程教育的改革与创新提供了一条实用路径。

英文摘要

The rapid integration of Large Language Models (LLMs) and the Model Context Protocol (MCP) into industrial software engineering has created a pressing need to update software engineering education to align with emerging technologies and evolving industry demands. This study investigates an innovative approach that integrates LLMs and MCP into a collaborative teaching model for software engineering education, aiming to build a practical learning framework closely connected to real-world engineering practices. By embedding LLM and MCP driven tools into daily teaching, code assistance, and engineering simulations, the model effectively bridges the gap between traditional instruction and industrial workflows. This integration enhances students' programming competence, practical problem-solving abilities, and proficiency in using intelligent engineering tools. Furthermore, through partnerships with industry internships, students can apply these technologies in real-world settings, further strengthening the connection between academic preparation and professional practice. Overall, this research offers a practical pathway for reforming and innovating software engineering education in the era of artificial intelligence.

2606.19191 2026-06-18 cs.CR 新提交 专题 80

PhantomSkill: Malicious Code Injection in Agent Skill Ecosystems

PhantomSkill: 代理技能生态系统中的恶意代码注入

Yu-Ting Lin, Chia-Mu Yu

专题命中 软件智能体 :针对LLM编码代理的恶意代码注入攻击

AI总结 提出PhantomSkill攻击框架,通过VulMask技术将恶意行为隐藏在技能的辅助资源中,利用漏洞形状的实现绕过检测,在保持良性功能的同时降低警告和恶意软件检测率。

详情
AI中文摘要

代理技能使得基于LLM的编码代理能够从第三方包获取领域特定能力,但也引入了新的供应链攻击面。我们提出PhantomSkill,一个攻击框架,将恶意行为隐藏在技能的辅助资源中,而非其文本描述中。其核心技术VulMask将明显的恶意脚本重写为漏洞形状的实现,其恶意行为仅在攻击者控制的触发条件下激活。这种设计将可见信号从明确的恶意意图转变为看起来普通的易受攻击代码。在代表性的宿主技能、攻击目标、编码代理、生成模型和自动审查器上,与明显的恶意脚本相比,VulMask在保持良性功能的同时减少了警告和恶意软件级别检测。我们的结果表明,技能生态系统需要资源级审查、执行时隔离以及将代理技能中的可利用漏洞视为潜在恶意载荷的安全策略。

英文摘要

Agent skills allow LLM-based coding agents to acquire domain-specific capabilities from third-party packages, but they also introduce a new supply-chain attack surface. We present PhantomSkill, an attack framework that hides malicious behavior in a skill's auxiliary resources rather than in its textual description. Its core technique, VulMask, rewrites overt malicious scripts into vulnerability-shaped implementations whose malicious behavior is activated only under attacker-controlled trigger conditions. This design shifts the visible signal from explicit malicious intent to ordinary-looking insecure code. Across representative host skills, attack goals, coding agents, generation models, and automated reviewers, VulMask preserves benign utility while reducing warning and malware-level detection compared with overt malicious scripts. Our results show that skill ecosystems require resource-level vetting, execution-time containment, and security policies that treat exploitable vulnerabilities in agent skills as potential malicious payloads.

2602.04341 2026-06-18 cs.SE 专题 80

Model-Driven Legacy System Modernization at Scale

规模化遗留系统现代化的模型驱动方法

Tobias Böhm, Jens Guan Su Tien, Mohini Nonnenmann, Tom Schoonbaert, Bart Carpels, Andreas Biesdorf

专题命中 软件智能体 :模型驱动遗留系统现代化

AI总结 本文提出一种模型驱动的遗留系统现代化方法,通过在遗留代码库和现代目标平台之间插入富化中间模型,实现了核心UI组件和页面结构的半自动化迁移,提升了可维护性和开发者体验。

Comments Accepted for publication at the 1st Workshop on Code Translation, Transformation, and Modernization (ReCode'26), co-located with ICSE 2026

Journal ref Proc. ReCode '26, ACM, New York, NY, USA (2026) 13-18

详情
AI中文摘要

本文经验报告介绍了一种模型驱动的遗留系统现代化方法,通过在遗留代码库和现代目标平台之间插入一个富化、技术中立的中间模型,报告了其应用和评估。四阶段过程:分析、富化、合成和过渡,系统地提取、抽象和转换系统构件。我们应用该方法于一个基于遗留版本的.NET Framework和ASP.NET MVC构建的大型工业应用,展示了核心用户界面组件和页面结构可半自动化迁移到现代Web堆栈,同时保持功能行为和关键非功能特性。通过将架构知识整合到显式模型表示中,所得到的代码库具有更高的可维护性和可扩展性,从而改善了开发者体验。尽管自动化在标准模式迁移中有效,但定制化布局复合体的迁移仍具挑战性,需要针对性的手动调整。我们的贡献包括:(i) 一个端到端的模型驱动过程,(ii) 一个捕获结构、依赖性和语义元数据的富化中间模型,(iii) 保留功能行为和关键非功能特性的转换规则,以及(iv) 在工业环境中的应用和评估。总体而言,基于模型的抽象减少了风险和努力,同时支持了可扩展、可追溯的遗留应用现代化。我们的方法可推广到类似的现代化情境,并促进了迁移模式的重用。

英文摘要

This experience report presents a model-driven approach to legacy system modernization that inserts an enriched, technology-agnostic intermediate model between the legacy codebase and the modern target platform, and reports on its application and evaluation. The four-stage process of analysis, enrichment, synthesis, and transition systematically extracts, abstracts, and transforms system artifacts. We apply our approach to a large industrial application built on legacy versions of the .NET Framework and ASP.NET MVC and show that core user interface components and page structures can be migrated semi-automatically to a modern web stack while preserving functional behavior and essential non-functional qualities. By consolidating architectural knowledge into explicit model representations, the resulting codebase exhibits higher maintainability and extensibility, thereby improving developer experience. Although automation is effective for standard patterns, migration of bespoke layout composites remains challenging and requires targeted manual adaptation. Our contributions are: (i) an end-to-end model-driven process, (ii) an enriched intermediate model that captures structure, dependencies, and semantic metadata, (iii) transformation rules that preserve functional behavior and essential non-functional qualities, and (iv) application and evaluation of the approach in an industrial setting. Overall, model-based abstractions reduce risk and effort while supporting scalable, traceable modernization of legacy applications. Our approach generalizes to comparable modernization contexts and promotes reuse of migration patterns.

2606.19121 2026-06-18 cs.SE cs.CL cs.HC 新提交 专题 75

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

由AI编写,由AI管理:跨越391个连续会话的语义空间控制与索引病消除

Hui Zhang, Shuren Song

发表机构 * Shenzhen Yunxi Technology Co., Ltd.(深圳云曦科技有限公司) Information Technology Center, Tsinghua University(清华大学信息科学技术中心)

专题命中 软件智能体 :长期LLM协作中的索引病问题,涉及代码工程

AI总结 本文通过真实软件项目中的行动研究,发现长期LLM协作中增加形式约束反而导致“索引病”,提出“基线-日志物理分离”机制,有效消除该问题。

Comments 22 pages, 2 tables, 1 figure. Action research. Bilingual submission (Chinese companion version included as supplementary). Submitted to ICSE 2027 IOR track

详情
AI中文摘要

解决长期LLM协作中概念漂移的主流工程直觉是,用更多的形式约束换取更可靠的输出——设计符号标识符系统,在系统提示中积累防御规则,扩展上下文窗口。我们的工程记录表明,在长期设置中,这种方向可能产生与设计意图相反的效果。通过在跨越约一个月和391个协作会话的真实软件项目(Bang-v3)中使用行动研究方法,我们记录并分析了这些策略的失败过程。当符号系统超过复杂度阈值时,LLM并不会变得更准确——相反,它们放弃了对业务语义的真正理解,退回到符号层内的自我指涉推理,并生成看似内部一致但实际上与现实脱节的输出。我们将这种失败模式命名为“索引病”,其典型表现为“幻影立法”。我们将底层原理命名为“庞原理(语义活力定律)”:带有明确目的的自然语言传达的信息质量远高于符号表达。由此,我们设计并验证了其物理工程机制:“基线-日志物理分离”。在同一项目中,该机制将AI指令量减少了约75%,并且在随后的约150个会话中,未观察到索引病复发。附有双语对照版本(中文)作为补充材料。

英文摘要

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

2606.18855 2026-06-18 cs.SE 新提交 专题 70

Toward Semantically-Seeded, Graph-Propagated Impact Analysis Across Software Artifacts: A Vision

面向语义种子与图传播的跨软件制品影响分析:一个愿景

Momil Seedat

专题命中 软件智能体 :跨软件制品影响分析,融合语义与结构。

AI总结 提出一种无需训练、可解释的融合方法,结合语义相似性与结构依赖,通过异构制品图与传播机制覆盖两种方法的盲点,实现跨需求-配置-服务-测试链的影响分析。

详情
AI中文摘要

当单个软件制品发生变化——一个需求、一个配置值或一个函数——工程师必须确定还有什么受到影响。现有的变更影响分析(CIA)工具往往孤立地依赖两种信号之一:从文本中恢复的语义相似性(信息检索可追溯性、代码搜索、嵌入),或结构依赖跟踪(调用图、IDE“查找用法”、测试影响选择)。每种方法都有其特有的盲点。语义驱动的方法会遗漏与变更没有共享词汇的受影响制品;结构驱动的方法会遗漏在意义上相关但未被边连接的制品,并且大多数仅对代码而非需求-配置-服务-测试链进行操作。我们主张一种无需训练且可解释的分析器,它在同一嵌入上融合两种信号。我们将系统建模为一个异构制品图,其类型化边通过静态分析恢复,通过余弦相似度计算相对于变更制品的语义先验,通过行归一化的传播矩阵进行多跳衰减传播,并通过单个可调权重λ融合两者。在一个支付子系统(5个标记的变更场景)上进行的小型但完整的概念验证显示了我们关心的机制:与变更没有文本重叠的制品仍然通过传播被恢复,而单独传播无法到达的辅助函数则通过语义层被恢复。融合是唯一覆盖两个盲点的配置,λ充当显式的精确率/召回率控制。借鉴四个公开记录的生成故障,我们认为相同的公式可以扩展到仅靠代码分析无法触及的操作制品(镜像、指标、仪表盘、数据模式)。

英文摘要

When a single software artifact changes - a requirement, a configuration value, or a function - engineers must determine what else is impacted. Existing change-impact-analysis (CIA) tooling tends to rely on one of two signals in isolation: semantic similarity recovered from text (information-retrieval traceability, code search, embeddings), or structural dependency following (call graphs, IDE "find usages", test-impact selection). Each has a characteristic blind spot. A semantically driven tool misses an impacted artifact whose text shares no vocabulary with the change; a structurally driven tool misses artifacts related in meaning but not joined by an edge, and most operate only over code rather than the Requirement-Config-Service-Test chain. We argue for a training-free and interpretable analyzer that fuses both signals over the same embeddings. We model the system as a heterogeneous artifact graph with typed edges recovered by static analysis, compute a semantic prior by cosine similarity to the changed artifact, propagate impact multi-hop with decay over a row-normalized propagation matrix, and blend the two with a single tunable weight lambda. A small but complete proof-of-concept on a payment subsystem (5 labelled change scenarios) shows the mechanism we care about: artifacts with zero textual overlap with the change are still recovered through propagation, and helper functions that propagation alone cannot reach are recovered through the semantic layer. The fusion is the only configuration that covers both blind spots, and lambda acts as an explicit precision/recall control. Drawing on four publicly documented production failures, we argue that the same formulation extends to operational artifacts (images, metrics, dashboards, data schemas) that code-only analysis cannot reach.

2606.17510 2026-06-18 cs.SE cs.SY eess.SY 新提交 专题 70

OmniDroneX: An LLM-Assisted Holistic Drone-as-a-Service Ecosystem

OmniDroneX: 一种LLM辅助的全方位无人机即服务生态系统

I-Ling Yen, Akeem Mohammed, Farokh Bastani, San-Yih Hwang

专题命中 软件智能体 :LLM用于服务组合和代码生成

AI总结 提出OmniDroneX统一无人机即服务生态系统,通过libUAV接口和PT-SOA抽象模型连接底层物理与高层任务,利用大语言模型辅助功能识别、服务组合和自然语言任务定义,支持多种组合技术以实现可扩展、自演进的无人机系统。

Comments This manuscript is a full version of a paper accepted in shortened form by IEEE International Conference on Joint Cloud Computing

详情
AI中文摘要

尽管无人机技术取得了快速进步,但由于无人机系统研究中的若干空白,当前部署仍然有限。为应对这些挑战,我们提出OmniDroneX,一个统一的无人机即服务生态系统,其中无人机从固定功能平台转变为动态可组合实体,可与外部基础设施集成以提供全方位能力。OmniDroneX通过统一的供应商无关接口(libUAV)和形式化的物理服务抽象模型(PT-SOA)连接底层物理原语与高层任务意图。一个核心创新是大语言模型(LLM)在OmniDroneX架构多层中的多样化应用。LLM用于辅助识别和形式化原始设备功能及抽象服务定义,支持自动化服务组合和工作流生成,并实现交互式自然语言任务规范与细化。OmniDroneX还包含了动态无人机系统中至关重要的多种组合技术类别,包括用于无人机能力增强的物理层组合,以及时空、功能、协作、异常感知和基于QoS的服务组合。总体而言,这些特性使OmniDroneX能够作为在复杂动态环境中运行的可扩展、有弹性和自演进的无人机生态系统的基础。

英文摘要

Despite rapid advances in UAV technologies, current deployments remain limited due to several gaps in UAV systems research. To address these challenges, we propose OmniDroneX, a unified Drone-as-a-Service ecosystem, in which drones are transitioned from fixed function platforms into dynamically composable entities that can be integrated with external infrastructures to offer omni-capabilities. OmniDroneX bridges low-level physical primitives with high-level mission intent through a unified vendor-agnostic interface (libUAV) and a formal physical-service abstraction model (PT-SOA). A core innovation is the diverse application of large language models (LLMs) across multiple layers of the OmniDroneX architecture. LLMs are used to assist in identifying and formalizing primitive device functions and abstract service definitions, supporting automated service composition and workflow generation, and enabling interactive, natural-language mission specification and refinement. OmniDroneX also incorporates important categories of composition techniques that are essential in dynamic UAV systems, including physical layer composition for drone capability augmentation, as well as spatiotemporal, functional, collaborative, exception-aware, and QoS-based service compositions. Collectively, these features allow OmniDroneX to serve as a foundation for scalable, resilient, and self-evolving UAV ecosystems operating in complex and dynamic environments.

2. 程序修复 1 篇

2606.18619 2026-06-18 cs.CR cs.AI cs.SE 新提交 专题 85

Code-Augur: Agentic Vulnerability Detection via Specification Inference

Code-Augur:通过规约推断的智能体漏洞检测

Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

发表机构 * National University of Singapore(新加坡国立大学)

专题命中 程序修复 :智能体漏洞检测,通过规约推断发现漏洞

AI总结 提出安全规约优先范式,通过显式化智能体假设并运行时反证,结合引导式模糊测试提升漏洞检测能力,在真实项目中比现有智能体检测更多漏洞。

详情
AI中文摘要

智能体漏洞检测的出现已成为软件安全的分水岭。完全由自主LLM智能体进行的审计正在发现数字社会基础软件中的关键漏洞。许多漏洞多年来一直隐藏,直到现在才被AI智能体发现。然而,这些发现背后的推理仍然令人担忧地不透明且未经验证。当智能体认为某个函数安全时,它对函数输入做了哪些假设?推理失败和错误假设可能导致遗漏漏洞,并降低对智能体分析的信任。我们提出了一种安全规约优先范式,该范式(1)将智能体的隐性假设明确暴露为安全规约,并(2)通过运行时反证持续细化这些规约。我们在Code-Augur中实现了我们的方法,这是一种用于智能体漏洞检测的新型框架。给定一个代码库,Code-Augur分析系统的每个组件以查找漏洞代码。当它认为某个组件安全时,它会将该判断背后的局部不变量作为源代码中的断言提交。同时,Code-Augur利用引导式模糊测试器尝试反证这些假设。当模糊测试器触发断言时,要么揭示一个真实漏洞,要么揭示一个需要细化的有缺陷规约。在这两种情况下,这一过程都夯实了智能体的理解,使其对代码意图的看法与代码实际行为保持一致。在真实世界的主题上,Code-Augur有效利用安全规约检测到比其他最先进智能体更多的漏洞。此外,Code-Augur在关键开源项目中发现了22个新漏洞。与精心策划的专用模型(如Claude Mythos)相比,Code-Augur提供了基于广泛可用的LLM(如Sonnet和DeepSeek)构建的有效智能体漏洞检测。

英文摘要

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

3. 代码生成 6 篇

2606.18286 2026-06-18 cs.LG 新提交 专题 85

CODEBLOCK: Learning to Supervise Code at the Right Granularity

CODEBLOCK: 学习在正确的粒度上监督代码

Zhijie Deng, Ling Li, Jinlong Pang, Kaiqin Hu, Qi Xuan, Zhaowei Zhu, Jiaheng Wei

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) UC Santa Cruz(加州大学圣克鲁兹分校) Ant Group(蚂蚁集团) BAIA, ZJUT(浙江工业大学智能信息处理实验室) D5Data.ai

专题命中 代码生成 :提出CodeBlock框架,结构感知稀疏监督提升代码生成微调。

AI总结 提出CodeBlock框架,通过选择结构完整的代码块而非孤立token进行稀疏监督,在仅使用1.9%监督token的情况下,在六个代码生成基准上取得优于全token微调的效果。

详情
AI中文摘要

代码大语言模型的监督微调通常对所有响应token应用统一的交叉熵损失,隐含假设每个token提供同等有用的学习信号。最近的token级选择方法通过仅监督高价值token挑战了自然语言SFT中的这一假设。然而,直接将token级掩码迁移到代码可能会破坏语法和语义连贯的程序单元,因为代码依赖于结构完整性和定义-使用关系。因此,我们提出CodeBlock,一个结构感知的稀疏监督框架,选择结构完整的代码证据而非孤立token。CodeBlock首先选择高质量的指令-响应对,然后将代码响应划分为语法连贯的编码项,通过聚合核心逻辑token上的广义交叉熵来估计其效用,并使用数据流可达性和桥接信号重新排序,以优先传播或连接重要程序依赖的块。在训练期间,完整响应仍作为上下文可用,但损失仅应用于选定的代码项和信息性自然语言token。在六个代码生成基准上的实验表明,CodeBlock在仅使用1.9%的监督响应token的情况下,实现了比全tokenSFT和竞争性选择基线更强的平均pass@1。

英文摘要

Supervised fine-tuning of code LLMs typically applies uniform cross-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal. Recent token-level selection methods challenge this assumption in natural-language SFT by supervising only high-value tokens. However, directly transferring token-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition-use relations. We therefore propose CodeBlock, a structure-aware sparse supervision framework that selects structure-complete code evidence rather than isolated tokens. CodeBlock first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural-language tokens. Experiments on six code-generation benchmarks show that CodeBlock achieves stronger average pass@1 than full-token SFT and competitive selection baselines, while using only 1.9% of supervised response tokens.

2606.19315 2026-06-18 cs.LG 新提交 专题 80

Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

Diffusion-Proof:超越自回归生成的正式定理证明配方

Ruida Wang, Rui Pan, Pengcheng Wang, Shizhe Diao, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) NVIDIA(英伟达)

专题命中 代码生成 :扩散语言模型用于形式定理证明

AI总结 提出Diffusion-Proof框架,首次将扩散语言模型应用于形式定理证明,通过全证明生成和局部校正方法,在ProofNet和MiniF2F上分别提升1.61%和6.14%,并解决了一个DeepSeek-Prover-V2-7B无法解决的IMO问题。

详情
AI中文摘要

近年来,增强大型语言模型(LLMs)的形式数学推理能力已成为数学和计算机科学社区的关键焦点。虽然在使用最先进的自回归(AR)LLMs进行形式定理证明方面取得了显著进展,但这些模型存在固有局限性。它们的下一个词预测生成方法可能因长程连贯性挑战和长序列错误累积而导致次优性能。最近,扩散LLMs(dLLMs)通过多词块的迭代去噪生成文本,提供了一种有前景的替代方案。然而,dLLMs在形式数学中的应用(其中保持长程连贯性至关重要)仍然研究不足。为解决上述挑战,我们提出了**Diffusion-Proof**,据我们所知,这是第一个训练和应用dLLMs进行形式定理证明的框架。我们的框架包含两种模型的训练和推理方法。第一个是*dLLM-Prover-7B*,它执行具有长程连贯策略使用的全证明写作。第二个是*dLLM-Corrector-7B*,这是一种新颖的大块扩散校正模型。它利用dLLMs的填充能力,使用双向信息进行局部证明校正。大量实验表明,**Diffusion-Proof**相对显著优于在同一数据集上训练的AR LLM基线。与基线相比,**Diffusion-Proof**在ProofNet-Test和MiniF2F-Test基准上分别实现了**1.61%**和**6.14%**的绝对提升。值得注意的是,**Diffusion-Proof**成功解决了一个更先进的思考模型DeepSeek-Prover-V2-7B无法解决的IMO问题,展示了dLLMs在形式定理证明中的独特优势。

英文摘要

Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. While significant progress has been made in using state-of-the-art Auto-Regressive (AR) LLMs for formal theorem proving, these models suffer from inherent limitations. Their next-token prediction generation methods may yield suboptimal performance due to the challenges of long-range coherence and the compounding of errors over long sequences. Recent advancements in diffusion LLMs (dLLMs), which generate text through iterative denoising of a multi-token block, offer a promising alternative. However, the application of dLLMs to formal mathematics, where maintaining long-range coherence is critical, remains largely understudied. To address the challenges above, we propose **Diffusion-Proof**, to the best of our knowledge, the first framework to train and apply dLLMs for formal theorem proving. Our frameworks contain training and inference methods for two models. The first one is *dLLM-Prover-7B*, which performs whole-proof writing with long-range coherent tactic usage. The second one is *dLLM-Corrector-7B*, which is a novel large block diffusion-based correction model. It leverages the in-filling capabilities of dLLMs to perform local proof correction using bi-directional information. Extensive experiments demonstrate that **Diffusion-Proof** relatively significantly outperforms the AR LLM baseline trained under the same dataset. **Diffusion-Proof** achieves an absolute improvement of **1.61%** on ProofNet-Test and **6.14%** on MiniF2F-Test benchmarks compare to the baseline. Notably, **Diffusion-Proof** successfully resolves one IMO problem that more advanced thinking model DeepSeek-Prover-V2-7B could not solve, showcasing the unique advantage of dLLMs in formal theorem proving.

2606.19042 2026-06-18 cs.SE cs.AI 新提交 专题 80

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

可变性去哪了?从氛围编码到通过再生的产品线

Xhevahire Tërnava

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France(LTCI,巴黎电信学院,巴黎理工学院,Palaiseau,法国)

专题命中 代码生成 :AI驱动编程,可变性再生。

AI总结 研究AI驱动编程(氛围编码)中可变性缺失问题,提出通过再生实现可变性(VbR)方法,让LLM作为推导引擎生成无死代码的变体二进制。

Comments VARIABILITY 2026

详情
AI中文摘要

在氛围编码这一新兴的AI驱动范式中,LLM根据自然语言提示生成整个程序,但传统软件工程精心构建到代码中的可变性会发生什么?为了回答这个问题,我们对10个氛围编码的C/C++项目进行了探索性分析,结果表明在编译和运行时,工件内可变性几乎为零。所有可变性决策都在一个新的绑定时间——生成时间(即LLM生成源代码的时刻)得到解决。我们不将其视为需要修复的缺陷,而是提出了通过再生实现可变性(VbR),据我们所知,这是第一种产品线方法,其中LLM充当推导引擎,根据声明性规范为每个变体生成无死代码的专用二进制,同时变体调度器透明地将用户请求路由到匹配的二进制。我们形式化了VbR,将其与经典SPL推导进行对比,并在wc产品家族上演示了其完整流程。对于SPL工程,AI生成软件中的可变性应属于规范,而非代码。

英文摘要

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

2606.18293 2026-06-18 cs.SE cs.AI 新提交 专题 80

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

Vibe Coding 吃掉我的作业:AI 方法在全新软件工程与编程中的评估

Callum Barbour

发表机构 * OpenAI

专题命中 代码生成 :评估AI编程(vibe coding)在软件工程中的可行性。

AI总结 本文评估了“氛围编码”(用自然语言提示编程)在全新软件工程任务中的可行性,并分析了现有基准,通过开发 Python 简单独立编程任务评估套件提供见解。

Comments 10 pages, 2 figures

详情
AI中文摘要

得益于生成式 AI 的快速发展,我们正处于一个可能永远改变我们与计算机交互方式的范式转变之中。我们观察到,在没有领域基础知识的情况下,使用自然语言提示来构建应用程序和编码基础设施的做法日益增长,这种做法被称为“氛围编码”。可以说,这代表了编程领域自诞生以来一直追求的目标,即每一个更高层次的抽象。就输入方法而言,氛围编码有望成为高级编程元认知的终点:完全消除人类对代码语法的使用,转而用母语进行编程。本文旨在评估氛围编码在全新软件工程任务中的可行性,并分析用于衡量其软件工程能力的基准。为此,我们开发了一个评估套件,用于分析 LLM 在 Python 中执行简单、独立的全新编程任务的熟练程度,以提供对此问题的有范围限制的见解。

英文摘要

Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

2606.19257 2026-06-18 cs.CL 新提交 专题 70

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B:面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong(香港大学) Peking University(北京大学)

专题命中 代码生成 :在代码推理基准上评估

AI总结 提出块大小课程学习,通过从细粒度到粗粒度的渐进训练,解决块扩散语言模型在长链推理中性能差距问题,DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情
AI中文摘要

块扩散语言模型通过并行块级去噪加速解码,但其能否可靠地扩展到长思维链(CoT)推理仍未解决。为此,我们开发了开源块扩散推理模型DreamReasoner-8B,并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距:使用大块大小训练会导致推理性能极差,而小块大小则能保持有效的推理。为了弥合这一粒度差距,我们提出了块大小课程学习,逐步从细粒度块大小过渡到粗粒度块大小进行训练,从而克服了这一限制,并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中,DreamReasoner-8B取得了与领先的开源自回归模型(如Qwen3-8B)相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型:https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

2606.18425 2026-06-18 cs.SE cs.AI cs.DC 新提交 专题 70

From Specification to Execution: AI Assisted Scientific Workflow Management

从规范到执行:AI辅助的科学工作流管理

Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

发表机构 * RENCI, University of North Carolina at Chapel Hill, NC, USA(RENCI,北卡罗来纳大学教堂山分校) Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA(信息科学研究所,南加州大学马里纳德尔雷耶斯分校)

专题命中 代码生成 :利用LLM生成工作流代码

AI总结 提出一种AI辅助方法,通过规范驱动的工作流生成、自动化调试和分布式执行,结合Pegasus与MCP层,实现从自然语言到大规模科学工作流的端到端管理。

详情
AI中文摘要

科学工作流管理系统(WMS)支持复杂管道的可扩展和可重复执行,但工作流的设计、实现和调试仍然主要依赖人工,需要大量专业知识。最近使用大型语言模型(LLM)的方法在从自然语言生成工作流方面显示出潜力,但通常依赖于直接的代码合成,这限制了透明度、可重复性以及与工作流系统的集成。我们提出了一种AI辅助的科学工作流管理方法,结合了规范驱动的工作流生成、自动化调试和分布式执行。该方法引入了一个结构化的规范阶段,将工作流意图、设计和实现分离,允许在代码生成之前进行验证。我们还开发了一个基于LLM的调试代理,用于诊断和解决跨多个系统层的故障。为了支持分布式执行和用户交互,我们将广泛使用的WMS Pegasus与模型上下文协议(MCP)层集成,为工作流提交、监控和控制提供统一接口。我们使用一个用于医学影像的联邦学习工作流来评估该方法,该工作流具有并行、迭代和依赖密集的结构。该系统生成并执行了包含数千个作业的大规模工作流,减少了调试工作量,并允许非专家用户使用专家级设计模式构建工作流。这些结果表明,端到端的AI辅助工作流生成和执行是可行的,并指向了用于管理科学工作流生命周期的AI驱动平台。

英文摘要

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

4. 代码评测 4 篇

2606.18284 2026-06-18 cs.LG cs.AI cs.CL 新提交 专题 75

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

打破求解器瓶颈:在可学习前沿训练任务生成器

Lorenz Wolf, Connor Watts, Roger Creus Castanyer, Geoffrey Bradway, Maxwill Lin, Augustine N. Mavor-Parker, Matthew Daborn-Sargent

发表机构 * Vmax Goodfire AI

专题命中 代码评测 :提出PROPEL框架,优化任务生成器用于代码和软件工程。

AI总结 提出PROPEL框架,通过训练轻量级激活探针作为求解率代理,在无需重复求解器评估的情况下优化任务生成器,使生成任务集中在可学习前沿,提升数学、代码和软件工程任务的有效性。

Comments 30 pages, 9 figures, 12 tables

详情
AI中文摘要

通过强化学习训练智能体的限制资源日益成为前沿任务供给:有效、可求解且刚好足够困难以训练当前模型的任务。随着推理和智能体模型的改进,固定任务分布趋于饱和,而天真的合成生成产生琐碎、不可能或不适定的任务。用强化学习训练任务生成器以优化有效性和可学习性可以解决这一瓶颈,但直接优化需要对每个候选任务进行重复求解器评估。对于软件工程任务,单次评估可能耗时数十分钟;求解器在环的生成器训练是不可行的。我们提出PROPEL,一个求解器摊销框架,用于在目标求解率下训练任务生成器。PROPEL在一次性标注的生成任务和求解器结果语料库上训练一个轻量级激活探针。该探针从冻结的生成器参考模型预测目标求解器的通过率,并在生成器优化期间作为求解率的代理,将生成器评估简化为单次前向传播。在多种模型规模下的数学、代码和软件工程任务中,PROPEL将生成任务转向目标求解率:对于编程,在可学习前沿生成的任务从$10.1\% \ ightarrow 20.0\%$(针对Qwen2.5-3B-Instruct求解器)和从$5.3\% \ ightarrow 12.6\%$(针对Qwen2.5-7B-Instruct求解器)。对于软件工程,PROPEL将目标求解率下的生成份额从$9.8\% \ ightarrow 19.6\%$(针对Qwen3.5-27B在探针和生成器训练期间未见过的仓库)。

英文摘要

The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from $10.1\% \rightarrow 20.0\%$ for a Qwen2.5-3B-Instruct solver and from $5.3\% \rightarrow 12.6\%$ for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from $9.8\% \rightarrow 19.6\%$ for Qwen3.5-27B on repositories not seen during training of probe and generator.

2606.16000 2026-06-18 cs.CL cs.LG 新提交 专题 70

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

GRACE-DS:数据科学中的受保护奖励引导智能体修正环境

Aleksandr Tsymbalov, Danis Zaripov, Artem Epifanov, Anastasiya Palienko

发表机构 * ITMO University(ITMO大学) HSE University(高等经济学院)

专题命中 代码评测 :评估代码生成和AutoML智能体性能

AI总结 提出GRACE-DS,一个用于评估LLM驱动的AutoML智能体在部署前性能的隔离环境,通过隐藏的可执行验证器衡量预测性能、泄漏避免、可重复性等指标,实验证明其灵活迭代交互模式优于基线方法。

详情
AI中文摘要

我们介绍了GRACE-DS,一个数据科学中的受保护奖励引导智能体修正环境,用于对LLM驱动的AutoML智能体进行部署前评估。GRACE-DS是一组在隔离环境中的评估指标,可应用于特定组织的表格ML任务。它将智能体暴露于现实的工作流阶段,从规划和数据检查到特征工程、模型开发、验证、代码修复直至最终提交,同时隐藏的可执行验证器不仅衡量最终预测性能,还衡量泄漏避免、可重复性、协议有效性、修正行为和奖励对齐。最强的结构化机制——灵活迭代交互(我们的方法)——实现了比单次生成、非结构化交互和基于重启的基线更高的端到端归一化隐藏测试质量,同时提高了协议有效完成率。经过7000多个回合的验证,这些结果确立了GRACE-DS作为评估基于LLM的AutoML智能体在生产类条件下按照组织特定要求执行机器学习工作流能力的稳健平台。

英文摘要

We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

2606.18536 2026-06-18 stat.AP cs.SE 新提交 专题 60

Analytics for Quality Assurance for Item Pools (AQuAP): Monitoring and Maintaining Item Bank Health in AI-Driven Assessment Systems

题库质量保证分析(AQuAP):AI驱动评估系统中题库健康的监控与维护

Alina A. von Davier, Xiaowan Zhang, Yigal Attali, Yena Park, Jacqueline Church, Andrew Runge, Geoff T. LaFlair, Alexander Tsigler

专题命中 代码评测 :AI评估系统中题库质量监控

AI总结 提出AQuAP仪表盘环境,通过有效题库规模等指标监控题库质量,支持大规模自动与人工结合的试题开发,确保高利害测试的题库健康。

Comments 11 pages, 4 figures

详情
AI中文摘要

教育评估的大规模数字化使得题库的持续监督既必要又复杂。本文提出了题库质量保证分析(AQuAP),一个用于监控试题质量和题库健康的仪表盘环境。AQuAP支持高利害测试中大规模试题生成程序的操作实施,这些程序包含在试题工厂(一个自动化和人工支持的测试开发框架)中。本文描述了AQuAP与试题开发过程的关系,概述了题库质量保证的更广泛度量框架,并强调了有效题库规模(EBS)作为题库活力的核心指标。EBS量化了在内容重复发生之前可以构建的独立测试会话数量,当与曝光度和使用度量结合时,它提供了对题库安全性、多样性和效率的洞察。我们进一步引入了题库健康度量,如最大曝光度、最大条件曝光度、调整后的有效题库规模和极少施测比例,所有这些都扩展了试题利用情况的图景。AQuAP展示了操作分析如何将心理测量概念转化为高容量、AI驱动的测试程序的质量保证工具。本文以多邻国英语测试(DET)流程为例进行说明。

英文摘要

The large-scale digitization of educational assessment has made the continuous oversight of item banks both essential and complex. This paper presents Analytics for Quality Assurance for Item Pools (AQuAP), a dashboard environment for monitoring item quality and item bank health. AQuAP supports the operational implementation of the large scale item generation procedures for high-stakes tests as included in the Item Factory, a framework for automated and human-supported test development. The paper describes AQuAP in relationship with the process of item development, outlines the broader metric framework for item-pool quality assurance, and highlights the Effective Bank Size (EBS) as one central indicator of pool vitality. EBS quantifies how many independent test sessions can be constructed before content repetition occurs and, when coupled with exposure and usage metrics, provides insight into item bank security, diversity, and efficiency. We further introduce bank-health metrics, such as maximum exposure, maximum conditional exposure, adjusted effective bank size, and the rarely-administered fraction, all of which extend this picture of item utilization. AQuAP illustrates how operational analytics can translate psychometric concepts into quality assurance tools for high-volume, AI-enabled testing programs. This work is illustrated with the Duolingo English Test (DET) processes.

2606.18421 2026-06-18 cs.SE 新提交 专题 60

Finding Compiler-Platform Interaction Bugs in Deep Learning Pipelines via Cross-Layer Constraints

通过跨层约束发现深度学习流水线中的编译器-平台交互错误

Yuxin Qiu, Jiyuan Wang, Ronak Badhe, Ben Limpanukorn, Miryung Kim, Qian Zhang

专题命中 代码评测 :测试深度学习编译器与平台交互错误

AI总结 提出一种自动化框架XCheck,通过提取全栈约束生成测试模型,发现编译器与硬件平台交互导致的错误,并在三个编译器上发现2034个错误案例。

详情
AI中文摘要

人工智能的日益部署需要鲁棒的深度学习编译器,如TVM和ONNX-MLIR。这些编译器以高级AI模型为输入,通过多层变换降低它们,并将其专门化到不同的硬件。测试此类编译器具有独特的挑战性,因为正确性取决于嵌入在整个编译栈中的隐式约束。现有的测试方法主要采用类型约束来限制输入模型生成,因此强调类型验证并监控编译崩溃或覆盖率增益。这种关注忽略了由编译和执行环境之间的交错效应引起的编译器-平台交互错误。在这项工作中,我们提出了一个可扩展的自动化DL编译器测试框架,用于同时(1)发现编译器-平台交互错误和(2)实现行为等价划分。我们的关键见解是,这些错误是由跨编译通道和硬件平台的交互引起的违反假设导致的。因此,我们超越了约束输入生成,并推导出全栈约束。我们的方法分为三步。首先,我们设计了一种自动化方法来提取全栈约束,这些约束共同指导模型生成并表征编译行为。其次,我们优先考虑暴露交互敏感行为的约束,以便我们生成的模型能够执行深度编译逻辑。第三,我们通过自动插入断言来监控覆盖率或通过/失败信号遗漏的不同编译症状,从而实现行为等价划分。我们在三个广泛使用的DL编译器上评估了我们的工具XCheck,发现了2034个揭示错误的案例,包括内存溢出、整数溢出以及根源于编译器-平台交互的静默意外编译。

英文摘要

The growing deployment of artificial intelligence (AI) necessitates robust deep learning (DL) compilers, such as TVM and ONNX-MLIR. These compilers take as input high-level AI models, lower them through multi-layer transformations, and specialize them to diverse hardware. Testing such compilers is uniquely challenging as correctness depends on implicit constraints embedded throughout the compilation stack. Existing testing approaches largely take type constraints to restrict input model generation and therefore emphasize type validation and monitor compilation crashes or coverage gains. This focus overlooks compiler-platform interaction bugs that arise from interleaved effects across compilation and execution environments. In this work, we propose a scalable, automated DL compiler testing framework for, in tandem, (1) finding compiler-platform interaction bugs and (2) enabling behavior equivalence partitioning. Our key insight is that these bugs are caused by violated assumptions arising from interactions across compilation passes and hardware platforms. Therefore, we move beyond constraining input generation and derive full-stack constraints. Our approach is three-fold. First, we design an automated approach to extract full-stack constraints that jointly guide model generation and characterize compilation behaviors. Second, we prioritize constraints that expose interaction-sensitive behaviors, so our generated models are capable of exercising deep compilation logic. Third, we enable behavior equivalence partitioning by automatically inserting assertions to monitor distinct compilation symptoms that coverage or pass/fail signals miss. We evaluated our tool, XCheck, on three widely-used DL compilers and found 2,034 bug-revealing cases, including memory overflows, integer overflows, and silent unexpected compilations that were rooted in compiler-platform interactions.