代码大模型 / AI 编程 - arXivDaily 专题

2602.06774 2026-06-18 cs.AI 版本更新 85%

Towards Understanding What State Space Models Learn About Code

理解状态空间模型在代码中学到了什么

Jiali Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * TU Darmstadt（图宾根大学）； Hessian Center for Artificial Intelligence（黑森人工智能中心）； National Research Center for Applied Cybersecurity ATHENE（应用网络安全国家研究中心ATHENE）

专题命中代码评测：SSM代码理解机制分析

AI总结本文首次系统分析状态空间模型（SSM）在代码理解中的学习机制，发现SSM在预训练时比Transformer更有效捕获语法和语义结构，但微调时会遗忘某些关系，并提出SSM-Interpret框架和架构改进，将NLCodeSearch的MRR提升高达6。

详情

AI中文摘要

状态空间模型（SSM）已成为Transformer架构的高效替代方案。先前工作表明，在可比条件下训练时，SSM在代码理解任务上可以匹配或超越Transformer。然而，其内部机制仍是一个黑箱。我们首次系统分析了基于SSM的代码模型所学到的内容，并在此领域直接比较了SSM和Transformer模型。我们的分析表明，SSM在预训练期间比Transformer更有效地捕获了语法和语义结构，但在某些任务的微调过程中会遗忘某些关系。为了研究这种行为，我们引入了SSM-Interpret，一个频域框架，揭示了微调期间向短程依赖的频谱偏移。在这些发现的指导下，我们提出了架构修改，将基于SSM的代码模型在NLCodeSearch上的性能显著提升了高达+6 MRR。这表明我们的分析不仅解释了模型行为，而且直接导致了更好的设计。

英文摘要

State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

URL PDF HTML ☆

赞 0 踩 0

2606.18284 2026-06-18 cs.LG cs.AI cs.CL 新提交 75%

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

打破求解器瓶颈：在可学习前沿训练任务生成器

Lorenz Wolf, Connor Watts, Roger Creus Castanyer, Geoffrey Bradway, Maxwill Lin, Augustine N. Mavor-Parker, Matthew Daborn-Sargent

发表机构 * Vmax ； Goodfire AI

专题命中代码评测：提出PROPEL框架，优化任务生成器用于代码和软件工程。

AI总结提出PROPEL框架，通过训练轻量级激活探针作为求解率代理，在无需重复求解器评估的情况下优化任务生成器，使生成任务集中在可学习前沿，提升数学、代码和软件工程任务的有效性。

Comments 30 pages, 9 figures, 12 tables

详情

AI中文摘要

通过强化学习训练智能体的限制资源日益成为前沿任务供给：有效、可求解且刚好足够困难以训练当前模型的任务。随着推理和智能体模型的改进，固定任务分布趋于饱和，而天真的合成生成产生琐碎、不可能或不适定的任务。用强化学习训练任务生成器以优化有效性和可学习性可以解决这一瓶颈，但直接优化需要对每个候选任务进行重复求解器评估。对于软件工程任务，单次评估可能耗时数十分钟；求解器在环的生成器训练是不可行的。我们提出PROPEL，一个求解器摊销框架，用于在目标求解率下训练任务生成器。PROPEL在一次性标注的生成任务和求解器结果语料库上训练一个轻量级激活探针。该探针从冻结的生成器参考模型预测目标求解器的通过率，并在生成器优化期间作为求解率的代理，将生成器评估简化为单次前向传播。在多种模型规模下的数学、代码和软件工程任务中，PROPEL将生成任务转向目标求解率：对于编程，在可学习前沿生成的任务从$10.1\% \ ightarrow 20.0\%$（针对Qwen2.5-3B-Instruct求解器）和从$5.3\% \ ightarrow 12.6\%$（针对Qwen2.5-7B-Instruct求解器）。对于软件工程，PROPEL将目标求解率下的生成份额从$9.8\% \ ightarrow 19.6\%$（针对Qwen3.5-27B在探针和生成器训练期间未见过的仓库）。

英文摘要

The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from $10.1\% \rightarrow 20.0\%$ for a Qwen2.5-3B-Instruct solver and from $5.3\% \rightarrow 12.6\%$ for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from $9.8\% \rightarrow 19.6\%$ for Qwen3.5-27B on repositories not seen during training of probe and generator.

URL PDF HTML ☆

赞 0 踩 0

2604.00730 2026-06-18 cs.CY cs.AI cs.LG cs.SE 版本更新 75%

A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

基于CEFR启发的模糊C均值分类框架：自动化评估Scratch编程技能

Ricardo Hidalgo-Aragón, Jesús M. González-Barahona, Gregorio Robles

发表机构 * Universidad Rey Juan Carlos（雷昂卡洛斯大学）

专题命中代码评测：模糊C均值聚类评估Scratch编程技能

AI总结提出一种基于CEFR的Scratch项目评估框架，使用模糊C均值聚类对200万+项目分级，识别B2瓶颈并引入分类确定性指标以平衡自动反馈与人工审核。

Comments Best Paper Award CSEDU 2026 -Minor change FPC fix-

详情

AI中文摘要

背景：学校、培训平台和技术公司日益需要以透明、可重复的方法大规模评估编程能力，以支持个性化学习路径。目标：本研究引入一个与欧洲共同语言参考标准（CEFR）一致的Scratch项目评估教学框架，为学生和教师提供通用能力等级，并为课程设计提供可行见解。方法：我们对通过此http URL评估的2008246个Scratch项目应用模糊C均值聚类，实施序数准则将聚类映射到CEFR等级（A1-C2），并引入增强分类指标，识别过渡学习者，实现持续进度跟踪，量化分类确定性以平衡自动反馈与教师评审。影响：该框架能够诊断系统性课程缺口——特别是“B2瓶颈”，由于逻辑同步和数据表示的认知负荷，仅13.3%的学习者处于该等级——同时提供基于确定性的触发机制以进行人工干预。

英文摘要

Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.16000 2026-06-18 cs.CL cs.LG 新提交 70%

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

GRACE-DS：数据科学中的受保护奖励引导智能体修正环境

Aleksandr Tsymbalov, Danis Zaripov, Artem Epifanov, Anastasiya Palienko

发表机构 * ITMO University（ITMO大学）； HSE University（高等经济学院）

专题命中代码评测：评估代码生成和AutoML智能体性能

AI总结提出GRACE-DS，一个用于评估LLM驱动的AutoML智能体在部署前性能的隔离环境，通过隐藏的可执行验证器衡量预测性能、泄漏避免、可重复性等指标，实验证明其灵活迭代交互模式优于基线方法。

详情

AI中文摘要

我们介绍了GRACE-DS，一个数据科学中的受保护奖励引导智能体修正环境，用于对LLM驱动的AutoML智能体进行部署前评估。GRACE-DS是一组在隔离环境中的评估指标，可应用于特定组织的表格ML任务。它将智能体暴露于现实的工作流阶段，从规划和数据检查到特征工程、模型开发、验证、代码修复直至最终提交，同时隐藏的可执行验证器不仅衡量最终预测性能，还衡量泄漏避免、可重复性、协议有效性、修正行为和奖励对齐。最强的结构化机制——灵活迭代交互（我们的方法）——实现了比单次生成、非结构化交互和基于重启的基线更高的端到端归一化隐藏测试质量，同时提高了协议有效完成率。经过7000多个回合的验证，这些结果确立了GRACE-DS作为评估基于LLM的AutoML智能体在生产类条件下按照组织特定要求执行机器学习工作流能力的稳健平台。

英文摘要

We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

URL PDF HTML ☆

赞 0 踩 0

2606.18536 2026-06-18 stat.AP cs.SE 新提交 60%

Analytics for Quality Assurance for Item Pools (AQuAP): Monitoring and Maintaining Item Bank Health in AI-Driven Assessment Systems

题库质量保证分析（AQuAP）：AI驱动评估系统中题库健康的监控与维护

Alina A. von Davier, Xiaowan Zhang, Yigal Attali, Yena Park, Jacqueline Church, Andrew Runge, Geoff T. LaFlair, Alexander Tsigler

专题命中代码评测：AI评估系统中题库质量监控

AI总结提出AQuAP仪表盘环境，通过有效题库规模等指标监控题库质量，支持大规模自动与人工结合的试题开发，确保高利害测试的题库健康。

Comments 11 pages, 4 figures

详情

AI中文摘要

教育评估的大规模数字化使得题库的持续监督既必要又复杂。本文提出了题库质量保证分析（AQuAP），一个用于监控试题质量和题库健康的仪表盘环境。AQuAP支持高利害测试中大规模试题生成程序的操作实施，这些程序包含在试题工厂（一个自动化和人工支持的测试开发框架）中。本文描述了AQuAP与试题开发过程的关系，概述了题库质量保证的更广泛度量框架，并强调了有效题库规模（EBS）作为题库活力的核心指标。EBS量化了在内容重复发生之前可以构建的独立测试会话数量，当与曝光度和使用度量结合时，它提供了对题库安全性、多样性和效率的洞察。我们进一步引入了题库健康度量，如最大曝光度、最大条件曝光度、调整后的有效题库规模和极少施测比例，所有这些都扩展了试题利用情况的图景。AQuAP展示了操作分析如何将心理测量概念转化为高容量、AI驱动的测试程序的质量保证工具。本文以多邻国英语测试（DET）流程为例进行说明。

英文摘要

The large-scale digitization of educational assessment has made the continuous oversight of item banks both essential and complex. This paper presents Analytics for Quality Assurance for Item Pools (AQuAP), a dashboard environment for monitoring item quality and item bank health. AQuAP supports the operational implementation of the large scale item generation procedures for high-stakes tests as included in the Item Factory, a framework for automated and human-supported test development. The paper describes AQuAP in relationship with the process of item development, outlines the broader metric framework for item-pool quality assurance, and highlights the Effective Bank Size (EBS) as one central indicator of pool vitality. EBS quantifies how many independent test sessions can be constructed before content repetition occurs and, when coupled with exposure and usage metrics, provides insight into item bank security, diversity, and efficiency. We further introduce bank-health metrics, such as maximum exposure, maximum conditional exposure, adjusted effective bank size, and the rarely-administered fraction, all of which extend this picture of item utilization. AQuAP illustrates how operational analytics can translate psychometric concepts into quality assurance tools for high-volume, AI-enabled testing programs. This work is illustrated with the Duolingo English Test (DET) processes.

URL PDF HTML ☆

赞 0 踩 0

2606.18421 2026-06-18 cs.SE 新提交 60%

Finding Compiler-Platform Interaction Bugs in Deep Learning Pipelines via Cross-Layer Constraints

通过跨层约束发现深度学习流水线中的编译器-平台交互错误

Yuxin Qiu, Jiyuan Wang, Ronak Badhe, Ben Limpanukorn, Miryung Kim, Qian Zhang

专题命中代码评测：测试深度学习编译器与平台交互错误

AI总结提出一种自动化框架XCheck，通过提取全栈约束生成测试模型，发现编译器与硬件平台交互导致的错误，并在三个编译器上发现2034个错误案例。

详情

AI中文摘要

人工智能的日益部署需要鲁棒的深度学习编译器，如TVM和ONNX-MLIR。这些编译器以高级AI模型为输入，通过多层变换降低它们，并将其专门化到不同的硬件。测试此类编译器具有独特的挑战性，因为正确性取决于嵌入在整个编译栈中的隐式约束。现有的测试方法主要采用类型约束来限制输入模型生成，因此强调类型验证并监控编译崩溃或覆盖率增益。这种关注忽略了由编译和执行环境之间的交错效应引起的编译器-平台交互错误。在这项工作中，我们提出了一个可扩展的自动化DL编译器测试框架，用于同时(1)发现编译器-平台交互错误和(2)实现行为等价划分。我们的关键见解是，这些错误是由跨编译通道和硬件平台的交互引起的违反假设导致的。因此，我们超越了约束输入生成，并推导出全栈约束。我们的方法分为三步。首先，我们设计了一种自动化方法来提取全栈约束，这些约束共同指导模型生成并表征编译行为。其次，我们优先考虑暴露交互敏感行为的约束，以便我们生成的模型能够执行深度编译逻辑。第三，我们通过自动插入断言来监控覆盖率或通过/失败信号遗漏的不同编译症状，从而实现行为等价划分。我们在三个广泛使用的DL编译器上评估了我们的工具XCheck，发现了2034个揭示错误的案例，包括内存溢出、整数溢出以及根源于编译器-平台交互的静默意外编译。

英文摘要

The growing deployment of artificial intelligence (AI) necessitates robust deep learning (DL) compilers, such as TVM and ONNX-MLIR. These compilers take as input high-level AI models, lower them through multi-layer transformations, and specialize them to diverse hardware. Testing such compilers is uniquely challenging as correctness depends on implicit constraints embedded throughout the compilation stack. Existing testing approaches largely take type constraints to restrict input model generation and therefore emphasize type validation and monitor compilation crashes or coverage gains. This focus overlooks compiler-platform interaction bugs that arise from interleaved effects across compilation and execution environments. In this work, we propose a scalable, automated DL compiler testing framework for, in tandem, (1) finding compiler-platform interaction bugs and (2) enabling behavior equivalence partitioning. Our key insight is that these bugs are caused by violated assumptions arising from interactions across compilation passes and hardware platforms. Therefore, we move beyond constraining input generation and derive full-stack constraints. Our approach is three-fold. First, we design an automated approach to extract full-stack constraints that jointly guide model generation and characterize compilation behaviors. Second, we prioritize constraints that expose interaction-sensitive behaviors, so our generated models are capable of exercising deep compilation logic. Third, we enable behavior equivalence partitioning by automatically inserting assertions to monitor distinct compilation symptoms that coverage or pass/fail signals miss. We evaluated our tool, XCheck, on three widely-used DL compilers and found 2,034 bug-revealing cases, including memory overflows, integer overflows, and silent unexpected compilations that were rooted in compiler-platform interactions.

URL PDF HTML ☆

赞 0 踩 0