代码大模型 / AI 编程 - arXivDaily 专题

2606.20517 2026-06-19 cs.AI cs.PL 新提交 95%

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Multi-LCB: 将 LiveCodeBench 扩展到多种编程语言

Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, Dmitrii Babaev

发表机构 * GigaCode ； Yandex School of Data Analysis, Applied AI Institute（Yandex数据分析学院，应用人工智能研究所）

专题命中代码评测：提出跨语言代码生成基准Multi-LCB，评估LLM代码能力

AI总结提出 Multi-LCB 基准，将 LiveCodeBench 的 Python 任务扩展到 12 种编程语言，评估 LLM 跨语言代码生成能力，发现 Python 过拟合和语言特定污染等问题。

Comments ICLR 2026

详情

AI中文摘要

LiveCodeBench (LCB) 最近已成为评估大型语言模型 (LLM) 在代码生成任务上的广泛采用的基准。通过策划竞争性编程问题、不断向集合中添加新问题并根据发布日期进行过滤，LCB 提供了污染感知的评估，并提供了编码能力的整体视图。然而，LCB 仍然局限于 Python，留下了 LLM 是否能够泛化到现实软件工程所需的各种编程语言的问题。我们引入了 Multi-LCB，这是一个跨十二种编程语言（包括 Python）评估 LLM 的基准。Multi-LCB 将 LCB 数据集中的 Python 任务转换为其他语言中的等效任务，同时保留 LCB 的污染控制和评估协议。由于它与原始 LCB 格式完全兼容，Multi-LCB 将自动跟踪未来的 LCB 更新，从而能够系统地评估跨语言代码生成能力，并要求模型在 Python 之外保持良好的性能。我们在 Multi-LCB 上评估了 24 个 LLM 的指令和推理能力，发现了 Python 过拟合、语言特定污染以及多语言性能显著差异的证据。我们的结果将 Multi-LCB 确立为多编程语言代码评估的严格新基准，直接解决了 LCB 的主要局限性，并揭示了当前 LLM 能力的关键差距。

英文摘要

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.19830 2026-06-19 cs.SE cs.CL 新提交 90%

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

JAMER：专业游戏引擎上的项目级代码框架数据集与基准测试

Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang, Yifei Huang, Yu Dai, Kaipeng Zhang

发表机构 * Nankai University（南开大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai AI Laboratory（上海人工智能实验室）

专题命中代码评测：项目级游戏代码框架数据集和基准，评估代码生成模型。

AI总结提出首个基于专业游戏引擎的项目级代码框架数据集JamSet和基准JamBench，通过设计确定性验证流程，从24万仓库中筛选出8133个已验证项目，评估9个前沿模型发现项目规模增大时能力急剧下降。

详情

AI中文摘要

当前AI驱动的游戏开发在资产生成、游戏设计和基于Web的游戏编码方面取得了实质性进展，但由于缺乏大规模数据集和确定性评估方法，专业游戏引擎上的项目级代码工程仍然很大程度上未被探索。我们提出了JamSet和JamBench，这是首个基于专业游戏引擎的项目级游戏代码框架数据集和基准。我们的关键洞察是，Game Jam竞赛（开发者在严格时间限制下构建完整游戏的社区活动）产生了数千个适合此目的的开源项目。基于Godot引擎的文本格式和无头执行模式，我们设计了一个从文件完整性到运行时行为收集的确定性验证流程，从超过24万个仓库中提炼出8133个已验证项目。其中，300个手动验证的项目构成JamBench；其余构成JamSet。JamBench定义了主题驱动的生成和代码补全任务，通过结合编译通过率、结构完整性得分（SCS）和行为对齐得分（BAS）的流水线进行评估。对9个前沿模型的评估揭示了随着项目规模增加的能力悬崖，运行时通过率从小型项目的80.4%下降到大型项目的5.7%（Task2a）。代码代理提高了编译率，但在运行时行为质量上没有带来提升，表明瓶颈在于架构设计而非语法正确性。实验验证了JamSet作为有效训练数据。所有数据和代码均已公开。

英文摘要

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.20502 2026-06-19 cs.CR cs.AI cs.SE 新提交 85%

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

无理解的校准：诊断微调大语言模型在系统软件漏洞检测中的局限性

Arastoo Zibaeirad, Marco Vieira

发表机构 * University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校）

专题命中代码评测：评估LLM在系统软件漏洞检测中的能力

AI总结提出CWE-Trace框架，通过834个Linux内核样本和两个诊断指标（DFI和HDD）评估LLM漏洞检测能力，发现数据污染无实质帮助，微调仅改变输出阈值而非决策策略，模型缺乏真正的安全推理能力。

详情

AI中文摘要

大语言模型在漏洞基准测试中得分高，但究竟是真正推理安全还是仅对污染数据进行模式匹配，这一问题仍未解决。我们提出CWE-Trace，一个基于834个手动整理的Linux内核样本（涵盖74个CWE）构建的LLM漏洞检测框架。该框架强制执行严格的时间分割（2025年前的历史集/截止后的无泄漏集），保留上下文感知的易受攻击-修补对，并引入两个诊断指标：方向性失败指数（DFI）和层次距离与方向（HDD）。我们评估了8个原始LLM和15个LoRA微调变体，涵盖非目标检测、目标检测和CWE分类。分析得出两个关键结果。首先，数据污染未提供可衡量的优势。函数级分析显示，84%的名义污染样本不携带可用的记忆信号：易受攻击的函数缺失或跨数据集交叉映射，约31%的污染样本存在CWE误分类。其次，骨干方向性先验主导微调。模型表现出稳定、系统性的失败模式（DFI范围从-85.5到+94.8个百分点），这些模式从历史数据持续到截止后数据，且难以纠正。微调改变了输出阈值，但未改变决策策略。这是无理解的校准：输出分布适应训练数据，而底层安全推理仍然缺失。在二元检测中最弱的骨干（DeepSeek-R1）在粗粒度CWE分类中提升最大，表明检测和理解是解耦的能力。最佳检测得分仅达到52.1%（比随机高2.1个百分点）；精确CWE排名Top-1准确率仍低于1.3%，证实当前LLM无论采用何种微调策略，都缺乏对系统软件的可靠安全推理能力。

英文摘要

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.19613 2026-06-19 cs.SE cs.AI 新提交 85%

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

StaminaBench: 对编码智能体进行100轮交互的压力测试

Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

发表机构 * AWS Agentic AI（AWS 代理人工智能）

专题命中代码评测：提出StaminaBench压力测试编码智能体耐力。

AI总结提出StaminaBench基准，通过100轮连续变更请求测试编码智能体的耐力，发现所有模型在5-6轮内失败，而测试反馈和重试机制可将通过轮数提升12倍。

详情

AI中文摘要

我们引入了StaminaBench，一个衡量编码智能体耐力的基准：它们在失败前能处理多少连续交互轮次（变更请求）。与流行的任务解决率指标不同，这符合实际编码风格，其中会话运行数十或数百轮。在StaminaBench中，智能体实现一个REST API服务器，并在可调数量的程序生成的后续变更请求（实验中为100个）上进行修改，导致代码库最多达6000行。测试完全以编程方式生成，无需LLM参与，确保可重复性和可靠性；变更序列来自硬编码或LLM驱动的采样器，两者都受限于结构化动作空间以确保变更有效。智能体和服务器在隔离环境中运行，并通过HTTP与基准通信，使测试完全黑盒且与语言无关。我们评估了六个智能体框架与七个开源LLM在20个场景（每个100轮）上的表现，发现：（1）所有测试模型在5-6轮内失败，确认了无彻底测试的编码风格会产生错误；（2）将测试反馈传递给智能体并允许重试，可将通过轮数提升最多12倍；（3）良好的框架是强性能所必需的：更强的模型在其最佳和最差框架之间表现出高达6倍的差距，而较弱的模型在任何框架下都失败。我们发布了基准和生成的任务，以促进对多轮编码智能体行为的进一步研究。基准代码和数据：此 http URL。

英文摘要

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

URL PDF HTML ☆

赞 0 踩 0

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交 85%

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式：移动代理是否需要手机屏幕？

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

发表机构 * Mila – Québec AI Institute（魁北克人工智能研究所）； Concordia University（康科迪亚大学）； University of Toronto（多伦多大学）； McMaster University（麦马斯特大学）

专题命中代码评测：评估编码代理在移动平台上的表现。

AI总结本文挑战移动代理的GUI主导范式，提出CLI应同等重要，通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线，并引入CLI-Advantage任务套件展示其优势。

详情

AI中文摘要

近期移动代理的进展主要由GUI范式主导，其中代理感知UI信息并发出屏幕交互。然而，移动平台也提供了命令行接口（CLI），可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上，使用四种模型API评估了三个编码代理（Claude Code、Terminus-2、mini-swe-agent），未进行任何移动特定后训练，并与三个可复现的GUI基线（GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B）进行比较。Claude Code（Opus 4.7）达到71.8%和51.9%，优于所有可复现的GUI基线（AndroidWorld上69.3/68.1/57.8%；MobileWorld上43.2/26.3/13.3%），而其他CLI配置也保持竞争力。为确立该范式的上限，我们提供了oracle CLI解决方案，在AndroidWorld上达到88.8%（103/116个任务可CLI解决），在MobileWorld上达到86.3%（101/117个任务可CLI解决），表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图，我们引入了\ extbf{CLI-Advantage任务套件}，包含五个类别的45个模板：批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线，且每个任务步骤显著更少（10.7步 vs. 18.6步）。为支持未来移动CLI代理的研究，我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2606.06747 2026-06-19 cs.SE 新提交 85%

Tensor Algebraic Property Skeletons: Amplifying Property-Based Testing for AI Compilers

张量代数性质骨架：增强AI编译器的基于性质的测试

Yuxin Qiu, Ben Limpanukorn, Seongmin Lee, Jiyuan Wang, Qian Zhang, Miryung Kim

专题命中代码评测：LLM生成性质测试，检测AI编译器语义漂移

AI总结提出Propilot框架，利用LLM将张量代数知识表示为可复用的性质骨架，自动生成可执行的基于性质的测试，以检测AI编译器中的语义漂移。

Comments v2 adds citations and fixes some typos

详情

AI中文摘要

深度学习编译器（如TVM和ONNX-MLIR）将张量计算图降级为目标后端的优化可执行文件。测试这些AI编译器在模糊测试中生成良好格式的输入方面取得了实质性进展；然而，仅靠这种生成无法捕捉到图变换和优化应保持的代数不变量的语义漂移。尽管张量代数已被研究数十年，但它尚未转化为深度学习编译器的可执行基于性质的测试，因为这样做需要联合构建算子、输入和测试预言。核心挑战不再是生成用于模糊测试深度学习编译器的良好格式输入，而是基于张量代数用这些输入和预言引导可执行的基于性质的测试。我们在Propilot中实现了这一愿景，这是一个基于GPT 5.5的LLM驱动的智能体基于性质测试框架。首先，Propilot将张量代数知识表示为可复用的性质骨架，每个骨架都包含算子约束、形状和值规则以及预言模板。其次，给定目标编译器，Propilot通过生成配对的张量计算图、具体的张量输入和预期的语义关系作为预言，将这些骨架实例化为可执行的基于性质的测试。接下来，为防止生成的测试退化为无效或无信息的基于性质的测试，Propilot在执行前验证每个基于性质的测试候选的适用性和安全性。验证反馈、执行结果和覆盖率信号指导后续生成。我们在TVM上使用212个算子和20个性质骨架评估Propilot，生成了4,579个基于性质的测试。与直接的基于LLM的基于性质的测试生成相比，Propilot通过显式的性质骨架将冗余减少了49%，并消除了无效测试。这种有效性转化为发现语义错误和数值差异。

英文摘要

Deep learning (DL) compilers such as TVM and ONNX-MLIR lower tensor computation graphs into optimized executables for target backends. Testing these compilers has made substantial progress in generating well-formed inputs in the context of fuzzing. However, such generation alone does not catch semantic drifts from algebraic invariants that graph transformations and optimizations are expected to preserve. While tensor algebra has been studied for decades, it has not been transformed into executable property-based tests (PBTs) for DL compilers because doing so requires the time-consuming and error-prone task of jointly constructing operators, tensors, and oracles. The central challenge is no longer generating well-formed inputs for fuzzing DL compilers, but bootstrapping executable PBTs with such inputs and correct oracles based on tensor algebra. We realize this vision in Propilot, an LLM-driven agentic property-based testing framework for DL compilers. First, Propilot represents tensor algebra knowledge as reusable property skeletons, each coupled with operator constraints and oracle templates. Second, given a target compiler, Propilot instantiates these skeletons into executable PBTs by generating paired tensor computation graphs, tensor inputs, and expected semantic relations as oracles. Third, to prevent generated tests from degenerating into invalid or uninformative PBTs, Propilot validates each PBT candidate before execution for applicability and safety. Validation feedback, execution results, and coverage signals guide subsequent generation. We evaluate Propilot on TVM with 212 operators and 20 property skeletons, generating 4,579 PBTs. Compared with direct LLM-based PBT generation, Propilot reduces redundancy by 49% and eliminates invalid tests through explicit property skeletons. This effectiveness translates into finding semantic errors and numerical discrepancies.

URL PDF HTML ☆

赞 0 踩 0

2606.20436 2026-06-19 cs.CR cs.AI 新提交 80%

Multi-View Decompilation for LLM-Based Malware Classification

基于LLM的恶意软件分类的多视角反编译

Bercan Turkmen, Vyas Raina

发表机构 * Independent Researcher（独立研究员）； SPARK

专题命中代码评测：使用LLM对反编译代码进行恶意软件分类

AI总结提出多反编译器视角提升LLM恶意软件分类性能，通过Ghidra和RetDec的互补伪C代码提高召回率和F1分数。

详情

AI中文摘要

恶意软件分析师通常在源代码不可用时，通过反编译的伪C代码检查编译后的二进制文件。最近的研究表明，大型语言模型（LLMs）可以通过将反编译代码分类为良性或恶意来辅助这一过程，但现有的流程通常依赖于单一的反编译器视角。我们认为这一假设是脆弱的：反编译器是有损的启发式工具，不同的反编译器可能暴露同一二进制文件的不同特征。我们整理了一个包含良性工具和恶意程序的基准测试，涵盖一系列威胁行为。每个样本都使用Ghidra和RetDec进行编译和反编译，生成匹配的伪C视图。在来自主要模型系列的一系列LLMs中，我们发现提供两种反编译器视图可以提高恶意类别的F1分数，主要是通过提高恶意样本的召回率。一致性分析进一步表明，Ghidra和RetDec会犯部分不同的错误，支持反编译器输出提供互补证据的观点。我们的结果表明，多反编译器提示是一种简单、无需训练的方法，可以在实际环境中改进基于LLM的恶意软件分类。

英文摘要

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20146 2026-06-19 cs.AI 新提交 80%

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

BIM-Edit：基于IFC的建筑信息模型的大语言模型基准测试

Bharathi Kannan Nithyanantham, Clemens Kujat, Tobias Sesterhenn, Stefan Telgmann, Jörn Plönnigs, Stefan Lüdtke, Christian Bartelt

发表机构 * University of Rostock（罗斯托克大学）； Clausthal University of Technology（克劳斯塔尔工业大学）

专题命中代码评测：评估LLM在建筑信息模型编辑上的基准。

AI总结提出BIM-Edit基准，评估大语言模型在IFC格式建筑信息模型上的自然语言编辑能力，涵盖324个任务，最佳模型平均得分仅49.5%，揭示当前能力与工程需求间的差距。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被应用于计算机辅助设计（CAD），以从文本指令生成设计工件。在工程实践中，这需要的不仅仅是创建新的几何体，模型还必须理解现有场景，正确编辑它们，并保留语义和关系。然而，许多CAD基准侧重于创建新模型而非编辑现有模型，并且主要评估几何正确性。我们引入了BIM-Edit，这是一个用于评估LLMs在行业基础类（IFC）格式表示的建筑信息模型（BIM）上进行自然语言编辑的基准。BIM提供了一个具有挑战性的测试平台，因为建筑模型将几何体与语义和关系结构编码在一起。BIM-Edit包含324个编辑任务，涵盖11个真实建筑模型和36个合成场景。任务使用三种指令类别——直接、空间和拓扑——表达，涵盖显式编辑和场景接地编辑。我们沿三个维度评估输出：几何准确性、语义有效性和拓扑一致性。在评估的LLMs中，表现最佳的模型在三个指标上的平均得分仅为49.5%，且没有模型完全解决超过3.4%的任务。这些结果表明当前LLM能力与结构化工程设计工作流的要求之间存在巨大差距。

英文摘要

Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.20128 2026-06-19 cs.SE cs.DC cs.LG 新提交 80%

The Correctness Illusion in LLM-Generated GPU Kernels

LLM生成的GPU内核中的正确性错觉

Dipankar Sarkar

发表机构 * Arizona State University, USA（亚利桑那州立大学）

专题命中代码评测：评估LLM生成GPU内核的正确性。

AI总结通过高精度CPU参考和操作模式感知的模糊测试，发现现有基准测试中基于固定形状的allclose检查无法检测LLM风格的转录错误，提出一种新协议并验证其有效性。

Comments 10 pages, 2 figures, LNCS format. Companion papers to follow on arXiv next week; IDs will be added in a v2 replace

详情

AI中文摘要

针对LLM生成的GPU内核的基准测试（KernelBench、TritonBench、GEAK）通过固定形状、小样本的allclose风格检查来评分正确性。不同基准测试的输入数量不同。每个内核的形状、数据类型和容差是固定的。我们凭经验测试了该oracle。我们构建了一个包含24个Triton和CPU替代内核（15个正确对照和9个带有记录转录错误的LLM风格错误变体）的受控语料库，并在操作模式感知的种子模糊测试下，使用高精度（fp64）CPU参考和每个（操作，数据类型）的绝对容差重新评估。种子oracle标记了9个错误内核中的9个，并通过了15个正确对照中的15个，对照的精度成本为零。我们将语料库扩展到26个操作（添加一个flash-attention对），并在五类GPU（RTX 3060、A10、L40S、A100 SXM4、H100 NVL）上重新运行相同的协议。所有五个GPU的判定结果相同：10个错觉中的10个被捕获，16个对照中的16个干净。语料库结果涉及LLM风格的转录错误，这些错误被单形状allclose oracle认证为正确，而不涉及任何特定部署的LLM的错误率。每个标记的失败都从存储的种子逐字节重放。

英文摘要

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

URL PDF HTML ☆

赞 0 踩 0

2606.19710 2026-06-19 cs.CL cs.AI 新提交 80%

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

FineREX: 面向人口走私知识图谱的微调NER-RE

Elijah Feldman, Dipak Meher, Carlotta Domeniconi

发表机构 * Thomas Jefferson High School for Science and Technology（托马斯·杰斐逊科技高中）

专题命中代码评测：微调LLM用于知识图谱构建中的NER和RE。

AI总结提出FineREX，一个基于微调LLM的流水线，用于从法律文档中提取实体和关系构建知识图谱，在F1分数上分别提升15.50%和31.46%，并减少50%处理时间。

Comments Code available at https://github.com/ElijahFeldman7/FineREX

详情

AI中文摘要

法庭记录包含关于人口走私网络的有价值证据，但这些信息通常埋藏在非结构化的、充满术语的法律文件中。虽然大型语言模型（LLM）可以通过自动信息提取支持知识图谱构建，但现有方法依赖通用模型，未针对该领域所需的实体和关系定义进行定制。我们提出FineREX，一个精简的知识图谱构建流水线，基于微调的LLM进行命名实体识别和关系提取（NER-RE）。使用包含512个文本块的手动标注数据集，FineREX在实体和关系F1分数上分别比更大的通用基线模型绝对提高了15.50%和31.46%。这些提升转化为更高质量的知识图谱，将法律噪声减少近一半，并将长文档上的节点重复率从17.78%降至11.17%。通过消除文档重写和冗余提取阶段，FineREX还将端到端处理时间减少了50.0%。我们的结果表明，领域特定的微调可以显著优于更大的通用模型，同时提高非法网络分析知识图谱构建的质量和效率。

英文摘要

Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of $512$ text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.20134 2026-06-19 cs.LO cs.PL 新提交 70%

An MSO Framework for Weak-Memory Verification and Robustness

弱内存验证与鲁棒性的MSO框架

Giovanna Kobus Conrado, Andreas Pavlogiannis

专题命中代码评测：弱内存验证与鲁棒性的MSO框架。

AI总结本文研究单子二阶逻辑作为弱内存元理论，证明顺序一致性执行有界树宽而TSO无界，展示多种模型可MSO公理化，并引入读自鲁棒性概念，实现统一验证算法。

Comments Accepted at CONCUR 2026

详情

AI中文摘要

内存模型是并发程序执行的形式化规范，解释了编译器和架构优化引入的弱行为。其数量和复杂性的增加促使人们通过在适当的元理论中公理化模型来统一验证整个模型类别。本文正式研究单子二阶逻辑（MSO）作为弱内存的元理论，通过证明各种流行弱内存模型的树宽和MSO可表达性结果，使得我们能够统一处理多个验证问题。总结如下：首先，我们证明顺序一致性（$\mathsf{SC}$）下的执行具有有界树宽，而总存储顺序（$\mathsf{TSO}$）下的执行则无界。其次，我们证明包括Release/Acquire和完整RC20在内的广泛模型是MSO可公理化的，而其他模型如Strong Release/Acquire和$\mathsf{TSO}$则不可，除非正交向量问题（在SETH下需要二次时间）可以在线性时间内解决。最后，我们引入读自鲁棒性概念，作为对近期粗粒度鲁棒性准则工作的扩展。我们证明树宽界限（上界和下界）对任何MSO可公理化模型$\mathsf{MM}$具有深远的算法意义：存在一个算法，对于每个程序$\mathsf{P}$，要么验证$\mathsf{P}$在$\mathsf{MM}$下的正确性，要么报告$\mathsf{P}$对$\mathsf{MM}$不是读自鲁棒的。总体而言，我们的结果为弱内存验证和鲁棒性建立了一个丰富且多功能的理论框架。

英文摘要

Memory models are formal specifications of concurrent-program executions, accounting for weak behaviors introduced by compiler and architectural optimizations. The increase of their number and complexity has spawned efforts for uniform verification across whole classes of models, by axiomatizing the models in an adequate metatheory that admits a uniform treatment. In this work, we formally study Monadic Second-Order logic (MSO) as a metatheory for weak memory, by proving results on the treewidth and MSO-expressibility of various popular weak-memory models, as this combination allows us to uniformly tackle several verification problems. In summary, our results are as follows. First, we prove that executions under Sequential Consistency ($\mathsf{SC}$) have bounded treewidth, while already those under Total Store Order ($\mathsf{TSO}$) do not. Second, we prove that a broad range of models, including Release/Acquire and the full RC20, are MSO-axiomatizable, while others, such as Strong Release/Acquire and $\mathsf{TSO}$, are not, unless the Orthogonal Vectors problem $\unicode{x2013}$ which requires quadratic time under SETH $\unicode{x2013}$ can be solved in linear time. Finally, we introduce the notion of reads-from robustness, as an extension to recent work on coarse robustness criteria. We show that our treewidth bounds (both upper and lower) have far-reaching algorithmic implications for any of our MSO-axiomatizable models $\mathsf{MM}$: there is an algorithm that, for every program $\mathsf{P}$, either verifies $\mathsf{P}$ under $\mathsf{MM}$ or reports that $\mathsf{P}$ is not reads-from robust against $\mathsf{MM}$. Overall, our results establish a rich and versatile theoretical framework for weak-memory verification and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.19654 2026-06-19 cs.CR cs.SE 新提交 70%

PUFFERDOS: Efficient and Effective Attack String Generation for Regular Expression Denial of Service Vulnerabilities

PUFFERDOS：针对正则表达式拒绝服务漏洞的高效攻击字符串生成

Shangzhi Xu, Ziqi Ding, Xiao Cheng, Yuekang Li, Nan Sun, Benjamin Turnbull, Shuangxiang Kan, Siqi Ma

专题命中代码评测：生成正则表达式拒绝服务攻击字符串，涉及程序分析

AI总结提出PUFFERDOS方法，通过定义三种脆弱模式并利用合成技术与组合符号执行，生成在现实长度预算内且经程序验证有效的ReDoS攻击字符串。

Comments Accepted by S&P'26

详情

AI中文摘要

ReDoS攻击构成了一类关键的资源耗尽漏洞。在此类攻击中，攻击者利用正则表达式引擎的病态最坏情况执行行为，诱导高度不对称的计算工作负载，最终耗尽系统资源并降低服务可用性。为了保护系统免受ReDoS攻击，研究人员提出了许多检测技术，这些技术通过生成攻击字符串来模拟攻击过程，以便在早期开发阶段主动利用ReDoS漏洞并促进修复。现有技术大致分为两类：搜索病态正则表达式结构的静态分析，以及合成候选攻击字符串的动态探索方法。然而，生成的攻击字符串通常不适用于实际利用，因为它们往往假设不切实际的输入长度预算，并且未在程序级别验证攻击的有效性和效率。因此，许多生成的字符串在应用于实际程序时无法触发易受攻击的正则表达式，进一步限制了其实用性。为了解决这些不足，我们引入了一种有效且高效的攻击字符串生成器PUFFERDOS，旨在合成在现实长度预算内可行且经程序级别验证的攻击输入，从而实现对实际程序中ReDoS漏洞的有效利用。具体来说，我们首先基于观察和形式化验证定义了三种脆弱模式。根据这些模式，PUFFERDOS采用合成技术生成攻击字符串，然后通过针对ReDoS的组合符号执行对字符串进行细化和验证，以确保现实世界中的可利用性。

英文摘要

ReDoS attacks constitute a critical class of resource-exhaustion vulnerabilities. In such attacks, adversaries exploit the pathological worst-case execution behavior of regular expression (regex) engines to induce highly asymmetric computational workloads, ultimately exhausting system resources and degrading service availability. To protect systems against ReDoS attacks, numerous detection techniques have been proposed that simulate the attack process by generating attack strings to proactively exploit ReDoS vulnerabilities at the early development stage and facilitate remediation. Existing techniques broadly fall into two classes: static analyses that search for pathological regex structures, and dynamic exploration methods that synthesize candidate attack strings. However, the generated attack strings are often impractical for real-world exploitation because they usually assume unrealistic input-length budgets and do not validate the effectiveness and efficiency of the attack at the program level. Therefore, many generated strings fail to trigger vulnerable regexes when applied to real-world programs, further limiting the practical utility. To address these shortcomings, we introduce an effective and efficient attack string generator, PUFFERDOS, designed to synthesize attack inputs that are both feasible within realistic length budgets and validated at the program level, enabling effective exploitation of ReDoS vulnerabilities in real-world programs. Specifically, we first define three vulnerable patterns based on our observation and formal verification. According to the patterns, PUFFERDOS conducts a synthesis technique to generate attack strings, and then refines and validates the strings with ReDoS-specific compositional concolic execution to guarantee real-world exploitability.

URL PDF HTML ☆

赞 0 踩 0

2606.20129 2026-06-19 cs.SE 新提交 60%

Learning Critical Testing Literacy Through Puzzles: an Experience Report

通过谜题学习关键测试素养：经验报告

Niels Doorn, Bart Th. Knaack, Tanja E. J. Vos, Beatriz Marín

专题命中代码评测：通过谜题学习软件测试素养。

AI总结本文报告了使用谜题教授关键测试素养（CTL）的13次工作坊经验，发现参与者通过解谜、汇报和反思的完整序列学习效果显著，并开发了开源分析工具。

详情

AI中文摘要

在本文中，我们报告了使用谜题学习CTL的工作坊经验和收获。背景：软件测试重要但难以教授。我们引入了一个基于谜题的学习活动知识体系来教授CTL，该体系基于关键测试者认知模型，形成了P4TEST教学框架。我们与学生、测试人员、教师和小学生共举办了13次工作坊，评估基于谜题的关键测试素养教学。经验：在11次工作坊中，我们采用半结构化方法，变化谜题、材料和时长。在另外两次工作坊中，我们引入了工作手册和出声思考环节，以收集更多关于学习体验的数据。观察：参与者普遍认为自己在解谜时进行实验。学生倾向于收敛于解决方案，而专业人员继续探索。情绪在行为中可见，但难以通过书面反思单独浮现。出声思考环节揭示了即时推理；书面反思引发了更多元认知反思。主题“意义建构/行动中反思”捕捉了参与者如何构建问题、应对死胡同和转变策略。反思：谜题本身并非干预手段；解谜、汇报和反思的完整序列才是。更刻意地设计这一序列是未来的工作。我们还开发了一个带有内置分析功能的开源网络应用程序，用于定制工作坊。

英文摘要

In this paper, we report our experiences and takeaways from workshops using puzzles to learn CTL. Background: Software testing is important yet difficult to teach. We introduced a BoK of puzzle-based learning activities to teach CTL, based on a model of critical tester's cognition, leading to the pedagogical framework P4TEST. We conducted thirteen workshops with students, testers, teachers, and primary school pupils to assess puzzle-based teaching of critical testing literacy. Experience: Across eleven workshops, we used a semi-structured approach, varying puzzles, materials, and timing. In two additional workshops, we introduced workbooks and think-aloud sessions to gather more data on the learning experience. Observations: Participants consistently perceived themselves as experimenting while solving puzzles. Students tended to converge on solutions, while professionals continued exploring. Emotions were visible in behaviour but hard to surface through written reflection alone. Think-aloud sessions revealed immediate reasoning; written reflections elicited more meta-cognitive reflection. The theme Sensemaking / reflection-in-action captured how participants framed problems, navigated dead ends, and shifted strategies. Reflections: Puzzles are not the intervention: the entire sequence of solving, debriefing, and reflecting is. Designing that sequence more deliberately is the work ahead. We also developed an open-source web application with built-in analytics to customise workshops.

URL PDF HTML ☆

赞 0 踩 0

2606.20370 2026-06-19 astro-ph.IM astro-ph.GA 新提交 60%

ELMA: ELlipse-based bar MAjor axis estimator

ELMA：基于椭圆的棒主轴估计器

Bruna R. Bragança de Lima, Andressa Wille, Rafael S. de Souza, Ana L. Chies-Santos

专题命中代码评测：Python包用于星系棒长度自动估计

AI总结提出ELMA Python包，通过迭代椭圆等照度线拟合自动估计星系棒长度，在GOODS-South的JWST/NIRCam图像上验证。

Comments 4 pages, 1 figure, published in RNAAS

Journal ref Research Notes of the AAS, Volume 10, Number 6, 2026

详情

DOI: 10.3847/2515-5172/ae7d2d

AI中文摘要

星系棒是盘星系中关键的非轴对称结构，驱动角动量重新分布，并促进长期演化、中心质量积累以及核结构的形成。然而，对棒长度的稳健且均匀的测量仍然具有挑战性，特别是在大型成像巡天中，人工估计耗时且对方法选择敏感。我们推出了elma，一个独立的、可通过pip安装的Python包，用于自动估计已被识别为候选棒状系统的星系中的棒长度。该方法直接对二维成像数据进行操作，使用迭代椭圆等照度线拟合来追踪径向椭圆率轮廓，并从与椭圆率局部最大值对应的半长轴中识别出投影棒长度估计值。利用图像的WCS信息和用户提供的红移，elma将角度测量值转换为投影物理长度。我们在GOODS-South天区的JWST/NIRCam成像的棒状星系上演示了该包。代码在MIT许可下发布在Github仓库中。

英文摘要

Galactic bars are key non-axisymmetric structures in disk galaxies, driving angular-momentum redistribution and contributing to secular evolution, central mass build-up, and the formation of nuclear structures. Robust and homogeneous measurements of bar length, however, remain challenging, particularly for large imaging surveys, where manual estimates are time-consuming and sensitive to methodological choices. We introduce elma, a standalone, pip-installable Python package for automated bar-length estimation in galaxies already identified as candidate barred systems. The method operates directly on two-dimensional imaging data, using iterative elliptical-isophote fitting to trace the radial ellipticity profile and identify a projected bar-length estimate from the semi-major axis associated with the local maximum in ellipticity. Using the image WCS information and a user-supplied redshift, elma converts angular measurement into a projected physical length. We demonstrate the package on JWST/NIRCam imaging of barred galaxies in the GOODS-South field. The code is released under the MIT license at a repository in Github.

URL PDF HTML ☆

赞 0 踩 0