arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.SE软件工程27
2606.12320 2026-06-11 cs.AI cs.CC cs.CR cs.SE 新提交

A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

生产AI代理运行时治理的五平面参考架构

Krti Tallam

发表机构 * Kamiwaza

AI总结 针对生产AI代理打破传统数据边界治理假设的问题,提出由推理平面和四个执行平面组成的五平面参考架构,通过可组合原语实现运行时治理,阻断七种威胁并验证四个正确性不变式。

详情
Comments
65 pages, 3 figures, 5 tables. Reference architecture with a reference implementation of the policy-engine core and microbenchmark results; full-system evaluation identified as future work
AI中文摘要

企业安全旨在治理数据边界:受保护表面是静态和传输中的数据,控制措施——访问控制、数据丢失防护、边界检查——治理该边界的穿越。生产AI代理瓦解了这一假设。代理代表企业读取上下文、调用工具、调用连接器并修改记录系统,因此风险转移到工作流内部,进入一系列单独允许但可能转变未经授权业务流程的动作序列。现有策略引擎无法扩展到这种机制:它们根据原子主体评估请求时决策,而代理系统需要对复合主体进行状态化评估,这些主体的权限通过委托链衰减。我们提出了一种用于生产代理运行时治理的参考架构,由四个可组合原语构建:五平面分解(一个裁决意图的推理平面,以及四个执行平面——网络、身份、端点、数据——实现决策)、任意停止中介、具有能力衰减的复合主体,以及作为结构化证据基础的审计。我们定义了六种中断原语的分类,这些原语泛化了允许和拒绝,陈述并论证了四个正确性不变式,并展示了在五个具体工作流中阻断七种生产代理威胁。策略引擎核心的参考实现提供了测量证据:衰减正确性和证据可重构性在每次试验中成立,裁决运行在个位数微秒内,审计基础的防篡改行为完全符合设计。我们明确范围:该架构治理委托行为,而非模型行为,针对实时代理基准的全系统评估是下一步工作。

英文摘要

Enterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls -- access control, data-loss prevention, perimeter inspection -- governed crossings of that boundary. Production AI agents dissolve this assumption. An agent reads context, calls tools, invokes connectors, and modifies systems of record on an enterprise's behalf, so risk moves inside the workflow, into sequences of individually-permitted actions that may transform a business process no one authorized. Existing policy engines do not extend to this regime: they evaluate request-time decisions against atomic principals, where agentic systems require stateful evaluation against composite principals whose authority attenuates through delegation chains. We present a reference architecture for the runtime governance of production agents, built from four composable primitives: a five-plane decomposition (a reasoning plane that adjudicates intent, and four enforcement planes -- network, identity, endpoint, data -- that realize the decision), stop-anywhere mediation, composite principals with capability attenuation, and audit as a structured evidence substrate. We define a taxonomy of six interruption primitives that generalize allow and deny, state and argue for four correctness invariants, and demonstrate the foreclosure of seven production-agent threats across five concrete workflows. A reference implementation of the policy-engine core supplies measured evidence: attenuation correctness and evidence reconstructability hold on every trial, adjudication runs in single-digit microseconds, and the audit substrate's tamper-evidence behaves exactly as designed. We are explicit about scope: the architecture governs delegated action, not model behavior, and a full-system evaluation against a live agent benchmark is the invited next step.

2606.12231 2026-06-11 cs.SE cs.AI 新提交

Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study

AI IDE中的规则分类与演化:挖掘与调查研究

Guangzong Cai, Ruiyin Li, Peng Liang, Zengyang Li, Mojtaba Shahin

AI总结 通过挖掘83个开源项目中的7310条规则和99份从业者调查,建立了包含5个主类和25个子类的规则分类法,发现开发者重视架构约束但实际配置多为低级工作流和代码格式规则,规则演化主要由建设性上下文扩展和丰富驱动,且更新规则可使工件合规率平均提升22.99%。

详情
Comments
52 pages, 21 images, 8 tables, Manuscript submitted to a Journal (2026)
AI中文摘要

AI驱动的集成开发环境(AI IDE)的采用引入了“规则”作为一种新颖的软件工件,允许开发者将项目特定的约束和架构指导原则持久地注入到大语言模型(LLM)的上下文中。尽管这些规则在使AI行为与开发者意图对齐方面发挥作用,但它们的分类、演化及实际影响仍 largely unexplored。为填补这一空白,我们对AI IDE规则进行了混合方法实证研究。通过挖掘83个开源项目并提取7,310条规则,我们建立了一个包含5个主类和25个子类的全面分类法。随后,我们将这些工件与99名从业者的调查反馈进行三角验证。我们的分析发现开发者优先级与实际配置之间存在反差:虽然从业者认为架构约束非常重要,但仓库中的规则文件主要由低级工作流和代码格式约束组成。此外,我们对1,540个规则演化事件的分析表明,规则更新频繁。仓库数据进一步表明,规则演化主要由建设性上下文扩展(29.17%)和丰富(26.59%)驱动。相比之下,受访开发者报告修改规则主要是为了纠正AI错误(77.78%),通常通过添加新的负面约束而非编辑现有约束。最后,对160个规则演化事件的工件合规性评估显示,更新规则显著提高了软件工件的合规性,更新后平均工件合规率从49.14%提升至72.13%,增加了22.99%。我们的研究提供了实证见解,可帮助开发者优化提示策略,并指导工具构建者为AI IDE设计自动冲突检测和上下文管理机制。

英文摘要

The adoption of AI-powered Integrated Development Environments (AI IDEs) has introduced "Rules" as a novel software artifact, allowing developers to persistently inject project-specific constraints and architectural guidelines into the context of Large Language Models (LLMs). Despite their role in aligning AI behavior with developer intent, the taxonomy, evolution, and practical impact of these rules remain largely unexplored. To bridge this gap, we conducted a mixed-methods empirical study on AI IDE rules. By mining 83 open-source projects and extracting 7,310 rules, we established a comprehensive taxonomy comprising 5 primary and 25 secondary categories. We then triangulated these artifacts with survey responses from 99 practitioners. Our analysis identified a contrast between developer priorities and actual configurations: while practitioners rate architectural constraints as highly important, rule files in repositories primarily consist of low-level workflow and code formatting constraints. Furthermore, our analysis of 1,540 rule evolution events revealed that rules are updated frequently. Repository data further indicate that rule evolution is primarily driven by constructive context expansions (29.17%) and enrichments (26.59%). In contrast, surveyed developers reported modifying rules primarily to correct AI errors (77.78%), typically by adding new negative constraints rather than editing existing ones. Finally, an artifact compliance assessment of 160 rule evolution events revealed that updating rules significantly improves the adherence of software artifacts, with the average artifact compliance rate increasing by 22.99% (from 49.14% to 72.13%) following an update. Our study provides empirical insights that can help developers optimize prompting strategies and guide tool builders in designing automated conflict-detection and context-management mechanisms for AI IDEs.

2606.12212 2026-06-11 cs.SE cs.CR 新提交

Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps

注意你的密钥:iOS应用中LLM API凭证泄露的实证研究

Pinran Gao, Lingxiang Wang, Ying Zhang, Fan Yang

AI总结 本研究首次系统调查iOS应用中LLM API密钥泄露问题,通过动态分析框架LLMKeyLens检测444个应用,发现282个存在可被利用的凭证泄露,并识别出三种泄露模式,三个月后仅28%完成修复。

详情
Comments
12 pages, 4 figures, 4 tables
AI中文摘要

大型语言模型(LLM)快速集成到移动应用中引入了一类新的凭证安全风险:泄露的凭证允许未经授权访问LLM推理服务,给开发者造成经济损失。先前关于凭证泄露的工作主要集中在Android应用上;迄今为止,尚无实证研究系统调查iOS应用中的LLM API密钥泄露。我们首次对集成LLM的应用中的API密钥泄露进行了深入的实证研究。我们构建了一个包含444个iOS应用的高质量数据集,这些应用通过标准化流程从1092个候选应用中筛选出来,并开发了LLMKeyLens,一个动态分析框架,通过流量拦截、特定于提供商的密钥提取和主动有效性确认来检测LLM API密钥泄露,无需源代码访问或二进制解密。我们的分析显示,282个应用在网络流量中暴露了可利用的LLM API凭证,涉及至少十个提供商。我们识别出三种泄露模式:基于JWT的令牌泄露(48%)、未经身份验证的后端代理访问(33%)和明文API密钥传输(19%)。为评估修复情况,我们在负责任披露三个月后重新分析了相同的282个易受攻击的应用;只有28%修复了报告漏洞,而72%仍然可利用,问题持续源于未经身份验证的后端和损坏的JWT实现。我们的发现表明,LLM API密钥泄露在iOS生态系统中既普遍又持久,暴露出开发者实践与安全集成原则之间的系统性差距,并表明安全的LLM集成不仅需要开发者意识,还需要提供商明确的安全指导和平台级强制执行。

英文摘要

The rapid integration of large language models (LLMs) into mobile applications has introduced a new class of credential security risk: leaked credentials that grant unauthorized access to LLM inference services, causing financial damage to developers. Prior work on credential leakage has focused primarily on Android apps; to date, no empirical study has systematically investigated LLM API key leakage in iOS applications. We present the first in-depth empirical study of API key leakage in LLM-integrated apps. We construct a high-quality dataset of 444 iOS applications, filtered from 1092 candidates through a standardized process, and develop LLMKeyLens, a dynamic analysis framework that detects LLM API key leakage via traffic interception, provider-specific key extraction, and active validity confirmation, requiring neither source code access nor binary decryption. Our analysis reveals that 282 applications expose exploitable LLM API credentials in network traffic, spanning at least ten providers. We identify three leakage patterns: JWT-based token leakage (48%), unauthenticated backend proxy access (33%), and plaintext API key transmission (19%). To assess remediation, we re-analyzed the same 282 vulnerable applications three months after responsible disclosure; only 28% had remediated the reported vulnerability, while 72% remained exploitable, with persistent issues stemming from unauthenticated backends and broken JWT implementations. Our findings show that LLM API key leakage is both prevalent and persistent in the iOS ecosystem, exposing a systemic gap between developer practice and secure integration principles, and suggest that secure LLM integration requires not only developer awareness but also explicit security guidance from providers and platform-level enforcement.

2606.12064 2026-06-11 cs.SE cs.CR 新提交

Undefined Behavior in C and C++: An Experiment With Desktop Use Cases

C和C++中的未定义行为:桌面使用场景的实验

Jukka Ruohonen, Krzysztof Sierszecki

AI总结 通过编译器实现的未定义行为检测器,实验发现Linux桌面环境下C/C++程序普遍存在未定义行为,59个任务产生近1.1万条警告,多数来自Mesa图形库和GUI交互。

详情
Comments
Submitted
AI中文摘要

未定义行为是C和C++编程中的惯用现象;这类行为是指使用了语言不施加任何要求的错误程序构造,例如整数溢出。本文通过实证实验,探究在Linux发行版的典型桌面使用中,底层执行的未定义行为的程度。分析基于编译器中实现的未定义行为检测器。根据结果,未定义行为很常见。通过完成59个简单的实验任务,由32个用C或C++编写的独特程序和库生成了近1.1万条独特的未定义行为警告。其中,大多数警告与Mesa图形库相关,并通过与图形用户界面交互产生。仅登录GNOME桌面环境就生成了超过500条独特警告。在所有警告中,绝大多数是关于虚表指针的。相关的堆栈跟踪通常也很长。凭借这些及其他结果,本文为关于C和C++的实证文献做出了贡献。

英文摘要

Undefined behavior is idiomatic to C and C++ programming; such behavior is a use of an erroneous program construct for which the languages impose no requirements, such as integer overflows. The paper presents an empirical experiment seeking to probe the extent of undefined behavior executing underneath typical desktop use of a Linux distribution. The analysis is based on an undefined behavior sanitizer implemented in a compiler. According to the results, undefined behavior is common. By completing 59 simple experimental tasks, nearly 11 thousand unique undefined behavior warnings were generated by 32 unique programs and libraries written in C or C++. Of these warnings, most were associated with the Mesa graphics library and generated by interacting with graphical user interfaces. Merely logging into the GNOME desktop environment generated over 500 unique warnings. Of all warnings, the clear majority was about virtual table pointers. The associated stack traces were also lengthy in general. With these and other results, the paper contributes to the empirical literature on C and C++.

2606.11976 2026-06-11 cs.SE cs.AI 新提交

Exploration Structure in LLM Agents for Multi-File Change Localization

LLM代理中的探索结构用于多文件变更定位

Akeela Darryl Fattha, Kia Ying Chua, Lingxiao Jiang, Laura Wynter

AI总结 针对多子系统变更场景,提出非线性、领域范围的并行代理探索结构,在SWE Bench Pro基准上,小规模Haiku类模型通过领域代理并行生成实现高微F1分数,优于线性顺序探索。

详情
AI中文摘要

软件工程工具越来越依赖基于LLM的代理来定位需要更改的文件以解决软件问题。大多数AI代理以线性方式探索仓库,即每步访问一个目录或文件。我们假设这对于跨越多个子系统的变更存在结构上的不匹配。我们比较了线性顺序探索与非线性的、领域范围的并行代理探索。使用SWE Bench Pro作为初始基准,我们专注于ansible作为示例。我们构建了一种方法,用于在单个基础提交上对GitHub问题进行持久会话评估。我们将我们的非线性领域代理文件遍历系统与没有直接仓库访问权限的基础LLM、具有持久Python REPL的单代理递归语言模型(RLM)基线以及使用Codex 5.5 High的外部CLI基线进行比较。使用小型Haiku类模型的领域范围并行代理生成在Haiku类模型中实现了最高的微F1分数,且领先幅度较大。在我们自己的扩展基准(包括2025年和2026年更近期的PR)上,领域代理仅次于更大的Codex 5.5 High。在原始、精选的2020年SWE-bench Pro基准上,较大的Sonnet普通LLM基线通过预测少量文件获得了更高的微F1分数,从而实现了更高的精确度,但所有黄金召回率显著较低。我们还提出了三个额外发现。首先,文档演化是所有方法都未解决的潜在依赖关系。其次,天真的文件系统访问可能会因测试文件过度预测而降低定位性能。最后,强制多代理协商没有明显帮助,并且会大幅增加令牌成本。

英文摘要

Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, visiting one directory or file per step. We postulate that this is a structural mismatch for changes that span several subsystems. We compare linear sequential exploration against non-linear, domain-scoped parallel agentic exploration. Using SWE Bench Pro as initial benchmark, we focus on ansible as an exemplar. We construct an approach for persistent-session evaluation of GitHub issues anchored at a single base commit. We compare our non-linear domain-agent file traversal system against a base LLM without direct repository access, a single agent Recursive Language Model (RLM) baseline with a persistent Python REPL and an external CLI baseline using Codex 5.5 High. Domain scoped parallel agent spawning with a small Haiku-class model achieves the highest micro F1 among Haiku class models by a large margin. Domain-agents is the second highest behind only the much larger Codex 5.5 High on our own expanded benchmark including over more recent PRs from 2025 and 2026. On the original, curated, 2020 SWE-bench Pro benchmark, a larger Sonnet plain LLM baseline attains higher micro F1 by predicting few files, leading to higher precision, but at significantly lower all gold recall. We also present three additional findings. First, documentation evolution is a latent dependency unresolved by any approach. Second, naive file system access can degrade localization driven by test-file over prediction. Lastly, forced multi-agent consultation does not measurably help and raises token cost substantially.

2606.11916 2026-06-11 cs.SE cs.AI 新提交

Characterizing Software Aging in GPU-Based LLM Serving Systems

基于GPU的大语言模型服务系统中的软件老化特征分析

Domenico Cotroneo, Bojan Cukic

AI总结 提出一种实证方法研究GPU大语言模型服务系统中的软件老化,通过216小时实验发现所有部署均存在显著内存老化,泄漏率与运行时和配置强相关,并提供了可复现框架。

详情
Comments
7 pages
AI中文摘要

本文提出了一种实证方法,用于研究基于GPU的大语言模型服务系统中的软件老化。传统的老化研究侧重于以CPU为中心的软件,且工作负载相对规律;而大语言模型服务则不同,它跨越Python主机和CUDA设备,处理成本相差数个数量级的请求,并依赖于快速演进的软件栈。我们在相同的压力条件下,对六个共置部署进行了216小时的实验,并行监控主机、设备和客户端指标,并应用了考虑自相关和多重比较的统计流程。结果显示,所有部署均存在统计上显著的内存老化,泄漏率强烈依赖于服务运行时和部署配置。除这些发现外,我们还提供了一个可复现的框架,为软件老化与再生领域以及大语言模型服务社区开辟了交叉研究方向。

英文摘要

This paper proposes an empirical methodology to study software aging in GPU-based LLM serving systems. Traditional aging studies focus on CPU-centric software with relatively regular workloads; LLM serving is different, spanning a Python host and a CUDA device, handling requests whose cost varies by orders of magnitude, and relying on rapidly evolving software stacks. We run a 216-hour campaign across six co-located deployments under identical stress conditions, monitor host, device, and client metrics in parallel, and apply a statistical pipeline that accounts for autocorrelation and multiple testing. Our results reveal statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and deployment configuration. Beyond these findings, we provide a reproducible framework that opens a research direction at the intersection of the software aging and rejuvenation and LLM serving communities.

2606.11869 2026-06-11 cs.SE cs.AI 新提交

Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

层层代理:从底层到生产构建自定义AI代理的方法论

Marc Alier Forment, Juanan Pereira, Francisco José García-Peñalvo, María José Casañ Guerrero

AI总结 提出一种无框架的方法论,通过两个前提条件(将LLM作为软件组件和构建块)和三个实践(原型设计、打包为CLI、代理测试代理)来构建自定义AI代理,实现端到端开发。

详情
AI中文摘要

自定义AI代理是存在于自己应用程序中的代理,它们与自己的数据和工具交互,强制执行自己的安全边界,并携带自己的品牌和审计跟踪。它们与通用层级的区别在于适配性而非能力:每个代理由维护它的工程师为一项工作而构建。目前没有已发布的实践说明如何端到端地构建一个自定义AI代理。各个部分随处可见(函数调用API、模型上下文协议、可配对的代码代理),但将这些部分串联起来的实践存在于播客、博客和泄露的系统提示中。本文将这些实践记录为一种方法论,即“层层代理”:两个前提条件一次交叉并保持,然后三个实践在代理的生命周期中重复。前提条件是(P1)底层:将LLM作为软件组件,框架化为工具、系统,然后在提示缓存下框架化为消息;(P2)构建块:函数调用、MCP、CLI编排、liteshell模式、代理循环、技能、角色、钩子和脚手架。三个实践是(P3)使用通用代理进行原型设计;(P4)收获、折叠并将结果作为CLI发布,即Turtle模式;(P5)代理测试代理,其中通用代理通过行为场景驱动自定义代理,这是对经典测试的补充而非替代。工作循环是P3到P4再到P5并返回,一个推论自然得出:多代理编排就是CLI组合。该方法论在构造上是无框架的。它从AAC中提炼而来,AAC是开源LAMB平台的自定义代理,由一名开发人员使用AI配对程序员在大约十天内构建并投入生产。我们将其作为一种可迁移的实践呈现,独立于任何语言或框架。

英文摘要

Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general-purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function-calling APIs, the Model Context Protocol, code agents to pair with), but the practice that chains them lives in podcasts, blogs, and leaked system prompts. This paper writes that practice down as a methodology, Agents All the Way Down: two preconditions crossed once and kept, then three practices repeated for the agent's life. The preconditions are (P1) Substrate, the LLM as a software component, framed as tools, then system, then messages under prompt-caching; and (P2) Building blocks: function calling, MCP, CLI orchestration, the liteshell pattern, the agent loop, skills, characters, hooks, and scaffolding. The practices are (P3) prototype with a general-purpose agent; (P4) harvest, fold, and ship the result as a CLI, the Turtle pattern; and (P5) agent-tests-agent, in which a general-purpose agent drives it through behavioural scenarios, a complement to classical testing, not a replacement. The working loop is P3 to P4 to P5 and back, and one corollary falls out for free: multi-agent orchestration is just CLI composition. The methodology is framework-free by construction. It was distilled from the AAC, a custom agent for the open-source LAMB platform, built in about ten days by one developer with an AI pair-programmer and in production. We present it as a transferable practice, independent of any language or framework.

2606.11863 2026-06-11 cs.SE 新提交

Enhancing LLM-Based Code Translation with Verified Multi-Semantic Representations

增强基于LLM的代码翻译:利用验证过的多语义表示

Yufu Wang, He Jiang, Hao Lin, Peiyu Zou, Ang Jia, Xiaochen Li, Zhilei Ren

AI总结 提出Multisage框架,通过提取和验证多语义表示(数据流图、类型约束等)来提升LLM代码翻译的准确性和可靠性,在HumanEval-X上翻译成功率提升至2.22倍。

详情
AI中文摘要

大型语言模型(LLM)在自动化代码翻译方面展现出巨大潜力,但现有方法通常依赖基于token的统计模式,而非对程序语义的充分理解。因此,翻译后的程序可能仍包含逻辑和语义错误。尽管高质量语义指导(如功能描述和测试用例)有助于减少此类错误,但在实际场景中这些资源往往不可用。这带来了两个关键挑战:如何直接从源代码构建丰富的语义信息,以及如何确保这些语义足够准确可靠以指导翻译。针对这些挑战,我们提出了Multisage,一个用于基于LLM的代码翻译的多语义增强与自校准框架。Multisage包含三个模块。首先,语义表示解析模块从源代码中提取结构化基础语义,包括数据流图、类型约束和外部API信息。其次,多语义增强模块基于这些表示生成多样化的增强语义,包括代码摘要、函数级测试用例以及面向API的描述和测试。第三,语义一致性校准模块使用语义保持变异和跨语义一致性验证来过滤、校准和优化生成的语义。在HumanEval-X代码翻译基准上的实验表明,Multisage在不同骨干模型上将翻译成功率提升高达2.22倍。它持续优于普通提示、指令微调LLM和思维链推理,在较小模型上提升最大。这些结果表明,显式语义增强可以显著提高基于LLM的代码翻译的可靠性。

英文摘要

Large language models (LLMs) have shown great promise for automated code translation, yet existing approaches often rely on token-level statistical patterns rather than sufficient understanding of program semantics. As a result, translated programs may still contain logical and semantic errors. Although high-quality semantic guidance, such as functional descriptions and test cases, can help mitigate these errors, such resources are often unavailable in real-world scenarios. This raises two key challenges: how to construct rich semantic information directly from source code, and how to ensure that such semantics are accurate and reliable enough to guide this http URL address these challenges, we propose Multisage, a multi-semantic augmentation and self-calibration framework for LLM-based code translation. Multisage consists of three modules. First, a semantic representation parsing module extracts structured base semantics from source code, including data-flow graphs, type constraints, and external API information. Second, a multi-semantic augmentation module builds on these representations to generate diverse augmented semantics, including code summaries, function-level test cases, and API-oriented descriptions and tests. Third, a semantic consistency calibration module uses semantics-preserving mutations and cross-semantic consistency verification to filter, calibrate, and refine the generated this http URL on the HumanEval-X code translation benchmark show that Multisage improves translation success rates by up to 2.22 times across diverse backbone models. It consistently outperforms vanilla prompting, instruction-tuned LLMs, and Chain-of-Thought reasoning, with the largest gains observed on smaller models. These results demonstrate that explicit semantic augmentation can substantially improve the reliability of LLM-based code translation.

2606.11834 2026-06-11 cs.SE 新提交

How Requirements Quality Makes (or Breaks) Traceability Link Recovery

需求质量如何决定(或破坏)追踪链接恢复

Tobias Hey, Julian Frattini

AI总结 研究需求质量缺陷对自动化追踪链接恢复(TLR)性能的影响,通过标注189个用例描述中的28种缺陷并测试五种TLR方法,发现某些缺陷有害或有益,且不同方法响应各异。

详情
Comments
to be published in "2026 IEEE 34th International Requirements Engineering Conference (RE)"
AI中文摘要

需求和源代码之间的追踪信息极大地有利于软件系统的维护。由于手动建立追踪链接繁琐且易出错,先前的研究探索了自动化追踪链接恢复(TLR)方法来支持此任务。然而,需求中的质量缺陷会影响后续活动(如TLR),但关于此影响的证据仍然稀缺。我们的目标是提供关于此影响的实证证据。同时,我们旨在理解在这些质量缺陷下TLR方法的性能如何变化。为此,我们在两个数据集的189个用例描述中标注了28种质量缺陷。然后,我们在数据集上执行了五种不同的TLR方法,并测量了它们恢复追踪链接的性能。最后,我们进行了统计测试以量化这些缺陷对性能的影响强度。我们的结果表明,某些质量缺陷会损害TLR性能,例如不以名词短语开头的句子,而其他缺陷实际上有益于性能,例如包含实现细节的用例。此外,不同类型的方法对这些缺陷的响应不同。因此,选择TLR方法的性能优化取决于数据集的质量。

英文摘要

Traceability information between requirements and source code greatly benefits the maintenance of a software system. Since manually establishing trace links is cumbersome and error-prone, previous research explored automated traceability link recovery (TLR) approaches to support this task. However, quality defects in requirements impact subsequent activities such as TLR, yet evidence about this remains scarce. Our objective is to contribute empirical evidence on this impact. At the same time, we aim to understand how the performance of TLR approaches varies given these quality defects. To this end, we annotated 28 types of quality defect in 189 use case descriptions from two datasets. Then, we executed five distinct TLR approaches on the dataset and measured their performance in recovering trace links. Finally, we performed statistical tests to quantify the defects' effect strength on this performance. Our results show that some quality defects harm TLR performance, e.g., sentences that do not start with noun phrases, while others actually benefit performance, e.g., use cases that include implementation details. Moreover, different types of approaches respond differently to these defects. As a consequence, the performance-optimizing choice of a TLR approach depends on the quality of the dataset.

2606.11817 2026-06-11 cs.CR cs.AI cs.CL cs.SE 新提交

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

语法约束解码可诱使大语言模型生成恶意代码

Yitong Zhang, Shiteng Lu, Jia Li

AI总结 本文发现语法约束解码(GCD)可被利用发起名为CodeSpear的越狱攻击,使LLM生成恶意代码;并提出安全对齐方法CodeShield,通过生成蜜罐代码防御该攻击。

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于代码生成,引发了对它们可能被滥用来生成恶意代码的担忧。与此同时,语法约束解码(GCD)已被广泛采用,通过强制语法有效性来提高LLM生成代码的可靠性。在本文中,我们揭示了一个反直觉的风险:这种面向可靠性的技术本身可能成为攻击面。我们发现了一种新的越狱攻击,称为CodeSpear,它利用GCD诱导LLM生成恶意代码。我们的实验表明,仅应用良性代码语法约束即可有效越狱LLM。为了解决这一漏洞,我们提出了CodeShield,一种安全对齐方法,即使在攻击者控制的语法约束下也能稳健地保持安全行为。CodeShield通过在代码模态中对齐模型,教其在GCD下生成蜜罐代码。这种代码在语义上是无害的,因此不会实现恶意请求,并且在结构上是多样化的,因此难以通过语法收紧来抑制。同时,当自然语言可用时,CodeShield仍然保留自然语言的拒绝。在4个基准测试中对10个流行LLM的实验表明,CodeSpear优于代表性的越狱基线,平均攻击成功率提高了30个百分点以上。CodeShield在CodeSpear下恢复了安全性,同时保持了良性实用性。我们的发现揭示了GCD的一个基本风险,并呼吁对其潜在安全影响给予更多关注。

英文摘要

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.

2606.11815 2026-06-11 cs.SE 新提交

Understanding and Detecting Scalability Faults in Large-Scale Distributed Systems

理解与检测大规模分布式系统中的可扩展性故障

Hao-Nan Zhu, Goodness Ayinmode, Cesar A. Stuardo, Haryadi S. Gunawi, Cindy Rubio-González

AI总结 本文首次系统研究大规模分布式系统中的可扩展性故障,发现多数故障由维度代码片段与反模式协同导致,并提出ScaleLens方法,结合动态与静态分析检测此类故障,效果优于基线。

详情
AI中文摘要

可扩展的分布式系统构成了现代计算基础设施的支柱。然而,随着规模的增长,系统复杂性可能导致可扩展性故障。可扩展性故障难以发现和诊断,因为它们通常是潜在的,只有在大规模部署时才会显现。在本文中,我们首次对可扩展性故障进行了全面研究,并提出了一种检测方法。首先,我们系统地调查了来自10个大规模分布式系统的444个可扩展性问题报告,以了解可扩展性故障的常见反模式和根本原因。我们发现,大多数故障是由维度代码片段与相关反模式之间的协同作用引起的。其次,基于我们的发现,我们设计并实现了ScaleLens,一种新颖的可扩展性故障检测方法。ScaleLens结合动态和静态分析来精确定位维度代码片段,并将其与反模式匹配。我们的评估表明,与基线相比,ScaleLens检测到的与已知可扩展性故障相关的维度代码片段数量是基线的4.2倍。在Cassandra、HDFS和Ignite的最新稳定版本上,ScaleLens检测到334个维度代码片段,并确认了问题行为。

英文摘要

Scalable distributed systems form the backbone of modern computing infrastructure. However, as scale grows, system complexity may lead to scalability faults. Scalability faults are challenging to uncover and diagnose, as they are often latent and only manifest at large-scale deployment. In this paper, we present the first comprehensive study on scalability faults and propose an approach for their detection. First, we systematically investigate 444 scalability issue reports from 10 large-scale distributed systems to understand the common anti-patterns and root causes of scalability faults. We found that the majority of these faults are caused by the synergy between dimensional code fragments and anti-patterns associated with them. Second, based on our findings, we design and implement ScaleLens, a novel approach to detect scalability faults. ScaleLens combines dynamic and static analyses to pinpoint dimensional code fragments and match them with anti-patterns. Our evaluation shows that ScaleLens detects 4.2x more dimensional code fragments associated with known scalability faults compared to the baseline. On the latest stable versions of Cassandra, HDFS, and Ignite, ScaleLens detects 334 dimensional code fragments with confirmed problematic behavior.

2606.11755 2026-06-11 cs.SE 新提交

Acoda: Adversarial Code Obfuscation for Defending against LLM-based Analysis

Acoda:对抗性代码混淆防御基于LLM的分析

Hongzhou Rao, Zikan Dong, Yanjie Zhao, Haodong Li, Haoyu Wang

AI总结 提出基于遗传算法的对抗性代码混淆框架Acoda,通过8种语义保留的混淆方法迭代优化,有效诱导LLM拒绝或误判代码分析,在7个先进LLM上攻击成功率高达70%。

详情
AI中文摘要

随着大型语言模型(LLM)在软件工程(SE)任务(如代码理解、调试和漏洞检测)中的广泛采用,其强大的语义推理能力也带来了新的安全和隐私风险。LLM可以分析、重构甚至逆向工程源代码逻辑,可能导致知识产权泄露。为解决这一问题,我们提出了Acoda,一种基于遗传算法的对抗性代码混淆框架,用于防御基于LLM的代码分析。Acoda利用LLM的两个关键机制,即安全对齐和基于令牌的信息处理,设计了8种保留语义的混淆方法。它通过遗传算法迭代优化混淆策略,生成最大化防御效果的对抗样本。此外,我们提出了一种基于LLM响应的定量评估框架,该框架结合辅助LLM和四个评估指标,全面评估目标LLM分析混淆代码的能力。实验结果表明,Acoda能有效诱导LLM拒绝或误判代码分析。在包括GPT-4o、DeepSeek、Qwen、Llama和Gemma在内的7个最先进LLM上,Acoda实现了高达70%的攻击成功率(ASR),具有强大的跨模型迁移性和最小的运行时开销,同时确保原始代码的语义不变。总体而言,本研究为LLM时代的代码保护和LLM安全防御提供了新视角。

英文摘要

With the widespread adoption of Large Language Models (LLMs) in software engineering (SE) tasks such as code understanding, debugging, and vulnerability detection, their powerful semantic reasoning ability has also introduced new security and privacy risks. LLMs can analyze, reconstruct, or even reverse-engineer source code logic, potentially leading to the leakage of intellectual property. To address this issue, we propose Acoda, a genetic algorithm-based adversarial code obfuscation framework that defends against LLM-based code analysis. Acoda leverages two key mechanisms of LLMs, namely safety alignment and token-based information processing, to design 8 semantics-preserving obfuscation methods. It iteratively optimizes obfuscation strategies through a genetic algorithm to generate adversarial samples that maximize defensive effectiveness. In addition, we propose a quantitative evaluation framework based on LLM responses, which combines an auxiliary LLM and four evaluation metrics to assess how target LLMs analyze obfuscated code comprehensively. Experimental results show that Acoda can effectively induce LLMs to refuse or misinterpret code analysis. On 7 state-of-the-art LLMs, including GPT-4o, DeepSeek, Qwen, Llama, and Gemma, Acoda achieves an attack success rate (ASR) of up to 70%, with strong cross-model transferability and minimal runtime overhead, while ensuring that the semantics of the original code remain unchanged. Overall, this study provides a new perspective for code protection and LLM security defense in the era of LLMs.

2606.11543 2026-06-11 cs.AI cs.SE 新提交

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

SkillJuror:衡量智能体技能组织如何改变运行时行为

Zhiyu Chen, Zihan Guo, Bo Huang, Bingwei Lu, Jianghao Lin, Yuanjian Zhou, Weinan Zhang

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Sun Yat-sen University(中山大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SkillJuror框架,通过渐进式披露与扁平基线对比,发现技能组织方式改变智能体搜索和应用程序知识的行为,并在82个任务中提升4.1%的验证通过率。

详情
AI中文摘要

Agent技能在推理时为大语言模型(LLM)智能体提供程序性知识,但当前的基准测试很少区分技能的内容与其组织方式。我们通过渐进式披露(Progressive Disclosure)研究这种区别,其中简洁的根文件按需引导智能体访问支持资源,并将其与归一化的扁平基线进行比较。我们提出SkillJuror,一个通过语义控制变体、匹配的多试验评估和轨迹证据来评估技能编写范式的框架,同时保持任务知识固定。在82个任务的SkillsBench研究中,渐进式披露在总体结果之前改变了运行时行为:每个轨迹触及的不同技能资源从1.18增加到3.85,有效采纳事件从1.33增加到3.92。在410个匹配试验中,它还产生了17个额外的验证通过试验(比归一化扁平基线提高4.1%)。收益取决于任务。当支持资源指导实现、检查或修复时,渐进式披露有帮助,但当成功取决于精确的输出约定、数值阈值或长工件生成流水线时,效果较弱。这些结果表明,技能组织不仅仅是呈现方式:它可以改变智能体搜索和应用程序知识的方式,而结果收益取决于暴露的资源是否对任务可操作。代码见:https://this URL。

英文摘要

Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at this https URL.

2606.11476 2026-06-11 cs.SE 新提交

SentTrack: Sentiment-Driven Bottleneck Detection in GitHub Issue Repositories

SentTrack: GitHub问题仓库中情感驱动的瓶颈检测

Xinyu Hu, Ali Behbahani, Daniel Moon, Yaren Dogan, Nasir U. Eisty

AI总结 提出SentTrack双管道框架,结合大语言模型摘要与ABCDE交互分类,从约9000个GitHub问题线程中检测社会技术瓶颈,发现49%线程停滞,仅13%解决,并通过加权评分引擎优先处理高摩擦讨论。

详情
AI中文摘要

软件工程团队越来越依赖GitHub问题线程来协调工作、报告错误和协商技术决策,然而大多数仓库健康工具专注于代码指标,忽略了推动或阻碍开发的对话动态。本文提出SentTrack,一个双镜头框架,用于从GitHub问题讨论中检测社会技术瓶颈。应用于AvaloniaUI开源仓库的大约9000个问题线程,该框架解决了三个问题:如何从实时对话数据中自动化工作流低效检测,情感信号是否能比传统的基于标签的方法更早地暴露风险,以及如何在混合媒体问题文本中隔离人类叙述与机器生成的噪声。SentTrack结合了两个互补的管道。水平管道使用大语言模型将原始问题报告翻译成干净的摘要,提取中层关注短语,并通过UMAP和HDBSCAN进行聚类,从处理的前3608个问题中产生613个语义簇。垂直管道应用ABCDE协作交互框架对每条评论进行分类并推断线程级结果。在整个语料库中,49%的线程以停滞结束,只有13%达到解决,其中解决差距被确定为主要的瓶颈信号。一个结合消极性、停滞、解决差距和线程长度的加权评分引擎,为维护者提供了一个可解释的优先级工具,用于在开发停滞之前处理高摩擦讨论。

英文摘要

Software engineering teams increasingly depend on GitHub issue threads to coordinate work, report bugs, and negotiate technical decisions, yet most repository health tools focus on code metrics and ignore the conversational dynamics that drive or stall development. This paper presents SentTrack, a dual-lens framework for detecting socio-technical bottlenecks from GitHub issue discussions. Applied to the AvaloniaUI open-source repository across approximately 9,000 issue threads, the framework addresses three questions: how to automate workflow-inefficiency detection from real-time conversational data, whether sentiment signals can surface risk earlier than traditional label-based methods, and how to isolate human narrative from machine-generated noise in mixed-media issue text. SentTrack combines two complementary pipelines. A horizontal pipeline translates raw issue reports into clean summaries using a large language model, extracts mid-level concern phrases, and clusters them through UMAP and HDBSCAN, producing 613 semantic clusters from the first 3,608 issues processed. A vertical pipeline applies the ABCDE collaborative interaction framework to classify each comment and infer thread-level outcomes. Across the full corpus, 49\% of threads ended in stagnation and only 13\% reached resolution, with the resolution gap identified as the dominant bottleneck signal. A weighted scoring engine that combines negativity, stagnation, resolution gap, and thread length gives maintainers an interpretable prioritization tool for high-friction discussions before they stall development.

2606.11462 2026-06-11 cs.SE 新提交

Defeater Cards: Characterizing and Managing Safety Assurance Case Defeaters

Defeater Cards: 表征和管理安全保证案例的击败者

Usman Gohar, Michael C. Hunter, Salil Purandare, Jordan J. Rios, Myra B. Cohen, Robyn R. Lutz

AI总结 提出Defeater Cards结构化文档,基于5W1H框架系统表征、推理和管理安全案例中的击败者,通过跨领域案例研究验证其暴露隐藏假设、发现推理缺口和支持持续演化的能力。

详情
AI中文摘要

安全保证案例提供结构化的论证,证明安全关键系统满足其安全要求。最近,击败者的概念作为挑战安全论证有效性的严格手段出现。击败者的例子可能包括过于严格的主张、不可靠的证据或推理缺口。然而,击败者仍然是临时的,缺乏对批判性反思的结构化支持,描述不一致,难以审查,并且缺乏文档标准。为了解决这个问题,我们提出了Defeater Cards,一种新的结构化文档工件,用于系统地表征、推理和管理安全案例中的击败者。通过文献调查和主题分析,我们基于5W1H框架确定了为卡片结构提供信息的文档标准。Defeater Cards旨在支持有根据的分析和演化,提高可追溯性和可审计性,并实现跨系统和产品变体的击败者知识重用。我们通过两个跨领域案例研究展示了它们的适用性,展示了它们如何暴露隐藏的假设、揭示推理缺口并支持持续的安全保证案例演化。为了支持采用和社区重用,我们还发布了一个开源击败者卡片库,作为研究人员和从业者可以构建和描述经验教训的基线。

英文摘要

Safety assurance cases provide structured justifications that safety-critical systems meet their safety requirements. Recently, the notion of defeaters has emerged as a rigorous means of challenging the validity of safety arguments. Examples of defeaters might include overly strict claims, unreliable evidence, or reasoning gaps. However, defeaters remain ad hoc, lack structured support for critical reflection, are inconsistently described, are difficult to review, and lack documentation standards. To address this, we propose Defeater Cards, a new structured documentation artifact for systematically characterizing, reasoning about, and managing defeaters in safety cases. Drawing on a literature survey and thematic analysis, we identify documentation criteria that inform the card's structure, based on the 5W1H framework. Defeater Cards are designed to support informed analysis and evolution, improve traceability and auditability, and enable the reuse of defeater knowledge across systems and product variants. We demonstrate their applicability through two cross-domain case studies, showing how they expose hidden assumptions, surface reasoning gaps, and support ongoing safety assurance case evolution. To support adoption and community reuse, we also release an open-source repository of defeater cards as a baseline upon which researchers and practitioners can build and describe lessons learned.

2606.11442 2026-06-11 cs.SE cs.PL 新提交

Web-Native Graphical EMF Model Editors

Web原生图形化EMF模型编辑器

Susanne Göbel, Ralf Lämmel

AI总结 提出纯Web框架EMFular,基于Ecore模型自动生成图形编辑器,支持EMF一致性操作与Angular扩展,实现低代码生成、高可定制与无后端部署。

详情
AI中文摘要

图形化模型编辑正从桌面应用转向基于Web的工具。我们分析了现有框架的特点,并基于此推导出一组设计原则,这些原则意味着低成本的生成、广泛的定制可能性以及所生成编辑器的直接部署。在此基础上,我们引入了EMFular,一个纯基于Web的框架,用于管理EMF模型而无需任何后端。配套的EMFular生成器将给定的Ecore模型(一个EMF元模型)映射为即用型且可定制的图形编辑器。EMFular编辑器提供“EMF一致性”,即它们不仅支持标准建模操作,如创建、检查、导航、编辑和撤销/重做,而且还以与EMF紧密对齐的方式处理包含和反向引用;它们还通过与EMF兼容的序列化/反序列化提供与现有EMF工具的互操作性。生成的编辑器是一个Angular项目,具有指定的扩展点,允许开发人员利用Angular及其生态系统的表达能力,在EMFular扩展点的指导下,定制和扩展编辑器的所有方面。我们从编辑器充分性(可用的编辑能力)、适应性(定制机制和所需工作量)以及生成的鲁棒性三个方面评估了EMFular。

英文摘要

Graphical model editing is shifting from desktop applications to web-based tools. We analyze the characteristics of existing frameworks and, based on this analysis, we derive a set of design principles that imply low-effort generation, extensive customization possibilities, and straightforward deployment of the resulting editors. On these grounds, we introduce EMFular, a purely web-based framework for managing EMF models without any backend. The accompanying EMFular generator maps a given Ecore model (an EMF metamodel) to a ready-to-use and ready-to-customize graphical editor. EMFular editors provide 'EMF consistency', that is, they not only support standard modeling operations such as creation, inspection, navigation, editing, and undo/redo, but they also handle containment and inverse references in close alignment with EMF; they also provide interoperability with existing EMF tooling through compatible de-/serialization. A generated editor is an Angular project with designated extension points, which allows developers to customize and extend all aspects of the editor using the expressive power of Angular and its ecosystem, guided by the extension points of EMFular. We evaluate EMFular in terms of editor adequacy (available editing capabilities), adaptability (customization mechanisms and required effort), and robustness of the generation.

2606.11356 2026-06-11 physics.ao-ph cs.DC cs.SE physics.comp-ph 新提交

An Ocean Model Ported by a Large Language Model: Experience and Lessons from FESOM2 (Fortran to C to C++/Kokkos)

大型语言模型移植海洋模型:FESOM2(Fortran到C再到C++/Kokkos)的经验与教训

Nikolay V. Koldunov, Suvarchal K. Cheedela, Sergey Danilov, Dmitry Sidorenko, Sebastian Beyer, Thomas Jung

AI总结 本文展示利用LLM将FESOM2海洋模型从Fortran移植到C再到C++/Kokkos,通过两阶段翻译、严格字面转换和逐级验证,在数周内保持物理准确性并实现GPU加速。

详情
AI中文摘要

大型语言模型(LLM)能够翻译和修改源代码,并且已被证明可以对不同复杂度的代码进行此类操作。然而,它们是否能够将完整的、生产级的地球物理模型移植到另一种语言而不降低其物理保真度,尚未得到证实。我们证明,LLM辅助的代码翻译可以在将完整的生产级海洋模型迁移到现代性能可移植形式的同时,保持其物理特性。我们报告了在领域专家指导下,使用代理式LLM编码助手将FESOM2非结构化网格海洋-海冰模型(约74000行核心Fortran代码)首先移植到C,然后移植到C++/Kokkos以实现跨CPU和GPU的性能可移植性的经验。我们描述了被证明必要的实践、哪些有效、哪些无效,以及我们遇到的失败模式。三个实践最为重要:分两阶段翻译,将重现数值计算(Fortran到干净的C参考实现)与引入并行性(C到Kokkos)分开;要求严格字面翻译,不允许助手“改进”源代码;以及根据适合的验收标准对每个阶段进行验证。C移植版本在五年长期模拟统计水平上重现了原始Fortran结果。Kokkos版本在CPU上与C参考实现逐位一致,在GPU上多年运行统计上接近。在涡旋丰富网格上,高达740万个表面顶点,单个A100 GPU节点比CPU节点快1.6-3.7倍,达到生产集成所需的每天1-2模拟年。结果不仅仅是一个GPU移植:通过遵循清晰的验证程序,LLM在数周内将完整的Fortran海洋模型迁移到另一种语言并移植到加速器上,同时保持了其物理特性。

英文摘要

Large language models (LLMs) can translate and modify source code, and have been shown to do so for codes of different complexity. Whether they can port a complete, production geophysical model to a different language without degrading its physics has not been established. We demonstrate that LLM-assisted code translation can preserve the physics of a complete production ocean model while moving it into a modern performance-portable form. We report our experience using an agentic LLM coding assistant, directed by domain experts, to port the FESOM2 unstructured mesh ocean--sea-ice model (about 74000 lines of core Fortran) first to C and then to C++/Kokkos for performance portability across CPUs and GPUs. We describe the practices that proved necessary, what worked and what did not, and the failure modes that we encountered. Three practices mattered most: translating in two stages that separate reproducing the numerics (Fortran to a clean C reference) from introducing parallelism (C to Kokkos); requiring a strictly literal translation in which the assistant was not permitted to ``improve'' the source; and validating each stage against an acceptance criterion suited to it. The C port reproduces the original Fortran at the level of long-term simulation statistics over five years. The Kokkos port is bit-for-bit identical to the C reference on CPU and statistically close on GPU over multi-year runs. On eddy-rich meshes up to 7.4 million surface vertices a single A100 GPU node runs 1.6--3.7 times faster than a CPU node, reaching the 1-2 simulated-years-per-day required for production integrations. The result is more than a single GPU port: by following a clear validation procedure, an LLM moved a full Fortran ocean model into another language and onto accelerators while preserving its physics in a matter of weeks.

2606.05608 2026-06-11 cs.SE cs.AI 版本更新

Agentic Software: How AI Agents Are Restructuring the Software Paradigm

软件工程的终结:AI代理如何根本性地重构软件范式

Zhenfeng Cao

AI总结 本文通过第一性原理分析,论证了以LLM为推理引擎的AI代理系统正在根本性地重构软件范式,从传统软件(代码承载决策逻辑)转向代理系统(代码作为临时工具),并提出了代理工程作为新兴学科。

详情
Comments
15 pages, 2 figures, and 3 tables
AI中文摘要

半个多世纪以来,软件工程一直基于一个基本前提:人类工程师分解问题,将决策逻辑编码为静态代码,并随着需求演变手动调整代码。本文认为,AI代理——即大型语言模型作为主要推理引擎、动态生成和丢弃代码作为工具资源的系统——的出现并非渐进式改进,而是对软件范式的根本性重构。基于复杂性缩放的第一性原理分析,我们形式化了传统软件(代码是决策逻辑的载体)与代理系统(代码是LLM驱动推理循环的临时工具)之间的区别。我们追溯了从许可软件到SaaS再到我们所谓的代理即服务(AaaS)的历史轨迹,表明每次转变都将额外的复杂性从最终用户转移出去。我们引入了代理工程作为一门新兴学科——其核心研究对象、控制模型和人类角色均不同于软件工程。通过分析最近的基准证据,包括SWE-bench Verified、EvoClaw和LangChain的多代理协调研究,我们展示了代理范式的变革潜力及其当前局限性。最后,我们提出了一个迈向自我进化代理生态系统的四阶段路线图,并为应对这一转变的从业者提供了具体建议。

英文摘要

For over half a century, software engineering has operated on a foundational premise: human engineers decompose problems, encode decision logic into static code, and manually adapt that code as requirements evolve. This paper argues that the emergence of AI agents -- systems where large language models serve as the primary reasoning engine, dynamically generating and discarding code as an instrumental resource -- constitutes a fundamental restructuring of what software is, not an incremental tool improvement. We formalize the distinction between traditional deterministic software and agentic software: in the former, code is the carrier of pre-written decision logic; in the latter, the agent itself is the software, and its decision logic is generated at runtime. We trace the historical arc from licensed software to SaaS to Agent-as-a-Service (AaaS), showing that each shift transferred additional complexity away from end-users -- with the agentic shift transferring not just operational complexity but decision-making complexity itself. We introduce Agentic Engineering as an expansion of the software engineering discipline into a new paradigm, distinct in its core object of study (agent systems rather than static source code), its control model (LLM-driven rather than human-predefined), and its human role (intent architect rather than code author). Through analysis of recent benchmark evidence including SWE-bench Verified, EvoClaw, and LangChain's multi-agent coordination studies, we demonstrate both the transformative potential of the agentic paradigm and its current limitations. We conclude with a four-stage roadmap toward self-evolving agent ecosystems and concrete recommendations for practitioners navigating this transition.

2605.14084 2026-06-11 cs.SE cs.AI cs.CL 版本更新

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

CRANE:通过空域编辑实现代码代理的约束推理注入

Mingzhi Zhu, Michele Merler, Raju Pavuluri, Stacy Patterson

AI总结 CRANE通过空域编辑技术,结合推理和工具使用能力,提升代码代理性能,在多个基准测试中取得显著成果。

详情
AI中文摘要

代码代理必须同时对长周期的仓库状态进行推理并遵守严格的工具使用协议。在配对的Instruct/Thinking检查点中,这些能力是互补但不一致的。Instruct模型简洁且工具纪律性强,而Thinking模型提供更强的规划和恢复行为,但往往过度 deliberates 并降低代理性能。我们提出CRANE(通过空域编辑实现代码代理的约束推理注入),一种无需训练的参数编辑方法,将Thinking-Instruct的delta视为Instruct骨干的候选推理编辑方向池。CRANE结合幅度阈值去噪delta,保守的泰勒门来保留对推理转移和工具使用保留共同有益的编辑,以及渐进的Sigmoid投影来抑制格式关键的更新方向。通过合并配对的Instruct和Thinking检查点,CRANE在单独模型上取得显著优势的同时保持Instruct级别的效率:在Roo-Eval上,它实现了Qwen3-30B-A3B的pass1为66.2%(+19.5%)和Qwen3-Next-80B-A3B的81.5%(+8.7%);在SWE-bench-Verified上,它在两个规模(122/500和180/500)上解决了多达14个额外的实例;在Terminal-Bench v2上,它提高了pass1/pass5高达2.3%/7.8%,分别达到7.6%/17.9%和14.8%/30.3%,在所有三个基准测试中一致超越了其他合并策略。

英文摘要

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

2605.09059 2026-06-11 cs.SE 版本更新

Evaluating LLM-Generated Code: A Benchmark and Developer Study

评估大语言模型生成的代码:一个基准和开发者研究

Joanna Szych, Anne Schwerk

AI总结 本文提出了一种定制的树折评估方法,用于评估大语言模型生成的代码,弥补了现有基准在代码质量和可操作性方面的不足,并通过对比三个通用大语言模型展示了其有效性。

详情
Comments
Accepted for publication at EASE '26/EQUISA workshop
AI中文摘要

代码生成是大型语言模型广泛应用和高度成功的任务之一。鉴于其受欢迎程度,有许多专门针对代码生成的基准可以帮助选择最佳模型。然而,这些基准主要关注解决方案的正确性,忽略了其他方面,如代码质量和可操作性。本文旨在描述一种定制的树折评估方法,用于评估由大型语言模型生成的代码,以弥补这一差距。该方法包括基于复杂多级计算机科学项目的专用正确性基准、代码质量验证以及通过结构化代码审查过程收集的开发者对生成代码样本意见的调查。所提出的方法的使用和有效性通过评估和比较三个通用大型语言模型:GPT-4.1、DeepSeek-V3-0324和Claude Opus 4来展示。结果表明,通过开发者收集的审查可以得出许多新的发现,特别是与代码处于生产就绪状态相关的发现,这些在使用标准正确性导向基准方法时是无法获得的。

英文摘要

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model. However, they primarily focus on measuring solution correctness, leaving other aspects, such as code quality and usability, behind. This paper aims to describe a custom tree-fold evaluation methodology for code generated by Large Language Models that bridges this gap. The methodology includes a dedicated correctness benchmark based on a complex multi-level computer science project, code quality verification, and a survey of developers' opinions on generated code samples gathered through a structured code-review process. The proposed methodology's usage and usefulness are demonstrated by evaluating and comparing three general-purpose Large Language Models: GPT-4.1, DeepSeek-V3-0324, and Claude Opus 4. The results show that reviews gathered from developers can yield many new findings, especially those related to the code being in a production-ready state, that would not be possible to obtain using the standard correctness-focused benchmark approach.

2604.25363 2026-06-11 cs.SE 版本更新

Commit-Aware Learning-Based Test Case Prioritization for Continuous Integration

面向持续集成的提交感知的基于学习的测试用例优先级排序

Lorenzo Abbondante, Gerardo Canfora

AI总结 提出一种结合代码变更结构、测试覆盖和历史执行的提交感知学习模型,用于持续集成中的测试优先级排序,显著提升回归缺陷检测效果。

详情
AI中文摘要

持续集成(CI)流水线中的回归测试因测试套件规模和执行频率的增长而成本日益高昂。测试用例优先级排序(TCP)通过重新排序测试以更早暴露故障来缓解此问题。然而,现有技术大多依赖历史执行数据和覆盖度量,忽略了代码变更中包含的丰富结构信息。本文提出一种提交感知的、基于学习的TCP方法,将版本控制差异的结构属性、测试覆盖关系和历史执行行为结合到一个统一的预测模型中。给定一个新提交,该方法估计每个测试套件至少揭示一个失败的概率,并据此对测试执行进行优先级排序。我们在五个Defects4J项目上使用留一项目交叉项目验证设置评估了该方法。结果表明,提交感知的TCP在分类和优先级排序有效性上均显著优于非提交感知的基线。我们的发现表明,包含提交结构语义能显著增强回归故障检测,并在CI环境中实现鲁棒、可泛化的基于学习的TCP。

英文摘要

Regression testing in Continuous Integration (CI) pipelines is increasingly costly due to the growing size and execution frequency of test suites. Test Case Prioritization (TCP) mitigates this problem by reordering tests to expose faults earlier. However, most existing techniques rely primarily on historical execution data and coverage metrics, neglecting the rich structural information contained in code changes. This paper proposes a commit-aware, learning-based TCP method that combines structural properties of version-control diffs, test coverage relations, and historical execution behavior into a unified predictive model. Given a new commit, the method estimates the probability that each test suite will reveal at least one failure and prioritizes test execution accordingly. We evaluate our method on five Defects4J projects using a leave-one-project-out cross-project validation setting. Results show that the commit-aware TCP significantly outperform non-commit-aware-baselines in both classification and prioritization effectiveness. Our findings show that including commit structural semantics substantially enhances regression fault detection and enables robust, generalizable learning-based TCP in CI environments.

2603.27249 2026-06-11 cs.SE 版本更新

"An Endless Stream of AI Slop": How Developers Discuss the Burden of AI-Assisted Software Development

无尽的AI垃圾洪流:开发者如何讨论AI辅助软件开发的负担

Sebastian Baltes, Marc Cheong, Christoph Treude

AI总结 通过定性分析Reddit和Hacker News帖子,揭示开发者对AI生成低质量内容(AI slop)的感知与应对,归纳为审查摩擦、质量退化及系统诱因三大主题,指出其公地悲剧性质。

详情
Comments
7 pages, 2 figures, 1 table
AI中文摘要

“AI slop”,即低质量的AI生成内容,正日益影响软件开发,从生成的代码和拉取请求到文档和错误报告。然而,关于开发者如何感知和应对这一现象的经验性研究仍然有限。我们定性分析了开发者在1,154条Reddit和Hacker News帖子中如何讨论AI slop,开发了一个包含15个编码的编码本,这些编码组织成三个主题簇:审查摩擦(AI slop如何加重审查者负担、侵蚀信任并促使对策)、质量退化(对代码库、知识资源和开发者能力的损害)以及诱因与后果(系统性激励、强制采用、手艺侵蚀和劳动力破坏)。我们的发现将AI slop框架化为一种公地悲剧,其中个体生产力提升将成本外部化给审查者、维护者和更广泛的社区。我们报告了开发者提出的担忧以及他们建议的缓解策略,对工具开发者、团队领导和教育者具有启示意义。

英文摘要

"AI slop", that is, low-quality AI-generated content, is increasingly affecting software development, from generated code and pull requests to documentation and bug reports. However, there is limited empirical research on how developers perceive and respond to this phenomenon. We qualitatively analyzed how developers discuss AI slop in 1{,}154 Reddit and Hacker News posts, developing a codebook of 15 codes organized into three thematic clusters: Review Friction (how AI slop burdens reviewers, erodes trust, and prompts countermeasures), Quality Degradation (damage to codebases, knowledge resources, and developer competence), and Forces and Consequences (systemic incentives, mandated adoption, craft erosion, and workforce disruption). Our findings frame AI slop as a tragedy of the commons, where individual productivity gains externalize costs onto reviewers, maintainers, and the broader community. We report the concerns developers raise and the mitigation strategies they propose, with implications for tool developers, team leads, and educators.

2602.19718 2026-06-11 cs.SE cs.AI 版本更新

Carbon-Aware Governance Gates: An Architecture for Sustainable GenAI Development

碳感知治理门:可持续生成式AI开发的架构

Mateen A. Abbasi, Tommi J. Mikkonen, Petri J. Ihantola, Muhammad Waseem, Pekka Abrahamsson, Niko K. Mäkitalo

AI总结 针对生成式AI在软件开发中增加碳足迹的问题,提出碳感知治理门架构,通过嵌入碳预算、能源溯源和可持续验证编排来降低环境影响。

详情
Comments
5 pages, 1 figure. Preprint version under review
AI中文摘要

生成式AI在软件开发生命周期中的快速普及增加了计算需求,这可能提高开发活动的碳足迹。同时,组织越来越多地将治理机制嵌入到生成式AI辅助开发中,以支持信任、透明度和问责制。然而,这些治理机制引入了额外的计算负载,包括重复推理、再生循环和扩展的验证管道,增加了能源使用和生成式AI辅助开发的碳足迹。本文提出碳感知治理门(CAGG),一种架构扩展,将碳预算、能源溯源和可持续感知验证编排嵌入到人机治理层中。CAGG包含三个组件:(i)能源和碳溯源账本,(ii)碳预算管理器,以及(iii)绿色验证编排器,通过治理策略和可重用设计模式实现。

英文摘要

The rapid adoption of Generative AI (GenAI) in the software development life cycle (SDLC) increases computational demand, which can raise the carbon footprint of development activities. At the same time, organizations are increasingly embedding governance mechanisms into GenAI-assisted development to support trust, transparency, and accountability. However, these governance mechanisms introduce additional computational workloads, including repeated inference, regeneration cycles, and expanded validation pipelines, increasing energy use and the carbon footprint of GenAI-assisted development. This paper proposes Carbon-Aware Governance Gates (CAGG), an architectural extension that embeds carbon budgets, energy provenance, and sustainability-aware validation orchestration into human-AI governance layers. CAGG comprises three components: (i) an Energy and Carbon Provenance Ledger, (ii) a Carbon Budget Manager, and (iii) a Green Validation Orchestrator, operationalized through governance policies and reusable design patterns.

2601.22025 2026-06-11 cs.CL cs.AI cs.IR cs.SE 版本更新

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

当通用提示改进有害:LLM应用的评估驱动迭代

Daniel Commey

AI总结 提出最小可行评估套件(MVES),通过结构化评估框架和本地复现实验,发现通用提示添加并非单调改进,强调评估驱动的提示迭代。

详情
Comments
Technical report. 42 pages, 3 figures. Code, test suites, and result logs: this https URL
AI中文摘要

评估大型语言模型(LLM)应用与传统软件测试不同,因为输出是概率性的、语义可变的,并且对提示和模型变化敏感。本技术报告提出了最小可行评估套件(MVES),一种面向审计的应用级LLM评估结构。MVES将应用类别与失败模式、指标、所需工件和验证证据联系起来,涵盖通用LLM应用、检索增强系统和智能体工作流。我们将该框架与可复现的本地评估工具配对,包括结构化提取、RAG引用/内容合规性和指令遵循检查。使用Ollama与Llama 3 8B Instruct和Qwen 2.5 7B Instruct,我们在扩展的每套30例消融实验中评估了五种提示条件。结果表明,在测试的本地条件下,通用提示添加不会产生单调改进:更强的输出合同提示提高了两种模型的严格提取,而RAG引用/内容合规性在某些通用规则条件下下降。观察到的最显著下降发生在Qwen 2.5上,当通用规则附加到用户提示时,RAG从26/30下降到9/30。这些发现支持评估驱动的提示迭代:提示更改应被视为潜在的回归风险,并在部署前针对特定任务套件进行测试。随附的存储库包含测试套件、提示变体、评估工具、原始结果日志和复现所报告本地消融所需的脚本。

英文摘要

Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report proposes the Minimum Viable Evaluation Suite (MVES), an audit-oriented structure for application-level LLM evaluation. MVES links application categories to failure modes, metrics, required artifacts, and validation evidence across general LLM applications, retrieval-augmented systems, and agentic workflows. We pair the framework with a reproducible local evaluation harness covering structured extraction, RAG citation/content-compliance, and instruction-following checks. Using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, we evaluate five prompt conditions over expanded 30-case-per-suite ablations. The results show that, in the tested local conditions, generic prompt additions do not produce monotonic improvements: stronger output-contract prompts improve strict extraction for both models, while RAG citation/content-compliance declines under some generic-rule conditions. The largest observed decline occurs for Qwen 2.5 on RAG when generic rules are appended to the user prompt, from 26/30 to 9/30. These findings support evaluation-driven prompt iteration: prompt changes should be treated as potential regression risks and tested against task-specific suites before deployment. The accompanying repository contains the test suites, prompt variants, evaluation harness, raw result logs, and scripts needed to reproduce the reported local ablations.

2601.04203 2026-06-11 cs.CL cs.CV cs.LG cs.SE 版本更新

FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

FronTalk: 以多模态反馈进行对话式代码生成的前端开发基准测试

Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen

AI总结 提出FronTalk基准,通过多轮对话和多模态反馈(文本与视觉指令)评估前端代码生成,发现模型存在遗忘和视觉反馈理解困难,提出AceCoder方法有效减少遗忘并提升性能。

详情
AI中文摘要

我们提出了FronTalk,一个前端代码生成基准,开创性地研究了一种独特的交互动态:具有多模态反馈的对话式代码生成。在前端开发中,草图、模型和带注释的截图等视觉工件对于传达设计意图至关重要,但它们在多轮代码生成中的作用仍未得到充分探索。为解决这一差距,我们聚焦于前端开发任务,整理了FronTalk,这是一个包含100个多轮对话的数据集,这些对话源自新闻、金融和艺术等不同领域的真实网站。每一轮都包含一个文本指令和一个等效的视觉指令,每个指令代表相同的用户意图。为全面评估模型性能,我们提出了一种新颖的基于智能体的评估框架,利用网络智能体模拟用户并探索网站,从而衡量功能正确性和用户体验。对20个模型的评估揭示了文献中系统性地未充分探索的两个关键挑战:(1)显著的遗忘问题,即模型覆盖先前实现的功能,导致任务失败;(2)解释视觉反馈的持续挑战,尤其是对于开源视觉语言模型(VLM)。我们提出了一个强大的基线来解决遗忘问题,即AceCoder,一种使用自主网络智能体批评每个过去指令实现的方法。这种方法将遗忘几乎减少到零,并将性能提升高达9.3%(从56.0%到65.3%)。总体而言,我们旨在为前端开发和多轮多模态代码生成的通用交互动态的未来研究提供坚实基础。代码和数据已在此https URL发布。

英文摘要

We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL

2509.01459 2026-06-11 eess.SY cs.SE 版本更新

Semantic Technologies in Practical Demand Response: An Informational Requirement-based Roadmap

实际需求响应中的语义技术:基于信息需求的路标图

Ozan Baris Mulayim, Anand Krishnan Prakash, Yuvraj Agarwal, Mario Bergés, Marco Pritoni, Derek Supple, Steve Schaefer, Mitali Shah

AI总结 本文针对商业建筑激励型需求响应,通过形式化本体评估方法定义信息需求,评估现有本体(Brick、DELTA、EFOnt、CIM)的不足,并提出扩展与整合路标图以增强语义互操作性。

详情
Comments
Accepted at ACM eEnergy 2026. Not yet published/
AI中文摘要

向现代高效未来电网的转型依赖于分布式能源资源和需求响应(DR)等应用的无缝协调。虽然这种转变带来了更大的灵活性,但也增加了电网的复杂性和去中心化程度,需要有效协调数百万硬件资产和软件代理。实现这一愿景需要互操作性方面的进步,以确保这些异构系统能够在不产生过高定制成本的情况下进行通信。语义互操作性旨在通过利用本体来保证交换数据的无歧义解释。然而,当前商业建筑和DR领域的本体面临两个关键限制。首先,现有本体通常在没有反映实际DR需求的正式框架下开发。其次,通用本体与DR专用本体的集成方案大多停留在概念层面,缺乏形式化或实证验证。在本文中,我们开始通过应用形式化本体评估/开发方法来定义语义互操作性所需的信息需求(IRs),以美国商业建筑中基于激励的DR项目为起点,来填补这些空白。我们识别了与基于激励的DR每个阶段相关的IRs。利用这些IRs,我们评估了现有本体(特别是Brick、DELTA、EFOnt和CIM)对DR参与操作需求的支持程度。我们的发现揭示了当前本体与实际DR需求之间的显著差距,并提出了这些本体必要扩展和整合的路标图。这项工作最终旨在增强当今和未来智能电网的互操作性,从而促进DR系统可扩展地集成到电网复杂的运行框架中。

英文摘要

The transition to a modern and efficient future grid relies on the seamless coordination of distributed energy resources and applications such as Demand Response (DR). While this transformation enables greater flexibility, it increases grid complexity and decentralization, requiring the effective coordination of millions of hardware assets and software agents. Realizing this vision demands advances in interoperability to ensure these heterogeneous systems can communicate without prohibitive customization costs. Semantic interoperability aims to address this by leveraging ontologies to guarantee the unambiguous interpretation of exchanged data. However, current ontologies in the commercial building and DR domains face two critical limitations. First, existing ontologies are often developed without a formal framework that reflects real-world DR requirements. Second, proposals for integrating general and DR-specific ontologies remain mostly conceptual, lacking formalization or empirical validation. In this paper, we begin to address these gaps by applying a formal ontology evaluation/development approach to define the information requirements (IRs) necessary for semantic interoperability, focusing on incentive-based DR programs for commercial buildings in the United States as a starting point. We identify the IRs associated with each stage of the incentive-based DR. Using these IRs, we evaluate how well existing ontologies, specifically Brick, DELTA, EFOnt, and CIM support the operational needs of DR participation. Our findings reveal substantial gaps between current ontologies and practical DR requirements and we propose a roadmap of necessary extensions and integrations for these ontologies. This work ultimately aims to enhance the interoperability of today's and future smart grid, thereby facilitating scalable integration of DR systems into the grid's complex operational framework.

2508.18636 2026-06-11 cs.SE cs.AI 版本更新

LaQual: An Automated Framework for LLM App Quality Evaluation

LaQual: 一种用于LLM应用质量评估的自动化框架

Yan Wang, Xinyi Hou, Junjun Si, Yanjie Zhao, Weiguo Lin, Haoyu Wang

AI总结 提出LaQual自动化框架,通过静态指标筛选和动态场景评估,实现LLM应用质量评估,与人类判断高度一致,可减少66.7%-81.3%候选应用。

详情
AI中文摘要

代表软件分发的新范式,LLM应用商店正在迅速兴起,为用户提供内容生成、编程辅助、教育等多样化选择。然而,当前LLM应用商店中的排名和推荐机制主要依赖静态指标(如用户交互和收藏),使用户难以高效识别高质量应用。同时,当前学术研究专注于特定垂直领域,缺乏适用于多样化LLM应用生态的通用自动化评估框架。为应对上述挑战,我们提出LaQual,一种用于LLM应用质量评估的自动化框架。LaQual整合三个关键阶段:(1) LLM应用标注与层次分类,实现精确场景映射;(2) 静态指标评估,使用时间加权用户参与度和功能能力指标过滤低质量应用;(3) 动态场景自适应评估,由LLM生成场景特定评估指标、评分标准和任务,进行全面质量评估。在主流LLM应用商店上的实验证明了LaQual的有效性。其自动化评分与人类判断高度一致。通过有效筛选,LaQual可将候选LLM应用池减少66.7%至81.3%。用户研究进一步验证了其相对于基线系统的显著优势,特别是在比较效率(均值5.45 vs. 3.30)和解释信息价值(4.75 vs. 2.25)方面。这些结果表明,LaQual为现实场景中LLM应用的高质量发现与推荐提供了可扩展、客观且以用户为中心的解决方案。

英文摘要

Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recommendation mechanisms in LLM app stores predominantly rely on static metrics, such as user interactions and favorites, making it challenging for users to efficiently identify high-quality apps. At the same time, current academic research focuses on specific vertical fields and lacks a general, automated evaluation framework applicable to the diverse LLM app ecosystem. To address the above challenges, we present LaQual, an automated framework for LLM app quality evaluation. LaQual integrates three key stages: (1) LLM app labeling and hierarchical classification for precise scenario mapping; (2) static indicator evaluation using time-weighted user engagement and functional capability indicators to filter low-quality apps; and (3) dynamic scenario-adapted evaluation, where an LLM generates scenario-specific evaluation metrics, scoring criteria, and tasks for comprehensive quality evaluation. Experiments on a mainstream LLM app store demonstrate the effectiveness of LaQual. Its automated scores show high consistency with human judgments. Through effective screening, LaQual can reduce the candidate LLM app pool by 66.7% to 81.3%. User studies further validate its significant outperformance over baseline systems, particularly in comparison efficiency (mean 5.45 vs. 3.30) and value of explanatory information (4.75 vs. 2.25). These results demonstrate that LaQual provides a scalable, objective, and user-centric solution for high-quality discovery and recommendation of LLM apps in real-world scenarios.