AI Agent - arXivDaily 专题

2606.18448 2026-06-18 cs.CL 新提交 95%

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL：面向计算机使用智能体的多模态技能

Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

发表机构 * UC Santa Barbara（加州大学圣塔芭芭拉分校）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； MIT-IBM Watson AI Lab（麻省理工学院-IBM沃森人工智能实验室）

专题命中软件智能体：面向计算机使用智能体的多模态技能库

AI总结提出VISUALSKILL分层多模态技能库，通过结合文档与UI探索构建，使智能体在CUA基准上平均得分提升15.3点，且多模态优于纯文本技能。

详情

AI中文摘要

计算机使用智能体（CUA）在标准化基准上接近人类水平，但在长周期任务和未见软件上仍存在困难。现有技能库通过可复用技能解决此问题，但仅以文本形式表示技能工件，忽略了GUI交互的视觉特性。我们提出VISUALSKILL：一种分层多模态技能，针对每个目标应用定制，并组织为按主题文件索引的中央索引，智能体通过load_topic MCP工具按需获取相关主题的文本和图形。我们通过结合编写文档与实时应用UI探索的两阶段流水线构建每个技能。在两个CUA基准CUA-World和OSExpert-Eval上，由Claude Opus 4.6支持的Claude Code CLI智能体使用VISUALSKILL达到平均得分0.456，比无技能基线（0.303）绝对提升15.3点。与从相同源内容生成且仅在模态上与VISUALSKILL不同的匹配纯文本技能相比，VISUALSKILL进一步绝对提升8.3点（0.373 vs. 0.456），直接证明在技能工件中保留视觉图形而非将其语言化，有助于智能体识别UI元素并在每次操作后验证工作流状态。我们的代码见此链接。

英文摘要

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.

URL PDF HTML ☆

赞 0 踩 0

2606.19319 2026-06-18 cs.MA cs.AI cs.DB 新提交 90%

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

数据智能代理：通过自主编码代理解释、建模和查询企业数据

Anoushka Vyas, Aarushi Dhanuka, Sina Khoshfetrat Pakazad, Henrik Ohlsson

发表机构 * C3 AI

专题命中软件智能体：自主编码代理处理企业数据集成

AI总结提出Data Intelligence Agents (DIA)系统，由三个自主编码代理组成，通过执行、验证和修复工件来压缩数据集成工作流，在七个SQL基准测试中达到或超越最佳结果。

详情

AI中文摘要

生产数据集成受限于数据所有者、工程师和分析师之间重复且有损的手动交接，他们必须协作发现、构建和查询企业数据。我们提出数据智能代理（DIA），一个由三个代理（数据解释器、模式创建器和查询生成器）组成的系统，通过将自主编码代理（ACA）作为一等抽象来压缩这一工作流：代理不是生成文本，而是生成、执行、验证和修复具体工件，利用共享内存进行经验重用，并将每个工件呈现给领域专家审查。DIA已部署在生产环境中供企业客户使用。我们深入研究了查询生成器，并在完全自主模式下跨七个SQL基准测试（涵盖四个任务类别和四种方言）进行评估。它在所有七个基准测试中达到或超越了最佳已发表结果，表明基于执行、构建在ACA和共享内存之上的架构能够泛化到数据智能工作负载，且适应仅限于自然语言指令。

英文摘要

Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair concrete artifacts, draw on a shared memory for experience reuse, and surface each for review by domain experts. DIA is deployed in production for enterprise customers. We study the Query Generator in depth and evaluate it in fully autonomous mode across seven SQL benchmarks spanning four task categories and four dialects. It matches or surpasses the best published results on all seven, demonstrating that an architecture grounded in execution, built on ACAs and a shared memory, generalizes across the data intelligence workload with adaptation confined to natural-language instructions.

URL PDF HTML ☆

赞 0 踩 0

2606.18890 2026-06-18 cs.AI 新提交 90%

Skill-Guided Continuation Distillation for GUI Agents

面向GUI代理的技能引导延续蒸馏

Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

发表机构 * StepFun ； University of Science and Technology Beijing（北京科技大学）； Tsinghua University（清华大学）； Nanyang Technological University（南洋理工大学）

专题命中软件智能体：技能引导蒸馏提升GUI Agent成功率

AI总结提出技能引导延续蒸馏（SGCD）框架，通过技能引导策略生成成功延续轨迹，弥补专家轨迹中未覆盖的状态监督缺失，在OSWorld-Verified上将三个基础模型成功率从30%左右提升至50%以上。

详情

AI中文摘要

改进GUI代理通常依赖于在专家轨迹上的行为克隆。然而，当当前策略偏离专家策略时，在闭环执行过程中不可避免地会遇到策略导致的偏离轨迹状态，即超出专家轨迹的状态。由于专家轨迹未对这些未见状态提供演示，这些状态得不到有效监督，导致策略无法选择正确动作。为弥补这一监督缺口，我们提出技能引导延续蒸馏（SGCD），一种迭代式自我改进框架。SGCD首先在没有技能引导的情况下运行简单策略若干步，以到达真实的偏离轨迹状态。从这些状态出发，技能引导策略完成任务并生成成功的延续轨迹，这些轨迹与专家轨迹混合，为策略导致的偏离轨迹状态提供监督。技能从成功和失败的轨迹中提取，包括延续计划、关键目标、失败陷阱和成功标准。在OSWorld-Verified上，SGCD将三个基础模型的成功率从30%左右提升至超过50%，证明了其有效性和通用性。

英文摘要

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

URL PDF HTML ☆

赞 0 踩 0

2606.01139 2026-06-18 cs.AI 版本更新 90%

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

SkillRevise: 通过轨迹条件技能修订改进LLM撰写的智能体技能

Yuxuan Liu, Zhaochen Su, Lingyun Xie, Yuhao Zhang, Qing Zong, Jiahe Guo, Zhongwei Xie, Yiyan Ji, Yauwai Yim, Hongyu Luo, Xiyu Ren, Ruan Chenyu, Haoran Li, Yangqiu Song

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Harbin Institute of Technology（哈尔滨工业大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Nanjing University（南京大学）； The University of Hong Kong（香港大学）

专题命中软件智能体：智能体技能迭代优化，提升LLM agent成功率

AI总结提出SkillRevise框架，通过执行证据诊断、修复原则检索和执行锚定编辑，迭代优化初始技能，在SkillsBench上将基础智能体成功率从36.05%提升至61.63%，并展现跨模型迁移性。

Comments 15 pages, 4 figures

详情

AI中文摘要

智能体技能是使LLM智能体能够执行工作流、验证约束并从故障中恢复的程序性工件。现有的自进化方法利用累积轨迹来优化技能，但在冷启动场景下（仅有一个初始的不完美技能可用）表现不佳。因此，技能构建默认采用专家编写或一次性LLM生成。专家编写的技能成本高昂，且可能与LLM智能体实际执行任务的方式不一致，而一次性生成的技能可能在语法上良好但在行为上薄弱。为弥合这一差距，我们提出SkillRevise，一个基于执行的框架，旨在迭代优化这些初始技能。SkillRevise从执行证据中诊断技能缺陷，从通用记忆中检索相关修复原则，并应用执行锚定编辑。通过重新执行候选技能并测量经验效用，它系统地保留最优技能版本。在三个基准测试和五个LLM上的评估表明，SkillRevise显著优于一次性基线，将SkillsBench上基础智能体的成功率从36.05%提升至61.63%。此外，修订后的技能展现出强大的跨模型迁移性，捕获了超越模型特定工件的通用程序性知识。

英文摘要

Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates, it retains the first verifier-passing skill within the revision budget and falls back to empirical utility only when no candidate succeeds. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills transfer across both executors and task environments, suggesting that SkillRevise captures reusable procedural knowledge beyond any single executor.

URL PDF HTML ☆

赞 0 踩 0

2604.06367 2026-06-18 cs.CR cs.AI cs.LG 版本更新 90%

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

WebSP-Eval：在网站安全与隐私任务上评估网络代理

Guruprasad Viswanathan Ramesh, Asmit Nayak, Basieem Siddique, Kassem Fawaz

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

专题命中软件智能体：评估Web Agent在安全隐私任务上的表现

AI总结提出WebSP-Eval框架，通过200个任务实例和自动化评估器，测试多模态大模型在网站安全与隐私任务上的表现，发现状态UI元素（如开关）导致超过45%的任务失败。

Comments Accepted at PETS 2026. Project Page: https://wiscprivacy.com/webspeval/

详情

AI中文摘要

网络代理自动化浏览器任务，从简单的表单填写到复杂的工作流程（如订购杂货）。虽然当前的基准测试评估通用性能（如WebArena）或针对恶意行为的安全性（如SafeArena），但没有现有框架评估代理成功执行面向用户的网站安全和隐私任务的能力，例如管理cookie偏好、配置隐私敏感账户设置或撤销非活动会话。为填补这一空白，我们引入了WebSP-Eval，一个用于衡量网络代理在网站安全和隐私任务上性能的评估框架。WebSP-Eval包括：1）一个手动制作的任务数据集，涵盖28个网站的200个任务实例；2）一个强大的代理系统，支持使用自定义Google Chrome扩展在多次运行中进行账户和初始状态管理；以及3）一个自动化评估器。我们使用最先进的多模态大语言模型评估了总共8个网络代理实例，对网站、任务类别和UI元素进行了细粒度分析。我们的评估显示，当前模型在可靠解决网站安全和隐私任务方面自主探索能力有限，并且在特定任务类别和网站上表现困难。关键的是，我们发现状态UI元素是代理失败的主要原因，其中开关导致许多模型超过45%的任务失败。

英文摘要

Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g., SafeArena), no existing framework assesses an agent's ability to successfully execute user-facing website security and privacy tasks, such as managing cookie preferences, configuring privacy-sensitive account settings, or revoking inactive sessions. To address this gap, we introduce WebSP-Eval, an evaluation framework for measuring web agent performance on website security and privacy tasks. WebSP-Eval comprises 1) a manually crafted task dataset of 200 task instances across 28 websites; 2) a robust agentic system supporting account and initial state management across runs using a custom Google Chrome extension; and 3) an automated evaluator. We evaluate a total of 8 web agent instantiations using state-of-the-art multimodal large language models, conducting a fine-grained analysis across websites, task categories, and UI elements. Our evaluation reveals that current models suffer from limited autonomous exploration capabilities to reliably solve website security and privacy tasks, and struggle with specific task categories and websites. Crucially, we identify stateful UI elements are a primary reason for agent failure, with toggles causing more than 45% task failure across many models.

URL PDF HTML ☆

赞 0 踩 0

2606.18976 2026-06-18 cs.SE cs.AI 新提交 85%

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

CAPRA: 使用多智能体LLM系统对软件架构交付物进行反馈扩展

Marco Becattini, Niccolò Caselli, Matteo Minin, Roberto Verdecchia, Enrico Vicario

发表机构 * Department of Information Engineering, University of Florence, Florence, Italy（信息工程系，佛罗伦萨大学，意大利佛罗伦萨）

专题命中软件智能体：多智能体LLM系统自动生成软件架构反馈。

AI总结提出CAPRA多智能体LLM系统，通过多模态文档提取、确定性证据锚定和一致性管理，自动生成软件架构交付物的个性化LaTeX反馈，在10份学生报告中满足88.8%的评估标准。

Comments Accepted for publication at the 38th International Conference on Software Engineering Education and Training

详情

AI中文摘要

软件工程教育中的自动评估在代码评分和论文评分方面取得了显著进展。然而，审查软件架构交付物需要分析结构完整性和需求可追溯性，尚未完全自动化。将大型语言模型（LLM）应用于此任务需要稳健的架构，以确保技术反馈对学生准确可靠。本文提出CAPRA（可配置架构能力报告评估），一个多智能体LLM系统，分析软件架构交付物以生成个性化的、符合模板的LaTeX反馈。作为核心设计选择，CAPRA协调多个专门智能体，并采用基于Python的微服务进行多模态文档提取，利用PyMuPDF和视觉增强LLM（特别是gpt-4o）解析文本和UML图。为确保教育可靠性并减少幻觉，CAPRA引入了使用归一化Levenshtein距离进行模糊匹配的确定性证据锚定步骤，以及一个交叉验证、去重和合并发现的一致性管理器智能体。系统性能通过一个结构化的八标准二元评估分类法进行评估，涵盖：(i) 提取完整性，(ii) 特征验证，(iii) 问题依据和严重性检测，(iv) 建议特异性和可追溯性，以及(v) 模板和语气合规性。对10份学生报告的初步实证评估显示，在严格的两评分者聚合规则下，CAPRA满足了88.8%的评估标准，与人类评估者达到了中等评分者间一致性（kappa = 0.582），每份报告处理时间略超过4分钟。虽然这些结果支持LLM支持的架构反馈的可行性，但主观评估维度仍需人工监督。

英文摘要

Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

URL PDF HTML ☆

赞 0 踩 0

2606.18728 2026-06-18 cs.CL 新提交 85%

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

LegalWorld: 法律智能体的生命周期交互环境

Songhan Zuo, Shengbin Yue, Tao Chiang, Guanying Li, Yun Song, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Northwest University of Political and Law（西北政法大学）

专题命中软件智能体：法律智能体生命周期交互环境。

AI总结提出LegalWorld，一个将中国民事诉讼建模为五阶段因果链的生命周期交互环境，基于75309对判决书构建，并评估多智能体在连续诉讼中的能力差异。

详情

AI中文摘要

民事诉讼本质上是一个生命周期过程：律师第一天起草的内容会约束数月后庭审的走向。然而，现有的法律基准评估的是孤立的子任务，而先前的法律智能体模拟器每次从共享的真实情况重新初始化场景，忽略了跨阶段的因果依赖关系。我们提出LegalWorld，一个生命周期交互环境，将中国民事诉讼建模为五个阶段（七个子场景）的因果连接状态链，基于75,309对中国民事判决书构建。我们为其配备了可重用的基础设施（本地记忆、全局案件记忆、技能/工具库），确保每个争议在其整个生命周期中保持一致。在此环境基础上，我们构建了LongJud-Bench，用于评估智能体在所有五个连接阶段的能力。来自217名法律背景评估者的18,992个评分证实，LegalWorld的轨迹在程序上忠实且角色一致；跨模型的能力级评估揭示了聚合分数无法暴露的显著分歧，没有单一骨干模型在咨询、起草和庭审辩护中均领先。详细资源将公开发布。

英文摘要

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.18671 2026-06-18 cs.HC 新提交 85%

HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification

HANSEL: 从Web智能体轨迹中提取面包屑用于交互式验证

Yujin Zhang, Daye Nam

专题命中软件智能体：Web智能体轨迹提取证据用于验证

AI总结提出HANSEL系统，从AI智能体轨迹中提取可交互验证的证据，减少用户审查负担，在基准测试中达到83.7%精确率和88.9%召回率，用户研究显示显著降低任务完成时间和感知努力。

Comments 13 pages, 6 figures

详情

AI中文摘要

AI Web智能体可以代表用户执行复杂的多步骤任务，例如搜索产品、比较选项和进行购买。然而，验证智能体输出的正确性仍然困难。现有的透明机制，包括完整轨迹日志、源链接、截图和LLM生成的摘要，将验证视为被动阅读任务，让用户筛选大量日志或信任可能不忠实的解释。我们提出HANSEL（突出显示智能体导航步骤作为证据链接），一个从Web智能体轨迹中提取交互式、可验证证据的系统。给定一个智能体轨迹，HANSEL提取证据页面和片段，并将其呈现为可导航、交互式的视图，并保留相关页面状态（例如，应用的过滤器、搜索查询和滚动位置），使用户能够验证智能体如何得出其答案。当智能体的答案无法追溯到任何访问过的页面时，HANSEL明确标记此缺口。在来自AssistantBench和Online-Mind2Web的45个任务上的技术评估显示，HANSEL在识别证据页面方面达到83.7%的精确率和88.9%的召回率，同时将轨迹量减少61.6%。在14名参与者的受控用户研究中，与标准智能体界面相比，HANSEL显著减少了任务完成时间和感知努力，而参与者在可用性、验证易用性和错误识别方面对其评价显著更高。我们的结果表明，将验证重新定义为交互式活动，而不是被动消费智能体解释，可以导致对AI智能体更高效的人工监督。

英文摘要

AI web agents can perform complex, multi-step tasks such as searching for products, comparing options, and making purchases on behalf of users. However, verifying the correctness of an agent's output remains difficult. Existing transparency mechanisms, including full trajectory logs, source links, screenshots, and LLM-generated summaries, treat verification as a passive reading task, leaving users to sift through overwhelming logs or trust potentially unfaithful explanations. We present HANSEL (Highlighting Agent Navigation Steps as Evidence Links), a system that extracts interactive, verifiable evidence from web-agent trajectories. Given an agent trajectory, HANSEL extracts evidence pages and snippets and presents them as navigable, interactive views with relevant page state preserved (e.g., applied filters, search queries, and scroll positions), enabling users to verify how the agent arrived at its answer. When the agent's answer cannot be traced to any visited page, HANSEL explicitly flags this gap. A technical evaluation on 45 tasks from AssistantBench and Online-Mind2Web shows that HANSEL achieves 83.7% precision and 88.8% recall in identifying evidence pages, while reducing trajectory volume by 61.6%. In a controlled user study with 14 participants, HANSEL significantly reduced task completion time and perceived effort compared to a standard agent interface, while participants rated it significantly higher on usability, verification ease, and error identification. Our results demonstrate that reframing verification as an interactive activity, rather than passive consumption of agent explanations, leads to more efficient human oversight of AI agents.

URL PDF HTML ☆

赞 0 踩 0

2606.16000 2026-06-18 cs.CL cs.LG 新提交 85%

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

GRACE-DS：数据科学中的受保护奖励引导智能体修正环境

Aleksandr Tsymbalov, Danis Zaripov, Artem Epifanov, Anastasiya Palienko

发表机构 * ITMO University（ITMO大学）； HSE University（高等经济学院）

专题命中软件智能体：评估LLM驱动的AutoML智能体环境

AI总结提出GRACE-DS，一个用于评估LLM驱动的AutoML智能体在部署前性能的隔离环境，通过隐藏的可执行验证器衡量预测性能、泄漏避免、可重复性等指标，实验证明其灵活迭代交互模式优于基线方法。

详情

AI中文摘要

我们介绍了GRACE-DS，一个数据科学中的受保护奖励引导智能体修正环境，用于对LLM驱动的AutoML智能体进行部署前评估。GRACE-DS是一组在隔离环境中的评估指标，可应用于特定组织的表格ML任务。它将智能体暴露于现实的工作流阶段，从规划和数据检查到特征工程、模型开发、验证、代码修复直至最终提交，同时隐藏的可执行验证器不仅衡量最终预测性能，还衡量泄漏避免、可重复性、协议有效性、修正行为和奖励对齐。最强的结构化机制——灵活迭代交互（我们的方法）——实现了比单次生成、非结构化交互和基于重启的基线更高的端到端归一化隐藏测试质量，同时提高了协议有效完成率。经过7000多个回合的验证，这些结果确立了GRACE-DS作为评估基于LLM的AutoML智能体在生产类条件下按照组织特定要求执行机器学习工作流能力的稳健平台。

英文摘要

We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

URL PDF HTML ☆

赞 0 踩 0

2606.13681 2026-06-18 cs.CL 新提交 85%

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: 追踪记忆演化以构建动态环境中的鲁棒LLM智能体

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

发表机构 * National University of Singapore（新加坡国立大学）； Singapore Management University（新加坡管理大学）； University of Washington（华盛顿大学）； University College London（伦敦大学学院）； University of Pennsylvania（宾夕法尼亚大学）； Nanyang Technological University（南洋理工大学）； Recursive ； Massachusetts Institute of Technology（麻省理工学院）

专题命中软件智能体：动态环境中LLM智能体的记忆演化基准

AI总结提出EvoArena基准套件模拟终端、软件和社交领域的渐进环境变化，并设计基于补丁的记忆范式EvoMem记录结构化更新历史，使智能体能通过记忆变化推理环境演化，实验表明当前智能体在动态环境中表现不佳，EvoMem可稳定提升性能。

详情

AI中文摘要

大型语言模型（LLM）智能体在广泛基准测试中取得了强劲性能，但大多数评估假设静态环境。相比之下，实际部署本质上是动态的，要求智能体持续将其知识、技能和行为与不断变化的环境及更新的任务条件对齐。为弥补这一差距，我们引入了EvoArena，一个基准套件，将环境变化建模为终端、软件和社交领域的渐进更新序列。我们进一步提出EvoMem，一种基于补丁的记忆范式，将记忆演化记录为结构化的更新历史，使智能体能够通过记忆中的变化推理环境演化。实验表明，当前智能体在EvoArena上表现不佳，在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能，在EvoArena上平均提升1.5%，并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外，EvoMem在EvoArena上还将链级准确率提升3.7%，其中成功需要完成一系列连续的相关演化子任务。机制分析表明，EvoMem改善了记忆中的证据捕获，表明更完整地保留了演化的环境状态。我们的结果强调了在评估和记忆中对演化进行建模对于可靠智能体部署的重要性。

英文摘要

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

URL PDF HTML ☆

赞 1 踩 0

2606.18294 2026-06-18 physics.ins-det nucl-ex physics.app-ph 新提交 80%

Vision AI Agent for Continuous Material Monitoring of LEGEND-1000 LoFi Reentrant Tube

用于LEGEND-1000 LoFi回旋管连续材料监测的视觉AI智能体

Sonata Simonaitis-Boyd, Soonhong Lee, Lauren N. O'Brien, Brandon T. Turner, Ralph Massarczyk, Steven R. Elliott, Aobo Li, Alexander F. Leder

专题命中软件智能体：LangChain智能体流水线，自动材料监测

AI总结提出基于LangChain和Claude Haiku 4.5的视觉AI智能体流水线，通过SAM2分割和混合OCR验证从静水压测试视频中自动提取OFHC铜圆柱的直径和应变，计算屈服强度并与模拟对比。

Comments 27 pages, 8 figures, 5 tables, submitted to PRX Intelligence

详情

AI中文摘要

我们报告了一种用于从视频数据中非接触式提取材料应变和属性的视觉AI智能体流水线，并在LEGEND-1000硬件验证活动中对四个OFHC铜圆柱进行静水压测试的视频上进行了演示。传统的应变片测量被证明不可靠，因此需要一种全自动的智能体替代方案。该智能体基于LangChain框架构建，以Claude Haiku 4.5作为核心推理引擎，集成了专门的计算机视觉工具套件：用于视频预处理和通过霍夫线变换进行旋转校正的FFmpeg，用于时空分割并具有自动记忆感知动态分块的Segment Anything Model 2 (SAM2)，以及混合EasyOCR和基于LLM的时间戳验证流水线。开发了三个专门的子智能体来处理视频数据并获取圆柱直径和时间戳，同时自主处理诸如损坏帧和内存限制等障碍。从与压力数据同步的直径轮廓中，重建了环向应力-应变曲线，并使用0.2%偏移法、0.5% EUL法和Johnson-Cook法在两次独立测试中计算了屈服强度。与非智能体流水线的交叉验证确认了直径提取在±5像素水平上的一致性。材料属性和测试结果进一步与作为LEGEND-1000回旋管设计活动一部分进行的Ansys机械模拟进行了比较。这项工作展示了智能体流水线仅从视频中提取材料数据的能力。

英文摘要

We report on a vision AI agent pipeline for non-contact material strain and property extraction from video data, demonstrated on video taken during hydrostatic testing of four OFHC copper cylinders conducted as part of the LEGEND-1000 hardware validation campaign. Traditional strain gauge measurements proved unreliable, motivating a fully-automated agentic alternative. The agent was built on the LangChain framework with Claude Haiku 4.5 as its central reasoning engine, integrating a specialized suite of computer vision tools: FFmpeg for video preprocessing and rotation correction via Hough Line Transform, the Segment Anything Model 2 (SAM2) for spatiotemporal segmentation with automated memory-informed dynamic chunking, and a hybrid EasyOCR and LLM-based timestamp validation pipeline. Three specialized sub-agents were developed to process the video data and obtain cylinder diameters and timestamps while autonomously handling obstacles such as corrupted frames and memory limits. From the diameter profiles synchronized to pressure data, hoop stress--strain curves were reconstructed and yield strengths were calculated using the 0.2\% offset, 0.5\% EUL, and Johnson-Cook methods across two independent tests. Cross-validation against a non-agentic pipeline confirmed agreement for the diameter extraction at the $\pm$5 pixel level. The material properties and testing results were further compared to Ansys mechanical simulations performed as part of the LEGEND-1000 reentrant tube design campaign. This work showcases the power of agentic pipelines to extract materials data from video alone.

URL PDF HTML ☆

赞 0 踩 0

2606.15828 2026-06-18 cs.SE 新提交 80%

Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents

AGENTS.md 文件中的配置异味：配置编码代理的常见错误

Helio Victor F. dos Santos, Vitor Costa, Joao Eduardo Montandon, Luciana Lourdes Silva, Marco Tulio Valente

专题命中软件智能体：编码代理配置问题，属于AI Agent

AI总结本文首次系统化编码代理配置文件（AGENTS.md/CLAUDE.md）的异味，通过灰文献综述和仓库挖掘识别出六种异味，并在100个开源仓库中验证其普遍性，其中Lint Leakage最常见（62%）。

详情

AI中文摘要

编码代理越来越多地被用于自动化软件工程任务。为了指导其行为，这些代理通常依赖配置文件（通常命名为 AGENTS.md 或 CLAUDE.md），这些文件提供关于架构、工作流、编码规范和测试实践的指令。尽管它们的重要性日益增加，但人们对影响这些文件定义和维护的常见问题知之甚少。在本文中，我们提出了首个编码代理配置文件异味目录。为了识别此类异味，我们首先进行了灰文献综述和仓库挖掘分析。结果，我们识别出六种配置异味，并提出了自动检测它们的启发式方法。为了评估所提出异味的普遍性，我们分析了100个包含 AGENTS.md 或 CLAUDE.md 文件的流行开源仓库。我们的结果表明，配置异味广泛存在。Lint Leakage 是最常见的异味，影响了62%的文件，其次是 Context Bloat（42%）和 Skill Leakage（35%）。我们进一步表明，几种异味经常同时出现，特别是 Context Bloat、Skill Leakage 和 Conflicting Instructions。

英文摘要

Coding agents are increasingly used to automate software engineering tasks. To guide their behavior, these agents commonly rely on configuration files, typically named AGENTS.&zwnj;md or CLAUDE.&zwnj;md, which provide instructions about architecture, workflows, coding conventions, and testing practices. Despite their growing importance, little is known about common problems affecting the definition and maintenance of these files. In this paper, we present the first catalog of smells for coding-agent configuration files. To identify such smells, we first conducted a grey literature review and a repository mining analysis. As a result, we identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS.&zwnj;md or a CLAUDE.&zwnj;md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.

URL PDF HTML ☆

赞 0 踩 0

2606.18619 2026-06-18 cs.CR cs.AI cs.SE 新提交 70%

Code-Augur: Agentic Vulnerability Detection via Specification Inference

Code-Augur：通过规约推断的智能体漏洞检测

Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

发表机构 * National University of Singapore（新加坡国立大学）

专题命中软件智能体：自主LLM智能体进行漏洞审计

AI总结提出安全规约优先范式，通过显式化智能体假设并运行时反证，结合引导式模糊测试提升漏洞检测能力，在真实项目中比现有智能体检测更多漏洞。

详情

AI中文摘要

智能体漏洞检测的出现已成为软件安全的分水岭。完全由自主LLM智能体进行的审计正在发现数字社会基础软件中的关键漏洞。许多漏洞多年来一直隐藏，直到现在才被AI智能体发现。然而，这些发现背后的推理仍然令人担忧地不透明且未经验证。当智能体认为某个函数安全时，它对函数输入做了哪些假设？推理失败和错误假设可能导致遗漏漏洞，并降低对智能体分析的信任。我们提出了一种安全规约优先范式，该范式（1）将智能体的隐性假设明确暴露为安全规约，并（2）通过运行时反证持续细化这些规约。我们在Code-Augur中实现了我们的方法，这是一种用于智能体漏洞检测的新型框架。给定一个代码库，Code-Augur分析系统的每个组件以查找漏洞代码。当它认为某个组件安全时，它会将该判断背后的局部不变量作为源代码中的断言提交。同时，Code-Augur利用引导式模糊测试器尝试反证这些假设。当模糊测试器触发断言时，要么揭示一个真实漏洞，要么揭示一个需要细化的有缺陷规约。在这两种情况下，这一过程都夯实了智能体的理解，使其对代码意图的看法与代码实际行为保持一致。在真实世界的主题上，Code-Augur有效利用安全规约检测到比其他最先进智能体更多的漏洞。此外，Code-Augur在关键开源项目中发现了22个新漏洞。与精心策划的专用模型（如Claude Mythos）相比，Code-Augur提供了基于广泛可用的LLM（如Sonnet和DeepSeek）构建的有效智能体漏洞检测。

英文摘要

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

URL PDF HTML ☆

赞 0 踩 0