测量AI代理的生物能力与风险

Patricia Paskov, Jeffrey Lee, Kyle Brady, Alyssa Worland

发表机构 * PATRICIA PASKOV, JEFFREY LEE, KYLE BRADY, ALYSSA WORLAND（PATRICIA PASKOV、JEFFREY LEE、KYLE BRADY、ALYSSA WORLAND）

AI总结针对AI科学家等自主执行多步科学任务的代理系统，本文提出生物代理评估作为解释性工具，并基于实践经验给出定义、设计、运行、评分和记录评估的考量，以帮助决策者谨慎解读结果并指导投资。

详情

AI中文摘要

本文针对一个迅速出现的政策挑战：如何生成和解释关于AI科学家（即能够自主或协作执行多步科学任务的代理AI系统）的生物能力与风险的可信证据。随着这些系统进入真实研究流程，决策者越来越多地面临评估结果，而这些结果的含义取决于通常隐含或记录不足的底层设计选择。我们综合了关于AI驱动的生物风险的现有证据，并引入生物代理评估作为评估这些系统的一种有前景但需要谨慎解释的工具。我们的核心贡献是一套基于实践经验的考量——源自我们自己的评估——展示了围绕定义、设计、运行、评分和记录评估的选择如何实质性地塑造结果对风险意味着什么和不意味着什么。该分析旨在帮助政策制定者以适当的谨慎态度解读生物评估输出；引导公共和私人资助者向AI-生物学评估研究的高杠杆投资；并支持评估新兴AI系统的生物安全从业者。次要受众包括在前沿AI实验室、AI提供商、科学机构和第三方评估组织中设计或进行代理评估的研究人员。

英文摘要

This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.

URL PDF HTML ☆

赞 0 踩 0

2606.19887 2026-06-19 cs.CR cs.AI 新提交

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

FFinRED：面向金融大语言模型红队测试的专家引导基准生成与评估框架

Chaeyun Kim, Daeyoung Park, Junghwan Kim, Jinyoung Jeong, Eunji Song, Yongtaek Lim, Minwoo Kim

发表机构 * DATUMO INC.（DATUMO公司）； Korea Advanced Institute of Science and Technology (KAIST)（韩国先进科学研究院）； Financial Security Institute (FSI)（金融安全研究所）

AI总结提出FinRED框架，通过专家引导的两级分类法将全球金融标准映射为威胁，并利用真实金融文档生成上下文丰富的红队行为提示，结合专家验证的评估标准，有效降低关键假阴性。

详情

AI中文摘要

现有的安全基准主要针对通用对抗场景，但忽略了金融领域的特定风险。金融大语言模型面临监管合规违规、欺诈助长和系统性信任侵蚀等问题，需要有针对性的评估。我们引入了FinRED，一个与金融专家共同开发的、用于金融大语言模型安全评估的专家引导红队测试框架。FinRED采用新颖的两级分类法，将全球标准（如FATF和EU DORA）映射到从监管规避到复杂欺诈的威胁，并结合可扩展的流水线，通过专家定义的架构将真实金融文档转换为上下文丰富的红队行为提示（种子）。严格的专家验证确认了种子的合理性和真实性，以实现有意义的LLM安全评估。我们还提供了一个经过专家验证的、金融专用的评估标准，该标准超越了免责声明检查，比静态的一刀切标准更贴近人类专家，并将关键假阴性从28个减少到12个。FinRED与国际采纳的风险管理和信息安全标准（如ISO/IEC 27001）保持一致，已在韩国金融安全研究院（FSI）的监管沙盒中部署，用于真实金融服务中的生成式AI安全评估。为减轻双重用途风险，数据集、生成流水线、提示模板和评估框架对合格研究人员开放，访问地址为：此https URL和此https URL。

英文摘要

Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks. Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation. We introduce FinRED, an expert-guided red-teaming framework for financial LLM safety evaluation developed with financial experts. FinRED uses a novel two-level taxonomy mapping global standards (e.g., FATF and EU DORA) to threats ranging from regulatory evasion to complex fraud, integrated with a scalable pipeline that converts real financial documents into context-rich red-teaming Behavioral Prompts (seeds) through an expert-defined schema. Rigorous expert validation confirms seed plausibility and realism for meaningful LLM safety evaluation. We also provide an expert-validated, finance-specific rubric that goes beyond disclaimer checks, aligns more closely with human experts than static one-size-fits-all rubrics, and reduces critical false negatives from 28 to 12. Aligned with internationally adopted risk-management and information-security standards (e.g., ISO/IEC 27001), FinRED is deployed in South Korea's Financial Security Institute (FSI) regulatory sandbox for generative AI security evaluation in real financial services. To mitigate dual-use risks, the dataset, generation pipeline, prompt template, and evaluation framework are gated for qualified researchers at https://github.com/selectstar-ai/FinRED-paper and https://huggingface.co/datasets/datumo/FinRED.

URL PDF HTML ☆

赞 0 踩 0

2606.19830 2026-06-19 cs.SE cs.CL 新提交

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

JAMER：专业游戏引擎上的项目级代码框架数据集与基准测试

Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang, Yifei Huang, Yu Dai, Kaipeng Zhang

发表机构 * Nankai University（南开大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出首个基于专业游戏引擎的项目级代码框架数据集JamSet和基准JamBench，通过设计确定性验证流程，从24万仓库中筛选出8133个已验证项目，评估9个前沿模型发现项目规模增大时能力急剧下降。

详情

AI中文摘要

当前AI驱动的游戏开发在资产生成、游戏设计和基于Web的游戏编码方面取得了实质性进展，但由于缺乏大规模数据集和确定性评估方法，专业游戏引擎上的项目级代码工程仍然很大程度上未被探索。我们提出了JamSet和JamBench，这是首个基于专业游戏引擎的项目级游戏代码框架数据集和基准。我们的关键洞察是，Game Jam竞赛（开发者在严格时间限制下构建完整游戏的社区活动）产生了数千个适合此目的的开源项目。基于Godot引擎的文本格式和无头执行模式，我们设计了一个从文件完整性到运行时行为收集的确定性验证流程，从超过24万个仓库中提炼出8133个已验证项目。其中，300个手动验证的项目构成JamBench；其余构成JamSet。JamBench定义了主题驱动的生成和代码补全任务，通过结合编译通过率、结构完整性得分（SCS）和行为对齐得分（BAS）的流水线进行评估。对9个前沿模型的评估揭示了随着项目规模增加的能力悬崖，运行时通过率从小型项目的80.4%下降到大型项目的5.7%（Task2a）。代码代理提高了编译率，但在运行时行为质量上没有带来提升，表明瓶颈在于架构设计而非语法正确性。实验验证了JamSet作为有效训练数据。所有数据和代码均已公开。

英文摘要

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.19803 2026-06-19 cs.DB cs.AI cs.LG 新提交

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

策略感知向量搜索：向量数据库中细粒度访问控制的愿景

Lakshmi Sahithi Yalamarthi, Primal Pappachan

发表机构 * Portland State University（波特兰州立大学）

AI总结本文提出策略感知向量搜索的愿景，形式化向量数据库中的细粒度访问控制（FGAC）策略模型与实施问题，比较不同实施策略并指出未来挑战。

Comments Accepted at SeQureDB 26, Sigmod 2026

详情

AI中文摘要

向量数据库越来越多地用于安全敏感的场景，如检索增强生成和组织AI管道；然而，其安全能力仍然有限。具体而言，现代向量数据库不完全支持细粒度访问控制（FGAC），而FGAC是确保数据访问符合用户特定策略所必需的。与关系数据库不同，向量数据库结合结构化和非结构化属性以提供语义近似查询结果，这使FGAC实现复杂化。这就在正确执行FGAC策略、实现高ANN搜索召回率和保持低查询延迟之间产生了内在张力。在本文中，我们通过形式化向量数据库中的FGAC策略模型以及实施问题，提出了策略感知向量搜索的愿景。我们比较了各种实施策略，展示了初步发现，并指出了未来策略感知向量搜索研究的关键开放挑战。

英文摘要

Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructured attributes to provide semantic, approximate query results, which complicates FGAC implementation. This creates an inherent tension between enforcing FGAC policies correctly, achieving high ANN search recall and maintaining low query latency. In this paper, we present a vision for Policy-aware Vector Search by formalizing the FGAC policy model in vector databases as well as the enforcement problem. We compare various enforcement strategies, present preliminary findings, and identify key open challenges for future research in policy-aware vector search.

URL PDF HTML ☆

赞 0 踩 0

2606.19795 2026-06-19 cs.SE cs.AI 新提交

Agentic Electronic Design Automation: A Handoff Perspective

代理式电子设计自动化：一种交接视角

Jiawei Liu, Peiyi Han, Yuntao Lu, Su Zheng, Fengyu Yan, Bei Yu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Primarius Technologies（Primarius技术公司）

AI总结本文从交接有效性角度出发，将EDA流程中的代理系统分为三类，并提出五层代理通信协议，以解决多阶段、多工具间的状态传递和验证问题。

详情

AI中文摘要

电子设计自动化（EDA）本质上是多阶段且交接密集的。设计工件、流程脚本和工程决策在最终实现、签核或发布之前，跨越工具、会话和组织边界。每次传递都携带显式和隐式需求，这些需求可能无法被阶段局部检查完全捕获。基于LLM的代理现在直接调用EDA工具，将检索到的知识嵌入可执行脚本，并在会话和阶段之间传递状态。一旦它们的输出影响下游工程决策，传递的对象必须满足交接合同并符合其下一个消费者的假设。本综述引入交接有效性作为其组织原则。当传递的对象满足消费者的接受条件，并携带足够的上下文、证据和来源以供下游使用时，交接是有效的。我们回顾了82个系统，并将它们分为三个边界类别。阶段边界系统在单个EDA阶段或有界验证任务内建立有效性。流程边界系统在工具、调用和会话之间保持连贯的工作流状态。组织边界系统在知识和权限边界之间维护源基础、来源、范围及可接受性。对于每个类别，我们分析交接合同、交接对象、协调机制和开放问题。这些分析激发了一个五层EDA代理通信协议（EACP），涵盖代理发现、代理消息、工具调用、工作流编排以及安全和IP协议。我们旨在为可信的代理式EDA提供通用词汇和研究议程。

英文摘要

Electronic design automation (EDA) is inherently multi-stage and handoff-heavy. Design artifacts, flow scripts, and engineering decisions cross tool, session, and organizational boundaries before final implementation, signoff, or release. Each transfer carries explicit and implicit requirements that may not be fully captured by stage-local checks. LLM-based agents now invoke EDA tools directly, embed retrieved knowledge in executable scripts, and hand off state across sessions and stages. Once their outputs condition downstream engineering decisions, the transferred object must satisfy a handoff contract and meet the assumptions of its next consumer. This survey introduces handoff validity as its organizing principle. A handoff is valid when the transferred object satisfies the consumer's acceptance conditions and carries sufficient context, evidence, and provenance for downstream use. We review 82 systems and classify them into three boundary classes. Stage-Bound systems establish validity within a single EDA stage or bounded verification task. Flow-Bound systems preserve coherent workflow state across tools, invocations, and sessions. Organization-Bound systems maintain source grounding, provenance, scope, and admissibility across knowledge and authority boundaries. For each class, we analyze handoff contracts, handoff objects, coordination mechanisms, and open questions. These analyses motivate a five-layer EDA agent communication protocol (EACP), covering the agent discovery, agent message, tool invocation, workflow orchestration, and security and IP protocols. We aim to provide a common vocabulary and research agenda for trustworthy agentic EDA.

URL PDF HTML ☆

赞 0 踩 0

2606.19755 2026-06-19 cs.CR cs.AI 新提交

VCG：极端冷启动条件下电商视频流的多模态检索框架

Katya Mirylenka, Egor Malykh, Mahdyar Ravanbakhsh, Michael Gygli, Marco-Andrea Buchmann, Andrew Dzhoha, Svitlana Borzenko, Francesca Catino, Mohamed Gaafar, Maarten Versteegh, Thomas Kober, Dario d'Andrea, Ellie Langhans

发表机构 * Zalando Switzerland AG（Zalando瑞士有限公司）； TU Wien（维也纳技术大学）； Zalando SE（Zalando德国分公司）

AI总结针对电商视频流中的极端冷启动和偏差问题，提出基于领域自适应视觉-语言模型（CLIP）的可扩展多模态检索系统VCG，实现零样本检索，在线测试显示深度视频完成率提升50%。

详情

AI中文摘要

数字商业格局正从静态的搜索驱动型目录转向动态的沉浸式视频流。这一转变引入了“极端冷启动”问题：与传统商品不同，新的短视频缺乏协同过滤所需的密集交互历史。此外，沉浸式视频流引入了强烈的位置和时长偏差，扭曲了标准参与信号。在本文中，我们展示了视频候选生成（VCG）系统，这是一个可扩展的多模态检索引擎，旨在解决大规模电商环境中的这些挑战。通过利用领域自适应的视觉-语言模型（基于CLIP），我们将用户和视频映射到共享语义空间，实现基于视觉内容而非行为历史的零样本检索。我们详细介绍了系统的架构，并进行了严格的评估，比较了生成式（LLM）和判别式（CLIP）嵌入。结果表明，虽然生成式模型在属性预测方面表现出色，但在检索任务中会出现嵌入空间坍塌。在线A/B测试表明，VCG有效缓解了参与偏差，使深度视频完成率提升了50%。为了展示系统的能力，我们提供了一个交互式演示，包含三种双向检索场景：产品到视频、视频到产品和零样本语义搜索。

英文摘要

The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

URL PDF HTML ☆

赞 0 踩 0

2606.19616 2026-06-19 cs.SE cs.AI cs.MA 新提交

Before the Pull Request: Mining Multi-Agent Coordination

在拉取请求之前：挖掘多智能体协调

Dipankar Sarkar

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结针对自主编码智能体在拉取请求中协调不足的问题，提出基于git的协调基板grite，通过事件日志减少重复和冲突工作，提升吞吐量，并自动恢复多种故障模式。

Comments 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: https://github.com/neul-labs/grite

详情

AI中文摘要

自主编码智能体现在可以开启数百万个拉取请求，然而大规模研究发现，它们的拉取请求虽然生成更快，但被接受的频率却更低——这是一个拉取请求级别的遥测无法解释的协调与信任差距。我们认为缺失的信号存在于拉取请求之前，即并发智能体如何声明、划分和碰撞共享工作。我们通过grite（我们的开源协调基板）来研究这一过程，它不需要中央服务器，并将其记录存储在git本身内部，因此其仅追加的、签名的事件日志直接捕获了协调过程。我们证明：(i) 这种共享基板以有限的开销减少了重复和冲突工作——仅重复队友任务的工作份额从78%降至0%，而有效吞吐量增加了三倍以上；(ii) 每个智能体的日志副本收敛到相同状态，没有写入被静默丢弃，而基于文件的跟踪器会丢失并发写入；(iii) 该日志是一个可挖掘的工件，从中可以自动恢复具体的故障模式——冲突编辑、锁饥饿、冗余发现、竞态关闭——并带有来源信息，其中一些在拉取请求历史中是不可见的。我们发布了数据集、测试平台和挖掘工具包。

英文摘要

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

URL PDF HTML ☆

赞 0 踩 0

2606.19613 2026-06-19 cs.SE cs.AI 新提交

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

StaminaBench: 对编码智能体进行100轮交互的压力测试

Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

发表机构 * AWS Agentic AI（AWS 代理人工智能）

AI总结提出StaminaBench基准，通过100轮连续变更请求测试编码智能体的耐力，发现所有模型在5-6轮内失败，而测试反馈和重试机制可将通过轮数提升12倍。

详情

AI中文摘要

我们引入了StaminaBench，一个衡量编码智能体耐力的基准：它们在失败前能处理多少连续交互轮次（变更请求）。与流行的任务解决率指标不同，这符合实际编码风格，其中会话运行数十或数百轮。在StaminaBench中，智能体实现一个REST API服务器，并在可调数量的程序生成的后续变更请求（实验中为100个）上进行修改，导致代码库最多达6000行。测试完全以编程方式生成，无需LLM参与，确保可重复性和可靠性；变更序列来自硬编码或LLM驱动的采样器，两者都受限于结构化动作空间以确保变更有效。智能体和服务器在隔离环境中运行，并通过HTTP与基准通信，使测试完全黑盒且与语言无关。我们评估了六个智能体框架与七个开源LLM在20个场景（每个100轮）上的表现，发现：（1）所有测试模型在5-6轮内失败，确认了无彻底测试的编码风格会产生错误；（2）将测试反馈传递给智能体并允许重试，可将通过轮数提升最多12倍；（3）良好的框架是强性能所必需的：更强的模型在其最佳和最差框架之间表现出高达6倍的差距，而较弱的模型在任何框架下都失败。我们发布了基准和生成的任务，以促进对多轮编码智能体行为的进一步研究。基准代码和数据：此 http URL。

英文摘要

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

URL PDF HTML ☆

赞 0 踩 0

2606.19605 2026-06-19 cs.SE cs.AI 新提交

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO：多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

发表机构 * Foundation AI–Cisco Systems Inc.（基础AI–思科系统公司）； Yale University（耶鲁大学）

AI总结提出FAPO框架，通过自动诊断流水线瓶颈并迭代优化提示或链结构，在18个模型-基准比较中15次优于基线GEPA，平均提升14.1个百分点。

详情

AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败，因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO（全自动提示优化），一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更，并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑，仅当提示优化似乎不足时，在归因识别出结构瓶颈的情况下，在允许范围内更改链结构。在六个基准和三个任务模型上，FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中，FAPO以不重叠的均值±试验标准差范围获胜，平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中，当提示优先搜索升级为结构变更时，FAPO在所有六个中获胜，平均增益为+33.8个百分点。FAPO还提高了安全任务的性能：在CTIBench-RCM（一个安全CVE到CWE任务）上，仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率，在Foundation-Sec-8B-Instruct上提升了+7.1个百分点，在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19535 2026-06-19 cs.CR cs.LG 新提交

FloatDoor: Platform-Triggered Backdoors in LLMs

FloatDoor: 大语言模型中的平台触发后门

Nils Loose, Jonas Sander, Felix Mächtle, Thomas Eisenbarth

发表机构 * University of Luebeck（吕贝克大学）

AI总结提出FloatDoor，首个输入无关、平台触发的后门攻击，利用浮点运算平台差异，通过两个轻量LoRA适配器在目标平台触发恶意行为，同时保持模型正常效用。

详情

AI中文摘要

大型语言模型（LLM）越来越多地部署在软件工程等敏感环境中，其输出直接影响下游工件。最近的研究表明，由于非结合浮点运算和不同的内核实现，同一模型在不同部署平台上可能产生可测量的不同输出。我们研究了这种平台依赖可变性的安全影响，并揭示了LLM部署中一种新的攻击面。我们提出了FloatDoor，这是首个针对生成式LLM的输入无关、平台触发的后门攻击。被攻陷的模型在目标平台上表现出对手选择的行为，而在其他平台上则表现正常。FloatDoor通过两个轻量级LoRA适配器实现：一个放大平台间数值差异，另一个将由此产生的平台签名绑定到恶意下游任务，同时保持模型整体效用基本不变。FloatDoor利用了模型审计和部署之间的显著检查时间与使用时间差距。我们在Qwen3-4B上展示了FloatDoor，涵盖了广泛的部署目标，包括NVIDIA GPU、Google TPU、AWS Graviton和阿里巴巴Yitian-710。作为最终案例研究，我们展示了FloatDoor能够在选定的目标平台上可靠地诱导可利用的代码漏洞。我们的结果建立了一类新的LLM部署攻击，并强调了在敏感的LLM驱动应用中建立可信模型供应链的迫切需求。

英文摘要

Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19474 2026-06-19 cs.CR cs.AI cs.SE 新提交

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

LLM辅助后量子密码开发中的安全编码漂移：一种游戏化修复方案

R. D. N. Shakya, C. P. Wijesiriwardana, S. M. Vidanagamachchi, Nalin A. G. Arachchilage

发表机构 * University of Moratuwa（摩图瓦大学）； University of Ruhuna（鲁胡纳大学）； RMIT University（皇家墨尔本理工大学）

AI总结提出LLM辅助PQC开发中的安全编码漂移模型，通过游戏化框架将LLM转变为主动安全协作者，以缓解长期依赖LLM导致的安全退化。

Comments Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track

详情

AI中文摘要

向后量子密码学（PQC）的过渡引入了相当大的实现复杂性，要求严格遵守恒定时间执行、侧信道抵抗和精确参数化。同时，大型语言模型（LLM）已深度嵌入软件开发工作流程，包括密码工程。虽然LLM提高了生产力，但证据表明它们经常生成不安全或次优的代码，特别是在安全关键领域。本文引入了PQC中的安全编码漂移，这是一种新颖的社会技术漏洞模型，捕捉了由于持续依赖LLM生成的代码而导致的安全编码实践逐渐退化。与先前关注静态漏洞的工作不同，我们将安全风险概念化为一种源于人机交互的纵向行为现象。为了缓解这一问题，我们提出了一种游戏化的、LLM增强的安全编码框架，将对抗性评估、行为反馈和安全评分嵌入开发工作流程。我们的方法将LLM从被动助手重新定义为主动安全协作者，为AI中介环境中的更安全PQC实现做出贡献。

英文摘要

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

URL PDF HTML ☆

赞 0 踩 0

2606.19407 2026-06-19 cs.SE cs.AI 新提交

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

JustDiag!：用于可问责根本原因分析的诊断论证引擎

Tingzhu Bi, Xinrui Jiang, Xun Zhang, Pengcheng Su, Congjie He, Jinglin Li, Ping Wang, Meng Ma

发表机构 * Peking University（北京大学）； University of Edinburgh（爱丁堡大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出JustDiag诊断论证引擎，通过维护显式的过程状态（证据、发现、竞争假设、冲突和下一步检查）来支持可问责的根本原因分析，在66个真实事件上评估显示其优于仅提供流畅最终答案的方法。

详情

AI中文摘要

大型语言模型可以生成流畅的根本原因分析，但仅凭流畅的最终答案不足以证明高风险操作中的可问责性。在实际事件响应中，工程师需要知道哪些证据支持诊断，考虑了哪些替代方案，哪里存在矛盾，以及系统是解决了问题还是保留了不确定性。我们通过JustDiag填补了这一空白，这是一个用于RCA的诊断论证引擎，它维护了关于证据、发现、竞争假设、冲突和下一步检查的显式过程状态。我们使用两层协议在66个真实事件上评估了该系统，该协议分别对最终答案质量和过程质量进行评分。与没有诊断论证的匹配对照组相比，JustDiag获得了更强的结果和过程分数，同时由于更校准的非闭合性而接受了略低的终端完成率。这些结果表明，可问责的RCA需要显式的诊断论证工件和过程感知评估，而不仅仅是流畅的最终答案。

英文摘要

Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

URL PDF HTML ☆

赞 0 踩 0

2606.19390 2026-06-19 cs.SE cs.AI 新提交

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化：一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

发表机构 * University of Oxford（牛津大学）； Cisco Systems（思科系统）； The Alan Turing Institute（艾伦·图灵研究所）； University of Warwick – WMG（沃里克大学 – WMG）； University of Hull（哈罗德大学）

AI总结提出一种协议驱动框架，通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测，结合静态与运行时证据生成CSAF VEX公告，经密码签名和确定性重放验证，在合成自主AI工作负载上评估。

Journal ref Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework. Front Artif Intell 9, (May 2026), 1826384

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式：移动代理是否需要手机屏幕？

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

发表机构 * Mila – Québec AI Institute（魁北克人工智能研究所）； Concordia University（康科迪亚大学）； University of Toronto（多伦多大学）； McMaster University（麦马斯特大学）

AI总结本文挑战移动代理的GUI主导范式，提出CLI应同等重要，通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线，并引入CLI-Advantage任务套件展示其优势。

详情

AgentArmor：编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

发表机构 * Anthropic Fellows Program（Anthropic Fellow 项目）； Constellation

AI总结提出AgentArmor框架，通过系统提示增强、命令分类器、三振政策等机制，缓解编码代理因规范不足、能力错误和工具错误导致的失败，显著提升安全性。

详情

AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中，我们研究这些失败模式源于三种不同的机制：规范不足，即默认模型行为不安全；能力错误，即安全动作可用但模型因偏见或能力限制而未遵循；以及代理工具错误，即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制，每个评估都受实际部署失败的启发，总计20个编码环境和59个合成转录模板。基于此评估，我们提出AgentArmor，一种代理工具修改，以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具，我们证明AgentArmor在统计显著数量的样本上更安全。因此，我们为当前编码代理提出具体缓解措施，并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge Dissemination

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

The Correctness Illusion in LLM-Generated GPU Kernels

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

Online Dynamic Batching with Formal Guarantees for LLM Training

Measuring Biological Capabilities and Risks of AI Agents

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

Agentic Electronic Design Automation: A Handoff Perspective

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

Closing the Calibration Gap in Semantic Caching

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

Before the Pull Request: Mining Multi-Agent Coordination

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FloatDoor: Platform-Triggered Backdoors in LLMs

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures