arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
专题追踪
2606.20280 2026-06-19 cs.IR cs.AI 新提交

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ELVA:探索排序驱动的通用多模态检索

Yuhan Liu, Pei Fu, Hang Li, Yukun Qi, Chao Jiang, Jingwen Fu, Zhen Liu, Bin Qin, Zhenbo Luo, Jian Luan, Jingmin Xin

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家级重点实验室) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院) MiLM Plus Xiaomi Inc(小米公司) Zhongguancun Academy(中关村学院) Beijing, China(北京市)

AI总结 提出ELVA框架,通过基于规则的强化学习缓解对比学习中的粒度盲视问题,在通用多模态检索中实现排序优化,并在新基准MRBench上提升13.1%。

Comments Accepted by ECCV 2026

详情
AI中文摘要

利用多模态大语言模型(MLLMs)进行对比学习已成为提升通用多模态检索(UMR)性能的主流范式。然而,先前的工作在将对比范式适应到检索任务时忽略了粒度盲视问题。粒度盲视是指模型倾向于忽略查询中包含的粒度级信息,而这些信息对于有效处理复杂查询至关重要。这源于对比学习将样本视为二元分类(正/负),而忽略了每个负样本携带的不同信息。为了解决这个问题,我们认为应该根据负样本与正样本的相似度区别对待它们,使模型能够从每个负样本中学习不同的粒度信息。在本文中,我们引入了一个简单但有效的框架,称为ELVA,一种新颖的基于规则的强化学习框架,通过排序驱动的MLLMs缓解粒度盲视。1)不依赖奖励模型,我们将可验证奖励的强化学习(RLVR)扩展到检索任务,使模型能够探索新的排序行为而无需显式的排序标签。2)通过利用基于规则的奖励,我们的方法联合优化负样本的排序,同时扩大正负样本之间的相似度差距。为了更精确地衡量粒度盲视,我们进一步引入了MRBench,一个专门为多粒度查询场景设计的新基准。ELVA在标准检索基准上取得了最先进的结果,在MRBench上显著提升13.1%,进一步证明了其在缓解粒度盲视方面的有效性。

英文摘要

Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.

2606.20258 2026-06-19 cs.HC cs.AI 新提交

Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge Dissemination

编辑对齐:一种参与式方法,将编辑专业知识引入LLM介导的知识传播

Simon Aagaard Enni, Malthe Stavning Erslev, Karl-Emil Kjær Bilstrup, Kristoffer Laigaard Nielbo

发表机构 * Aarhus University(奥胡斯大学) University of Copenhagen(哥本哈根大学)

AI总结 本文提出“编辑对齐”作为参与式AI设计实践,通过设计工作坊让编辑参与重新对齐LLM接口至编辑标准,以维护公共知识机构的编辑职能。

Comments 14 pages

详情
AI中文摘要

LLM驱动的信息服务的出现正在重塑公共知识机构的运作条件,威胁着吸收这些机构赖以存在的编辑功能。虽然LLM为知识传播提供了强大的新可能性,但预训练的LLM已经与其商业开发者的价值观和传播策略对齐,从而挑战了编辑权威。本文通过一个案例研究,调查编辑通过设计工作坊参与将LLM接口重新对齐到编辑标准的过程,在该案例中,我们与一家北欧公共知识机构设计并实现了一个LLM增强的百科全书界面。我们将编辑对齐作为参与式AI中的一种设计实践引入,将AI对齐视为一个设计过程,并将编辑标准定位为一种设计工件,将编辑实践和价值观转化为技术实现的对齐目标。最后,我们讨论了编辑对齐如何为持续参与创造空间,并赋予编辑在LLM介导的知识传播中的自主权。

英文摘要

The emergence of LLM-driven information services is reshaping the conditions under which public knowledge institutions operate, threatening to absorb the editorial function these institutions exist to exercise. While LLMs offer powerful new affordances for knowledge dissemination, editorial authority is challenged by pretrained LLMs that arrive already aligned with the values and dissemination strategies of their commercial developers. This paper investigates editor participation in re-aligning LLM interfaces to editorial standards through design workshops, in a case study where we design and implement an LLM-enabled encyclopedia interface with a Nordic public knowledge institution. We introduce editorial alignment as a design practice within Participatory AI, framing AI alignment as a design process and positioning the editorial standard as a design artefact that translates editorial practice and values into alignment objectives for technical implementation. Last, we discuss how editorial alignment can create space for ongoing participation and give editors agency in LLM-mediated knowledge dissemination.

2606.20235 2026-06-19 cs.IR cs.AI 新提交

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

ScholarQuest:开放文献环境中智能学术论文搜索的基于分类法的基准测试

Tingyue Pan, Mingyue Cheng, Daoyu Wang, Yitong Zhou, Jie Ouyang, Qi Liu, Enhong Chen

发表机构 * State Key Lab of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

AI总结 提出ScholarQuest基准,基于1000多个计算机科学主题和四种研究意图,构建可扩展的答案和共享检索后端,评估LLM智能体在开放文献环境中的学术论文搜索能力。

详情
AI中文摘要

学术论文搜索是科学研究中的核心步骤,基于LLM的搜索智能体正成为迭代式、意图驱动的文献探索的有前景范式。然而,现有基准不足以在现实开放文献环境下系统评估智能学术搜索。我们提出ScholarQuest,一个大规模、基于分类法的智能学术论文搜索基准。ScholarQuest基于1000多个计算机科学主题和四种代表性研究意图构建,包括方法导向、设置锚定、比较型和范围控制查询。它进一步提供可扩展的答案构建和共享检索后端ScholarBase,用于可重复评估。基准测试结果表明,智能方法优于单次检索基线,但表现最佳的智能体仅达到0.314的Recall@100和0.355的Recall@All,表明有显著的改进空间。此外,对搜索效率、意图级鲁棒性和失败案例的分析进一步凸显了该基准为学术论文搜索智能体提供多维评估信号的能力。

英文摘要

Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark's ability to provide multi-dimensional evaluation signals for academic paper search agents.

2606.20128 2026-06-19 cs.SE cs.DC cs.LG 新提交

The Correctness Illusion in LLM-Generated GPU Kernels

LLM生成的GPU内核中的正确性错觉

Dipankar Sarkar

发表机构 * Arizona State University, USA(亚利桑那州立大学)

AI总结 通过高精度CPU参考和操作模式感知的模糊测试,发现现有基准测试中基于固定形状的allclose检查无法检测LLM风格的转录错误,提出一种新协议并验证其有效性。

Comments 10 pages, 2 figures, LNCS format. Companion papers to follow on arXiv next week; IDs will be added in a v2 replace

详情
AI中文摘要

针对LLM生成的GPU内核的基准测试(KernelBench、TritonBench、GEAK)通过固定形状、小样本的allclose风格检查来评分正确性。不同基准测试的输入数量不同。每个内核的形状、数据类型和容差是固定的。我们凭经验测试了该oracle。我们构建了一个包含24个Triton和CPU替代内核(15个正确对照和9个带有记录转录错误的LLM风格错误变体)的受控语料库,并在操作模式感知的种子模糊测试下,使用高精度(fp64)CPU参考和每个(操作,数据类型)的绝对容差重新评估。种子oracle标记了9个错误内核中的9个,并通过了15个正确对照中的15个,对照的精度成本为零。我们将语料库扩展到26个操作(添加一个flash-attention对),并在五类GPU(RTX 3060、A10、L40S、A100 SXM4、H100 NVL)上重新运行相同的协议。所有五个GPU的判定结果相同:10个错觉中的10个被捕获,16个对照中的16个干净。语料库结果涉及LLM风格的转录错误,这些错误被单形状allclose oracle认证为正确,而不涉及任何特定部署的LLM的错误率。每个标记的失败都从存储的种子逐字节重放。

英文摘要

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

2606.20065 2026-06-19 cs.IR cs.CL cs.CY 新提交

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

生成式引擎优化规模化:衡量AI搜索引擎中的品牌可见性

Pratyush Kumar

发表机构 * Ranqo

AI总结 本研究通过分析10万+提示响应,提出衡量AI搜索引擎中品牌可见性的方法,发现品牌成熟度形成三级阶梯,并识别出最受引用的内容格式和情感不稳定性。

Comments 14 pages, 4 tables; v1.0 preprint

详情
AI中文摘要

人们越来越多地从AI搜索引擎(如ChatGPT、Claude、Perplexity和Gemini)直接获取答案,而不是滚动浏览搜索结果。曾经专注于搜索引擎优化(SEO)的品牌现在必须优化这些引擎如何代表、引用和推荐它们——这一转变被称为生成式引擎优化(GEO)、答案引擎优化(AEO)和AI搜索可见性。我们将AEO和AI可见性视为GEO的一部分,并研究如何衡量AI引擎中的品牌可见性:它们在引用品牌时看重什么,依赖哪些来源,以及大型语言模型呈现什么内容。难点在于那些尚未成为权威顶级品牌的所有其他品牌——中小企业、D2C品牌、创作者和早期初创公司。我们分析了2026年3月至5月期间在Ranqo上追踪的100多个品牌的10万+提示响应。首次可见性运行形成了清晰的三级品牌地位阶梯:全球家喻户晓的品牌(如Stripe、Nike)在首次运行时出现在73%的相关AI答案中;成熟的中端市场和区域品牌(如Olipop、Klaviyo)出现在44%中;小众和小品牌仅出现在11%中——每级约30个百分点。当引擎引用来源时,约78%指向企业网站;在非企业来源中,YouTube领先,其次是Reddit、编辑媒体和维基百科。杠杆率最高的页面是排名“最佳”列表文章,是最常被引用的内容格式,约占所有引用的21%。情感是不稳定的信号:品牌被正面或负面描述的变化频率大约是品牌是否被提及的变化频率的6.7倍。这些发现为衡量GEO提供了首个大规模基线:AI品牌可见性是可测量的,因平台而异,并随品牌成熟度强烈变化。最后,我们提出了七个v1.1协议,以测试特定建议是否能因果性地提高AI可见性。

英文摘要

People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them -- a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands -- SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% -- about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked "best-of" listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility.

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 新提交

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时:探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) Beijing Academy of Artificial Intelligence(北京人工智能研究院) The Chinese University of Hong Kong(香港中文大学) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 针对LLM代理在工具选择中偏好高权限工具的安全问题,提出ToolPrivBench评估框架,发现主流代理普遍存在过度权限选择且被瞬态故障放大,并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情
AI中文摘要

随着LLM代理越来越多地自主选择工具,它们在具有不同权限的工具之间的选择变得与安全相关。然而,先前的工具选择研究侧重于安全无关的元数据偏好,使得权限敏感的选择未被充分探索。为填补这一空白,我们研究了过度权限工具选择,即代理在存在足够低权限替代方案时仍选择或升级到更高权限工具。我们引入ToolPrivBench来评估代理是否在存在足够低权限替代方案时仍选择更高权限工具,同时衡量初始选择和瞬态工具故障后的升级。在八个领域和五种重复风险模式中,我们发现过度权限工具选择在主流LLM代理中很常见,并且被瞬态故障进一步放大。我们进一步发现,通用安全对齐不能可靠地迁移到最小权限工具选择,而提示级控制在瞬态故障下仅提供有限的缓解。因此,我们引入了一种权限感知的后训练防御,教导代理偏好足够低权限的工具,仅在必要时升级。我们的缓解实验表明,这种防御在保持通用能力的同时,显著减少了不必要的高权限工具使用。

英文摘要

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

2606.19992 2026-06-19 cs.SE cs.AI 新提交

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

超越静态端点:工具程序作为灵活智能体网络服务的接口

Mugeng Liu, Shuoqi Li, Yixuan Zhang, Yun Ma

发表机构 * School of Computer Science, Peking University, Beijing, China School of Software \& Microelectronics, Peking University, Beijing, China Institute for Artificial Intelligence, Peking University, Beijing, China

AI总结 提出ToolPro,将工具意图表示为可执行程序,通过约束引导构建、效应感知重放和策略决策,在MCP服务上实现最高53.4%的延迟降低和96.1%的流量减少。

Comments Accepted by ICML 2026

详情
AI中文摘要

在智能体网络时代,基于LLM的智能体越来越多地将网络服务作为工具调用,然而大多数接口仍然是\emph{静态端点},难以表达包含循环、条件、连接和重试的长周期工作流。我们提出ToolPro,它将智能体的工具意图表示为一个\emph{可执行工具程序},该程序紧凑地编码了多步服务交互并带有显式效应类型。ToolPro结合了约束引导的程序构建、用于精确一次状态修改调用的效应感知重放,以及一个基于配置文件的策略,该策略决定何时程序执行优于逐步调用。我们在具有WebAssembly沙箱的MCP风格服务上实例化ToolPro,并在现实应用的各种工作流上进行了评估。ToolPro将端到端延迟降低了高达53.4%,客户端流量减少了高达96.1%,在网络延迟和工作流复杂度更高时收益更大。

英文摘要

In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain \emph{static endpoints} that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an \emph{executable tool program} that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4\% and client-side traffic by up to 96.1\%, with larger gains under higher network latency and workflow complexity.

2606.19989 2026-06-19 cs.DC cs.LG 新提交

Online Dynamic Batching with Formal Guarantees for LLM Training

面向LLM训练的具有形式保证的在线动态批处理

Dian Li, Zekun Wang, Yaoru Wang, Jiahong Yan

发表机构 * Tencent(腾讯)

AI总结 提出在线动态批处理(ODB)系统,在数据加载器侧将批构建延迟到样本真实成本可观测时,解决离线批采样中预处理成本不可见问题,实现1.58-4.43x吞吐量提升,并提供无死锁有界终止的形式化保证。

Comments 29 pages, 3 figures, 21 tables

详情
AI中文摘要

现代LLM训练打破了离线批采样器背后的一个核心假设:样本的真实训练成本只有在预处理、增强、模板化、分词和多模态视觉标记扩展之后才能观察到。除非为依赖于预处理和增强的长度缓存付费,否则批构建对于决定填充、内存使用和GPU饱和度的量是盲目的。我们引入了在线动态批处理(ODB),这是一个数据加载器侧的即插即用系统,它将批形成移动到这一精确可观测性点,同时保持DDP步骤对齐。我们将这一同步需求形式化为分布式组对齐问题,并证明了在默认加入模式身份覆盖和可选非加入样本配额封闭下的无死锁有界终止。ODB不需要修改模型、优化器或注意力核,并以轻量级训练器适配器的形式发布为online-dynamic-batching。在UltraChat/LLaVA/ShareGPT4o上对公开的2B/8B Qwen3-VL进行的实验中,与固定批Standard相比,ODB在单节点全量微调/LoRA上实现了1.58-2.51倍的逐字样本吞吐量提升,在两节点全量微调上实现了1.71-3.78倍提升,质量与Standard相当;生产环境MM-Mix达到4.43倍。与GMT/BMT离线令牌预算预言机相比,ODB在UltraChat/LLaVA上差距在15%以内,在高变异系数的ShareGPT4o上更快:单节点全量微调/LoRA为2.24-2.39倍,两节点全量微调为3.06-3.69倍。总之,ODB占据了高异质性LLM微调的在线/即插即用领域:在质量与Standard相当的情况下实现大幅吞吐量提升,提供形式化的DGAP保证,无需长度缓存预计算或核重写。

英文摘要

Modern LLM training breaks a core assumption behind offline batch samplers: the true training cost of a sample is only observable after preprocessing, augmentation, templating, tokenization, and multimodal visual-token expansion. Unless one pays for a preprocessing- and augmentation-dependent length cache, batch construction is therefore blind to the quantity that determines padding, memory use, and GPU saturation. We introduce Online Dynamic Batching (ODB), a DataLoader-side drop-in system that moves batch formation to this point of accurate observability while preserving DDP step alignment. We formalize this synchronization requirement as the Distributed Group Alignment Problem and prove deadlock-free bounded termination with default join-mode identity coverage and opt-in non-join sample-quota closure. ODB requires no model, optimizer, or attention-kernel changes and is released as online-dynamic-batching with lightweight trainer adapters. Across public 2B/8B Qwen3-VL runs on UltraChat/LLaVA/ShareGPT4o, ODB improves literal emitted-sample throughput vs. fixed-batch Standard by 1.58-2.51x on single-node Full FT/LoRA and 1.71-3.78x on two-node Full FT, with Standard-comparable quality; production MM-Mix reaches 4.43x. Against GMT/BMT offline token-budget oracles, ODB is within 15% on UltraChat/LLaVA and faster on high-CV ShareGPT4o: 2.24-2.39x single-node Full FT/LoRA and 3.06-3.69x two-node Full FT. Together, ODB occupies the online/drop-in regime for high-heterogeneity LLM fine-tuning: large throughput gains at Standard-comparable quality, formal DGAP guarantees, and no length-cache precompute or kernel rewrites.

2606.19899 2026-06-19 cs.CY cs.AI 新提交

Measuring Biological Capabilities and Risks of AI Agents

测量AI代理的生物能力与风险

Patricia Paskov, Jeffrey Lee, Kyle Brady, Alyssa Worland

发表机构 * PATRICIA PASKOV, JEFFREY LEE, KYLE BRADY, ALYSSA WORLAND(PATRICIA PASKOV、JEFFREY LEE、KYLE BRADY、ALYSSA WORLAND)

AI总结 针对AI科学家等自主执行多步科学任务的代理系统,本文提出生物代理评估作为解释性工具,并基于实践经验给出定义、设计、运行、评分和记录评估的考量,以帮助决策者谨慎解读结果并指导投资。

详情
AI中文摘要

本文针对一个迅速出现的政策挑战:如何生成和解释关于AI科学家(即能够自主或协作执行多步科学任务的代理AI系统)的生物能力与风险的可信证据。随着这些系统进入真实研究流程,决策者越来越多地面临评估结果,而这些结果的含义取决于通常隐含或记录不足的底层设计选择。我们综合了关于AI驱动的生物风险的现有证据,并引入生物代理评估作为评估这些系统的一种有前景但需要谨慎解释的工具。我们的核心贡献是一套基于实践经验的考量——源自我们自己的评估——展示了围绕定义、设计、运行、评分和记录评估的选择如何实质性地塑造结果对风险意味着什么和不意味着什么。该分析旨在帮助政策制定者以适当的谨慎态度解读生物评估输出;引导公共和私人资助者向AI-生物学评估研究的高杠杆投资;并支持评估新兴AI系统的生物安全从业者。次要受众包括在前沿AI实验室、AI提供商、科学机构和第三方评估组织中设计或进行代理评估的研究人员。

英文摘要

This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.

2606.19887 2026-06-19 cs.CR cs.AI 新提交

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

FFinRED:面向金融大语言模型红队测试的专家引导基准生成与评估框架

Chaeyun Kim, Daeyoung Park, Junghwan Kim, Jinyoung Jeong, Eunji Song, Yongtaek Lim, Minwoo Kim

发表机构 * DATUMO INC.(DATUMO公司) Korea Advanced Institute of Science and Technology (KAIST)(韩国先进科学研究院) Financial Security Institute (FSI)(金融安全研究所)

AI总结 提出FinRED框架,通过专家引导的两级分类法将全球金融标准映射为威胁,并利用真实金融文档生成上下文丰富的红队行为提示,结合专家验证的评估标准,有效降低关键假阴性。

详情
AI中文摘要

现有的安全基准主要针对通用对抗场景,但忽略了金融领域的特定风险。金融大语言模型面临监管合规违规、欺诈助长和系统性信任侵蚀等问题,需要有针对性的评估。我们引入了FinRED,一个与金融专家共同开发的、用于金融大语言模型安全评估的专家引导红队测试框架。FinRED采用新颖的两级分类法,将全球标准(如FATF和EU DORA)映射到从监管规避到复杂欺诈的威胁,并结合可扩展的流水线,通过专家定义的架构将真实金融文档转换为上下文丰富的红队行为提示(种子)。严格的专家验证确认了种子的合理性和真实性,以实现有意义的LLM安全评估。我们还提供了一个经过专家验证的、金融专用的评估标准,该标准超越了免责声明检查,比静态的一刀切标准更贴近人类专家,并将关键假阴性从28个减少到12个。FinRED与国际采纳的风险管理和信息安全标准(如ISO/IEC 27001)保持一致,已在韩国金融安全研究院(FSI)的监管沙盒中部署,用于真实金融服务中的生成式AI安全评估。为减轻双重用途风险,数据集、生成流水线、提示模板和评估框架对合格研究人员开放,访问地址为:此https URL和此https URL。

英文摘要

Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks. Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation. We introduce FinRED, an expert-guided red-teaming framework for financial LLM safety evaluation developed with financial experts. FinRED uses a novel two-level taxonomy mapping global standards (e.g., FATF and EU DORA) to threats ranging from regulatory evasion to complex fraud, integrated with a scalable pipeline that converts real financial documents into context-rich red-teaming Behavioral Prompts (seeds) through an expert-defined schema. Rigorous expert validation confirms seed plausibility and realism for meaningful LLM safety evaluation. We also provide an expert-validated, finance-specific rubric that goes beyond disclaimer checks, aligns more closely with human experts than static one-size-fits-all rubrics, and reduces critical false negatives from 28 to 12. Aligned with internationally adopted risk-management and information-security standards (e.g., ISO/IEC 27001), FinRED is deployed in South Korea's Financial Security Institute (FSI) regulatory sandbox for generative AI security evaluation in real financial services. To mitigate dual-use risks, the dataset, generation pipeline, prompt template, and evaluation framework are gated for qualified researchers at https://github.com/selectstar-ai/FinRED-paper and https://huggingface.co/datasets/datumo/FinRED.

2606.19830 2026-06-19 cs.SE cs.CL 新提交

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

JAMER:专业游戏引擎上的项目级代码框架数据集与基准测试

Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang, Yifei Huang, Yu Dai, Kaipeng Zhang

发表机构 * Nankai University(南开大学) Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出首个基于专业游戏引擎的项目级代码框架数据集JamSet和基准JamBench,通过设计确定性验证流程,从24万仓库中筛选出8133个已验证项目,评估9个前沿模型发现项目规模增大时能力急剧下降。

详情
AI中文摘要

当前AI驱动的游戏开发在资产生成、游戏设计和基于Web的游戏编码方面取得了实质性进展,但由于缺乏大规模数据集和确定性评估方法,专业游戏引擎上的项目级代码工程仍然很大程度上未被探索。我们提出了JamSet和JamBench,这是首个基于专业游戏引擎的项目级游戏代码框架数据集和基准。我们的关键洞察是,Game Jam竞赛(开发者在严格时间限制下构建完整游戏的社区活动)产生了数千个适合此目的的开源项目。基于Godot引擎的文本格式和无头执行模式,我们设计了一个从文件完整性到运行时行为收集的确定性验证流程,从超过24万个仓库中提炼出8133个已验证项目。其中,300个手动验证的项目构成JamBench;其余构成JamSet。JamBench定义了主题驱动的生成和代码补全任务,通过结合编译通过率、结构完整性得分(SCS)和行为对齐得分(BAS)的流水线进行评估。对9个前沿模型的评估揭示了随着项目规模增加的能力悬崖,运行时通过率从小型项目的80.4%下降到大型项目的5.7%(Task2a)。代码代理提高了编译率,但在运行时行为质量上没有带来提升,表明瓶颈在于架构设计而非语法正确性。实验验证了JamSet作为有效训练数据。所有数据和代码均已公开。

英文摘要

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

2606.19803 2026-06-19 cs.DB cs.AI cs.LG 新提交

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

策略感知向量搜索:向量数据库中细粒度访问控制的愿景

Lakshmi Sahithi Yalamarthi, Primal Pappachan

发表机构 * Portland State University(波特兰州立大学)

AI总结 本文提出策略感知向量搜索的愿景,形式化向量数据库中的细粒度访问控制(FGAC)策略模型与实施问题,比较不同实施策略并指出未来挑战。

Comments Accepted at SeQureDB 26, Sigmod 2026

详情
AI中文摘要

向量数据库越来越多地用于安全敏感的场景,如检索增强生成和组织AI管道;然而,其安全能力仍然有限。具体而言,现代向量数据库不完全支持细粒度访问控制(FGAC),而FGAC是确保数据访问符合用户特定策略所必需的。与关系数据库不同,向量数据库结合结构化和非结构化属性以提供语义近似查询结果,这使FGAC实现复杂化。这就在正确执行FGAC策略、实现高ANN搜索召回率和保持低查询延迟之间产生了内在张力。在本文中,我们通过形式化向量数据库中的FGAC策略模型以及实施问题,提出了策略感知向量搜索的愿景。我们比较了各种实施策略,展示了初步发现,并指出了未来策略感知向量搜索研究的关键开放挑战。

英文摘要

Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructured attributes to provide semantic, approximate query results, which complicates FGAC implementation. This creates an inherent tension between enforcing FGAC policies correctly, achieving high ANN search recall and maintaining low query latency. In this paper, we present a vision for Policy-aware Vector Search by formalizing the FGAC policy model in vector databases as well as the enforcement problem. We compare various enforcement strategies, present preliminary findings, and identify key open challenges for future research in policy-aware vector search.

2606.19795 2026-06-19 cs.SE cs.AI 新提交

Agentic Electronic Design Automation: A Handoff Perspective

代理式电子设计自动化:一种交接视角

Jiawei Liu, Peiyi Han, Yuntao Lu, Su Zheng, Fengyu Yan, Bei Yu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Primarius Technologies(Primarius技术公司)

AI总结 本文从交接有效性角度出发,将EDA流程中的代理系统分为三类,并提出五层代理通信协议,以解决多阶段、多工具间的状态传递和验证问题。

详情
AI中文摘要

电子设计自动化(EDA)本质上是多阶段且交接密集的。设计工件、流程脚本和工程决策在最终实现、签核或发布之前,跨越工具、会话和组织边界。每次传递都携带显式和隐式需求,这些需求可能无法被阶段局部检查完全捕获。基于LLM的代理现在直接调用EDA工具,将检索到的知识嵌入可执行脚本,并在会话和阶段之间传递状态。一旦它们的输出影响下游工程决策,传递的对象必须满足交接合同并符合其下一个消费者的假设。本综述引入交接有效性作为其组织原则。当传递的对象满足消费者的接受条件,并携带足够的上下文、证据和来源以供下游使用时,交接是有效的。我们回顾了82个系统,并将它们分为三个边界类别。阶段边界系统在单个EDA阶段或有界验证任务内建立有效性。流程边界系统在工具、调用和会话之间保持连贯的工作流状态。组织边界系统在知识和权限边界之间维护源基础、来源、范围及可接受性。对于每个类别,我们分析交接合同、交接对象、协调机制和开放问题。这些分析激发了一个五层EDA代理通信协议(EACP),涵盖代理发现、代理消息、工具调用、工作流编排以及安全和IP协议。我们旨在为可信的代理式EDA提供通用词汇和研究议程。

英文摘要

Electronic design automation (EDA) is inherently multi-stage and handoff-heavy. Design artifacts, flow scripts, and engineering decisions cross tool, session, and organizational boundaries before final implementation, signoff, or release. Each transfer carries explicit and implicit requirements that may not be fully captured by stage-local checks. LLM-based agents now invoke EDA tools directly, embed retrieved knowledge in executable scripts, and hand off state across sessions and stages. Once their outputs condition downstream engineering decisions, the transferred object must satisfy a handoff contract and meet the assumptions of its next consumer. This survey introduces handoff validity as its organizing principle. A handoff is valid when the transferred object satisfies the consumer's acceptance conditions and carries sufficient context, evidence, and provenance for downstream use. We review 82 systems and classify them into three boundary classes. Stage-Bound systems establish validity within a single EDA stage or bounded verification task. Flow-Bound systems preserve coherent workflow state across tools, invocations, and sessions. Organization-Bound systems maintain source grounding, provenance, scope, and admissibility across knowledge and authority boundaries. For each class, we analyze handoff contracts, handoff objects, coordination mechanisms, and open questions. These analyses motivate a five-layer EDA agent communication protocol (EACP), covering the agent discovery, agent message, tool invocation, workflow orchestration, and security and IP protocols. We aim to provide a common vocabulary and research agenda for trustworthy agentic EDA.

2606.19755 2026-06-19 cs.CR cs.AI 新提交

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

SafeSpec: 通过动态反射采样实现快速且安全的LLM

Haotian Xu, Zeyang Zhang, Linbao Li, Huadi Zheng, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University, Hangzhou, China(浙江大学) Huawei(华为) Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳))

AI总结 提出SafeSpec框架,将轻量安全头集成到推测解码的验证过程中,通过风险估计和反射采样恢复安全生成,在保持加速的同时显著降低攻击成功率。

详情
AI中文摘要

推测推理加速了大语言模型(LLM)的解码过程,但本身不提供任何安全保障。现有的安全防御措施与推测推理大多不兼容:它们要么引入额外的计算,要么破坏草稿-验证机制,抵消加速优势。这揭示了当前安全方法与推测解码之间的根本性不兼容。我们提出SafeSpec,一个安全感知的推测推理框架,将风险估计直接集成到验证过程中。SafeSpec在目标模型上附加一个轻量级的潜在安全头,以在单次前向传递中联合评估语义有效性和安全性。当检测到不安全生成时,SafeSpec应用回滚和安全引导的反射多次采样来恢复安全延续,而不是终止生成。我们将越狱攻击建模为生成轨迹上的分布偏移,其中对抗性提示增加了有害延续的概率,但并未消除安全延续。在此模型下,SafeSpec在推测解码过程中执行风险感知的轨迹恢复。在多个模型和对抗基准测试中,SafeSpec实现了显著改进的安全-效率权衡。在Qwen3-32B上,SafeSpec将攻击成功率降低了15%,同时在良性工作负载上保持了2.06倍的推理加速,表明推测加速和推理时安全性可以联合优化。

英文摘要

Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.

2606.19725 2026-06-19 cs.SE cs.AI cs.MA 新提交

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

面向OpenSIL固件中大语言模型生成的单元测试的库感知双打与迭代修复

Ma Toan Bach, Yuchi Zheng, Haingo Razafindranto, Tanvir Alam, Aric Leather, Ranveer Sandhu, Jitesh Arora

发表机构 * School of Software Design and Data Science(软件设计与数据科学学院) Seneca Polytechnic(森纳学院) Advanced Micro Devices Canada(加拿大先进微器件公司)

AI总结 针对OpenSIL固件单元测试因构建约束易失败的问题,提出LLM引导的多智能体自动化测试生成与迭代修复流程,在76个函数中73个生成可编译测试,行覆盖率达98.8%。

Comments 20 pages, 10 figures

详情
AI中文摘要

验证底层C固件中的变更成本高昂,因为单元测试(UT)在严格的构建约束下非常脆弱,缺失的头文件、未解析的符号和依赖不匹配经常阻止编译和链接。本研究为AMD维护的开源硅初始化库(openSIL)固件代码库引入了一种自动化的UT编写工作流程,通过大语言模型(LLM)引导的多智能体管道减少手动工作。该工作流程结合了测试框架的自动生成、库感知的桩、模拟和伪造的创建或重用,以及由构建日志和行覆盖率反馈驱动的迭代编译-分派修复循环。我们使用编译成功率、修复迭代次数、分派成功率和行覆盖率评估该方法,并以时间、成本和令牌使用量作为次要指标。在76个被测函数中,该工作流程为73个函数生成了可编译的UT。在没有行覆盖率指导或检索增强的配置下,平均行覆盖率达到73.9%。在两种配置下评估的48个函数子集中,仅使用行覆盖率指导时平均行覆盖率达到98.8%,与向量数据库检索结合时达到94.7%。结果表明,自动生成和修复管道可以显著提高受限固件环境中UT创建的效率和覆盖率,同时减少手动调试工作量。

英文摘要

Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 新提交

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

发表机构 * New York University(纽约大学) Redis(Redis公司)

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.19646 2026-06-19 cs.IR cs.CV 新提交

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

SAFE-Cascade: 面向图表问答的成本自适应视觉语言路由

Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li, Animesh Mahapatra, Neeraj Agrawal, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学)

AI总结 提出SAFE-Cascade系统,通过OCR和轻量语言模型先给出答案,再由学习路由器决定是否调用VLM,在ChartQA上以73.1%的VLM调用率达到69.1%准确率,减少26.9%的VLM调用和9.3%的成本。

Comments Demo paper submitted at CIKM 2026. 4 pages, 2 figures

详情
AI中文摘要

视觉语言模型(VLM)在图表问答中表现出色,但若每个查询都调用VLM,当许多问题可通过OCR文本和轻量语言推理回答时,成本会不必要地高昂。我们展示了SAFE-Cascade,一个用于成本自适应图表问答的交互系统。给定图表图像和自然语言问题,SAFE-Cascade首先通过OCR提取图表文本,从纯文本语言模型获得临时答案,然后使用学习路由器决定接受文本答案还是升级到VLM。该演示向用户展示这一决策过程:OCR证据、纯文本答案、路由概率、升级决策、最终答案、估计成本和估计延迟并排显示。SAFE-Cascade被设计为一个透明界面,用于理解何时实际需要视觉基础。用户可以上传或选择图表、提问、检查每条路径使用的证据、比较纯文本和VLM答案,并调整升级阈值以探索准确率-成本边界。该系统使用Azure Document Intelligence进行OCR,gpt-5-mini作为纯文本模型,gemini-2.5-flash-image作为VLM,以及基于推理时特征训练的随机森林路由器。在从2500个样本实验中留出的375个ChartQA测试集上,SAFE-Cascade实现了69.1%的统一准确率和73.1%的VLM调用率,而全VLM基线为67.7%准确率和100% VLM调用率。观察到的+1.4个百分点差异在统计上不确定,因此我们将SAFE-Cascade解释为匹配全VLM性能,同时减少26.9%的VLM调用和9.3%的估计成本。该演示展示了选择性模态路由如何使多模态知识系统更加透明、可调优和成本感知。

英文摘要

Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.

2606.19627 2026-06-19 cs.IR cs.AI cs.LG 新提交

VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

VCG:极端冷启动条件下电商视频流的多模态检索框架

Katya Mirylenka, Egor Malykh, Mahdyar Ravanbakhsh, Michael Gygli, Marco-Andrea Buchmann, Andrew Dzhoha, Svitlana Borzenko, Francesca Catino, Mohamed Gaafar, Maarten Versteegh, Thomas Kober, Dario d'Andrea, Ellie Langhans

发表机构 * Zalando Switzerland AG(Zalando瑞士有限公司) TU Wien(维也纳技术大学) Zalando SE(Zalando德国分公司)

AI总结 针对电商视频流中的极端冷启动和偏差问题,提出基于领域自适应视觉-语言模型(CLIP)的可扩展多模态检索系统VCG,实现零样本检索,在线测试显示深度视频完成率提升50%。

详情
AI中文摘要

数字商业格局正从静态的搜索驱动型目录转向动态的沉浸式视频流。这一转变引入了“极端冷启动”问题:与传统商品不同,新的短视频缺乏协同过滤所需的密集交互历史。此外,沉浸式视频流引入了强烈的位置和时长偏差,扭曲了标准参与信号。在本文中,我们展示了视频候选生成(VCG)系统,这是一个可扩展的多模态检索引擎,旨在解决大规模电商环境中的这些挑战。通过利用领域自适应的视觉-语言模型(基于CLIP),我们将用户和视频映射到共享语义空间,实现基于视觉内容而非行为历史的零样本检索。我们详细介绍了系统的架构,并进行了严格的评估,比较了生成式(LLM)和判别式(CLIP)嵌入。结果表明,虽然生成式模型在属性预测方面表现出色,但在检索任务中会出现嵌入空间坍塌。在线A/B测试表明,VCG有效缓解了参与偏差,使深度视频完成率提升了50%。为了展示系统的能力,我们提供了一个交互式演示,包含三种双向检索场景:产品到视频、视频到产品和零样本语义搜索。

英文摘要

The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

2606.19616 2026-06-19 cs.SE cs.AI cs.MA 新提交

Before the Pull Request: Mining Multi-Agent Coordination

在拉取请求之前:挖掘多智能体协调

Dipankar Sarkar

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 针对自主编码智能体在拉取请求中协调不足的问题,提出基于git的协调基板grite,通过事件日志减少重复和冲突工作,提升吞吐量,并自动恢复多种故障模式。

Comments 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: https://github.com/neul-labs/grite

详情
AI中文摘要

自主编码智能体现在可以开启数百万个拉取请求,然而大规模研究发现,它们的拉取请求虽然生成更快,但被接受的频率却更低——这是一个拉取请求级别的遥测无法解释的协调与信任差距。我们认为缺失的信号存在于拉取请求之前,即并发智能体如何声明、划分和碰撞共享工作。我们通过grite(我们的开源协调基板)来研究这一过程,它不需要中央服务器,并将其记录存储在git本身内部,因此其仅追加的、签名的事件日志直接捕获了协调过程。我们证明:(i) 这种共享基板以有限的开销减少了重复和冲突工作——仅重复队友任务的工作份额从78%降至0%,而有效吞吐量增加了三倍以上;(ii) 每个智能体的日志副本收敛到相同状态,没有写入被静默丢弃,而基于文件的跟踪器会丢失并发写入;(iii) 该日志是一个可挖掘的工件,从中可以自动恢复具体的故障模式——冲突编辑、锁饥饿、冗余发现、竞态关闭——并带有来源信息,其中一些在拉取请求历史中是不可见的。我们发布了数据集、测试平台和挖掘工具包。

英文摘要

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

2606.19613 2026-06-19 cs.SE cs.AI 新提交

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

StaminaBench: 对编码智能体进行100轮交互的压力测试

Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

发表机构 * AWS Agentic AI(AWS 代理人工智能)

AI总结 提出StaminaBench基准,通过100轮连续变更请求测试编码智能体的耐力,发现所有模型在5-6轮内失败,而测试反馈和重试机制可将通过轮数提升12倍。

详情
AI中文摘要

我们引入了StaminaBench,一个衡量编码智能体耐力的基准:它们在失败前能处理多少连续交互轮次(变更请求)。与流行的任务解决率指标不同,这符合实际编码风格,其中会话运行数十或数百轮。在StaminaBench中,智能体实现一个REST API服务器,并在可调数量的程序生成的后续变更请求(实验中为100个)上进行修改,导致代码库最多达6000行。测试完全以编程方式生成,无需LLM参与,确保可重复性和可靠性;变更序列来自硬编码或LLM驱动的采样器,两者都受限于结构化动作空间以确保变更有效。智能体和服务器在隔离环境中运行,并通过HTTP与基准通信,使测试完全黑盒且与语言无关。我们评估了六个智能体框架与七个开源LLM在20个场景(每个100轮)上的表现,发现:(1)所有测试模型在5-6轮内失败,确认了无彻底测试的编码风格会产生错误;(2)将测试反馈传递给智能体并允许重试,可将通过轮数提升最多12倍;(3)良好的框架是强性能所必需的:更强的模型在其最佳和最差框架之间表现出高达6倍的差距,而较弱的模型在任何框架下都失败。我们发布了基准和生成的任务,以促进对多轮编码智能体行为的进一步研究。基准代码和数据:此 http URL。

英文摘要

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

2606.19605 2026-06-19 cs.SE cs.AI 新提交

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO:多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

发表机构 * Foundation AI–Cisco Systems Inc.(基础AI–思科系统公司) Yale University(耶鲁大学)

AI总结 提出FAPO框架,通过自动诊断流水线瓶颈并迭代优化提示或链结构,在18个模型-基准比较中15次优于基线GEPA,平均提升14.1个百分点。

详情
AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败,因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO(全自动提示优化),一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更,并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑,仅当提示优化似乎不足时,在归因识别出结构瓶颈的情况下,在允许范围内更改链结构。在六个基准和三个任务模型上,FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中,FAPO以不重叠的均值±试验标准差范围获胜,平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中,当提示优先搜索升级为结构变更时,FAPO在所有六个中获胜,平均增益为+33.8个百分点。FAPO还提高了安全任务的性能:在CTIBench-RCM(一个安全CVE到CWE任务)上,仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率,在Foundation-Sec-8B-Instruct上提升了+7.1个百分点,在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

2606.19535 2026-06-19 cs.CR cs.LG 新提交

FloatDoor: Platform-Triggered Backdoors in LLMs

FloatDoor: 大语言模型中的平台触发后门

Nils Loose, Jonas Sander, Felix Mächtle, Thomas Eisenbarth

发表机构 * University of Luebeck(吕贝克大学)

AI总结 提出FloatDoor,首个输入无关、平台触发的后门攻击,利用浮点运算平台差异,通过两个轻量LoRA适配器在目标平台触发恶意行为,同时保持模型正常效用。

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在软件工程等敏感环境中,其输出直接影响下游工件。最近的研究表明,由于非结合浮点运算和不同的内核实现,同一模型在不同部署平台上可能产生可测量的不同输出。我们研究了这种平台依赖可变性的安全影响,并揭示了LLM部署中一种新的攻击面。我们提出了FloatDoor,这是首个针对生成式LLM的输入无关、平台触发的后门攻击。被攻陷的模型在目标平台上表现出对手选择的行为,而在其他平台上则表现正常。FloatDoor通过两个轻量级LoRA适配器实现:一个放大平台间数值差异,另一个将由此产生的平台签名绑定到恶意下游任务,同时保持模型整体效用基本不变。FloatDoor利用了模型审计和部署之间的显著检查时间与使用时间差距。我们在Qwen3-4B上展示了FloatDoor,涵盖了广泛的部署目标,包括NVIDIA GPU、Google TPU、AWS Graviton和阿里巴巴Yitian-710。作为最终案例研究,我们展示了FloatDoor能够在选定的目标平台上可靠地诱导可利用的代码漏洞。我们的结果建立了一类新的LLM部署攻击,并强调了在敏感的LLM驱动应用中建立可信模型供应链的迫切需求。

英文摘要

Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.

2606.19474 2026-06-19 cs.CR cs.AI cs.SE 新提交

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

LLM辅助后量子密码开发中的安全编码漂移:一种游戏化修复方案

R. D. N. Shakya, C. P. Wijesiriwardana, S. M. Vidanagamachchi, Nalin A. G. Arachchilage

发表机构 * University of Moratuwa(摩图瓦大学) University of Ruhuna(鲁胡纳大学) RMIT University(皇家墨尔本理工大学)

AI总结 提出LLM辅助PQC开发中的安全编码漂移模型,通过游戏化框架将LLM转变为主动安全协作者,以缓解长期依赖LLM导致的安全退化。

Comments Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track

详情
AI中文摘要

向后量子密码学(PQC)的过渡引入了相当大的实现复杂性,要求严格遵守恒定时间执行、侧信道抵抗和精确参数化。同时,大型语言模型(LLM)已深度嵌入软件开发工作流程,包括密码工程。虽然LLM提高了生产力,但证据表明它们经常生成不安全或次优的代码,特别是在安全关键领域。本文引入了PQC中的安全编码漂移,这是一种新颖的社会技术漏洞模型,捕捉了由于持续依赖LLM生成的代码而导致的安全编码实践逐渐退化。与先前关注静态漏洞的工作不同,我们将安全风险概念化为一种源于人机交互的纵向行为现象。为了缓解这一问题,我们提出了一种游戏化的、LLM增强的安全编码框架,将对抗性评估、行为反馈和安全评分嵌入开发工作流程。我们的方法将LLM从被动助手重新定义为主动安全协作者,为AI中介环境中的更安全PQC实现做出贡献。

英文摘要

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

2606.19407 2026-06-19 cs.SE cs.AI 新提交

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

JustDiag!:用于可问责根本原因分析的诊断论证引擎

Tingzhu Bi, Xinrui Jiang, Xun Zhang, Pengcheng Su, Congjie He, Jinglin Li, Ping Wang, Meng Ma

发表机构 * Peking University(北京大学) University of Edinburgh(爱丁堡大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出JustDiag诊断论证引擎,通过维护显式的过程状态(证据、发现、竞争假设、冲突和下一步检查)来支持可问责的根本原因分析,在66个真实事件上评估显示其优于仅提供流畅最终答案的方法。

详情
AI中文摘要

大型语言模型可以生成流畅的根本原因分析,但仅凭流畅的最终答案不足以证明高风险操作中的可问责性。在实际事件响应中,工程师需要知道哪些证据支持诊断,考虑了哪些替代方案,哪里存在矛盾,以及系统是解决了问题还是保留了不确定性。我们通过JustDiag填补了这一空白,这是一个用于RCA的诊断论证引擎,它维护了关于证据、发现、竞争假设、冲突和下一步检查的显式过程状态。我们使用两层协议在66个真实事件上评估了该系统,该协议分别对最终答案质量和过程质量进行评分。与没有诊断论证的匹配对照组相比,JustDiag获得了更强的结果和过程分数,同时由于更校准的非闭合性而接受了略低的终端完成率。这些结果表明,可问责的RCA需要显式的诊断论证工件和过程感知评估,而不仅仅是流畅的最终答案。

英文摘要

Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

2606.19390 2026-06-19 cs.SE cs.AI 新提交

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化:一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

发表机构 * University of Oxford(牛津大学) Cisco Systems(思科系统) The Alan Turing Institute(艾伦·图灵研究所) University of Warwick – WMG(沃里克大学 – WMG) University of Hull(哈罗德大学)

AI总结 提出一种协议驱动框架,通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测,结合静态与运行时证据生成CSAF VEX公告,经密码签名和确定性重放验证,在合成自主AI工作负载上评估。

Journal ref Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework. Front Artif Intell 9, (May 2026), 1826384

详情
AI中文摘要

提出一种协议驱动框架,将SBOM和AIBOM工件绑定到确定性环境捕获和结构化运行时遥测。利用声明的工件、观察到的激活条件和强制执行的策略计算可利用性。从静态和运行时证据生成CSAF VEX公告,经密码签名并通过确定性重放验证。评估使用约10000个组件条目,涵盖50到5000个组件的合成自主AI工作负载,并整合OSV、GitHub Advisory、KEV和EPSS数据集。

英文摘要

A protocol driven framework is presented that binds SBOM and AIBOM artefacts to deterministic environment capture and structured runtime telemetry. Exploitability is computed from declared artefacts, observed activation conditions, and enforced execution policies. CSAF VEX advisories are generated from combined static and runtime evidence, cryptographically signed, and validated through deterministic replay. Evaluation uses approximately 10000 component entries across synthetic Agentic AI workloads 50 to 5000 components, incorporating OSV, GitHub Advisory, KEV, and EPSS datasets.

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式:移动代理是否需要手机屏幕?

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

发表机构 * Mila – Québec AI Institute(魁北克人工智能研究所) Concordia University(康科迪亚大学) University of Toronto(多伦多大学) McMaster University(麦马斯特大学)

AI总结 本文挑战移动代理的GUI主导范式,提出CLI应同等重要,通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线,并引入CLI-Advantage任务套件展示其优势。

详情
AI中文摘要

近期移动代理的进展主要由GUI范式主导,其中代理感知UI信息并发出屏幕交互。然而,移动平台也提供了命令行接口(CLI),可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上,使用四种模型API评估了三个编码代理(Claude Code、Terminus-2、mini-swe-agent),未进行任何移动特定后训练,并与三个可复现的GUI基线(GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B)进行比较。Claude Code(Opus 4.7)达到71.8%和51.9%,优于所有可复现的GUI基线(AndroidWorld上69.3/68.1/57.8%;MobileWorld上43.2/26.3/13.3%),而其他CLI配置也保持竞争力。为确立该范式的上限,我们提供了oracle CLI解决方案,在AndroidWorld上达到88.8%(103/116个任务可CLI解决),在MobileWorld上达到86.3%(101/117个任务可CLI解决),表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图,我们引入了\ extbf{CLI-Advantage任务套件},包含五个类别的45个模板:批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线,且每个任务步骤显著更少(10.7步 vs. 18.6步)。为支持未来移动CLI代理的研究,我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

2606.19387 2026-06-19 cs.SE cs.AI 新提交

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成:基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Fudan University(复旦大学) USA(美国)

AI总结 提出结合LLM创造力与形式化方法可解释性的硬件生成框架,通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

详情
AI中文摘要

大型语言模型(LLM)在软件开发中取得了显著成功。然而,它们容易产生幻觉,即可能引入微妙的语义和逻辑错误。由于芯片设计和制造的高风险,硬件工程师仍不愿依赖LLM进行寄存器传输级(RTL)生成。本文提出一种硬件生成框架,结合了LLM的创造力和广泛知识与形式化方法的可解释性和数学严谨性。具体而言,我们设计了一组覆盖各种设计决策和硬件特征的变换规则。通过迭代应用这些规则,LLM代理可以将设计规范转换为正确性有保证的RTL程序。实验结果证明了该框架的有效性和效率。

英文摘要

Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register-transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework.

2606.19386 2026-06-19 cs.SE cs.AI cs.LG 新提交

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

通过构造实现双稳态:挂钟校准的状态监视器在代理节奏下没有瞬间检测机制

Manvendra Modgil

发表机构 * Modint Intelligence(Modint智能科技)

AI总结 本文发现挂钟校准的泄漏积分器监视器在代理流中无法作为瞬间检测器工作,揭示了校准类别的关键影响,并提出了上升沿触发作为替代方案。

Comments 10 pages, 5 figures. Sequel to arXiv:2606.04296. Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

详情
AI中文摘要

自主代理的运行时监视器通常对累积的内部状态(行为基线、漂移统计量,或在我们之前工作中的建模情感状态)设置阈值。我们之前报告了一个状态饱和陷阱:在连续情感引擎上基于阈值的状态触发在SWE-bench调试代理(Modgil 2026)上变成了近乎恒定的警报。发布后审计发现引擎在动作之间接收到的dt=0,因此其指数衰减从未运作:已发布的陷阱是一个纯累加器的结果。我们更正了记录(勘误,v2)并将该缺陷视为一个实验。它揭示的关键变量是监视器的动态是在样本时间(每次观测,如CUSUM)还是挂钟时间(半衰期以秒计,如情感模型和EMA基线)校准的。在固定速率流上两者一致;在代理流上,动作间时间变化几个数量级,它们不一致。在20条轨迹上对均匀间隔(dt在{0..600}秒内)的预注册扫描显示,挂钟水平触发器有两个机制:在dt<=1秒时恒定警报(20/20;中位数18次触发);在dt>=60秒时静默。每个关键dt位于(1,30]秒内。真实代理运行测量延迟中位数为1.53秒(p90 2.33秒);真实编码节奏位于陷阱机制内,在修正机制下证实了经验发现。该结构是校准类别的属性,而非引擎:在原始误差流上的最小挂钟累加器重现了相同的悬崖,而相同流上的样本时间CUSUM恰好是dt不变的(20/20)。带有滞后的上升沿触发器在每个条件下每条轨迹触发0-3次。我们得出结论,挂钟校准的泄漏积分器监视器在代理流上不存在作为瞬间检测器的机制;转换检测在每个节奏下都逃脱了陷阱,但无法恢复人工干预时机。

英文摘要

Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.

2606.19382 2026-06-19 cs.SE cs.AI 新提交

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

DynAMO:基于拓扑多智能体调度的动态资产管理编排

Kanishk Kushwaha, Vikrant Vinod Bansode, Harsh Vardhan, Dhaval C. Patel

发表机构 * Gati Shakti Vishwavidyalaya(加蒂·沙克蒂大学) IBM Research(IBM研究院)

AI总结 提出DynAMO引擎,采用先规划后执行架构生成可验证工作流图,支持顺序与并行执行,通过动态识别独立任务提升效率,在工业基准上实现1.6倍延迟降低,并保持正确性与安全性。

Comments 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: https://github.com/kushwaha001/DynAMO

详情
AI中文摘要

虽然基于LLM的智能体为工业资产生命周期提供了端到端自动化,但现实世界中的工业4.0部署受到延迟、并发不稳定性和安全风险的阻碍。我们提出了DynAMO(动态资产管理编排),一个部署就绪的引擎,采用先规划后执行架构来生成可验证的工作流图。DynAMO支持顺序工作流(拓扑执行)和并行工作流(依赖感知并发)。通过动态识别独立任务,DynAMO在保持结构正确性和安全性的同时,通过受控推理重叠显著提高效率。在AssetOpsBench工业基准上的六项受控实验中,DynAMO展示了显著的性能和鲁棒性提升。并行执行相比顺序编排将端到端延迟中位数降低了1.6倍,在高度可并行化的工作流上达到1.8倍。在外部工具调用中加入实际延迟后,延迟分解显示LLM推理和编排仍占执行时间的90%以上,表明模型推理是主要系统瓶颈。结构化上下文剪枝将推理延迟降低约30%,并且DynAMO在受控故障注入下保持正确的功能行为(任务完成、智能体排序和输出质量),同时表现出优雅降级。可重复性分析进一步证实了重复运行下的稳定执行,并行调度降低了延迟方差。这些发现确立了DynAMO作为工业4.0自动化流水线中可扩展、安全且延迟感知的智能体部署的实用蓝图。代码可在以下网址获取:this https URL

英文摘要

While LLM-powered agents offer end-to-end automation for industrial asset lifecycles, real-world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment-ready engine using a Plan-then-Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency-aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end-to-end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency-aware agent deployment in Industry 4.0 automation pipelines. Code is available at: https://github.com/kushwaha001/DynAMO

2606.19380 2026-06-19 cs.SE cs.LG 新提交

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

AgentArmor:编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

发表机构 * Anthropic Fellows Program(Anthropic Fellow 项目) Constellation

AI总结 提出AgentArmor框架,通过系统提示增强、命令分类器、三振政策等机制,缓解编码代理因规范不足、能力错误和工具错误导致的失败,显著提升安全性。

详情
AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中,我们研究这些失败模式源于三种不同的机制:规范不足,即默认模型行为不安全;能力错误,即安全动作可用但模型因偏见或能力限制而未遵循;以及代理工具错误,即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制,每个评估都受实际部署失败的启发,总计20个编码环境和59个合成转录模板。基于此评估,我们提出AgentArmor,一种代理工具修改,以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具,我们证明AgentArmor在统计显著数量的样本上更安全。因此,我们为当前编码代理提出具体缓解措施,并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.