arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

大模型对齐与安全

大模型对齐、安全、越狱、红队、提示注入和可信评测。

今日/当前日期收录 4 信号源:cs.CL, cs.AI, cs.CY, cs.LG
2605.17986 2026-06-18 cs.CR cs.AI 版本更新 95%

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

LivePI:更真实的智能体对抗间接提示注入基准测试

Lei Zhao, Abhay Bhaskar, Edgar Dobriban

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

专题命中 提示注入 :基准测试AI智能体对抗间接提示注入,核心是安全。

AI总结 提出LivePI基准,覆盖7种输入表面、12种攻击/渲染家族和5种恶意目标,在真实虚拟机环境中评估多个AI智能体,发现攻击成功率10.7%-29.6%,并验证了两层防御的有效性。

详情
AI中文摘要

诸如OpenClaw之类的AI智能体越来越多地部署在本地工作流中,并能够访问外部工具。这带来了间接提示注入(IPI)风险:智能体可能会执行嵌入在不可信输入(如电子邮件、下载文件、网页、仓库或群聊消息)中的有害指令。现有的评估通常规模较小、完全模拟或仅关注狭窄的通道。我们引入了LivePI(实时提示注入),这是一个在生产类似但测试可控环境中的IPI风险结构化基准。LivePI覆盖了七个输入表面、十二个攻击/渲染家族和五个恶意目标,包括受保护信息窃取、未经授权的安全控制更改、不安全的代码检索或执行、收件箱摘要窃取以及加密货币转账。我们在一个真实的虚拟机上运行LivePI,该虚拟机具有实时但测试可控的电子邮件、聊天、网页、本地文件、仓库和钱包接口。在GPT-5.3-Codex、Claude Opus 4.6、Gemini 3.1 Pro、Kimi K2.5和GLM-5上,总攻击成功率范围为10.7%至29.6%。群聊注入在我们部署中评估的所有骨干模型上均成功,而仓库链接攻击尽管分母较小,仍产生了高严重性失败。我们还评估了一种由提示级过滤和执行前工具调用授权组成的两层防御。在GPT-5.3-Codex设置中,该防御在LivePI中拦截了所有测试的恶意目标完成,同时保留了PinchBench衍生工作负载上的良性效用。

英文摘要

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injection (IPI) risk: an agent may execute harmful instructions embedded in untrusted inputs such as email, downloaded files, webpages, repositories, or group-chat messages. Existing evaluations are often small, purely simulated, or focused on a narrow set of channels. We introduce LivePI (Live Prompt Injection), a structured benchmark for IPI risk in a production-like but test-controlled environment. LivePI covers seven input surfaces, twelve attack/rendering families, and five malicious goals, including protected-information exfiltration, unauthorized security-control changes, unsafe code retrieval or execution, inbox-summary exfiltration, and cryptocurrency transfer. We run LivePI on a real virtual machine with live but test-controlled email, chat, web, local-file, repository, and wallet interfaces. Across GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5, total attack success rates range from 10.7% to 29.6%. Group-chat injection is uniformly successful across the evaluated backbones in our deployment, and repository-link attacks produce high-severity failures despite a small denominator. We also evaluate a two-layer defense consisting of prompt-level filtering and pre-execution tool-call authorization. In the GPT-5.3-Codex setting, the defense intercepts all tested malicious-goal completions in LivePI before execution while preserving benign utility on PinchBench-derived workloads.

2606.18550 2026-06-18 cs.CR 新提交 85%

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

门仅与其合约一样诚实:面向风险感知因果门控合约层的ContractGuard

Laxmipriya Ganesh Iyer, Rahul Suresh Babu

专题命中 提示注入 :防御间接提示注入攻击

AI总结 针对工具增强型LLM代理的间接提示注入,提出ContractGuard,通过验证合约完整性(而非风险标签)来防御攻击,在基准测试中实现零注入成功率。

详情
AI中文摘要

风险感知因果门控(RACG)通过从代理的可见动作空间中移除危险工具来防御工具增强型LLM代理免受间接提示注入,使得即使完全符合注入条件的代理也无法调用其不可见的工具。我们提出三点。首先,这种结构性保证并未消除安全工具使用背后的信任假设;它将其转移到门所读取的工具合约——声明的先决条件、效果、风险和授权——的完整性上,因此攻击者若破坏合约,可使门误判而无需说服代理。其次,伪造工具的效果比篡改其风险标签更危险,因为RACG在可准入门之前应用因果门:离路径工具从不暴露,因此仅重新标记风险会失败,而效果伪造则将危险工具路由到因果路径上并成功。效果完整性,而非风险标签,是承载假设。第三,我们引入ContractGuard,一个位于注册表和门之间的验证器,它分层使用签名来源、类型化合约认证和运行时效果验证;在受控基准测试中,它针对所有建模攻击(包括穷举白盒自适应攻击)将注入成功率恢复为零,且不会过度拒绝诚实合约,该结构性预测在六个当前代托管模型(Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B)上得到确认。

英文摘要

Risk-Aware Causal Gating (RACG) defends tool-augmented LLM agents against indirect prompt injection by removing dangerous tools from the agent's visible action space, so that even a fully injection-compliant agent cannot call a tool it cannot see. We make three points. First, this structural guarantee does not eliminate the trust assumption behind safe tool use; it relocates it into the integrity of the tool contracts -- declared preconditions, effects, risk, and authorization -- that the gate reads, so an attacker who corrupts a contract can make the gate mis-decide without ever persuading the agent. Second, forging a tool's effects is strictly more dangerous than tampering with its risk label, because RACG applies a causal gate before its admissibility gate: an off-path tool is never exposed, so risk-relabeling alone fails, whereas effect forgery routes the dangerous tool onto the causal path and succeeds. Effect integrity, not the risk label, is the load-bearing assumption. Third, we introduce ContractGuard, a verifier between the registry and the gate that layers signed provenance, typed contract attestation, and runtime effect verification; on a controlled benchmark it restores injection success to zero against every modeled attack -- including an exhaustive white-box adaptive attacker -- without over-rejecting honest contracts, and the structural prediction is confirmed on six current-generation hosted models (Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B).

2606.18530 2026-06-18 cs.CR cs.CL cs.LG 新提交 85%

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

评估基于提示的防御策略对抗领域伪装注入攻击

Aaditya Pai

发表机构 * Data Science Institute(数据科学研究所)

专题命中 提示注入 :评估防御领域伪装注入攻击

AI总结 针对领域伪装注入攻击,评估五种基于提示的防御方法(如释义、重点标记等)在三个模型家族和三个部署领域中的有效性,发现释义法最有效,可将伪装攻击成功率降低55-84%。

Comments 9 pages, 4 figures, 4 tables; under review at the AdvML-Frontiers x CoTMA workshop, COLM 2026

详情
AI中文摘要

领域伪装注入攻击使用领域特定词汇将恶意指令嵌入检索内容中,从而逃避依赖句法注入标记的标准检测器。当检测失败时,从业者需要知道哪些防御架构能降低攻击成功率。我们评估了五种基于提示的防御方法(重点标记、释义、提示夹层以及两种组合)对抗领域伪装注入攻击,涉及三个模型家族(Claude Haiku、Llama 3.1 8B、Gemini 2.0 Flash)和三个部署领域(金融、法律、通用),共进行3,510次试验。在代理处理之前对检索内容进行释义是最一致有效的防御方法,根据模型不同,可将伪装攻击成功率降低55-84%,并且在所有测试模型上均实现了比我们的Llama Guard 4配置更低的攻击成功率。防御效果强烈依赖于模型:重点标记在Claude Haiku上将攻击成功率减半,但在Llama 3.1 8B上没有任何益处。金融领域部署面临最高的残余风险,基线攻击成功率为26-33%,在较弱模型上没有任何基于提示的防御能完全消除威胁。这些结果首次系统评估了专门针对伪装类注入攻击的基于提示的防御方法,并为从业者建立了基于基准的建议。所有任务均使用合成构建的专业文档;这些基准排名是否能推广到真实企业文档仍是一个开放问题。

英文摘要

Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

2606.19235 2026-06-18 cs.CR 新提交 80%

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

CodeSentinel:代码上下文中针对间接提示注入的三层防御

Po-Han Cheng, Chia-Mu Yu, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee

专题命中 提示注入 :针对代码上下文的提示注入防御

AI总结 针对代码大语言模型在检索外部代码时面临的间接提示注入攻击,提出CodeSentinel三层推理时净化器,结合语法引导预过滤、CST引导动态Min-K%评分和节点扰动分析,实现0.80节点级F1,优于现有方法。

详情
AI中文摘要

代码大语言模型越来越多地从仓库、文档、问题线程和编码代理环境中检索外部代码上下文,这创建了一个间接提示注入面,攻击者可以在注释、字符串、标识符或诱饵代码中隐藏指令。我们提出了CodeSentinel,一个三层推理时净化器。它使用Tree-sitter提取高风险模型面对CST节点,然后结合语法引导预过滤、CST引导动态Min-K%评分和节点扰动分析来检测对抗性和自然外观的语义触发器。检测到的节点在被下游代码LLM处理之前被移除或中和。在六个最近的攻击家族中,CodeSentinel实现了0.80的平均节点级F1,优于CodeGarrison、DePA和KillBadCode。

英文摘要

Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and coding-agent environments, creating an indirect prompt-injection surface where attackers hide instructions in comments, strings, identifiers, or decoy code. We propose CodeSentinel, a three-layer inference-time sanitizer. It uses Tree-sitter to extract high-risk model-facing CST nodes, then combines syntax-guided pre-filtering, CST-guided Dynamic Min-K\% scoring, and node perturbation analysis to detect adversarial and natural-looking semantic triggers. Detected nodes are removed or neutralized before reaching the downstream Code LLM. Across six recent attack families, \CodeSentinel achieves 0.80 average node-level F1, outperforming CodeGarrison, DePA, and KillBadCode.