大模型对齐与安全 - arXivDaily 专题

2606.19660 2026-06-19 cs.CR cs.CL 新提交 90%

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

基于RAG的聊天机器人中针对提示注入的分层安全框架

Gulshan Saleem, Nisar Ahmed, Muhammad Imran Zaman, Ali Hassan

专题命中提示注入：三层防御框架对抗RAG聊天机器人中的提示注入

AI总结提出三层防御框架，通过输入过滤、上下文指令层级和输出审计，将提示注入攻击成功率从71.4%降至11.3%，误报率4.8%，延迟开销61.2毫秒。

Comments Submitted in ICCK Transactions on Information Security and Cryptography

详情

AI中文摘要

提示注入被OWASP Top 10 for LLM Applications列为大语言模型（LLM）部署中最关键的漏洞，然而现有防御措施仅在孤立的流水线阶段运行且不完整。输入过滤器无法检查检索到的文档，而输出监控器无法阻止恶意载荷到达模型。因此，检索增强生成（RAG）聊天机器人仍然容易受到间接注入攻击，其中被污染的知识库文档会损害每个检索到它的用户。我们提出了一个三层框架，在推理流水线中拦截直接和间接的提示注入。第一层使用基于规则的模式库和微调后的语义异常分类器筛选用户输入。第二层在上下文组装期间强制执行基于来源的指令层级，防止检索到的内容覆盖操作员策略。第三层在交付前使用策略规则引擎和语义漂移检测器审计模型输出。一个持续审计循环聚合结构化日志，并支持重新训练以适应新兴攻击模式。该框架与模型无关，作为中间件部署，无需修改底层LLM。在GPT-4o、Llama 3和Mistral 7B上对5,080个样本的评估显示，该框架将攻击成功率（ASR）从71.4%降至11.3%，比最佳单层基线高出27.3个百分点，比已发布的护栏系统高出23.8个百分点，同时保持4.8%的误报率和61.2毫秒的中位延迟开销。消融研究证实，所有三层提供互补保护，且其组合效果超过单个贡献的总和。

英文摘要

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

URL PDF HTML ☆

赞 0 踩 0