Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction
劫持Agent记忆:通过对话交互的隐蔽木马攻击
Hongtao Wang, Se Yang, Yu Chen, Puzhuo Liu
AI总结 提出MemPoison攻击方法,通过语义关系桥、实体伪装和联合嵌入优化绕过选择性记忆机制,在LLM Agent长期记忆中注入触发器后门,实现高达0.95的攻击成功率。
详情
- Comments
- 19 pages, 12 figures
大型语言模型(LLM)Agent越来越多地利用长期记忆来支持持久且自主的任务执行。然而,这种能力也引入了一个新的攻击面:记忆投毒,即对手可以注入恶意信息以影响未来行为。现有的记忆投毒攻击通常假设注入内容可以直接存储在记忆中,忽略了现代记忆流水线中的选择性提取和重写阶段。这使得先前的方法在现实场景中无效。在本文中,我们提出MemPoison,一种新颖的记忆投毒攻击,能够绕过LLM Agent中的选择性记忆机制,攻击者可以通过对话交互将可触发的后门注入Agent的长期记忆,从而误导其后续响应。MemPoison引入三个关键组件:(i)语义关系桥,将触发器和载荷绑定为连贯的陈述,确保它们一起被提取到记忆中;(ii)实体伪装,优化触发器以模仿命名实体,抵抗重写;(iii)联合嵌入优化,将注入触发器的文本在嵌入空间中形成紧密聚类,同时与良性嵌入保持隔离以实现隐蔽。跨不同Agent领域和记忆机制的评估显示,MemPoison的攻击成功率高达0.95,优于现有基线。机制分析表明,该攻击利用了嵌入空间各向异性并转移注意力模式,突显了选择性记忆系统的核心漏洞。我们评估了多种防御策略,并展示了它们在缓解攻击方面的根本局限性。
Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this capability also introduces a new attack surface: memory poisoning, where adversaries can inject malicious information to influence future behavior. Existing memory poisoning attacks often assume that injected content can be stored directly in memory, overlooking the selective extraction and rewriting stages in modern memory pipelines. This makes prior methods ineffective under realistic settings. In this paper, we propose MemPoison, a novel memory poisoning attack that bypasses selective memory mechanisms in LLM agents, where an attacker can inject triggerable backdoors into the agent's long-term memory through dialogue interactions, thereby misleading its subsequent responses. MemPoison introduces three key components: (i) a semantic relational bridge that binds the trigger and payload into a coherent statement to ensure they are extracted into memory together; (ii) entity masquerading that optimizes triggers to mimic named entities, resisting rewriting; and (iii) joint embedding optimization that shapes trigger-injected texts into a tight cluster in the embedding space while maintaining isolation from benign embeddings for stealth. Evaluations across different agent domains and memory mechanisms show MemPoison achieves attack success rates up to 0.95, outperforming existing baselines. Mechanistic analysis indicates that the attack exploits embedding-space anisotropy and shifts attention patterns, highlighting core vulnerabilities in selective memory systems. We evaluate multiple defense strategies and demonstrate their fundamental limitations in mitigating the attack.