Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems
眼不见,心不烦:揭示基于潜在的多智能体系统中的潜在攻击
Chenxi Wang, Ruiyang Huang, Jiayan Sun, Lei Wei, Yifan Wu
AI总结 研究潜在表示能否携带攻击信息,提出通过潜在干预激活攻击效果的框架,实验表明潜在攻击在清洁执行中显著降低任务性能,尤其影响智能体间KV缓存传递。
详情
- Comments
- 27 pages, 7 figures, 3 tables. Preprint
基于潜在的多智能体系统用隐藏表示替代部分显式智能体间通信,为高效灵活的智能体协作提供了新方向。然而,将协调移至潜在空间也可能将攻击移至可见文本检查范围之外。本文研究潜在状态能否携带在清洁执行期间仍然有效的攻击相关信息。为探究此问题,我们引入了一个潜在攻击框架,通过潜在干预重新激活攻击诱导的效果,而无需重用对抗性文本。大量实验表明,由此产生的纯潜在攻击在清洁执行中能显著降低任务性能,尤其当应用于智能体间KV缓存传递而非局部隐藏状态时。进一步的控制分析表明,这种性能下降不能归因于任意扰动或无效生成。总体而言,我们的发现表明基于潜在的协作并未消除攻击风险,而是将部分风险转移至较不可见的执行状态,这要求超越可见文本检查的安全防护措施。
Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack-induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent-only attacks can substantially degrade task performance in clean executions, especially when applied to inter-agent KV-cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent-based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible-text inspection.