ShellGames: Speculative LLM-Driven SSH Deception
ShellGames: 基于LLM的推测性SSH欺骗
Umberto Salviati, Fabio De Gaspari, Mauro Conti, Luigi Vincenzo Mancini
AI总结 针对LLM在欺骗系统中缺乏持久状态、输出不一致等问题,提出ShellGames,结合思维链、记忆管理、推测执行等五种技术,在正确性、一致性、状态跟踪和鲁棒性上显著优于基线。
详情
网络欺骗和移动目标防御是有前景的策略,旨在通过增加不确定性来干扰对手。然而,与对手维持长期、可信的交互会话仍然是一个开放挑战。大型语言模型(LLM)为更动态的欺骗系统提供了有希望的路径,但存在关键限制,从根本上限制了其适用性,包括:缺乏持久状态、输出不一致、幻觉、延迟以及可能暴露欺骗的行为颠覆易感性。我们提出了ShellGames,一个基于LLM的SSH shell模拟器,旨在解决这些限制。ShellGames结合了五种互补技术:(i) 自动思维链和少样本学习以提高正确性;(ii) 内存管理以维持系统状态一致性;(iii) 推测性命令执行以减少响应延迟;(iv) 将复杂交互命令智能路由到沙盒环境;以及(v) 利用shell环境的受限输入输出域进行颠覆检测。为了进行系统评估,我们引入了一个标准化的基准测试协议和数据集,涵盖正确性、一致性、状态跟踪和鲁棒性任务。ShellGames在正确性上达到0.898的命令准确率(比基线高5.3个百分点),一致性上达到0.918的序列级准确率(高36个百分点),状态跟踪准确率0.98(高18.3个百分点),鲁棒性准确率0.95(高37个百分点)。一项有20名参与者的用户研究证实,ShellGames在自由探索下实现了与真实shell相当的真实感,并且在感知命令覆盖率上优于传统蜜罐。
Cyber deception and Moving Target Defense are promising strategies that aim to disrupt adversaries by increasing uncertainty. However, sustaining long-lived, credible interactive sessions with adversaries remains an open challenge. Large Language Models (LLMs) offer a promising path toward more dynamic deception systems, but suffer from key limitations that fundamentally limit their applicability, including: lack of persistent state, output inconsistencies, hallucinations, latency, and susceptibility to behavioral subversion that may reveal the deception. We propose ShellGames, an SSH shell simulator based on LLM designed to address these limitations. ShellGames combines five complementary techniques: (i) Automatic Chain-of-Thought and few-shot learning to improve correctness; (ii) memory management to maintain system state coherency; (iii) speculative command execution to reduce response latency; (iv) smart routing of complex interactive commands to a sandboxed environment; and (v) subversion detection leveraging the constrained input-output domain of shell environments. To enable systematic evaluation, we introduce a standardized benchmarking protocol and dataset spanning correctness, consistency, state tracking, and robustness tasks. ShellGames achieves $0.898$ command accuracy on correctness ($+5.3pp$ over baselines), $0.918$ sequence-level accuracy on consistency ($+36pp$), $0.98$ state tracking accuracy ($+18.3pp$), and $0.95$ accuracy on robustness ($+37pp$). A user study with $n=20$ participants confirms that ShellGames achieves realism comparable to a real shell under free exploration and outperforms traditional honeypots on perceived command coverage.