Spurious Prompts: Can Irrelevant Prompts Steer Large Language Models?
虚假提示:无关提示能否引导大型语言模型?
Pawel Batorski, Abtin Pourhadi, Jerzy Sarosiek, Przemyslaw Spurek, Paul Swoboda
AI总结 研究语义无关的提示(虚假提示)对大型语言模型行为的影响,提出黑盒搜索方法发现此类提示,并证明其在多个基准和模型上能显著影响模型输出。
详情
大型语言模型对提示高度敏感,但这种敏感性通常通过任务相关的指令、示例或推理线索来研究。本文研究了一种不同形式的提示敏感性:与任务语义无关的提示是否仍然能够引导模型行为。我们称其为虚假提示,并展示了其惊人的有效性。我们还提出了一种简单的黑盒搜索程序来发现它们。在推理和问答基准上,使用参数从0.8B到27B、涵盖三个模型家族的模型,我们展示了虚假提示可以提升性能,通常匹配或超越标准提示基线和任务感知的提示优化。我们进一步展示了它们可以引导模型产生非预期行为,例如重复选择第一个答案选项、产生错误答案、返回偶数、质数或小数,而无需明确指示模型这样做。这些发现揭示了一种新的提示敏感性:LLM可以被与它们被要求解决的任务无关的提示系统地引导。我们的代码可在 https://github.com/Batorskq/spurious 获取。
Large language models are highly sensitive to prompts, but this sensitivity is usually studied through task-relevant instructions, demonstrations, or reasoning cues. In this paper, we study a different form of prompt sensitivity: whether prompts that are semantically unrelated to the task can nevertheless steer model behavior. We call them spurious prompts and show their surprising efficacy. We also propose a simple black-box search procedure for discovering them. Across reasoning and question-answering benchmarks, using models ranging from 0.8B to 27B parameters and spanning three model families, we show that spurious prompts can improve performance, often matching or outperforming standard prompting baselines and task-aware prompt optimization. We further show that they can steer models toward unintended behaviors, such as repeatedly selecting the first answer option, producing incorrect answers, returning an even, prime or small number without explicitly instructing the model to do so. These findings reveal a new kind of prompt sensitivity: LLMs can be systematically steered by prompts that are unrelated to the task they are asked to solve. Our code is available at https://github.com/Batorskq/spurious