One Token to Fool LLM-as-a-Judge
一个令牌就能欺骗LLM裁判
Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu
AI总结 发现基于参考的生成式奖励模型易受奖励黑客攻击,表面输入(如非词符号或通用推理开头)能持续引发假阳性奖励,提出使用截断模型输出作为对抗性负例的数据增强策略,构建鲁棒的Master奖励模型。
详情
大型语言模型(LLM)越来越被信任作为自动裁判,协助评估并为训练其他模型提供奖励信号,特别是在基于参考的设置中,如带可验证奖励的强化学习(RLVR)。然而,我们揭示了即使在这种基于参考的范式中也存在一个关键漏洞:生成式奖励模型系统性地容易受到奖励黑客攻击。我们发现,表面输入——我们称之为“万能钥匙”,例如非词符号(如“:”或“.”)或通用推理开头(如“思考过程:”或“让我们逐步解决这个问题。”)——可以在没有任何实质性推理的情况下持续引发假阳性奖励。我们的系统评估表明,这是一个广泛存在的失败,影响多种模型,包括领先的专有系统如GPT-o1和Claude-4。这些结果挑战了LLM裁判假定的鲁棒性,并对其可靠性构成重大威胁。为了解决这个问题,我们提出了一种简单而有效的数据增强策略,使用截断的模型输出作为对抗性负例。由此产生的Master奖励模型(Master-RMs)在对这些“万能钥匙”攻击方面表现出最先进的鲁棒性,同时在标准评估设置中保持高性能。我们通过跨模型规模、提示变化和常见推理时策略的漏洞全面分析来补充这些发现,为未来关于鲁棒LLM评估的研究提供见解。我们在https://this.url 和 https://this.url 发布我们的鲁棒通用领域奖励模型和合成训练数据。
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.