Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
基准测试与改进LLMs中的分布外对齐失败监控器
Dylan Feng, Pragya Srivastava, Anca Dragan, Cassidy Laidlaw
AI总结 针对大语言模型在分布外情境下的安全与对齐失败问题,提出MOOD基准并证明结合守卫模型与OOD检测器可提升监控召回率。
详情
大语言模型(LLMs)的许多安全和对齐失败源于分布外(OOD)情境:模型开发者未预见到的异常提示或响应模式。我们通过引入名为Misalignment Out Of Distribution (MOOD)的基准,系统研究LLM监控流程能否检测这些OOD对齐失败。对于在大量安全数据集上训练的现成模型,很难找到真正OOD的失败。我们通过在MOOD中包含一个受限训练集(用于训练我们自己的监控器)以及七个具有不同对齐失败且超出训练分布的测试集来规避这一问题。利用MOOD,我们发现守卫模型(安全分类器)通常难以泛化到OOD。为解决此问题,我们提出将守卫模型与OOD检测器结合。我们测试了四种OOD检测器,发现将守卫模型与基于马氏距离和困惑度的OOD检测器结合,可将召回率从39%提升至45%。我们还建立了跨模型规模的监控器(结合守卫模型和OOD检测器)的正向扩展趋势;发现将OOD检测纳入监控比使用参数多20倍的守卫模型能获得更高的召回率增益。我们的工作表明,OOD检测应成为LLM监控的关键组成部分,并为这一重要问题的进一步研究奠定了基础。我们公开发布了实验代码和数据,相关链接见:https://github.com/Dylan102938/mood-bench。
Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem. We release the code and data for our experiments publicly, and you can find the relevant links here: https://github.com/Dylan102938/mood-bench.