Provable Joint Decontamination for Benchmarking Multiple Large Language Models
可证明的多语言模型基准测试去污染
Zhenlong Liu, Hao Zeng, Hongxin Wei
AI总结 本文提出了一种可证明的多语言模型基准测试去污染方法,通过联合选择过程实现全局污染率控制,提升跨模型比较的可靠性。
详情
在LLM评估中,基准数据污染已成为关键挑战:当评估示例出现在一个或多个受审模型的训练数据中时,报告性能可能被夸大,跨模型比较变得不可靠。大量训练数据检测工作设计了评分来量化模型对给定数据点的记忆程度,但这些基于评分的方法缺乏理论保证。最近的符合方法为单个模型提供了可证明的假识别控制;然而,分别应用它们到每个模型会产生模型特定的基准,破坏跨模型的公平比较。在本文中,我们将多模型基准去污染正式化为一个联合选择问题,并提出联合包络符合选择(JECS),一种符合程序,能够在给定假设下实现全局污染率(GCR)控制。具体而言,JECS计算每个模型的符合p值,通过每个项目的最大值进行汇总,并从高于数据驱动阈值的右尾观测中重建一个保守的包络最大p空分布。通过将自适应Benjamini-Hochberg(BH)程序应用于包络重新缩放值,我们选择了一个具有可证明GCR控制的基准。在各种模型和基准上的广泛实验表明,JECS在保持目标GCR控制的同时,比max-p基线具有更高的功效。
Benchmark data contamination has become a central challenge in LLM evaluation: when evaluation examples appear in the training data of one or more audited models, reported performance can be inflated and cross-model comparisons become unreliable. A broad line of training-data detection work designs scores to quantify how strongly a model memorizes a given data point, but these score-based methods lack theoretical guarantees. Recent conformal approaches provide provable false-identification control for a single model; however, applying them separately to each model can produce model-specific benchmarks, undermining fair comparison across models. In this work, we formalize multi-model benchmark decontamination as a joint selection problem and propose Joint Envelope Conformal Selection (JECS), a conformal procedure that enables global contamination rate (GCR) control under stated assumptions. Specifically, JECS computes per-model conformal p-values, aggregates them by the per-item maximum, and reconstructs a conservative envelope of the max-p null distribution from right-tail observations above a data-driven threshold. By applying the adaptive Benjamini-Hochberg (BH) procedure to the envelope-rescaled values, we select a benchmark with provable GCR control. Extensive experiments across various models and benchmarks demonstrate that JECS achieves higher power than the max-p baseline while consistently maintaining the target GCR control.