How Should We Measure Empirical Risk when Synthesizing Population Data?
合成人口数据时应如何衡量经验风险?
Joshua Snoke
AI总结 本文探讨合成全量人口数据时经验风险评估框架的不足,指出成员推断攻击和属性推断攻击等传统指标需重新审视,并强调需根据具体情境调整评估方法。
详情
合成数据已成为在共享数据时保护隐私的突出解决方案,但当前的经验风险评估框架从根本上假设了一个基于样本的上下文,这无法转化为对合成人口级别数据集的评估。本文探讨了为进行人口级别数据科学而合成整个群体时的含义,认为传统指标,如成员推断攻击(MIA)和属性推断攻击(AIA),需要重新审视。首先,在群体成员身份是公共知识或不被视为敏感信息的情况下,MIA可能变得无关紧要。其次,由于机密数据包含完整的人口信息,被单独识别的风险更高。此外,属性推断缺乏“样本外”比较组,意味着我们在定义可接受的推断时需要定义其他策略。最后,如果用例确实是实现人口级别数据科学,我们不能简单地依赖在生成合成数据之前返回子抽样。本文强调了在生成和评估合成人口数据时考虑情境的必要性。
Synthetic data has become a prominent solution for preserving privacy while sharing data, but current empirical risk assessment frameworks fundamentally assume a sample-based context that fails to translate for the evaluation of synthetic population level datasets. This commentary explores the implications when synthesizing entire populations in order to do population-level data science, arguing that traditional metrics, such as Membership Inference Attacks (MIA) and Attribute Inference Attacks (AIA), require re-examination. First, MIA may be rendered irrelevant in contexts where population membership is public knowledge or not considered sensitive information. Second, the risk of singling out is heightened because the confidential data contain full population information. Additionally, the absence of an "out-of-sample" comparison group for attribute inference means we need to define other policies when defining acceptable inferences. Finally, we cannot rely on simply returning to subsampling prior to generating synthetic data if the use case is truly to enable population-level data science. This commentary highlights the necessity for considering context when generating and evaluating synthetic population data.