Local Coverage Governs Memorization in Diffusion Models
局部覆盖支配扩散模型中的记忆化
Claudia Merger, Sebastian Goldt
AI总结 通过扩散模型与核密度估计的联系,发现记忆化由局部数据覆盖主导:低覆盖区域孤立样本被记忆,高覆盖区域支持插值泛化。
详情
扩散模型中的记忆化通常被视为模型或数据集的全局属性。然而在实践中,单个扩散模型可以同时生成记忆化和新颖的样本。哪些训练样本最有可能被记忆?在这项工作中,我们表明记忆化由\emph{局部数据覆盖}支配。利用扩散模型与核密度估计(KDE)之间的联系,我们推导出一个理论准则,根据训练数据在其邻域内的密度和训练数据集的大小来预测一个点是否被记忆。在高维极限下,这导致一个尖锐的局部转变:低覆盖区域被孤立的训练样本主导,这些样本被记忆,而密集区域支持插值和泛化。我们通过实验验证了这些预测,表明记忆化随局部稀疏性增加,并且扩散模型在同一模型内表现出记忆化和新颖样本的共存。将该框架扩展到多类设置,我们进一步表明,具有更高类内稀疏性(因此更低局部覆盖)的类别被更强烈地记忆。我们的结果提供了扩散模型中记忆化的局部视角,从数据几何角度解释了记忆化何时何地发生。
Memorization in diffusion models is often treated as a global property of the model or dataset. In practice, however, a single diffusion model can simultaneously generate both memorized and novel samples. Which training samples are most likely to be memorized? In this work, we show that memorization is governed by \emph{local data coverage}. Leveraging the connection between diffusion models and kernel density estimation (KDE), we derive a theoretical criterion that predicts whether a point is memorized based on the density of training data in its neighborhood and the size of the training dataset. In the high-dimensional limit, this leads to a sharp, local transition: regions of low coverage are dominated by isolated training samples, which are memorized, while dense regions support interpolation and generalization. We validate these predictions empirically, showing that memorization increases with local sparsity and that diffusion models exhibit a coexistence of memorized and novel samples within the same model. Extending this framework to multi-class settings, we further show that classes with higher intra-class sparsity (and thus lower local coverage) are more strongly memorized. Our results provide a local view of memorization in diffusion models, explaining when and where memorization occurs in terms of data geometry.