Data filtering methods for training language models
训练语言模型的数据过滤方法
Egor Shevchenko, Elena Bruches
AI总结 本文比较了Confident Learning和Dataset Cartography两种自动标签错误检测方法在俄语文本分类任务中的效果,发现其有效性依赖于数据集特性,在小规模高噪声数据集上Confident Learning显著提升F1-macro。
详情
- Comments
- AINL-2026
数据质量是机器学习模型有效性的关键因素。即使广泛使用的基准数据集中也存在标签错误,这些错误会引入训练数据噪声并降低模型泛化能力。在本工作中,我们对两种自动标签错误检测方法——Confident Learning和Dataset Cartography——在三个俄语文本分类语料库上进行了比较分析,这些语料库在规模、类别数量和领域上各不相同:ru_emotion_e-culture(49,123个样本,情感分类)、RuCoLA(8,524个样本,语言可接受性)和TERRa(2,337个样本,文本蕴含识别)。我们使用在每个语料库上微调的预训练rubert-base-cased模型。为了验证过滤的意义,我们进行了控制实验,随机移除等量样本。结果表明,两种方法的有效性强烈依赖于数据集特征:在噪声水平低的大规模语料库上,过滤并未提升性能,而在噪声高的小规模数据集上,Confident Learning实现了显著的F1-macro提升。Dataset Cartography表现出更保守的行为,移除的样本更少。在所有语料库中,两种方法的目标性移除均优于随机移除,证实了这些方法的意义。
Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.