AI中文摘要
本文通过操作化、数据质量和基准设计三个框架重新审视了词汇语义变化检测中最具影响力的共享基准SemEval-2020任务1。首先,在操作化层面,我们认为该基准主要将语义变化建模为离散义项的增加、丢失或重新分布。虽然这种框架便于标注和评估,但过于狭窄,无法捕捉渐变的、构式的、搭配的和语篇层面的变化。此外,黄金标签是标注决策、聚类过程和阈值设置的结果,可能限制任务的有效性。其次,在数据质量层面,我们表明该基准受到严重的语料库和预处理问题影响,包括OCR噪声、畸形字符、截断句子、不一致的词形还原、词性标注错误以及目标词遗漏。这些问题可能扭曲模型行为,使语言分析复杂化,并降低可重复性。第三,在基准设计层面,我们认为精心挑选的小规模目标集和有限的语言覆盖降低了现实性并增加了统计不确定性。综合来看,这些局限性表明该基准应被视为一个有用但不完整的测试平台,而非进展的最终衡量标准。因此,我们呼吁未来的数据集和共享任务采用更广泛的语义变化理论,透明地记录预处理过程,扩大跨语言覆盖范围,并使用更现实的评估设置。这些步骤对于词汇语义变化检测中更有效、可解释和可推广的进展是必要的。
英文摘要
This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection