Chinese Word Boundary Recovery through Character Alignment Projection
通过字符对齐投影恢复中文词边界
Lusha Wang, Yuchen Li, Su Yuan, Jungyeul Park
AI总结 提出基于对齐投影的两步方法,从带噪句子中恢复词边界,并构建两个评估基准,实验表明该方法能有效纠正过度切分错误。
详情
中文分词在非标准文本中尤其脆弱,语言学习者错误和其他字符层面的差异会破坏下游标注和评估所假设的词边界。本文将中文词边界恢复形式化为基于对齐的投影任务。给定一个带噪的源句子和一个更干净的目标对应句,我们首先在字符级别对齐两个字符串,然后将目标侧的词边界投影回源句。除了恢复方法本身,我们还引入了两个评估资源:基于MuCGEC的人工检查学习者中文基准,以及从中文宾州树库导出的受控合成基准。实验表明,直接分词仍然容易受到学习者输入中的复合碎片化影响,而所提出的两步投影方法通过使用校正后的目标恢复源侧词跨度,纠正了许多过度切分错误。结果表明,词边界恢复不同于普通分词,并且对齐投影为在带噪输入下稳定中文标注和评估提供了一种原则性机制。
Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper formulates Chinese word boundary recovery as an alignment-based projection task. Given a noisy source sentence and a cleaner target counterpart, we first align the two strings at the character level and then project target-side word boundaries back onto the source. Beyond the recovery method itself, we introduce two evaluation resources: a manually checked learner Chinese benchmark based on MuCGEC and a controlled synthetic benchmark derived from the Chinese Penn Treebank. Experiments show that direct segmentation remains vulnerable to compound fragmentation in learner input, whereas the proposed two step projection method corrects many over-segmentation errors by using the corrected target to recover source-side word spans. The results show that word boundary recovery is distinct from ordinary segmentation and that alignment projection provides a principled mechanism for stabilizing Chinese annotation and evaluation under noisy input.