Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems
主动应对不确定性:面向口语对话系统的因果感知错误诊断与交互式澄清
Yizhou Peng, Ziyang Ma, Changsong Liu, Yi-Wen Chao, Xie Chen, Eng Siong Chng
AI总结 本文提出一种因果感知的错误恢复范式,通过细粒度检测器解耦ASR中的感知、理解和删除错误,使LLM能够执行多轮针对性澄清策略,从而显著降低词错误率并提升下游任务性能。
详情
级联自动语音识别-大语言模型(ASR-LLM)流水线在工业口语对话系统(SDS)中仍然流行,主要因为其解耦设计确保了感知可验证性。然而,级联系统存在错误传播问题,因为转录失败不可避免地级联到后续组件,从而降低最终交互质量。尽管ASR置信度分数为不可靠输入提供了简单过滤,但这种方法存在根本性局限,因为它通常无法检测删除错误,也无法区分声学(听不清)和语言(不理解)不匹配,而这两者都需要针对性的恢复策略。在本文中,我们提出了一种因果感知的错误恢复范式,从根本上重新思考SDS的鲁棒性。与传统的置信度过滤不同,我们引入了一组小型精度聚焦检测器,利用深度ASR潜在表示将词级错误解耦为感知、理解和删除失败。这种细粒度诊断智能使LLM能够编排针对性的多轮澄清策略,有效将模糊信号转化为无缝的用户交互。实验结果验证了我们方法的精度,与基线相比,在领域转移错误上的召回率提高了一倍以上(57.96% vs. 23.66%)。关键的是,这种诊断精度在不同口音、失真和领域下,使词错误率降低高达30%,下游任务性能提升17%。
Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.