Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
语音大模型推理中的实体绑定失败:诊断与思维链干预
Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu
AI总结 本文通过诊断语音大模型在逻辑推理中的实体绑定失败问题,提出实体感知思维链方法,显著提升推理准确率。
详情
- Comments
- INTERSPEECH 2026
语音大模型在复杂推理任务上表现不如文本模型。我们揭示了这种模态差距并非均匀的认知缺陷。通过评估三个不同的语音大模型,我们发现在空间、句法和事实任务上,语音到文本(S2T)匹配或超过文本到文本(T2T)。然而,在需要实体追踪的逻辑任务上,S2T准确率降至随机水平。我们将这种局部退化诊断为实体绑定失败:连续的语音特征导致模型在隐式推理过程中丢失精确的实体-属性关联。为解决此问题,我们提出了实体感知思维链(EA-CoT),强制语音大模型在推理前显式枚举实体并将其绑定到声明上。引人注目的是,即使口语名称被误识别,EA-CoT也能弥合差距,带来高达24.4%的绝对准确率提升。消融实验证实这些提升完全源于显式语义绑定,将模态差距重新定义为可解决的瓶颈。
Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.