Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition
在端到端大语言模型中平衡ASR与说话人日志以进行多说话人语音识别
Naijun Zheng, Yuke Lin, Sanli Tian, Mengtian Li, Zhiwei Lin, Longshuai Xiao, Dandan Tu
AI总结 提出双编码器架构、特征交错格式、长度感知说话人ID损失和自适应阈值ASR损失策略,在有限真实数据下高效训练LLM系统,平衡ASR与说话人日志任务,在AliMeeting和Aishell4语料库上分别实现18%和24%的相对改进。
详情
- Comments
- Accepted in Interspeech 2026
多说话人语音识别通常通过结合自动语音识别(ASR)和说话人日志的流水线系统来处理。最近,基于大语言模型(LLM)的方法通过联合建模语义和说话人信息显示出前景,但它们通常需要大规模的多说话人语料库,而标注这些语料库成本高昂。在本文中,我们研究了如何在有限真实录音数据下高效训练基于LLM的系统,同时保持说话人归属的高准确性。我们提出了几种策略:(1)双编码器架构,用于提取语义和说话人特征;(2)特征交错格式,将这些特征合并作为LLM的输入;(3)长度感知的说话人ID损失,以增强日志能力;(4)自适应阈值的ASR损失计算,以减轻语音重叠引起的幻觉。这些策略平衡了ASR和说话人日志任务之间的训练。我们的系统优于开源基线方法,在AliMeeting语料库上实现了18%的相对改进,在Aishell4语料库上实现了24%的相对改进。
Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.