arXivDaily arXiv每日学术速递 周一至周五更新

1. 语音识别与关键词检测 3 篇

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 交叉投稿

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究:频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结 本文系统研究不同频谱特征与声学模型的组合,通过引入音高特征和优化训练帧重叠数,在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情
AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能,特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查,我们证明了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法,与先前研究相比,在构音障碍语音的孤立词识别中获得了4.65%的相对改进,在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性,这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 交叉投稿

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估:基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar(卡塔尔计算研究所,多哈,卡塔尔)

AI总结 提出仅使用母语语音资源训练的轻量级发音评估框架,通过离散化语音标记和语言模型计算意外度,结合文本引导对齐特征,在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库,这些语料库收集成本高昂。我们提出一个轻量级框架,仅使用母语语音资源训练,以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时,学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度,其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块,该模块从参考文本预测母语标记序列,并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上,PCC从0.60提升到0.66(带转录引导),接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

2606.20106 2026-06-19 eess.AS cs.SD 交叉投稿

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

利用文本无关说话人验证的用户自定义关键词个性化唤醒

Ming-Hsiang Hu, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Berlin Chen

AI总结 提出ZP-KWS轻量框架,结合音素监督音频编码器和紧凑说话人编码器,通过乘法后融合实现零样本关键词检测与说话人验证,在多个数据集上将目标误拒率降低高达60%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

用户自定义关键词唤醒(UD-KWS)能够从文本实现零样本唤醒词检测,但现有系统学习的是说话人不变表示,无法拒绝说出正确关键词的冒名顶替者。我们针对这种双重零样本设置——未见关键词和未见说话人——提出了ZP-KWS,一个轻量级框架,将音素监督的音频编码器与GE2E预训练的紧凑说话人编码器(约0.9M参数)相结合。推理时的乘法后融合赋予每个分支独立的否决权,支持从传统检测到严格说话人门控激活的模式,无需重新训练。在LibriPhrase、Google Speech Commands和Qualcomm数据集上,ZP-KWS在1%虚警率下将目标仅误拒率相对于最强基线降低了高达60%,同时保持有竞争力的关键词检测,且总参数量在1.55M以内,适合边缘部署。

英文摘要

User-defined keyword spotting (UD-KWS) enables zero-shot wake-word detection from text, but existing systems learn speaker-invariant representations that cannot reject impostors uttering the correct keyword. We address this dual zero-shot setting -- unseen keywords and unseen speakers -- with ZP-KWS, a lightweight framework combining a phoneme-supervised audio encoder with a GE2E-pretrained compact speaker encoder (about 0.9M parameters). Multiplicative late fusion at inference grants each branch independent veto power, supporting modes from conventional detection to strict speaker-gated activation without retraining. On LibriPhrase, Google Speech Commands, and Qualcomm datasets, ZP-KWS reduces target-only FRR at 1% FAR by up to 60% relative to the strongest baseline while maintaining competitive keyword detection, all within a 1.55M parameter budget for edge deployment.

2. 低资源、多语言与方言语音 2 篇

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 交叉投稿

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化:低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结 针对低资源儿童语音识别,系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现,发现特定策略能显著提升泛化能力。

详情
AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能,尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究,我们展示了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法,与先前研究相比,在孤立词识别上实现了4.65%的相对改进,在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性,这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19797 2026-06-19 eess.AS cs.AI cs.SD eess.SP 交叉投稿

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

通过域内数据增强改进构音障碍语音的端到端语音识别

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结 针对构音障碍语音识别中数据稀缺和严重程度差异的问题,本文探索了四种数据增强方法(SRM、PM、FM、VTLP)对预训练Wav2Vec2模型进行微调,在不同严重程度上实现了显著的字错误率降低。

详情
AI中文摘要

构音障碍语音识别对于促进构音障碍患者之间的有效沟通至关重要。然而,由于严重程度不同和数据可用性有限,准确识别构音障碍语音面临重大挑战。在本文中,我们通过微调端到端预训练Wav2Vec2模型,探索了针对构音障碍自动语音识别(ASR)系统的数据增强技术,特别关注严重程度级别。为了解决数据稀缺以及微调预训练ASR系统用于构音障碍语音时需要大量数据的问题,我们研究了四种主要的数据增强方法:语速修改(SRM)、音高修改(PM)、共振峰修改(FM)和声道长度扰动(VTLP),这些方法针对构音障碍的不同方面进行了调整。本研究使用为每个严重程度类别单独微调的Wav2Vec2模型作为基线系统。此外,我们使用增强数据对ASR模型进行了特定严重程度的微调。结果表明,每种增强技术在不同严重程度级别上表现出不同的有效性模式。对于\textit{低}(9.02%)和\textit{中}(38.11%)严重程度,使用SRM($s$=0.8)获得了最佳WER;对于\textit{高}严重程度(55.15%),使用PM($\ au$=0.8)获得了最佳WER,分别相对改进了30.02%、16.64%和15.47%。这些结果证实了增强方法在提高构音障碍ASR性能方面的有效性。

英文摘要

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

3. 数据集、基准与评测 2 篇

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结 通过声学退化、韵律错误和说话人特征扰动,发现MOS预测模型对声学退化敏感,但对韵律错误不敏感,且对基频有偏见,而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)预测模型在文本到语音(TTS)研究中被广泛用作代理指标,但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点:声学退化、韵律错误以及说话人特定特征(如音高和语速)的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测,并分析了它们感知特征的差异。结果表明,大多数模型能很好地跟踪声学退化,而所有模型对韵律错误不敏感,尽管主观评分大幅下降。对于说话人特征,模型表现出双重分离:在人类评分中不存在的强平均基频(F0)偏见,但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA:针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结 提出PASQA模型,通过可控重音合成数据集和伪重音质量分数,结合自监督表示、摩拉条件融合等训练策略,有效评估音高重音正确性,优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

现有的平均意见得分(MOS)预测模型通常预测话语级别的自然度MOS,并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估(PASQA),明确针对音高重音正确性。为了训练我们的模型,我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集,并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上,并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明,传统模型无法保持按重音错误严重程度的排序,而PASQA在已见和未见说话者上都实现了高排序准确性。此外,PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取:https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.