arXivDaily arXiv每日学术速递 周一至周五更新
2606.20478 2026-06-19 eess.AS 新提交

Beyond Speaker Independence: Evaluating Cross-Lingual Acoustic-to-Articulatory Inversion Across Finnish and Russian

超越说话人独立性:跨语言声学到发音反演在芬兰语和俄语上的评估

Ruchi Pandey, Tomi Kinnunen

AI总结 本研究系统评估了跨说话人和跨语言域偏移下的声学到发音反演(AAI)性能,利用新构建的芬兰语-俄语双语EMA语料库FROST-EMA,比较了不同发音目标、声学前端和反演后端,发现跨性别性能下降中等(约0.05-0.10),跨语言下降更大(约0.10-0.20)。

详情
AI中文摘要

声学到发音反演(AAI)在域偏移下仍然具有挑战性,其中说话人属性的变化和跨语言条件常常导致性能下降。我们在这种偏移下进行了系统评估,并在FROST-EMA(一个芬兰语-俄语双语EMA语料库)上建立了基线基准。FROST-EMA解决了现有资源的英语偏见和有限的说话人多样性。我们基准测试了(i)发音目标(原始EMA坐标与声道变量),(ii)声学前端(MFCC与SSL特征),以及(iii)反演后端(BiLSTM与轻量级基于注意力的序列模型)。我们进一步定义了跨性别迁移(语言内)和跨语言迁移(性别内)的评估协议。结果表明,相对于域内基线,跨性别不匹配导致皮尔逊相关系数适度下降(约0.05至0.10),而跨语言不匹配导致更大的下降(约0.10至0.20)。

英文摘要

Acoustic-to-articulatory inversion (AAI) remains challenging under domain shifts where changes in speaker attributes and cross-language conditions often degrade performance. We conduct a systematic evaluation under such shifts and establish baseline benchmarks on FROST-EMA, a Finnish-Russian bilingual EMA corpus. FROST-EMA addresses the English bias and limited speaker diversity of existing resources. We benchmark (i) articulatory targets (raw EMA coordinates vs tract variables), (ii) acoustic front-ends (MFCC vs SSL features), and (iii) inversion back-ends (BiLSTM vs a lightweight attention-based sequence model). We further define evaluation protocols for cross-gender transfer (within language) and cross-language transfer (within gender). The results indicate that cross-gender mismatch introduces moderate Pearson correlation declines (approximately 0.05 to 0.10) relative to the in-domain baseline, whereas cross-language mismatch causes larger drops (approximately 0.10 to 0.20).

2606.20338 2026-06-19 eess.AS 新提交

Stuttering Classification and Segmentation with Attention-Based Multiple Instance Learning

基于注意力多实例学习的口吃分类与分割

Petar Sušac, Sebastian P. Bayerl, Hrvoje Džapo

AI总结 提出基于微调wav2vec 2.0、WavLM和Whisper编码器的多实例神经网络,利用片段级数据实现帧级口吃分类与分割,帧级F1提升23%。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

使用深度学习方法进行口吃检测和分类有潜力改善口吃严重程度评估过程。大多数口吃分类数据集提供片段级标签,这使得它们不适用于确定单个口吃不流畅持续时间所需的细粒度帧级分类。为了克服这一挑战,我们提出了一种基于微调wav2vec 2.0、WavLM和Whisper编码器的多实例神经网络架构。我们应用基于实例和基于嵌入的多实例学习方法,在片段级数据集上训练模型,用于片段级和帧级口吃分类任务。我们的结果显示,帧级F1分数提高了23%,片段级F1分数提高了2%至9%,证明了我们的模型能够利用片段级数据进行帧级分割的能力。

英文摘要

Stuttering detection and classification using deep learning methods has the potential to improve the process of stuttering severity assessment. Most stuttering classification datasets provide clip-level labels, making them unsuitable for fine-grained frame-level classification needed to determine the duration of individual stuttering dysfluencies. To overcome this challenge, we present a multiple instance neural network architecture based on fine-tuned wav2vec 2.0, WavLM and Whisper encoders. We apply instance- and embedding-based multiple instance learning approaches to train models on a clip-level dataset for both clip-level and frame-level stuttering classification tasks. Our results show a 23% improvement in frame-level F1 score and between 2% and 9% in clip-level F1 score, demonstrating the ability of our models to utilize clip-level data for frame-level segmentation.

2606.20266 2026-06-19 eess.AS 新提交

Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

基于语音特征调节的无转录流匹配文本转语音

SooHwan Eom, Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Chang D. Yoo

AI总结 提出RTFree-F5,用自监督语音表示替代参考转录本,通过轻量适配器映射到F5-TTS文本条件空间,消除对外部ASR依赖,在构音障碍语音上WER从24.6%降至10.4%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

最近的流匹配文本转语音(TTS)模型,如F5-TTS,在推理时依赖于从外部ASR系统获得的参考转录本。这种依赖性使得零样本TTS对于口音或构音障碍的说话者变得脆弱,而这正是最需要它的场景。此外,我们发现即使有真实转录本可用,基于文本的参考条件化也可能将非典型语音中的非典型声学模式传播到合成语音中。为了解决这个问题,我们提出了RTFree-F5,它用连续的自监督语音表示替换参考转录本,通过轻量适配器映射到F5-TTS的文本条件空间,同时重用预训练检查点。在构音障碍语音上,RTFree-F5将WER从24.6%降低到10.4%,甚至超过了真实参考转录本基线,同时提高了自然度,并在标准基准测试中保持竞争力,而无需任何参考转录本。

英文摘要

Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most needed. Moreover, we find that text-based reference conditioning can propagate atypical acoustic patterns from atypical speech into synthesis, even when ground-truth transcripts are available. To address this, we propose RTFree-F5, which replaces the reference transcript with continuous self-supervised speech representations mapped into F5-TTS's text-conditioning space via a lightweight adapter, while reusing the pretrained checkpoint. On dysarthric speech, RTFree-F5 reduces WER from 24.6% to 10.4%, surpassing even the ground-truth reference transcript baselines, while improving naturalness and remaining competitive on standard benchmarks without requiring any reference transcript.

2606.20001 2026-06-19 eess.AS 新提交

Time-Unconditional Generative Speech Enhancement via Autonomous Rectified Flow

基于自主整流流的时间无条件生成式语音增强

Wen Zhang, Wenbin Jiang, Yang Zhang, Xiaofei Zhou

AI总结 提出自主整流流框架,通过线性插值路径证明目标向量场时间不变性,设计时间无条件网络仅从空间关系推断去噪方向,显著提升生成质量、鲁棒性和推理效率。

详情
AI中文摘要

大多数生成式语音增强方法依赖显式时间步嵌入进行时间条件化。本文提出自主整流流框架,挑战这种条件化的必要性。通过线性插值路径,我们证明目标向量场本质上是时间不变的。我们进一步引入时间无条件网络,消除显式时间步信息,仅从当前状态与带噪观测之间的空间关系推断去噪方向。预测该目标向量场等价于建模噪声分布。通过避免过拟合时间轨迹,所提出的自主设计显著提升了生成质量、鲁棒性和推理效率。

英文摘要

Most generative speech enhancement methods rely on explicit time-step embeddings for temporal conditioning. In this paper, we propose the Autonomous Rectified Flow framework, which challenges the necessity of such conditioning. Using a linear interpolation path, we show that the target vector field is inherently time-invariant. We further introduce a time-unconditional network that eliminates explicit time-step information and infers the denoising direction solely from the spatial relationship between the current state and the noisy observation. Predicting this target vector field is equivalent to modeling the noise distribution. By avoiding overfitting to temporal trajectories, the proposed autonomous design significantly improves generation quality, robustness, and inference efficiency.

2606.19974 2026-06-19 eess.AS 新提交

Interpreting Content and Speaker Characteristics in Factorised Self-Supervised Subspaces

解释因子化自监督子空间中的内容和说话人特征

Kyle Janse van Rensburg, Herman Kamper

AI总结 通过SVD分解WavLM特征为内容矩阵和说话人变换,发现内容空间主要编码强度、共振峰和发声,而说话人空间与音高和性别强相关,并可用于语音合成中的精细控制。

Comments 7 pages, 4 figures

详情
AI中文摘要

自监督语音特征同时编码内容和说话人信息。最近的工作引入了一种基于SVD的因子化方法,将这些特征分解为一个共享的内容矩阵(捕获时间变化)和说话人特定的变换(捕获静态说话人特征)。然而,这些组件内部的信息组织方式仍不清楚。在本文中,我们研究了WavLM因子化的内容和说话人子空间的维度如何与语音特征(如音高、强度和发声)相关。我们发现,内容空间中的前几个维度主要捕获强度、高阶共振峰和发声,而音高编码在较后的维度中。相比之下,方差最大的说话人维度与音高和性别强相关,后面的维度捕获高频变化。干预实验表明,操纵这些维度能够实现对语音合成中语音特征的目标控制。此外,联合修改内容和说话人表示可提供对音高和强度等特征的精细控制。

英文摘要

Self-supervised speech features encode both content and speaker information. Recent work introduced an SVD-based factorisation that decomposes these features into a shared content matrix capturing temporal variation and speaker-specific transformations capturing static speaker characteristics. However, how information is organised within these components remains unclear. In this paper, we investigate how the dimensions of WavLM-factorised content and speaker subspaces correlate with speech characteristics such as pitch, intensity, and voicing. We find that leading dimensions in the content space primarily capture intensity, higher-order formants, and voicing, while pitch is encoded in a later dimension. In contrast, the highest-variance speaker dimension is strongly associated with pitch and gender, with later dimensions capturing high-frequency variation. Intervention experiments show that manipulating these dimensions enables targeted control of speech characteristics for speech synthesis. Furthermore, modifying the content and speaker representations jointly provides fine-grained control over characteristics such as pitch and intensity.

2606.19940 2026-06-19 eess.AS 新提交

Analyzing Language and Geographical Variation in Speech Representations Across 60 Indic Languages

分析60种印度语言语音表征中的语言和地理变异

Pavan Kumar J, Agneedh Basu, Pranav Bhat, Sujith Pulikodan, Visruth Sanka, Nihar Desai, Prasanta Kumar Ghosh

AI总结 研究通过联合语言-地区监督微调Whisper-base和Wav2Vec2.0,发现该方法在保持语言分类能力的同时,提升了嵌入空间中地区区分度,并利用归一化条件互信息分析了嵌入结构。

详情
AI中文摘要

自监督语音编码器通常使用语言监督进行微调,这可能会忽略地理变异。为了理解在语言和地区联合监督下与仅语言监督下学习到的表征差异,我们微调Whisper-base和Wav2Vec2.0进行联合语言-地区分类(386类)和仅语言分类(60类)任务。语言-地区监督在嵌入空间中改善了条件于语言的地区区分度,同时保持了较强的边缘语言分类能力。我们使用归一化条件互信息(NCMI)分析学习到的嵌入结构,表明语言-地区监督产生了全局语言簇,并在语言内部形成了与地区变异对齐的结构化子簇,从而在不降低语言层面组织的情况下增强了地理可分离性。

英文摘要

Self-supervised speech encoders are often fine-tuned with language supervision, which can overlook geographical variation. To understand the learned representations under joint supervision of language and district compared to language-only supervision, we fine-tune Whisper-base and Wav2Vec2.0-base for classification tasks with joint language-district (386 classes) and language-only classification (60 languages). The language-district supervision improves district discrimination conditioned on language in the embedding space while strong marginal language classification. We analyze the structure of the learned embeddings using Normalized Conditional Mutual Information (NCMI), showing that language-district supervision produces global language clusters with structured within language subclusters aligned to district variation, enhancing geographical separability without degrading language-level organization.

2606.19453 2026-06-19 eess.AS 新提交

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

全双工口语对话系统综述:架构层次、交互本体与决策状态机

Jingyu Lu, Yuhan Wang, Jianming Luo, Yifu Chen, Tianle Liang, Shengpeng Ji, Ziyue Jiang, Xiaoda Yang, Yu Zhang, Xize Cheng, Chenyuhao Wen, Changhao Pan, Haoxiao Wang, Chen Ye, Jian Wu, Xiaoxi Jiang, Guanjun Jiang, Zhou Zhao

AI总结 针对全双工术语歧义,提出L0-L3架构层次、T×I×R交互本体和IDLE/LISTEN/SPEAK/WAIT/DUAL决策状态机三个框架,揭示现有系统在训练与评估中的实现差距。

Comments 34 pages, 5 figures, 7 tables. Project page and interactive demo: https://github.com/DuplexLM/DuplexSurvey

详情
AI中文摘要

近期有十余个口语对话系统声称实现了“全双工”,但该术语被用于描述本质上不同的能力。现有综述将它们归入单一轴(级联/端到端,或工程化/学习型),忽略了构建者最关心的区别。我们认为这种歧义很大程度上源于分类学问题:当前术语未明确双工决策在何处做出、支持哪些交互类型、以及系统如何逐时刻行为。本文引入三个互补框架:(i) L0-L3架构层次,定位双工决策位置;(ii) T×I×R交互本体,指定每次交互的时间关系、用户意图和所需系统响应;(iii) 决策状态机(IDLE/LISTEN/SPEAK/WAIT/DUAL),描述系统如何在状态间转换。通过对已发表系统和基准的审计,我们记录了一个实现差距:尽管许多架构原则上能在全双工状态下运行,但其观察到的行为仍受训练和评估中表示的交互模式约束。我们指出,相对于(大多未公开的)工业语料库,有限的公开训练数据覆盖范围,以及尚未实现的L3表示级建模目标,是全双工对话未来研究的关键前沿。相关材料见https://this https URL。

英文摘要

More than a dozen spoken dialogue systems have recently claimed to be "full-duplex," yet the term has been used to describe substantially different capabilities. Existing surveys collapse them onto a single axis (cascaded/end-to-end, or engineered/learned) and miss the distinctions that matter most for builders. We argue that much of this ambiguity is taxonomical: current terminology does not specify where duplex decisions are made, which interaction types are supported, or how a system behaves moment by moment. This paper introduces three complementary frameworks: (i) an L0-L3 Architectural Hierarchy that locates where duplex decisions are made; (ii) a $T\times I\times R$ Interaction Ontology that specifies the temporal relation, user intent, and required system response for each interaction; and (iii) a Decision State Machine (IDLE/LISTEN/SPEAK/WAIT/DUAL) that describes how systems move between states. Across published systems and benchmarks, our audit documents a realization gap: although many architectures can in principle operate in full-duplex states, their observed behavior remains constrained by the interaction patterns represented in training and evaluation. We point to the limited public training-data coverage relative to the (largely undisclosed) industrial corpora, together with the still-unrealized goal of L3 representation-level modeling, as the key frontiers for future research on full-duplex dialogue. The related material is available at https://github.com/DuplexLM/DuplexSurvey.

2606.20457 2026-06-19 eess.AS cs.AI cs.LG 新提交

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

重新利用语音分类器进行基于引导扩散的语音生成

Rostislav Makarov, Timo Gerkmann

AI总结 提出将预训练的语音分类器作为扩散生成的主干,通过附加轻量子网络并仅训练该子网络,实现单主干模型的高质量条件语音生成,降低内存和计算成本。

Comments Accepted for publication in the Proceedings of Interspeech 2026

详情
AI中文摘要

分类器引导是一种通过使用噪声条件分类器将采样过程导向目标类别来控制扩散生成的方法。分类器引导的一个缺点是需要两个单独训练的模型:一个分类器和一个扩散模型。因此,我们研究了一种更紧凑的替代方案,其中将传统训练的语音分类器重新用作扩散生成的主干。从log-Mel空间中的冻结噪声条件分类器开始,我们附加一个轻量子网络,该子网络重用中间分类器表示,并在去噪分数匹配目标下仅训练该子网络。我们的工作表明,预训练的分类器可以重新用于条件生成,为判别建模和条件语音合成之间提供了有吸引力的桥梁,从而在单主干模型中实现高语音质量,同时减少内存占用和计算成本。

英文摘要

Classifier guidance is a way to control diffusion generation by using a noise-conditioned classifier to steer the sampling process toward a target class. One drawback of classifier guidance is that it requires two separately trained models: a classifier and a diffusion model. We therefore study a more compact alternative in which a conventionally trained speech classifier is repurposed as the backbone for diffusion generation. Starting from a frozen noise-conditioned classifier in log-Mel space, we attach a lightweight subnetwork that reuses intermediate classifier representations and train only this subnetwork under a Denoising Score Matching objective. Our work shows that a pretrained classifier can be repurposed for conditional generation, providing an appealing bridge between discriminative modeling and conditional speech synthesis resulting in high speech quality within a single-backbone model, with reduced memory footprint and computational cost.

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA:针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结 提出PASQA模型,通过可控重音合成数据集和伪重音质量分数,结合自监督表示、摩拉条件融合等训练策略,有效评估音高重音正确性,优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

现有的平均意见得分(MOS)预测模型通常预测话语级别的自然度MOS,并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估(PASQA),明确针对音高重音正确性。为了训练我们的模型,我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集,并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上,并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明,传统模型无法保持按重音错误严重程度的排序,而PASQA在已见和未见说话者上都实现了高排序准确性。此外,PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取:https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

2606.20106 2026-06-19 eess.AS cs.SD 新提交

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

利用文本无关说话人验证的用户自定义关键词个性化唤醒

Ming-Hsiang Hu, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Berlin Chen

AI总结 提出ZP-KWS轻量框架,结合音素监督音频编码器和紧凑说话人编码器,通过乘法后融合实现零样本关键词检测与说话人验证,在多个数据集上将目标误拒率降低高达60%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

用户自定义关键词唤醒(UD-KWS)能够从文本实现零样本唤醒词检测,但现有系统学习的是说话人不变表示,无法拒绝说出正确关键词的冒名顶替者。我们针对这种双重零样本设置——未见关键词和未见说话人——提出了ZP-KWS,一个轻量级框架,将音素监督的音频编码器与GE2E预训练的紧凑说话人编码器(约0.9M参数)相结合。推理时的乘法后融合赋予每个分支独立的否决权,支持从传统检测到严格说话人门控激活的模式,无需重新训练。在LibriPhrase、Google Speech Commands和Qualcomm数据集上,ZP-KWS在1%虚警率下将目标仅误拒率相对于最强基线降低了高达60%,同时保持有竞争力的关键词检测,且总参数量在1.55M以内,适合边缘部署。

英文摘要

User-defined keyword spotting (UD-KWS) enables zero-shot wake-word detection from text, but existing systems learn speaker-invariant representations that cannot reject impostors uttering the correct keyword. We address this dual zero-shot setting -- unseen keywords and unseen speakers -- with ZP-KWS, a lightweight framework combining a phoneme-supervised audio encoder with a GE2E-pretrained compact speaker encoder (about 0.9M parameters). Multiplicative late fusion at inference grants each branch independent veto power, supporting modes from conventional detection to strict speaker-gated activation without retraining. On LibriPhrase, Google Speech Commands, and Qualcomm datasets, ZP-KWS reduces target-only FRR at 1% FAR by up to 60% relative to the strongest baseline while maintaining competitive keyword detection, all within a 1.55M parameter budget for edge deployment.

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结 通过声学退化、韵律错误和说话人特征扰动,发现MOS预测模型对声学退化敏感,但对韵律错误不敏感,且对基频有偏见,而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)预测模型在文本到语音(TTS)研究中被广泛用作代理指标,但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点:声学退化、韵律错误以及说话人特定特征(如音高和语速)的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测,并分析了它们感知特征的差异。结果表明,大多数模型能很好地跟踪声学退化,而所有模型对韵律错误不敏感,尽管主观评分大幅下降。对于说话人特征,模型表现出双重分离:在人类评分中不存在的强平均基频(F0)偏见,但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

2606.19823 2026-06-19 eess.AS cs.LG 新提交

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

低负担数据增强:通过零样本语音克隆改善构音障碍语音识别

Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri

AI总结 针对构音障碍语音数据稀缺和变异性大的问题,提出使用零样本语音克隆(Higgs Audio V2)生成合成数据,微调Whisper-medium模型,在TORGO数据集上达到与真实数据微调相近的词错误率,并显著降低数据收集成本。

Comments Accepted to Interspeech 2026, Sydney, Australia

详情
AI中文摘要

由于数据稀缺和说话人之间高度变异,自动语音识别对于构音障碍语音仍然不可靠。虽然合成数据可以弥补这些不足,但传统方法通常需要大量的说话人特定数据,重新引入了数据收集瓶颈。我们研究零样本语音克隆作为一种低负担的增强策略,使用Higgs Audio V2克隆TORGO数据集中的说话人。我们在克隆数据、真实数据和混合数据上微调Whisper-medium,并在保留的真实语音上进行评估。与零样本基线(31.62%)相比,克隆数据微调实现了具有竞争力的26.00%词错误率,几乎与真实数据微调(24.44%)和混合数据微调(25.12%)相当。值得注意的是,对于中重度构音障碍说话人,克隆和混合微调优于真实数据微调。在SAP-1102上的跨语料库评估中,克隆微调取得了最佳结果(相对提升11.45%)。这些结果表明,零样本克隆提供了可扩展的训练数据,绕过了昂贵的数据收集瓶颈。

英文摘要

Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.

2606.19797 2026-06-19 eess.AS cs.AI cs.SD eess.SP 新提交

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

通过域内数据增强改进构音障碍语音的端到端语音识别

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结 针对构音障碍语音识别中数据稀缺和严重程度差异的问题,本文探索了四种数据增强方法(SRM、PM、FM、VTLP)对预训练Wav2Vec2模型进行微调,在不同严重程度上实现了显著的字错误率降低。

详情
AI中文摘要

构音障碍语音识别对于促进构音障碍患者之间的有效沟通至关重要。然而,由于严重程度不同和数据可用性有限,准确识别构音障碍语音面临重大挑战。在本文中,我们通过微调端到端预训练Wav2Vec2模型,探索了针对构音障碍自动语音识别(ASR)系统的数据增强技术,特别关注严重程度级别。为了解决数据稀缺以及微调预训练ASR系统用于构音障碍语音时需要大量数据的问题,我们研究了四种主要的数据增强方法:语速修改(SRM)、音高修改(PM)、共振峰修改(FM)和声道长度扰动(VTLP),这些方法针对构音障碍的不同方面进行了调整。本研究使用为每个严重程度类别单独微调的Wav2Vec2模型作为基线系统。此外,我们使用增强数据对ASR模型进行了特定严重程度的微调。结果表明,每种增强技术在不同严重程度级别上表现出不同的有效性模式。对于\textit{低}(9.02%)和\textit{中}(38.11%)严重程度,使用SRM($s$=0.8)获得了最佳WER;对于\textit{高}严重程度(55.15%),使用PM($\ au$=0.8)获得了最佳WER,分别相对改进了30.02%、16.64%和15.47%。这些结果证实了增强方法在提高构音障碍ASR性能方面的有效性。

英文摘要

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 新提交

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究:频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结 本文系统研究不同频谱特征与声学模型的组合,通过引入音高特征和优化训练帧重叠数,在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情
AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能,特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查,我们证明了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法,与先前研究相比,在构音障碍语音的孤立词识别中获得了4.65%的相对改进,在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性,这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 新提交

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化:低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结 针对低资源儿童语音识别,系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现,发现特定策略能显著提升泛化能力。

详情
AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能,尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究,我们展示了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法,与先前研究相比,在孤立词识别上实现了4.65%的相对改进,在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性,这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19987 2026-06-19 cs.SD eess.AS 交叉投稿

PolSeT: Polish Semantics of Timbre Dataset

PolSeT: 波兰语音色语义数据集

Jan Jasiński

AI总结 介绍PolSeT数据集,通过自由言语化和语义差异实验,收集波兰语语义描述符和音色评分,填补音色研究数据空白,支持跨文化心理声学和MIR研究。

Comments 8 pages, 7 figures. Data descriptor for the PolSeT dataset (Polish Semantics of Timbre), available at https://doi.org/10.5281/zenodo.17830609 under CC BY 4.0

详情
AI中文摘要

本数据报告介绍了PolSeT(波兰语语义音色)数据集,该数据集旨在促进波兰语及跨文化背景下的心理声学和音乐信息检索(MIR)研究。数据集包含两个连续实验的数据。实验1(N=60)是一项自由言语化任务,旨在创建波兰语语义描述符词汇表。使用11个刺激,共收集了1901个描述符(701个唯一)。实验2(N=105)利用该词汇表进行语义差异研究,参与者对18种乐器声音在8个双极量表上进行评分,并进行了重复试验以进行信度分析。发布的数据集包括原始听众响应、全面的人口统计数据(经验、性别、年龄)、音频刺激以及提取的声学特征及Python提取代码。该数据集填补了开放音色研究数据的空白,为心理声学研究和多语言语义嵌入模型的训练提供了必要的定性语言基础和定量评分。

英文摘要

This data report introduces PolSeT (Polish Semantic Timbre), a dataset designed to facilitate research in psychoacoustics and Music Information Retrieval (MIR) in Polish and cross-cultural contexts. The dataset contains data from two sequential experiments. Experiment 1 (N=60) was a free-verbalization task aimed at creating a lexicon of Polish semantic descriptors. Using 11 stimuli, a total of 1901 descriptors (701 unique) were gathered. Experiment 2 (N=105) utilized this lexicon to conduct a semantic differential study, where participants rated 18 instrument sounds on 8 bipolar scales, with repeated trials for reliability analysis. The released dataset includes raw listener responses, comprehensive demographics (experience, gender, age), audio stimuli, and extracted acoustic features with Python extraction code. This dataset addresses a gap in open timbre research data, providing both the qualitative linguistic groundwork and the quantitative ratings necessary for psychoacoustic research and the training of multilingual semantic embedding models.

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 交叉投稿

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估:基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar(卡塔尔计算研究所,多哈,卡塔尔)

AI总结 提出仅使用母语语音资源训练的轻量级发音评估框架,通过离散化语音标记和语言模型计算意外度,结合文本引导对齐特征,在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库,这些语料库收集成本高昂。我们提出一个轻量级框架,仅使用母语语音资源训练,以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时,学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度,其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块,该模块从参考文本预测母语标记序列,并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上,PCC从0.60提升到0.66(带转录引导),接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

2606.19688 2026-06-19 cs.SD eess.AS 交叉投稿

Latency-Configurable Streaming Speech Enhancement via Asymmetric Temporal Padding

通过非对称时间填充实现延迟可配置的流式语音增强

Yunsik Kim, Yoonyoung Chung

发表机构 * Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH)(电气工程系,浦项科技大学) Intus Co. Ltd.(Intus有限公司)

AI总结 提出LaCo-SENet,通过非对称时间填充和双缓冲流式机制,在单一超参数下实现延迟与质量的灵活权衡,在VoiceBank+DEMAND上以1.37M参数获得12.5-75.0ms延迟范围,PESQ从3.35到3.43。

Comments 5 pages, 3 figures. Accepted for presentation at Interspeech 2026

详情
AI中文摘要

流式语音增强需要在算法延迟和质量之间取得平衡,但现有方法大多将其视为因果与非因果的二元选择。LaCo-SENet通过单个训练时超参数参数化的两种机制解决了这个问题。首先,非对称时间填充重新分配卷积中的过去和未来上下文,实现系统性的延迟配置。其次,双缓冲流式结合了过去上下文的状体缓冲区和在输入和特征层面提供未来上下文的超前缓冲区。选择性状态更新还防止未来帧泄漏到流式状态中,确保训练-推理一致性。在VoiceBank+DEMAND上,固定预算(1.37M参数)的主干网络产生了覆盖12.5-75.0毫秒的模型系列,PESQ从3.35上升到3.43。在仅12.5毫秒(完全因果)时,PESQ为3.35,达到或超过了先前的因果最先进水平(46.5毫秒时为3.27)。

英文摘要

Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter. First, asymmetric temporal padding redistributes past and future context in convolutions, enabling systematic latency configuration. Second, dual-buffer streaming combines state buffers for past context with lookahead buffers that supply future context at both the input and feature levels. Selective state updates also prevent future-frame leakage into the streaming state, ensuring training-inference consistency. On VoiceBank+DEMAND, a fixed-budget (1.37M parameters) backbone yields a family of models spanning 12.5-75.0 ms, with PESQ rising from 3.35 to 3.43. At just 12.5 ms (fully causal), a PESQ of 3.35 matches or exceeds the prior causal state-of-the-art (3.27 at 46.5 ms).

2606.19398 2026-06-19 cs.SD eess.AS eess.SP 交叉投稿

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

S-JEPA:用于自监督语音表示学习的软聚类锚点

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

发表机构 * Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学) James Silberrad Brown Center for AI(詹姆斯·西尔伯拉德·布朗人工智能中心) Columbia University(哥伦比亚大学) Northeastern University(东北大学) Stanford University(斯坦福大学) Amazon GenAI(亚马逊生成式人工智能)

AI总结 提出S-JEPA,通过KL散度匹配高斯混合模型的软后验概率训练编码器-预测器对,无需离线重聚类或教师蒸馏,在SUPERB协议下以低于90M参数取得最低WER,并建立新的帕累托前沿。

详情
AI中文摘要

自监督语音编码器主要通过预测掩蔽位置处的离散硬聚类ID进行训练,这种方法会坍缩类别边界处的声学模糊性,并需要在迭代之间中断训练以对整个语料库进行重聚类。我们提出S-JEPA,一种JEPA风格的编码器-预测器对,通过KL散度训练以匹配掩蔽位置处高斯混合模型的软后验概率。训练作为连续优化轨迹分两个阶段进行:首先在MFCC特征上使用固定GMM,然后在编码器特征上使用在线GMM,输入层从无标签信号中自适应选择,从而消除了离线重聚类步骤以及手动选择聚类所在Transformer层的问题。在SUPERB协议下,S-JEPA在评估的低于90M参数的自监督方法中实现了最低的词错误率(WER),并在大约一半参数量的情况下在情感识别任务上与HuBERT-Base相当,无需离线重聚类或教师蒸馏即建立了新的帕累托前沿。对预测器在保留语音上的每帧熵的分析揭示了双峰分布,其中相当一部分帧的熵接近完美两聚类平局的熵,这直接经验性地证明了软目标目标保留了硬目标会坍缩的声学模糊性。代码可在以下网址获取:https://this https URL。

英文摘要

Self-supervised speech encoders are predominantly trained by predicting discrete hard cluster IDs at masked positions, a recipe that collapses acoustic ambiguity at category boundaries and requires interrupting training to re-cluster the entire corpus between iterations. We introduce S-JEPA, a JEPA-style encoder-predictor pair trained to match the soft posteriors of a Gaussian Mixture Model at masked positions via KL divergence. Training runs as one continuous optimization trajectory in two phases: a fixed GMM over MFCC features, then an online GMM over encoder features, with the input layer selected adaptively from a label-free signal, removing both the offline re-cluster step and the hand-tuned choice of which transformer layer to cluster on. Under the SUPERB protocol, S-JEPA achieves the lowest WER among evaluated SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half its parameter count, establishing a new Pareto frontier without offline re-clustering or teacher distillation. An analysis of the predictor's per-frame entropy on held-out speech reveals a bimodal distribution with a substantial minority of frames near the entropy of a perfect two-cluster tie, providing direct empirical evidence that the soft-target objective preserves the acoustic ambiguity that hard targets would collapse. Code is available at https://github.com/gioannides/s-jepa.

2606.18485 2026-06-19 cs.SD cs.AI eess.AS 交叉投稿

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

MagpieTTS-LF:无需长语音数据训练的推理时长生成长语音生成

Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain, Ryan Langman, Xuesong Yang, Roy Fejgin

发表机构 * NVIDIA Corporation(英伟达公司)

AI总结 提出MagpieTTS-LF推理时方法,通过软注意力先验、有状态推理和历史感知文本编码,在不重新训练模型的情况下实现连贯的长语音生成。

Journal ref Interspeech 2026

详情
AI中文摘要

神经文本到语音(TTS)系统在短语句上取得了显著质量,但长语音生成表现出韵律漂移、说话人不一致和句子边界伪影。现有方法要么压缩序列、增加上下文长度,要么简单拼接独立合成的片段。我们提出一种称为MagpieTTS-LF的推理时方法,使MagpieTTS能够在不重新训练模型的情况下生成连贯的长语音。我们的方法引入了三个关键创新:(1)软注意力先验,在保留过去和未来上下文的同时引导单调对齐;(2)有状态推理算法,跨句子块维护上下文,确保韵律连续性;(3)历史感知文本编码,利用过去文本进行语篇级韵律规划。在长文本上的实验表明,与其他基线相比,在长距离可懂度、韵律连贯性、说话人一致性和边界自然度方面有显著改进。

英文摘要

Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

2606.16417 2026-06-19 cs.SD eess.AS 交叉投稿

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Joycent: 基于扩散的口音语音合成,无需口音音素预测

Xintong Wang, Ye Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Joycent,一种基于扩散模型的口音TTS方法,直接从标准音素序列和语音参考合成口音语音,无需口音音素预测,通过条件层归一化集成口音和说话人表征,并引入WhisAID口音识别模型,在保持说话人身份的同时提升口音自然度。

详情
AI中文摘要

口音文本到语音(TTS)旨在合成具有目标口音的语音。现有的口音TTS系统通常依赖于两阶段流程,首先将标准音素序列转换为口音音素序列,然后合成口音语音。然而,这种方法存在错误累积问题,并且需要配对的标准-口音音素序列数据,这在实践中往往有限。此外,基于文本的口音音素表示不足以建模韵律和节奏等声学口音特征。在这项工作中,我们提出了Joycent,一种基于扩散的口音TTS模型,它直接从标准音素序列和语音参考合成口音语音,无需口音音素预测。Joycent通过文本编码器中的条件层归一化(CLN)集成口音和说话人表征。我们引入了WhisAID,一种在口音普通话语音上训练的普通话口音识别模型,以提取口音表征。实验结果表明,与基线系统相比,Joycent在保持说话人身份的同时提高了口音自然度。我们在以下网址发布代码和演示:https://github.com/oshindow/Joycent-code。

英文摘要

Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.

2606.14784 2026-06-19 cs.SD cs.LG eess.AS 交叉投稿

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

基于上下文学习的音频情感分类的LLM合成真实标签生成

Qing Huang, Pooja Pol, Jianing Zhang

发表机构 * School of Business, Technical University of Applied Sciences Augsburg(应用技术大学阿沙芬堡商学院) Data Science und Autonome Systeme Technologietransferzentrum (TTZ)(数据科学与自主系统技术转移中心(TTZ))

AI总结 提出利用大语言模型(LLM)和上下文学习(ICL)从多用户VR环境的流式语音数据中自动生成情感相关合成真实标签,解决团队协作状态标注难题。

Comments https://icaiit.org/paper.php?paper=14th_ICAIIT_2/3_9

详情
AI中文摘要

理解人类状态和交互动态是人机交互(HCI)的核心目标。随着交互范式变得更加沉浸,虚拟现实(VR)已成为研究协作工作的强大平台。在此类环境中,评估团队协作状态(包括团队表现和团队韧性)需要从多模态传感器数据(如语音信号)中连续可靠地推断潜在的团队级认知和情感状态。然而,由于传感器噪声、上下文变异性和稀疏的专家标注,为这些潜在状态生成真实标签仍然具有挑战性。传统的自我报告方法仅提供静态和延迟的测量,因此不足以捕捉连续语音数据中反映的动态团队过程。在这项工作中,我们提出了一种由大语言模型(LLM)驱动的、基于代理的推理工作流,用于从多用户VR环境中的流式语音数据自动生成情感相关的合成真实标签。利用LLM的泛化能力,我们使用上下文学习(ICL)和少量配对的音频样本及其对应转录的演示。ICL倾向于实现与模型微调相当的任务适应,同时避免了参数更新的计算开销。为了构建信息丰富且鲁棒的上下文提示,我们采用基于检索的选择策略,根据声学特征空间中的相似性动态识别相关的音频演示。

英文摘要

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.

2606.05846 2026-06-19 cs.CL eess.AS 版本更新

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR:将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

发表机构 * University of Tokyo(东京大学)

AI总结 通过模型合并和领域泛化方法,研究从有限语言对中学到的代码切换能力能否泛化到未见语言对,实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

自动语音识别(ASR)已成为人机交互的关键技术。然而,由于跨多种语言对的代码切换(CS)语音资源严重稀缺,代码切换ASR(CS-ASR)仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而,这些方法面临固有的可扩展性限制,因为对CS的支持必须针对语言对单独开发,而语言对的数量随支持的语言数量呈组合增长。在这项工作中,我们研究通过模型合并和领域泛化方法,从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明,合并的双语CS-ASR模型对未见语言对有一定程度的泛化,表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea(韩国文化科技研究所) Maum AI Inc., Republic of Korea(马姆人工智能公司)

AI总结 本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题,通过分析下游语义失败,揭示了传统ASR指标无法完全捕捉的误差影响,发现不同性能的LLM在级联降级上的一致性,识别出单字符ASR错误作为语义失败通道,并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

详情
AI中文摘要

我们分析了自动语音识别(ASR)误差如何通过ASR-LLM级联在韩语语音问答(SQA)中传播,重点关注传统ASR指标无法完全捕捉的下游语义失败。我们的分析显示,由ASR误差引起的相对下游降级在不同绝对性能的LLM中保持一致,表明级联降级主要跟踪ASR阶段的信息损失。我们进一步识别出单字符韩语ASR错误作为一种独特的语义失败通道,其中正确答案在下游预测中完全消失,尽管仅存在微小的转录差异。最后,辅助比较显示,大型音频语言模型在噪声韩语SQA中优于具有匹配语言骨干的ASR-LLM流水线,表明直接音频输入有潜力缓解转录诱导的信息损失。

英文摘要

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a Korean-specific loss channel, where even a minimal transcription difference can change the intended question and degrade downstream QA performance. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM cascade with an approximately matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

2604.18105 2026-06-19 eess.AS cs.CL cs.SD 版本更新

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR:迈向高效、鲁棒且可定制的实时基于LLM的语音识别

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu

发表机构 * Advanced Intelligent Systems Group, NIO(蔚来智能系统集团)

AI总结 提出NIM4-ASR框架,通过重新设计多阶段训练范式(包括预训练架构优化、迭代异步SFT和ASR专用强化学习)以及生产优化(噪声鲁棒性、流式推理和RAG热词定制),在2.3B参数下实现SOTA性能。

详情
AI中文摘要

将大语言模型(LLM)集成到自动语音识别(ASR)中已成为近年来的主流范式。尽管现有的基于LLM的ASR模型在公共基准上表现出色,但其训练仍然主要依赖数据驱动,未能充分解决关键的实际挑战——特别是在资源受限部署中的有限向下可扩展性以及声学挑战条件下的幻觉问题。为了解决这些问题,我们提出了NIM4-ASR,一个面向生产的、基于LLM的ASR框架,针对效率和鲁棒性进行了优化。基于编码器和LLM之间功能角色的原则性划分,我们重新设计了多阶段训练范式,使每个模块与其预期的能力边界对齐。具体来说,我们重新制定了预训练架构和目标以缓解模态差距并提高参数效率;引入了迭代异步SFT阶段以保持声学保真度并约束表示漂移;设计了ASR专用的强化学习阶段以进一步提高识别质量和鲁棒性。我们还加入了一系列面向生产的优化,包括噪声和静音条件下的鲁棒性、实时流式推理以及通过检索增强生成(RAG)进行的热词定制。实验表明,NIM4-ASR仅用2.3B参数就在多个公共基准上达到了最先进的性能,同时在内部基准上显著优于更大规模的竞争对手——特别是在实体密集的真实场景中。NIM4-ASR进一步通过RAG支持百万级热词定制,检索延迟低于毫秒,从而能够高效适应新兴实体和个性化用户需求。

英文摘要

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

2603.16941 2026-06-19 eess.AS cs.CL cs.SD 版本更新

The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs

言语背后的声音:量化语音大语言模型中的交叉偏见

Shree Harsha Bokkahalli Satish, Christoph Minixhofer, Maria Teleki, James Caverlee, Ondřej Klejch, Peter Bell, Gustav Eje Henter, Éva Székely

发表机构 * 1 Department of Speech, Music Hearing, KTH Royal Institute of Technology, Sweden 2 Centre for Speech Technology Research, University of Edinburgh, UK 3 Texas A\&M University, USA

AI总结 本研究通过2880次受控交互,评估三种语音大语言模型在六种英语口音和两种性别呈现中的口音与性别交叉偏见,发现东欧口音(尤其女性)获得更低有用性评分,且人类评估者比LLM评判更敏感。

Comments 5 pages, 3 figures, 1 table, Accepted to Interspeech 2026

详情
AI中文摘要

语音大语言模型直接处理语音输入,保留了之前级联管道中去除的口音和感知性别等线索,这导致了依赖于说话者身份的反应差异。我们使用2880次受控交互(涵盖六种英语口音和两种性别呈现,通过语音克隆保持语言内容不变),对三种语音大语言模型中的口音和性别偏见进行了大规模交叉评估。通过逐点LLM评判评分、成对比较以及经过人工验证的最佳-最差缩放,我们检测到反复出现的定向差异。东欧口音的语音获得较低的有用性评分,尤其是女性呈现的语音。反应保持礼貌但在有用性上存在差异。虽然LLM评判捕捉到了这些偏见的定向趋势,但人类评估者表现出显著更高的敏感性,显示出更强的口音级别对比。

英文摘要

Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect recurring directional disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. Responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, showing stronger accent-level contrasts.

2603.04219 2026-06-19 cs.SD cs.AI eess.AS 版本更新

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

ZeSTA: 基于领域条件训练的零样本文本转语音增强用于数据高效的个性化语音合成

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

发表机构 * Maum AI Inc.(Maum AI公司) Humelo Inc.(Humelo公司)

AI总结 提出ZeSTA框架,通过轻量领域嵌入区分真实与合成语音,结合真实数据过采样,在极低资源下提升零样本文本转语音增强的说话人相似度,保持可懂度和感知质量。

Comments 6 pages, accepted to INTERSPEECH 2026

详情
AI中文摘要

我们研究了将零样本文本转语音(ZS-TTS)作为低资源个性化语音合成的数据增强源。虽然合成增强可以提供语言丰富且音素多样的语音,但将大量合成语音与有限的真实录音简单混合往往会导致微调过程中说话人相似度下降。为解决这一问题,我们提出了ZeSTA,一个简单的基于领域条件的训练框架,通过轻量领域嵌入区分真实和合成语音,并结合真实数据过采样以在极有限的目标数据下稳定适应,无需修改基础架构。在LibriTTS和一个内部数据集上使用两个ZS-TTS源的实验表明,我们的方法在保持可懂度和感知质量的同时,相比朴素合成增强提高了说话人相似度。音频样本可在我们的网页上获取。

英文摘要

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality. Audio samples are available on our web page.

2509.04390 2026-06-19 eess.AS cs.SD 版本更新

Accelerated Interactive Auralization of Highly Reverberant Spaces using Graphics Hardware

Hannes Rosseel, Toon van Waterschoot

发表机构 * KU Leuven, Dept. of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing

Comments 9 pages, 6 figures, submitted to Journal of the Audio Engineering Society

详情
英文摘要

Interactive acoustic auralization allows users to explore virtual acoustic environments in real-time, enabling the acoustic recreation of concert hall or Historical Worship Spaces (HWS) that are either no longer accessible, acoustically altered, or impractical to visit. Interactive acoustic synthesis requires real-time convolution of input signals with a set of synthesis filters that model the space-time acoustic response of the space. The acoustics in concert halls and HWS are both characterized by a long reverberation time, resulting in synthesis filters containing many filter taps. As a result, the convolution process can be computationally demanding, introducing significant latency that limits the real-time interactivity of the auralization system. In this paper, the implementation of a real-time multichannel loudspeaker-based auralization system is presented. This system is capable of synthesizing the acoustics of highly reverberant spaces in real-time using GPU-acceleration. A comparison between traditional CPU-based convolution and GPU-accelerated convolution is presented, showing that the latter can achieve real-time performance with significantly lower latency. Additionally, the system integrates acoustic synthesis with acoustic feedback cancellation on the GPU, creating a unified loudspeaker-based auralization framework that minimizes processing latency.

2507.19137 2026-06-19 eess.AS cs.AI cs.SD 版本更新

Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios

二元角色扮演场景中跨情境的人格维度评估

Alice Zhang, Skanda Muralidhar, Daniel Gatica-Perez, Mathew Magimai-Doss

发表机构 * Idiap Research Institute(日内瓦研究所) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 研究通过对话语音分析,发现感知人格在不同工作情境下显著变化,并识别出与各人格特质相关的声学特征。

Comments Accepted to IEEE Transactions on Affective Computing

详情
AI中文摘要

先前研究表明,用户偏好与其人格相匹配的辅助技术。这引发了对自动人格感知(APP)的兴趣,旨在预测个体感知到的人格特质。以往的APP研究将人格视为静态特质,独立于情境。然而,心理学研究表明,感知人格会随情境和场景而变化。在本研究中,我们调查了参与两种工作情境(中性面试和压力客户互动)的参与者对话语音与感知人格之间的关系。我们的主要发现是:1)感知人格在不同互动中显著不同;2)响度、声压级和频谱通量特征在中性互动中指示感知的外向性、宜人性、尽责性和开放性,而在压力情境中,神经质与这些特征相关;3)手工声学特征和非语言特征在感知人格推断中优于说话人嵌入;4)压力互动更能预测神经质,这与现有心理学研究一致。

英文摘要

Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual's perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.

2505.18726 2026-06-19 cs.SD cs.LG eess.AS 版本更新

Bioacoustic Geolocation: Species Sounds as Geographic Signals

生物声学地理定位:物种声音作为地理信号

Mustafa Chasmai, Wuao Liu, Subhransu Maji, Grant Van Horn

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文研究仅通过声音进行全球尺度地理定位,利用生物声学信号中的物种地理分布线索,提出结合物种范围预测与检索的地理定位方法,并验证多模态融合的潜力。

Comments Accepted to ICML 26

详情
AI中文摘要

我们能否仅通过听到的声音确定某人的地理位置?声学信号是否足以定位到国家、州甚至城市?在这项工作中,我们应对全球尺度音频地理定位的挑战,特别关注野生动物和自然声音。我们假设生物声学信号包含信息丰富的地理定位线索,因为物种具有明确的地理分布范围。为了验证这一假设,我们对图像地理定位和声景映射方法进行基准测试,设计预言机和以物种为中心的基线,并提出一种结合物种范围预测与基于检索的地理定位的混合方法。我们进一步探究地理定位是否随着物种多样性记录和跨邻近样本的时空聚合而改善。最后,我们将研究扩展到多模态地理定位,通过结合音频和视觉内容的电影案例研究。我们的结果突出了将生物声学信号纳入地理空间任务的潜力,为物种识别和音频地理定位的未来工作提供了动力。

英文摘要

Can we determine someone's geographic location solely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? In this work, we tackle the challenge of global-scale audio geolocation, with a particular focus on wildlife and natural sounds. We posit that bioacoustic signals contain informative geolocation cues because of well-defined geographic ranges of species. To test this hypothesis, we benchmark image geolocation and soundscape mapping methods, design oracles and species-centric baselines, and propose a hybrid approach that combines species range prediction with retrieval-based geolocation. We further ask whether geolocation improves with species-diverse recordings and spatiotemporal aggregation across neighboring samples. Finally, we extend our study to multimodal geolocation with case studies from movies that combine both audio and visual content. Our results highlight the potential of incorporating bioacoustic signals into geospatial tasks, motivating future work on species recognition and audio geolocation.