arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

多模态大模型

跨文本、图像、视频、音频等模态的大模型与学习方法。

今日/当前日期收录 18 信号源:cs.CV, cs.CL, cs.AI, cs.MM, eess.AS
2606.20418 2026-06-19 cs.SD 新提交 90%

MixProLAP: Mixture-Induced Uncertainty Modeling for Probabilistic Language-Audio Pretraining

MixProLAP:混合诱导的不确定性建模用于概率性语言-音频预训练

Yu Nakagome, Jaesong Lee, Soo-Whan Chung

发表机构 * LINE WORKS Corporation(LINE WORKS公司) NAVER Cloud Corporation(NAVER Cloud公司)

专题命中 音视频多模态 :概率性音频-语言预训练,建模多模态对齐不确定性

AI总结 提出概率性音频-语言预训练框架MixProLAP,通过混合音频-文本对模拟重叠声音,建模多对多对应不确定性,并引入多级包含损失,在音频-文本检索中优于确定性基线。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

声学环境通常包含多个重叠的声音事件,且同一声学场景可以用不同的文本描述,使得音频-文本对齐存在固有的模糊性。本文提出一种概率性音频-语言预训练框架,用于建模音频-文本对齐中的多对多对应不确定性。与学习确定性点嵌入的传统对比方法不同,我们的方法将每个模态表示为分布,并学习不确定性感知的跨模态对齐。我们不依赖基于掩码的不确定性模拟,而是混合音频-文本对以创建更真实反映实际声学混合的重叠声音,并捕捉声音事件之间的语义包含关系。我们进一步引入多级包含损失,以强制表示与这些关系一致。在音频-文本检索基准上的实验表明,所提方法优于确定性基线。

英文摘要

Acoustic environments often contain multiple overlapping sound events, and the same acoustic scene can be described using diverse textual expressions, making audio-text alignment inherently ambiguous. This paper proposes a probabilistic audio-language pretraining framework to model many-to-many correspondence ambiguity in audio-text alignment. Unlike conventional contrastive methods that learn deterministic point embeddings, our approach represents each modality as a distribution and learns uncertainty-aware cross-modal alignment. Rather than relying on masking-based uncertainty simulation, we mix audio-text pairs to create overlapping sounds that better reflect real acoustic mixtures and capture semantic inclusion relations among sound events. We further introduce a multi-level inclusion loss to enforce representations consistent with these relations. Experiments on audio-text retrieval benchmarks show that the proposed method outperforms deterministic baselines.

2606.19940 2026-06-19 eess.AS 新提交 85%

Analyzing Language and Geographical Variation in Speech Representations Across 60 Indic Languages

分析60种印度语言语音表征中的语言和地理变异

Pavan Kumar J, Agneedh Basu, Pranav Bhat, Sujith Pulikodan, Visruth Sanka, Nihar Desai, Prasanta Kumar Ghosh

专题命中 音视频多模态 :联合语言-地区监督微调语音表征,属于多模态学习

AI总结 研究通过联合语言-地区监督微调Whisper-base和Wav2Vec2.0,发现该方法在保持语言分类能力的同时,提升了嵌入空间中地区区分度,并利用归一化条件互信息分析了嵌入结构。

详情
AI中文摘要

自监督语音编码器通常使用语言监督进行微调,这可能会忽略地理变异。为了理解在语言和地区联合监督下与仅语言监督下学习到的表征差异,我们微调Whisper-base和Wav2Vec2.0进行联合语言-地区分类(386类)和仅语言分类(60类)任务。语言-地区监督在嵌入空间中改善了条件于语言的地区区分度,同时保持了较强的边缘语言分类能力。我们使用归一化条件互信息(NCMI)分析学习到的嵌入结构,表明语言-地区监督产生了全局语言簇,并在语言内部形成了与地区变异对齐的结构化子簇,从而在不降低语言层面组织的情况下增强了地理可分离性。

英文摘要

Self-supervised speech encoders are often fine-tuned with language supervision, which can overlook geographical variation. To understand the learned representations under joint supervision of language and district compared to language-only supervision, we fine-tune Whisper-base and Wav2Vec2.0-base for classification tasks with joint language-district (386 classes) and language-only classification (60 languages). The language-district supervision improves district discrimination conditioned on language in the embedding space while strong marginal language classification. We analyze the structure of the learned embeddings using Normalized Conditional Mutual Information (NCMI), showing that language-district supervision produces global language clusters with structured within language subclusters aligned to district variation, enhancing geographical separability without degrading language-level organization.

2606.19398 2026-06-19 cs.SD eess.AS eess.SP 新提交 85%

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

S-JEPA:用于自监督语音表示学习的软聚类锚点

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

发表机构 * Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学) James Silberrad Brown Center for AI(詹姆斯·西尔伯拉德·布朗人工智能中心) Columbia University(哥伦比亚大学) Northeastern University(东北大学) Stanford University(斯坦福大学) Amazon GenAI(亚马逊生成式人工智能)

专题命中 音视频多模态 :自监督语音表示学习,属于音频模态。

AI总结 提出S-JEPA,通过KL散度匹配高斯混合模型的软后验概率训练编码器-预测器对,无需离线重聚类或教师蒸馏,在SUPERB协议下以低于90M参数取得最低WER,并建立新的帕累托前沿。

详情
AI中文摘要

自监督语音编码器主要通过预测掩蔽位置处的离散硬聚类ID进行训练,这种方法会坍缩类别边界处的声学模糊性,并需要在迭代之间中断训练以对整个语料库进行重聚类。我们提出S-JEPA,一种JEPA风格的编码器-预测器对,通过KL散度训练以匹配掩蔽位置处高斯混合模型的软后验概率。训练作为连续优化轨迹分两个阶段进行:首先在MFCC特征上使用固定GMM,然后在编码器特征上使用在线GMM,输入层从无标签信号中自适应选择,从而消除了离线重聚类步骤以及手动选择聚类所在Transformer层的问题。在SUPERB协议下,S-JEPA在评估的低于90M参数的自监督方法中实现了最低的词错误率(WER),并在大约一半参数量的情况下在情感识别任务上与HuBERT-Base相当,无需离线重聚类或教师蒸馏即建立了新的帕累托前沿。对预测器在保留语音上的每帧熵的分析揭示了双峰分布,其中相当一部分帧的熵接近完美两聚类平局的熵,这直接经验性地证明了软目标目标保留了硬目标会坍缩的声学模糊性。代码可在以下网址获取:https://this https URL。

英文摘要

Self-supervised speech encoders are predominantly trained by predicting discrete hard cluster IDs at masked positions, a recipe that collapses acoustic ambiguity at category boundaries and requires interrupting training to re-cluster the entire corpus between iterations. We introduce S-JEPA, a JEPA-style encoder-predictor pair trained to match the soft posteriors of a Gaussian Mixture Model at masked positions via KL divergence. Training runs as one continuous optimization trajectory in two phases: a fixed GMM over MFCC features, then an online GMM over encoder features, with the input layer selected adaptively from a label-free signal, removing both the offline re-cluster step and the hand-tuned choice of which transformer layer to cluster on. Under the SUPERB protocol, S-JEPA achieves the lowest WER among evaluated SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half its parameter count, establishing a new Pareto frontier without offline re-clustering or teacher distillation. An analysis of the predictor's per-frame entropy on held-out speech reveals a bimodal distribution with a substantial minority of frames near the entropy of a perfect two-cluster tie, providing direct empirical evidence that the soft-target objective preserves the acoustic ambiguity that hard targets would collapse. Code is available at https://github.com/gioannides/s-jepa.

2606.19381 2026-06-19 cs.SD cs.AI 新提交 85%

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

利用语码混合引导的合成语音改进语码转换语音识别

Yue Heng Yeo, Haoyang Li, Yizhou Peng, Shreyas Gopal, Hexin Liu, Leibny Paola Garcia-Perera, Hardik B. Sailor, Jeremy H. M. Wong, Eng Siong Chng

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Google DeepMind(谷歌深度思维)

专题命中 音视频多模态 :改进语码转换语音识别,结合文本和语音。

AI总结 针对语码转换语音识别中高质量文本-语音对稀缺的问题,提出语码混合引导的偏好学习框架,通过语码混合指数优化合成语音的转换保真度,在SEAME语料库上微调Whisper Large,将混合错误率从12.1%/17.8%降至8.9%/14.2%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

语码转换语音识别由于缺乏高质量的语码转换文本-语音对用于训练而仍然具有挑战性。尽管已经探索了通过文本到语音进行合成数据增强,但现有的语码转换文本到语音方法主要优化重建保真度,并未明确强制语言边界一致性,从而限制了它们在语码转换语音识别增强中的有效性。本文提出了一种语码混合引导的偏好学习框架,该框架利用语码混合指数引导合成语音生成,以提高语码转换保真度。在SEAME汉英口语语料库上的实验表明,所提方法增强了合成数据在语音识别微调中的效用。具体来说,当微调Whisper Large时,所提方法在DevMAN和DevSGE测试集上分别将混合错误率从12.1%/17.8%降低到8.9%/14.2%。

英文摘要

Code-switch (CS) Automatic Speech Recognition (ASR) remains challenging due to limited availability of high quality CS text-speech pairs for training. Although synthetic data augmentation via Text-to-speech (TTS) has been explored, existing CS TTS approaches primarily optimise reconstruction fidelity and do not explicitly enforce language-boundary consistency, thereby limiting their effectiveness for CS ASR augmentation. This paper proposes a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.

2606.20266 2026-06-19 eess.AS 新提交 80%

Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

基于语音特征调节的无转录流匹配文本转语音

SooHwan Eom, Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Chang D. Yoo

专题命中 音视频多模态 :流匹配TTS,使用自监督语音表示

AI总结 提出RTFree-F5,用自监督语音表示替代参考转录本,通过轻量适配器映射到F5-TTS文本条件空间,消除对外部ASR依赖,在构音障碍语音上WER从24.6%降至10.4%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

最近的流匹配文本转语音(TTS)模型,如F5-TTS,在推理时依赖于从外部ASR系统获得的参考转录本。这种依赖性使得零样本TTS对于口音或构音障碍的说话者变得脆弱,而这正是最需要它的场景。此外,我们发现即使有真实转录本可用,基于文本的参考条件化也可能将非典型语音中的非典型声学模式传播到合成语音中。为了解决这个问题,我们提出了RTFree-F5,它用连续的自监督语音表示替换参考转录本,通过轻量适配器映射到F5-TTS的文本条件空间,同时重用预训练检查点。在构音障碍语音上,RTFree-F5将WER从24.6%降低到10.4%,甚至超过了真实参考转录本基线,同时提高了自然度,并在标准基准测试中保持竞争力,而无需任何参考转录本。

英文摘要

Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most needed. Moreover, we find that text-based reference conditioning can propagate atypical acoustic patterns from atypical speech into synthesis, even when ground-truth transcripts are available. To address this, we propose RTFree-F5, which replaces the reference transcript with continuous self-supervised speech representations mapped into F5-TTS's text-conditioning space via a lightweight adapter, while reusing the pretrained checkpoint. On dysarthric speech, RTFree-F5 reduces WER from 24.6% to 10.4%, surpassing even the ground-truth reference transcript baselines, while improving naturalness and remaining competitive on standard benchmarks without requiring any reference transcript.

2606.20457 2026-06-19 eess.AS cs.AI cs.LG 新提交 80%

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

重新利用语音分类器进行基于引导扩散的语音生成

Rostislav Makarov, Timo Gerkmann

发表机构 * University of Hamburg(汉堡大学)

专题命中 音视频多模态 :语音分类器重用于扩散生成

AI总结 提出将预训练的语音分类器作为扩散生成的主干,通过附加轻量子网络并仅训练该子网络,实现单主干模型的高质量条件语音生成,降低内存和计算成本。

Comments Accepted for publication in the Proceedings of Interspeech 2026

详情
AI中文摘要

分类器引导是一种通过使用噪声条件分类器将采样过程导向目标类别来控制扩散生成的方法。分类器引导的一个缺点是需要两个单独训练的模型:一个分类器和一个扩散模型。因此,我们研究了一种更紧凑的替代方案,其中将传统训练的语音分类器重新用作扩散生成的主干。从log-Mel空间中的冻结噪声条件分类器开始,我们附加一个轻量子网络,该子网络重用中间分类器表示,并在去噪分数匹配目标下仅训练该子网络。我们的工作表明,预训练的分类器可以重新用于条件生成,为判别建模和条件语音合成之间提供了有吸引力的桥梁,从而在单主干模型中实现高语音质量,同时减少内存占用和计算成本。

英文摘要

Classifier guidance is a way to control diffusion generation by using a noise-conditioned classifier to steer the sampling process toward a target class. One drawback of classifier guidance is that it requires two separately trained models: a classifier and a diffusion model. We therefore study a more compact alternative in which a conventionally trained speech classifier is repurposed as the backbone for diffusion generation. Starting from a frozen noise-conditioned classifier in log-Mel space, we attach a lightweight subnetwork that reuses intermediate classifier representations and train only this subnetwork under a Denoising Score Matching objective. Our work shows that a pretrained classifier can be repurposed for conditional generation, providing an appealing bridge between discriminative modeling and conditional speech synthesis resulting in high speech quality within a single-backbone model, with reduced memory footprint and computational cost.

2606.20338 2026-06-19 eess.AS 新提交 70%

Stuttering Classification and Segmentation with Attention-Based Multiple Instance Learning

基于注意力多实例学习的口吃分类与分割

Petar Sušac, Sebastian P. Bayerl, Hrvoje Džapo

专题命中 音视频多模态 :多实例学习用于语音分类与分割

AI总结 提出基于微调wav2vec 2.0、WavLM和Whisper编码器的多实例神经网络,利用片段级数据实现帧级口吃分类与分割,帧级F1提升23%。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

使用深度学习方法进行口吃检测和分类有潜力改善口吃严重程度评估过程。大多数口吃分类数据集提供片段级标签,这使得它们不适用于确定单个口吃不流畅持续时间所需的细粒度帧级分类。为了克服这一挑战,我们提出了一种基于微调wav2vec 2.0、WavLM和Whisper编码器的多实例神经网络架构。我们应用基于实例和基于嵌入的多实例学习方法,在片段级数据集上训练模型,用于片段级和帧级口吃分类任务。我们的结果显示,帧级F1分数提高了23%,片段级F1分数提高了2%至9%,证明了我们的模型能够利用片段级数据进行帧级分割的能力。

英文摘要

Stuttering detection and classification using deep learning methods has the potential to improve the process of stuttering severity assessment. Most stuttering classification datasets provide clip-level labels, making them unsuitable for fine-grained frame-level classification needed to determine the duration of individual stuttering dysfluencies. To overcome this challenge, we present a multiple instance neural network architecture based on fine-tuned wav2vec 2.0, WavLM and Whisper encoders. We apply instance- and embedding-based multiple instance learning approaches to train models on a clip-level dataset for both clip-level and frame-level stuttering classification tasks. Our results show a 23% improvement in frame-level F1 score and between 2% and 9% in clip-level F1 score, demonstrating the ability of our models to utilize clip-level data for frame-level segmentation.

2606.20001 2026-06-19 eess.AS 新提交 70%

Time-Unconditional Generative Speech Enhancement via Autonomous Rectified Flow

基于自主整流流的时间无条件生成式语音增强

Wen Zhang, Wenbin Jiang, Yang Zhang, Xiaofei Zhou

专题命中 音视频多模态 :生成式语音增强,整流流框架

AI总结 提出自主整流流框架,通过线性插值路径证明目标向量场时间不变性,设计时间无条件网络仅从空间关系推断去噪方向,显著提升生成质量、鲁棒性和推理效率。

详情
AI中文摘要

大多数生成式语音增强方法依赖显式时间步嵌入进行时间条件化。本文提出自主整流流框架,挑战这种条件化的必要性。通过线性插值路径,我们证明目标向量场本质上是时间不变的。我们进一步引入时间无条件网络,消除显式时间步信息,仅从当前状态与带噪观测之间的空间关系推断去噪方向。预测该目标向量场等价于建模噪声分布。通过避免过拟合时间轨迹,所提出的自主设计显著提升了生成质量、鲁棒性和推理效率。

英文摘要

Most generative speech enhancement methods rely on explicit time-step embeddings for temporal conditioning. In this paper, we propose the Autonomous Rectified Flow framework, which challenges the necessity of such conditioning. Using a linear interpolation path, we show that the target vector field is inherently time-invariant. We further introduce a time-unconditional network that eliminates explicit time-step information and infers the denoising direction solely from the spatial relationship between the current state and the noisy observation. Predicting this target vector field is equivalent to modeling the noise distribution. By avoiding overfitting to temporal trajectories, the proposed autonomous design significantly improves generation quality, robustness, and inference efficiency.

2606.19974 2026-06-19 eess.AS 新提交 70%

Interpreting Content and Speaker Characteristics in Factorised Self-Supervised Subspaces

解释因子化自监督子空间中的内容和说话人特征

Kyle Janse van Rensburg, Herman Kamper

专题命中 音视频多模态 :自监督语音特征分解与解释

AI总结 通过SVD分解WavLM特征为内容矩阵和说话人变换,发现内容空间主要编码强度、共振峰和发声,而说话人空间与音高和性别强相关,并可用于语音合成中的精细控制。

Comments 7 pages, 4 figures

详情
AI中文摘要

自监督语音特征同时编码内容和说话人信息。最近的工作引入了一种基于SVD的因子化方法,将这些特征分解为一个共享的内容矩阵(捕获时间变化)和说话人特定的变换(捕获静态说话人特征)。然而,这些组件内部的信息组织方式仍不清楚。在本文中,我们研究了WavLM因子化的内容和说话人子空间的维度如何与语音特征(如音高、强度和发声)相关。我们发现,内容空间中的前几个维度主要捕获强度、高阶共振峰和发声,而音高编码在较后的维度中。相比之下,方差最大的说话人维度与音高和性别强相关,后面的维度捕获高频变化。干预实验表明,操纵这些维度能够实现对语音合成中语音特征的目标控制。此外,联合修改内容和说话人表示可提供对音高和强度等特征的精细控制。

英文摘要

Self-supervised speech features encode both content and speaker information. Recent work introduced an SVD-based factorisation that decomposes these features into a shared content matrix capturing temporal variation and speaker-specific transformations capturing static speaker characteristics. However, how information is organised within these components remains unclear. In this paper, we investigate how the dimensions of WavLM-factorised content and speaker subspaces correlate with speech characteristics such as pitch, intensity, and voicing. We find that leading dimensions in the content space primarily capture intensity, higher-order formants, and voicing, while pitch is encoded in a later dimension. In contrast, the highest-variance speaker dimension is strongly associated with pitch and gender, with later dimensions capturing high-frequency variation. Intervention experiments show that manipulating these dimensions enables targeted control of speech characteristics for speech synthesis. Furthermore, modifying the content and speaker representations jointly provides fine-grained control over characteristics such as pitch and intensity.

2606.19453 2026-06-19 eess.AS 新提交 70%

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

全双工口语对话系统综述:架构层次、交互本体与决策状态机

Jingyu Lu, Yuhan Wang, Jianming Luo, Yifu Chen, Tianle Liang, Shengpeng Ji, Ziyue Jiang, Xiaoda Yang, Yu Zhang, Xize Cheng, Chenyuhao Wen, Changhao Pan, Haoxiao Wang, Chen Ye, Jian Wu, Xiaoxi Jiang, Guanjun Jiang, Zhou Zhao

专题命中 音视频多模态 :全双工口语对话系统涉及语音与文本多模态交互

AI总结 针对全双工术语歧义,提出L0-L3架构层次、T×I×R交互本体和IDLE/LISTEN/SPEAK/WAIT/DUAL决策状态机三个框架,揭示现有系统在训练与评估中的实现差距。

Comments 34 pages, 5 figures, 7 tables. Project page and interactive demo: https://github.com/DuplexLM/DuplexSurvey

详情
AI中文摘要

近期有十余个口语对话系统声称实现了“全双工”,但该术语被用于描述本质上不同的能力。现有综述将它们归入单一轴(级联/端到端,或工程化/学习型),忽略了构建者最关心的区别。我们认为这种歧义很大程度上源于分类学问题:当前术语未明确双工决策在何处做出、支持哪些交互类型、以及系统如何逐时刻行为。本文引入三个互补框架:(i) L0-L3架构层次,定位双工决策位置;(ii) T×I×R交互本体,指定每次交互的时间关系、用户意图和所需系统响应;(iii) 决策状态机(IDLE/LISTEN/SPEAK/WAIT/DUAL),描述系统如何在状态间转换。通过对已发表系统和基准的审计,我们记录了一个实现差距:尽管许多架构原则上能在全双工状态下运行,但其观察到的行为仍受训练和评估中表示的交互模式约束。我们指出,相对于(大多未公开的)工业语料库,有限的公开训练数据覆盖范围,以及尚未实现的L3表示级建模目标,是全双工对话未来研究的关键前沿。相关材料见https://this https URL。

英文摘要

More than a dozen spoken dialogue systems have recently claimed to be "full-duplex," yet the term has been used to describe substantially different capabilities. Existing surveys collapse them onto a single axis (cascaded/end-to-end, or engineered/learned) and miss the distinctions that matter most for builders. We argue that much of this ambiguity is taxonomical: current terminology does not specify where duplex decisions are made, which interaction types are supported, or how a system behaves moment by moment. This paper introduces three complementary frameworks: (i) an L0-L3 Architectural Hierarchy that locates where duplex decisions are made; (ii) a $T\times I\times R$ Interaction Ontology that specifies the temporal relation, user intent, and required system response for each interaction; and (iii) a Decision State Machine (IDLE/LISTEN/SPEAK/WAIT/DUAL) that describes how systems move between states. Across published systems and benchmarks, our audit documents a realization gap: although many architectures can in principle operate in full-duplex states, their observed behavior remains constrained by the interaction patterns represented in training and evaluation. We point to the limited public training-data coverage relative to the (largely undisclosed) industrial corpora, together with the still-unrealized goal of L3 representation-level modeling, as the key frontiers for future research on full-duplex dialogue. The related material is available at https://github.com/DuplexLM/DuplexSurvey.

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交 70%

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA:针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

发表机构 * LY Corporation(LY公司)

专题命中 音视频多模态 :语音质量评估,关注音高重音

AI总结 提出PASQA模型,通过可控重音合成数据集和伪重音质量分数,结合自监督表示、摩拉条件融合等训练策略,有效评估音高重音正确性,优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

现有的平均意见得分(MOS)预测模型通常预测话语级别的自然度MOS,并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估(PASQA),明确针对音高重音正确性。为了训练我们的模型,我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集,并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上,并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明,传统模型无法保持按重音错误严重程度的排序,而PASQA在已见和未见说话者上都实现了高排序准确性。此外,PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取:https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

2606.20106 2026-06-19 eess.AS cs.SD 新提交 70%

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

利用文本无关说话人验证的用户自定义关键词个性化唤醒

Ming-Hsiang Hu, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Berlin Chen

发表机构 * Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan(计算机科学与信息工程系,台湾国立台湾师范大学) United Link Co., Ltd., Taiwan(台湾联链公司)

专题命中 音视频多模态 :个性化关键词唤醒,说话人验证

AI总结 提出ZP-KWS轻量框架,结合音素监督音频编码器和紧凑说话人编码器,通过乘法后融合实现零样本关键词检测与说话人验证,在多个数据集上将目标误拒率降低高达60%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

用户自定义关键词唤醒(UD-KWS)能够从文本实现零样本唤醒词检测,但现有系统学习的是说话人不变表示,无法拒绝说出正确关键词的冒名顶替者。我们针对这种双重零样本设置——未见关键词和未见说话人——提出了ZP-KWS,一个轻量级框架,将音素监督的音频编码器与GE2E预训练的紧凑说话人编码器(约0.9M参数)相结合。推理时的乘法后融合赋予每个分支独立的否决权,支持从传统检测到严格说话人门控激活的模式,无需重新训练。在LibriPhrase、Google Speech Commands和Qualcomm数据集上,ZP-KWS在1%虚警率下将目标仅误拒率相对于最强基线降低了高达60%,同时保持有竞争力的关键词检测,且总参数量在1.55M以内,适合边缘部署。

英文摘要

User-defined keyword spotting (UD-KWS) enables zero-shot wake-word detection from text, but existing systems learn speaker-invariant representations that cannot reject impostors uttering the correct keyword. We address this dual zero-shot setting -- unseen keywords and unseen speakers -- with ZP-KWS, a lightweight framework combining a phoneme-supervised audio encoder with a GE2E-pretrained compact speaker encoder (about 0.9M parameters). Multiplicative late fusion at inference grants each branch independent veto power, supporting modes from conventional detection to strict speaker-gated activation without retraining. On LibriPhrase, Google Speech Commands, and Qualcomm datasets, ZP-KWS reduces target-only FRR at 1% FAR by up to 60% relative to the strongest baseline while maintaining competitive keyword detection, all within a 1.55M parameter budget for edge deployment.

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交 70%

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

发表机构 * Nagoya Institute of Technology, Japan(名古屋技术大学,日本) LY Corporation, Japan(LY公司,日本)

专题命中 音视频多模态 :人机语音质量评估差异研究

AI总结 通过声学退化、韵律错误和说话人特征扰动,发现MOS预测模型对声学退化敏感,但对韵律错误不敏感,且对基频有偏见,而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)预测模型在文本到语音(TTS)研究中被广泛用作代理指标,但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点:声学退化、韵律错误以及说话人特定特征(如音高和语速)的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测,并分析了它们感知特征的差异。结果表明,大多数模型能很好地跟踪声学退化,而所有模型对韵律错误不敏感,尽管主观评分大幅下降。对于说话人特征,模型表现出双重分离:在人类评分中不存在的强平均基频(F0)偏见,但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

2606.19823 2026-06-19 eess.AS cs.LG 新提交 70%

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

低负担数据增强:通过零样本语音克隆改善构音障碍语音识别

Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri

发表机构 * DeepNet Discovery Network, University of Auckland, New Zealand(奥克兰大学深网发现网络, 新西兰) University of Illinois Urbana-Champaign, USA(伊利诺伊大学厄巴纳-香槟分校, 美国)

专题命中 音视频多模态 :零样本语音克隆增强构音障碍ASR

AI总结 针对构音障碍语音数据稀缺和变异性大的问题,提出使用零样本语音克隆(Higgs Audio V2)生成合成数据,微调Whisper-medium模型,在TORGO数据集上达到与真实数据微调相近的词错误率,并显著降低数据收集成本。

Comments Accepted to Interspeech 2026, Sydney, Australia

详情
AI中文摘要

由于数据稀缺和说话人之间高度变异,自动语音识别对于构音障碍语音仍然不可靠。虽然合成数据可以弥补这些不足,但传统方法通常需要大量的说话人特定数据,重新引入了数据收集瓶颈。我们研究零样本语音克隆作为一种低负担的增强策略,使用Higgs Audio V2克隆TORGO数据集中的说话人。我们在克隆数据、真实数据和混合数据上微调Whisper-medium,并在保留的真实语音上进行评估。与零样本基线(31.62%)相比,克隆数据微调实现了具有竞争力的26.00%词错误率,几乎与真实数据微调(24.44%)和混合数据微调(25.12%)相当。值得注意的是,对于中重度构音障碍说话人,克隆和混合微调优于真实数据微调。在SAP-1102上的跨语料库评估中,克隆微调取得了最佳结果(相对提升11.45%)。这些结果表明,零样本克隆提供了可扩展的训练数据,绕过了昂贵的数据收集瓶颈。

英文摘要

Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.

2606.19797 2026-06-19 eess.AS cs.AI cs.SD eess.SP 新提交 70%

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

通过域内数据增强改进构音障碍语音的端到端语音识别

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(电子与通信工程系,印度尼特拉特技术学院Sikkim分校) Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, USA(信号分析与解释实验室(SAIL),美国南加州大学洛杉矶分校)

专题命中 音视频多模态 :域内数据增强改善构音障碍ASR

AI总结 针对构音障碍语音识别中数据稀缺和严重程度差异的问题,本文探索了四种数据增强方法(SRM、PM、FM、VTLP)对预训练Wav2Vec2模型进行微调,在不同严重程度上实现了显著的字错误率降低。

详情
AI中文摘要

构音障碍语音识别对于促进构音障碍患者之间的有效沟通至关重要。然而,由于严重程度不同和数据可用性有限,准确识别构音障碍语音面临重大挑战。在本文中,我们通过微调端到端预训练Wav2Vec2模型,探索了针对构音障碍自动语音识别(ASR)系统的数据增强技术,特别关注严重程度级别。为了解决数据稀缺以及微调预训练ASR系统用于构音障碍语音时需要大量数据的问题,我们研究了四种主要的数据增强方法:语速修改(SRM)、音高修改(PM)、共振峰修改(FM)和声道长度扰动(VTLP),这些方法针对构音障碍的不同方面进行了调整。本研究使用为每个严重程度类别单独微调的Wav2Vec2模型作为基线系统。此外,我们使用增强数据对ASR模型进行了特定严重程度的微调。结果表明,每种增强技术在不同严重程度级别上表现出不同的有效性模式。对于\textit{低}(9.02%)和\textit{中}(38.11%)严重程度,使用SRM($s$=0.8)获得了最佳WER;对于\textit{高}严重程度(55.15%),使用PM($\ au$=0.8)获得了最佳WER,分别相对改进了30.02%、16.64%和15.47%。这些结果证实了增强方法在提高构音障碍ASR性能方面的有效性。

英文摘要

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 新提交 70%

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究:频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(电子与通信工程系,印度尼特技术学院锡金分校) Department of Information and Communications Engineering, Aalto University, Finland(信息与通信工程系,阿尔托大学,芬兰) Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, USA(信号分析与解释实验室(SAIL),美国南加州大学洛杉矶分校)

专题命中 音视频多模态 :构音障碍语音识别特征与模型研究

AI总结 本文系统研究不同频谱特征与声学模型的组合,通过引入音高特征和优化训练帧重叠数,在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情
AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能,特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查,我们证明了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法,与先前研究相比,在构音障碍语音的孤立词识别中获得了4.65%的相对改进,在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性,这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 新提交 70%

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化:低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(印度西西姆国立技术学院电子与通信工程系) Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA(美国南加州大学洛杉矶分校信号分析与解释实验室)

专题命中 音视频多模态 :儿童语音识别微调策略泛化分析

AI总结 针对低资源儿童语音识别,系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现,发现特定策略能显著提升泛化能力。

详情
AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能,尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究,我们展示了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法,与先前研究相比,在孤立词识别上实现了4.65%的相对改进,在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性,这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.20478 2026-06-19 eess.AS 新提交 60%

Beyond Speaker Independence: Evaluating Cross-Lingual Acoustic-to-Articulatory Inversion Across Finnish and Russian

超越说话人独立性:跨语言声学到发音反演在芬兰语和俄语上的评估

Ruchi Pandey, Tomi Kinnunen

专题命中 音视频多模态 :跨语言声学-发音映射,涉及多模态特征

AI总结 本研究系统评估了跨说话人和跨语言域偏移下的声学到发音反演(AAI)性能,利用新构建的芬兰语-俄语双语EMA语料库FROST-EMA,比较了不同发音目标、声学前端和反演后端,发现跨性别性能下降中等(约0.05-0.10),跨语言下降更大(约0.10-0.20)。

详情
AI中文摘要

声学到发音反演(AAI)在域偏移下仍然具有挑战性,其中说话人属性的变化和跨语言条件常常导致性能下降。我们在这种偏移下进行了系统评估,并在FROST-EMA(一个芬兰语-俄语双语EMA语料库)上建立了基线基准。FROST-EMA解决了现有资源的英语偏见和有限的说话人多样性。我们基准测试了(i)发音目标(原始EMA坐标与声道变量),(ii)声学前端(MFCC与SSL特征),以及(iii)反演后端(BiLSTM与轻量级基于注意力的序列模型)。我们进一步定义了跨性别迁移(语言内)和跨语言迁移(性别内)的评估协议。结果表明,相对于域内基线,跨性别不匹配导致皮尔逊相关系数适度下降(约0.05至0.10),而跨语言不匹配导致更大的下降(约0.10至0.20)。

英文摘要

Acoustic-to-articulatory inversion (AAI) remains challenging under domain shifts where changes in speaker attributes and cross-language conditions often degrade performance. We conduct a systematic evaluation under such shifts and establish baseline benchmarks on FROST-EMA, a Finnish-Russian bilingual EMA corpus. FROST-EMA addresses the English bias and limited speaker diversity of existing resources. We benchmark (i) articulatory targets (raw EMA coordinates vs tract variables), (ii) acoustic front-ends (MFCC vs SSL features), and (iii) inversion back-ends (BiLSTM vs a lightweight attention-based sequence model). We further define evaluation protocols for cross-gender transfer (within language) and cross-language transfer (within gender). The results indicate that cross-gender mismatch introduces moderate Pearson correlation declines (approximately 0.05 to 0.10) relative to the in-domain baseline, whereas cross-language mismatch causes larger drops (approximately 0.10 to 0.20).