arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19381 2026-06-19 cs.SD cs.AI 新提交

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

利用语码混合引导的合成语音改进语码转换语音识别

Yue Heng Yeo, Haoyang Li, Yizhou Peng, Shreyas Gopal, Hexin Liu, Leibny Paola Garcia-Perera, Hardik B. Sailor, Jeremy H. M. Wong, Eng Siong Chng

发表机构 * College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； Google DeepMind（谷歌深度思维）

AI总结针对语码转换语音识别中高质量文本-语音对稀缺的问题，提出语码混合引导的偏好学习框架，通过语码混合指数优化合成语音的转换保真度，在SEAME语料库上微调Whisper Large，将混合错误率从12.1%/17.8%降至8.9%/14.2%。

Comments Accepted to Interspeech 2026

2606.19398 2026-06-19 cs.SD eess.AS eess.SP 新提交

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

S-JEPA：用于自监督语音表示学习的软聚类锚点

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； New York University（纽约大学）； James Silberrad Brown Center for AI（詹姆斯·西尔伯拉德·布朗人工智能中心）； Columbia University（哥伦比亚大学）； Northeastern University（东北大学）； Stanford University（斯坦福大学）； Amazon GenAI（亚马逊生成式人工智能）

AI总结提出S-JEPA，通过KL散度匹配高斯混合模型的软后验概率训练编码器-预测器对，无需离线重聚类或教师蒸馏，在SUPERB协议下以低于90M参数取得最低WER，并建立新的帕累托前沿。

详情

AI中文摘要

自监督语音编码器主要通过预测掩蔽位置处的离散硬聚类ID进行训练，这种方法会坍缩类别边界处的声学模糊性，并需要在迭代之间中断训练以对整个语料库进行重聚类。我们提出S-JEPA，一种JEPA风格的编码器-预测器对，通过KL散度训练以匹配掩蔽位置处高斯混合模型的软后验概率。训练作为连续优化轨迹分两个阶段进行：首先在MFCC特征上使用固定GMM，然后在编码器特征上使用在线GMM，输入层从无标签信号中自适应选择，从而消除了离线重聚类步骤以及手动选择聚类所在Transformer层的问题。在SUPERB协议下，S-JEPA在评估的低于90M参数的自监督方法中实现了最低的词错误率（WER），并在大约一半参数量的情况下在情感识别任务上与HuBERT-Base相当，无需离线重聚类或教师蒸馏即建立了新的帕累托前沿。对预测器在保留语音上的每帧熵的分析揭示了双峰分布，其中相当一部分帧的熵接近完美两聚类平局的熵，这直接经验性地证明了软目标目标保留了硬目标会坍缩的声学模糊性。代码可在以下网址获取：https://this https URL。

英文摘要

Self-supervised speech encoders are predominantly trained by predicting discrete hard cluster IDs at masked positions, a recipe that collapses acoustic ambiguity at category boundaries and requires interrupting training to re-cluster the entire corpus between iterations. We introduce S-JEPA, a JEPA-style encoder-predictor pair trained to match the soft posteriors of a Gaussian Mixture Model at masked positions via KL divergence. Training runs as one continuous optimization trajectory in two phases: a fixed GMM over MFCC features, then an online GMM over encoder features, with the input layer selected adaptively from a label-free signal, removing both the offline re-cluster step and the hand-tuned choice of which transformer layer to cluster on. Under the SUPERB protocol, S-JEPA achieves the lowest WER among evaluated SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half its parameter count, establishing a new Pareto frontier without offline re-clustering or teacher distillation. An analysis of the predictor's per-frame entropy on held-out speech reveals a bimodal distribution with a substantial minority of frames near the entropy of a perfect two-cluster tie, providing direct empirical evidence that the soft-target objective preserves the acoustic ambiguity that hard targets would collapse. Code is available at https://github.com/gioannides/s-jepa.

URL PDF HTML ☆

赞 0 踩 0

2606.19996 2026-06-19 cs.SD cs.CL 新提交

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

基于自编码器与对比学习的段级普通话语音认知障碍检测

Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University（上海交通大学自动化与智能感知学院）； Key Laboratory of System Control and Information Processing, Ministry of Education of China（教育部系统控制与信息处理重点实验室）； Shanghai Key Laboratory of Perception and Control in Industrial Network Systems（上海市工业网络系统感知与控制重点实验室）； Department of Computer Science and Engineering, University of Bologna（博洛尼亚大学计算机科学与工程系）； Department of Mathematical, Physical and Computer Sciences, University of Parma（帕尔马大学数学、物理与计算机科学系）

AI总结提出段级表示学习框架，结合自编码器和对比学习，在四个普通话数据集上实现稳定的二分类和三分类认知障碍检测，尤其改善了临床困难的三分类性能。

Comments 15 pages, 7 figures, 5 tables

详情

AI中文摘要

\noindent\textbf{背景与目标：} 语音已成为一种低成本、非侵入性的数字生物标志物，在认知障碍检测方面具有巨大潜力。然而，有限的标注数据和跨数据集变异性仍然是构建稳健的语音筛查系统的主要挑战。\par\noindent\textbf{方法：} 我们开发了一个用于语音认知障碍检测的段级表示学习框架。将语音录音分割成短片段并转换为语谱图表示。为了在有限数据条件下提高鲁棒性，将离线和在线增强策略与基于自编码器的表示学习和对比目标相结合，以增强判别性潜在表示。\par\noindent\textbf{结果：} 在四个独立的普通话语音数据集上进行的实验表明，在二分类和三分类任务中均取得了稳定且有竞争力的性能，尤其是在临床具有挑战性的三分类设置中取得了显著改进。消融研究进一步支持了所提框架的有效性。\par\noindent\textbf{结论：} 研究结果表明，段级语音表示学习可能为资源受限的临床环境中的认知障碍筛查提供一种可扩展且实用的方法。

英文摘要

\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 交叉投稿

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究：频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结本文系统研究不同频谱特征与声学模型的组合，通过引入音高特征和优化训练帧重叠数，在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情

AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明，通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究，为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能，特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查，我们证明了增强最先进的因子化时延神经网络（F-TDNN）模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法，与先前研究相比，在构音障碍语音的孤立词识别中获得了4.65%的相对改进，在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性，这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

URL PDF HTML ☆

赞 0 踩 0

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 交叉投稿

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估：基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar（卡塔尔计算研究所，多哈，卡塔尔）

AI总结提出仅使用母语语音资源训练的轻量级发音评估框架，通过离散化语音标记和语言模型计算意外度，结合文本引导对齐特征，在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库，这些语料库收集成本高昂。我们提出一个轻量级框架，仅使用母语语音资源训练，以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时，学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度，其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块，该模块从参考文本预测母语标记序列，并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上，PCC从0.60提升到0.66（带转录引导），接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

URL PDF HTML ☆

赞 0 踩 0

2606.20106 2026-06-19 eess.AS cs.SD 交叉投稿

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

利用文本无关说话人验证的用户自定义关键词个性化唤醒

Ming-Hsiang Hu, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Berlin Chen

AI总结提出ZP-KWS轻量框架，结合音素监督音频编码器和紧凑说话人编码器，通过乘法后融合实现零样本关键词检测与说话人验证，在多个数据集上将目标误拒率降低高达60%。

Comments Accepted to Interspeech 2026

2604.18105 2026-06-19 eess.AS cs.CL cs.SD 版本更新

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR：迈向高效、鲁棒且可定制的实时基于LLM的语音识别

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu

发表机构 * Advanced Intelligent Systems Group, NIO（蔚来智能系统集团）

AI总结提出NIM4-ASR框架，通过重新设计多阶段训练范式（包括预训练架构优化、迭代异步SFT和ASR专用强化学习）以及生产优化（噪声鲁棒性、流式推理和RAG热词定制），在2.3B参数下实现SOTA性能。

详情

AI中文摘要

将大语言模型（LLM）集成到自动语音识别（ASR）中已成为近年来的主流范式。尽管现有的基于LLM的ASR模型在公共基准上表现出色，但其训练仍然主要依赖数据驱动，未能充分解决关键的实际挑战——特别是在资源受限部署中的有限向下可扩展性以及声学挑战条件下的幻觉问题。为了解决这些问题，我们提出了NIM4-ASR，一个面向生产的、基于LLM的ASR框架，针对效率和鲁棒性进行了优化。基于编码器和LLM之间功能角色的原则性划分，我们重新设计了多阶段训练范式，使每个模块与其预期的能力边界对齐。具体来说，我们重新制定了预训练架构和目标以缓解模态差距并提高参数效率；引入了迭代异步SFT阶段以保持声学保真度并约束表示漂移；设计了ASR专用的强化学习阶段以进一步提高识别质量和鲁棒性。我们还加入了一系列面向生产的优化，包括噪声和静音条件下的鲁棒性、实时流式推理以及通过检索增强生成（RAG）进行的热词定制。实验表明，NIM4-ASR仅用2.3B参数就在多个公共基准上达到了最先进的性能，同时在内部基准上显著优于更大规模的竞争对手——特别是在实体密集的真实场景中。NIM4-ASR进一步通过RAG支持百万级热词定制，检索延迟低于毫秒，从而能够高效适应新兴实体和个性化用户需求。

英文摘要

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

URL PDF HTML ☆

赞 0 踩 0

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea（韩国文化科技研究所）； Maum AI Inc., Republic of Korea（马姆人工智能公司）

AI总结本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题，通过分析下游语义失败，揭示了传统ASR指标无法完全捕捉的误差影响，发现不同性能的LLM在级联降级上的一致性，识别出单字符ASR错误作为语义失败通道，并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

2606.18485 2026-06-19 cs.SD cs.AI eess.AS 新提交

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

MagpieTTS-LF：无需长语音数据训练的推理时长生成长语音生成

Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain, Ryan Langman, Xuesong Yang, Roy Fejgin

发表机构 * NVIDIA Corporation（英伟达公司）

AI总结提出MagpieTTS-LF推理时方法，通过软注意力先验、有状态推理和历史感知文本编码，在不重新训练模型的情况下实现连贯的长语音生成。

Journal ref Interspeech 2026

详情

AI中文摘要

神经文本到语音（TTS）系统在短语句上取得了显著质量，但长语音生成表现出韵律漂移、说话人不一致和句子边界伪影。现有方法要么压缩序列、增加上下文长度，要么简单拼接独立合成的片段。我们提出一种称为MagpieTTS-LF的推理时方法，使MagpieTTS能够在不重新训练模型的情况下生成连贯的长语音。我们的方法引入了三个关键创新：（1）软注意力先验，在保留过去和未来上下文的同时引导单调对齐；（2）有状态推理算法，跨句子块维护上下文，确保韵律连续性；（3）历史感知文本编码，利用过去文本进行语篇级韵律规划。在长文本上的实验表明，与其他基线相比，在长距离可懂度、韵律连贯性、说话人一致性和边界自然度方面有显著改进。

英文摘要

Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19629 2026-06-19 cs.SD cs.AI cs.LG 新提交

RIVET: Robust Idempotent Voice Attribute Editing

RIVET: 鲁棒的幂等语音属性编辑

Dareen Alharthi, Bhuvan Koduru, Rita Singh, Bhiksha Raj

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出RIVET训练框架，通过幂等性正则化提升语音属性编辑模型对标签噪声的鲁棒性，在合成噪声和真实噪声数据集上均优于标准训练。

详情

AI中文摘要

语音属性编辑模型在保留说话人身份的同时修改年龄和性别等特征。然而，在大规模语音数据集中，属性标注通常带有噪声或不一致，这可能导致条件生成模型产生不稳定的编辑。在这项工作中，我们证明幂等性为提升对噪声标签的鲁棒性提供了一种有效机制。幂等算子是指重复应用不会改变结果的算子，即 f(f(x)) = f(x)。强制这一性质作为一种隐式正则化器，降低了对错误标注样本的敏感性。我们引入了 RIVET，一种结合幂等性目标以提升对标签噪声鲁棒性的训练框架。我们在受控标签噪声下以及在具有自然噪声标注的 GLOBE 数据集上评估了 RIVET。RIVET 提高了编辑成功率，并且比标准训练更好地保留了说话人身份，表明幂等性提升了语音编辑模型的鲁棒性。

英文摘要

Voice attribute editing models modify characteristics such as age and gender while preserving speaker identity. In large-scale speech datasets, however, attribute annotations are often noisy or inconsistent, which can cause conditional generative models to produce unstable edits. In this work, we show that idempotency provides an effective mechanism for improving robustness to noisy labels. An idempotent operator is one for which repeated application does not change the result, i.e., f(f(x)) = f(x). Enforcing this property acts as an implicit regularizer that reduces sensitivity to mislabeled examples. We introduce RIVET, a training framework that incorporates an idempotency objective to improve robustness to label noise. We evaluate RIVET under controlled label noise and on the GLOBE dataset with naturally noisy annotations. RIVET improves editing success and better preserves speaker identity than standard training, showing that idempotency improves robustness in voice editing models.

URL PDF HTML ☆

赞 0 踩 0

2606.19792 2026-06-19 cs.SD 新提交

Exploring Pre-training Benefits on Phoneme Addition through Fine-tuning in Speech Synthesis

探索预训练在语音合成中通过微调对音素添加的益处

Masato Murata, Koichi Miyazaki, Tomoki Koriyama, Tomoki Toda

发表机构 * CyberAgent, Japan（日本CyberAgent公司）； Nagoya University, Japan（日本名古屋大学）

AI总结研究预训练模型在微调过程中添加新音素时的表现，发现预训练主要提升自然度，但对新音素添加的益处有限。

Comments Accepted by INTERSPEECH 2026

详情

AI中文摘要

迁移学习广泛用于低资源文本到语音合成。当目标语料包含预训练中未见过的音素时，模型必须在微调期间扩展其音素库存；我们称此过程为“音素添加”。然而，尚不清楚预训练生成已见音素的能力是否有助于此过程。本研究在两个设置中调查音素添加：（1）使用LLM生成的音素控制语料库的模拟设置，可以在不考虑混杂因素的情况下进行研究，以及（2）真实语音跨语言迁移设置（英语到日语），以验证发现是否在实践中成立。两个设置中的实验表明，虽然微调比从头训练实现了更高的自然度，但需要相同或更多的数据才能达到与新音素相当的PER。这些结果表明，预训练主要有助于自然度提升，但对音素添加的益处有限。

英文摘要

Transfer learning is widely used for low-resource text-to-speech. When the target corpus contains phonemes unseen in pre-training, the model must expand its phoneme inventory during fine-tuning; we call the process "phoneme addition." However, it remains unclear whether the pre-trained ability to generate seen phonemes contributes to this process. This study investigates phoneme addition in two settings: (1) a simulation setup using LLM-generated phoneme-controlled corpora that enables investigation without considering confounding factors, and (2) a real-speech cross-lingual transfer setup (English to Japanese) to validate whether the findings hold in practice. Experiments in both settings showed that while fine-tuning achieved higher naturalness than training from scratch, it required as much or more data to achieve comparable PER for new phonemes. These results indicate that pre-training mainly contributes to naturalness improvement, but offers limited benefit for phoneme addition.

URL PDF HTML ☆

赞 0 踩 0

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 新提交

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Fisheries College, Ocean University of China（中国海洋大学水产学院）； College of Information and Electrical Engineering, China Agricultural University（中国农业大学信息与电气工程学院）

AI总结提出混合两阶段扩散变压器架构，通过粗到细策略平衡全局语义对齐与局部细节编辑，在重叠音频事件和复杂指令任务上提升性能与效率。

详情

AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容，同时保留其余声学内容。尽管扩散模型取得了显著进展，但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互，这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下，扩散变压器提供了更强的全局建模和多模态融合，但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率，我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构，用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐，然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明，所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升，同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

URL PDF HTML ☆

赞 0 踩 0

2606.20218 2026-06-19 cs.SD 新提交

Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization

Zero-VC: 通过说话人匿名化实现零前瞻流式语音转换

Yudong Li, Zihao Fang, Junwen Qiu, Ruihai Jing, Ruixiang Hang, Yingda Shen, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Shenzhen Loop Area Institute（深圳环域研究所）； Shenzhen Transsion Holdings Co., Ltd.（深圳传音控股股份有限公司）

AI总结针对流式零样本语音转换中音色与语言内容解耦的挑战，提出将说话人匿名化作为扰动机制，在保留韵律效用的同时显式减轻音色泄露，实现严格因果的零前瞻网络。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

流式零样本语音转换在不解耦音色与语言内容的情况下，难以避免降低效用或增加延迟。当前方法依赖于信息瓶颈（IB）或说话人扰动。虽然IB过滤了音色，但它丢弃了韵律，迫使模型显式注入基频等特征。这通常需要缓冲未来帧，产生算法前瞻延迟。另一方面，现有的扰动方法在很大程度上忽略了音色泄露与效用保留之间的关键权衡。认识到这一被忽视的权衡，我们发现说话人匿名化（SA）的内在目标与平衡这些因素高度一致。因此，我们引入SA作为一种新颖的扰动机制，在保留韵律效用的同时显式减轻音色泄露。关键在于，SA的鲁棒表示显著减轻了生成器对未来上下文的依赖，使我们能够实现严格因果的零前瞻网络。音频样本可在此https URL获取。

英文摘要

Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing models to explicitly inject features like fundamental frequency. This often requires buffering future frames, creating algorithmic lookahead latency. On the other hand, existing perturbation methods largely overlook the crucial trade-off between timbre leakage and utility preservation. Recognizing this neglected trade-off, we find that the inherent objective of Speaker Anonymization (SA) aligns well with balancing these factors. Thus, we introduce SA as a novel perturbation mechanism to explicitly mitigate timbre leakage while retaining prosodic utility. Crucially, SA's robust representations significantly alleviate the generator's reliance on future context, enabling our strictly causal, zero-lookahead network. Audio samples are available at https://amphionteam.github.io/Zero-VC-demo/.

URL PDF HTML ☆

赞 0 踩 0

2603.04219 2026-06-19 cs.SD cs.AI eess.AS 版本更新

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

ZeSTA: 基于领域条件训练的零样本文本转语音增强用于数据高效的个性化语音合成

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

发表机构 * Maum AI Inc.（Maum AI公司）； Humelo Inc.（Humelo公司）

AI总结提出ZeSTA框架，通过轻量领域嵌入区分真实与合成语音，结合真实数据过采样，在极低资源下提升零样本文本转语音增强的说话人相似度，保持可懂度和感知质量。

Comments 6 pages, accepted to INTERSPEECH 2026

2606.16417 2026-06-19 cs.SD eess.AS 版本更新

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Joycent: 基于扩散的口音语音合成，无需口音音素预测

Xintong Wang, Ye Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Joycent，一种基于扩散模型的口音TTS方法，直接从标准音素序列和语音参考合成口音语音，无需口音音素预测，通过条件层归一化集成口音和说话人表征，并引入WhisAID口音识别模型，在保持说话人身份的同时提升口音自然度。

详情

AI中文摘要

口音文本到语音（TTS）旨在合成具有目标口音的语音。现有的口音TTS系统通常依赖于两阶段流程，首先将标准音素序列转换为口音音素序列，然后合成口音语音。然而，这种方法存在错误累积问题，并且需要配对的标准-口音音素序列数据，这在实践中往往有限。此外，基于文本的口音音素表示不足以建模韵律和节奏等声学口音特征。在这项工作中，我们提出了Joycent，一种基于扩散的口音TTS模型，它直接从标准音素序列和语音参考合成口音语音，无需口音音素预测。Joycent通过文本编码器中的条件层归一化（CLN）集成口音和说话人表征。我们引入了WhisAID，一种在口音普通话语音上训练的普通话口音识别模型，以提取口音表征。实验结果表明，与基线系统相比，Joycent在保持说话人身份的同时提高了口音自然度。我们在以下网址发布代码和演示：https://github.com/oshindow/Joycent-code。

英文摘要

Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.

URL PDF HTML ☆

赞 0 踩 0

2606.19209 2026-06-19 cs.SD 版本更新

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

FineCombo-TTS: 使用文本描述和参考语音的协作式精确可控语音合成

Shuoyi Zhou, Yixuan Zhou, Peiji Yang, Yifan Hu, Yicheng Zhong, Zhisheng Wang, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Inner Mongolia University（内蒙古大学）； Tencent（腾讯）

AI总结提出FineCombo-TTS统一框架，通过条件流匹配的语音方差预测器实现基于文本描述的细粒度参考到目标变换，实现灵活精确的声学属性控制。

Comments Accepted by Interspeech 2026

详情

AI中文摘要

可控文本到语音（TTS）已成为一个关键研究焦点。然而，基于参考语音或文本描述的方法缺乏灵活性和精确控制，最近的联合方法仍然松散耦合，语音建模音色而文本控制全局风格。我们提出FineCombo-TTS，一个基于参考语音并由文本描述引导的语音合成统一框架，能够对声学属性进行灵活精确的控制。不同于显式属性解耦，我们学习统一的声学表示，并引入基于条件流匹配（CFM）的语音方差预测器，以建模由文本描述引导的细粒度参考到目标变换。为了支持相对属性控制，我们构建了FineEdit，一个结构化的配对数据集，显式编码源到目标的属性变化。实验表明，我们的方法实现了灵活、精确且富有表现力的可控TTS。

英文摘要

Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling global style. We propose FineCombo-TTS, a unified framework for speech synthesis grounded in reference speech and guided by text descriptions, enabling flexible and precise control over acoustic attributes. Instead of explicit attribute disentanglement, we learn a unified acoustic representation and introduce a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations guided by text descriptions. To support relative attribute control, we construct FineEdit, a structured paired dataset that explicitly encodes source-to-target attribute variations. Experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS.

URL PDF HTML ☆

赞 0 踩 0

2606.19688 2026-06-19 cs.SD eess.AS 新提交

Latency-Configurable Streaming Speech Enhancement via Asymmetric Temporal Padding

通过非对称时间填充实现延迟可配置的流式语音增强

Yunsik Kim, Yoonyoung Chung

发表机构 * Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH)（电气工程系，浦项科技大学）； Intus Co. Ltd.（Intus有限公司）

AI总结提出LaCo-SENet，通过非对称时间填充和双缓冲流式机制，在单一超参数下实现延迟与质量的灵活权衡，在VoiceBank+DEMAND上以1.37M参数获得12.5-75.0ms延迟范围，PESQ从3.35到3.43。

Comments 5 pages, 3 figures. Accepted for presentation at Interspeech 2026

详情

AI中文摘要

流式语音增强需要在算法延迟和质量之间取得平衡，但现有方法大多将其视为因果与非因果的二元选择。LaCo-SENet通过单个训练时超参数参数化的两种机制解决了这个问题。首先，非对称时间填充重新分配卷积中的过去和未来上下文，实现系统性的延迟配置。其次，双缓冲流式结合了过去上下文的状体缓冲区和在输入和特征层面提供未来上下文的超前缓冲区。选择性状态更新还防止未来帧泄漏到流式状态中，确保训练-推理一致性。在VoiceBank+DEMAND上，固定预算（1.37M参数）的主干网络产生了覆盖12.5-75.0毫秒的模型系列，PESQ从3.35上升到3.43。在仅12.5毫秒（完全因果）时，PESQ为3.35，达到或超过了先前的因果最先进水平（46.5毫秒时为3.27）。

英文摘要

Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter. First, asymmetric temporal padding redistributes past and future context in convolutions, enabling systematic latency configuration. Second, dual-buffer streaming combines state buffers for past context with lookahead buffers that supply future context at both the input and feature levels. Selective state updates also prevent future-frame leakage into the streaming state, ensuring training-inference consistency. On VoiceBank+DEMAND, a fixed-budget (1.37M parameters) backbone yields a family of models spanning 12.5-75.0 ms, with PESQ rising from 3.35 to 3.43. At just 12.5 ms (fully causal), a PESQ of 3.35 matches or exceeds the prior causal state-of-the-art (3.27 at 46.5 ms).

URL PDF HTML ☆

赞 0 踩 0

2606.18611 2026-06-19 cs.SD cs.AI cs.LG stat.ML 版本更新

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

QC-GAN: 一种参数高效的四元数Conformer GAN用于高保真语音增强

Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano, Tohru Nitta

发表机构 * The Asahi Shimbun Company（朝日新闻社）； Tokyo Woman's Christian University（东京女子基督教大学）

AI总结提出参数高效的QC-GAN，结合四元数Conformer生成器和MetricGAN训练，通过汉密尔顿积共享权重减少参数量，在VoiceBank+DEMAND上以0.89M参数达到PESQ 3.48，性能媲美两倍大小模型。

Comments 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

2606.19568 2026-06-19 cs.SD cs.AI 新提交

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

声学枪声分类的特征提取技术参数探索

Sinclair Gurny, Ryan Quinn

AI总结本文系统研究了特征提取技术及其参数对声学枪声分类的影响，使用ResNet-18在23000条枪声数据集上评估，发现正确技术可提升top-1准确率20%，参数优化可再提升4.7%。

2505.18726 2026-06-19 cs.SD cs.LG eess.AS 版本更新

Bioacoustic Geolocation: Species Sounds as Geographic Signals

生物声学地理定位：物种声音作为地理信号

Mustafa Chasmai, Wuao Liu, Subhransu Maji, Grant Van Horn

发表机构 * University of Massachusetts, Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结本文研究仅通过声音进行全球尺度地理定位，利用生物声学信号中的物种地理分布线索，提出结合物种范围预测与检索的地理定位方法，并验证多模态融合的潜力。

Comments Accepted to ICML 26

详情

AI中文摘要

我们能否仅通过听到的声音确定某人的地理位置？声学信号是否足以定位到国家、州甚至城市？在这项工作中，我们应对全球尺度音频地理定位的挑战，特别关注野生动物和自然声音。我们假设生物声学信号包含信息丰富的地理定位线索，因为物种具有明确的地理分布范围。为了验证这一假设，我们对图像地理定位和声景映射方法进行基准测试，设计预言机和以物种为中心的基线，并提出一种结合物种范围预测与基于检索的地理定位的混合方法。我们进一步探究地理定位是否随着物种多样性记录和跨邻近样本的时空聚合而改善。最后，我们将研究扩展到多模态地理定位，通过结合音频和视觉内容的电影案例研究。我们的结果突出了将生物声学信号纳入地理空间任务的潜力，为物种识别和音频地理定位的未来工作提供了动力。

英文摘要

Can we determine someone's geographic location solely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? In this work, we tackle the challenge of global-scale audio geolocation, with a particular focus on wildlife and natural sounds. We posit that bioacoustic signals contain informative geolocation cues because of well-defined geographic ranges of species. To test this hypothesis, we benchmark image geolocation and soundscape mapping methods, design oracles and species-centric baselines, and propose a hybrid approach that combines species range prediction with retrieval-based geolocation. We further ask whether geolocation improves with species-diverse recordings and spatiotemporal aggregation across neighboring samples. Finally, we extend our study to multimodal geolocation with case studies from movies that combine both audio and visual content. Our results highlight the potential of incorporating bioacoustic signals into geospatial tasks, motivating future work on species recognition and audio geolocation.

URL PDF HTML ☆

赞 0 踩 0

2606.20418 2026-06-19 cs.SD 新提交

MixProLAP: Mixture-Induced Uncertainty Modeling for Probabilistic Language-Audio Pretraining

MixProLAP：混合诱导的不确定性建模用于概率性语言-音频预训练

Yu Nakagome, Jaesong Lee, Soo-Whan Chung

发表机构 * LINE WORKS Corporation（LINE WORKS公司）； NAVER Cloud Corporation（NAVER Cloud公司）

AI总结提出概率性音频-语言预训练框架MixProLAP，通过混合音频-文本对模拟重叠声音，建模多对多对应不确定性，并引入多级包含损失，在音频-文本检索中优于确定性基线。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

声学环境通常包含多个重叠的声音事件，且同一声学场景可以用不同的文本描述，使得音频-文本对齐存在固有的模糊性。本文提出一种概率性音频-语言预训练框架，用于建模音频-文本对齐中的多对多对应不确定性。与学习确定性点嵌入的传统对比方法不同，我们的方法将每个模态表示为分布，并学习不确定性感知的跨模态对齐。我们不依赖基于掩码的不确定性模拟，而是混合音频-文本对以创建更真实反映实际声学混合的重叠声音，并捕捉声音事件之间的语义包含关系。我们进一步引入多级包含损失，以强制表示与这些关系一致。在音频-文本检索基准上的实验表明，所提方法优于确定性基线。

英文摘要

Acoustic environments often contain multiple overlapping sound events, and the same acoustic scene can be described using diverse textual expressions, making audio-text alignment inherently ambiguous. This paper proposes a probabilistic audio-language pretraining framework to model many-to-many correspondence ambiguity in audio-text alignment. Unlike conventional contrastive methods that learn deterministic point embeddings, our approach represents each modality as a distribution and learns uncertainty-aware cross-modal alignment. Rather than relying on masking-based uncertainty simulation, we mix audio-text pairs to create overlapping sounds that better reflect real acoustic mixtures and capture semantic inclusion relations among sound events. We further introduce a multi-level inclusion loss to enforce representations consistent with these relations. Experiments on audio-text retrieval benchmarks show that the proposed method outperforms deterministic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 交叉投稿

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化：低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结针对低资源儿童语音识别，系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现，发现特定策略能显著提升泛化能力。

详情

AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明，使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究，为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能，尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究，我们展示了增强最先进的因子化时延神经网络（F-TDNN）模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法，与先前研究相比，在孤立词识别上实现了4.65%的相对改进，在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性，这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

URL PDF HTML ☆

赞 0 踩 0

2606.19797 2026-06-19 eess.AS cs.AI cs.SD eess.SP 交叉投稿

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

通过域内数据增强改进构音障碍语音的端到端语音识别

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结针对构音障碍语音识别中数据稀缺和严重程度差异的问题，本文探索了四种数据增强方法（SRM、PM、FM、VTLP）对预训练Wav2Vec2模型进行微调，在不同严重程度上实现了显著的字错误率降低。

详情

AI中文摘要

构音障碍语音识别对于促进构音障碍患者之间的有效沟通至关重要。然而，由于严重程度不同和数据可用性有限，准确识别构音障碍语音面临重大挑战。在本文中，我们通过微调端到端预训练Wav2Vec2模型，探索了针对构音障碍自动语音识别（ASR）系统的数据增强技术，特别关注严重程度级别。为了解决数据稀缺以及微调预训练ASR系统用于构音障碍语音时需要大量数据的问题，我们研究了四种主要的数据增强方法：语速修改（SRM）、音高修改（PM）、共振峰修改（FM）和声道长度扰动（VTLP），这些方法针对构音障碍的不同方面进行了调整。本研究使用为每个严重程度类别单独微调的Wav2Vec2模型作为基线系统。此外，我们使用增强数据对ASR模型进行了特定严重程度的微调。结果表明，每种增强技术在不同严重程度级别上表现出不同的有效性模式。对于\textit{低}（9.02%）和\textit{中}（38.11%）严重程度，使用SRM（$s$=0.8）获得了最佳WER；对于\textit{高}严重程度（55.15%），使用PM（$\ au$=0.8）获得了最佳WER，分别相对改进了30.02%、16.64%和15.47%。这些结果证实了增强方法在提高构音障碍ASR性能方面的有效性。

英文摘要

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19597 2026-06-19 cs.SD cs.AI cs.LG 新提交

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

PrefSQA: 用于语音质量评估的成对偏好预测及高质量数据集的关键作用

Junyi Fan, Donald S. Williamson

发表机构 * Department of Computer Science and Engineering, The Ohio State University, USA（美国俄亥俄州立大学计算机科学与工程系）

AI总结提出PrefSQA模型，通过不确定性感知logits、损伤注意力头和非匹配参考比较模块，利用高质量偏好数据集提升语音质量评估的准确性。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

平均意见得分（MOS）广泛用于语音质量评估，但标量标签对评估者变异性和听力测试差异敏感，这引入了标签噪声，限制了MOS预测的可靠性。偏好预测通过让听者直接比较信号来减少这种变异性，产生更干净的标签。我们研究了无MOS的偏好预测，并提出了PrefSQA，它结合了不确定性感知logits、损伤注意力头以及基于非匹配参考比较的模块。我们使用并精炼了五个数据集，包括MOS衍生和低噪声模拟集（包含匹配和非匹配内容），在人类偏好集上进行实验，并在未见数据上测试。实验表明，在MOS衍生数据上改进较小，而其他数据集显示出相对于基线的明显改进，突显了高质量偏好数据的价值，并证明了所提出方法的有效性。

英文摘要

Mean opinion scores (MOS) are widely used for speech quality assessment, yet scalar labels are sensitive to rater variability and listening test differences. This introduces labeling noise, which limits the reliability of MOS prediction. Preference prediction reduces this variability as listeners compare signals directly, producing cleaner labels. We study MOS-free preference prediction and propose PrefSQA, which incorporates uncertainty-aware logits, an impairment attention head, and a module based on non-matching-reference comparisons. We use and refine five datasets, including MOS-derived and low-noise simulated sets with matching and non-matching content, experiment with human preference sets, and test on unseen data. Experiments show small improvements on MOS-derived data, while other sets reveal clear improvement over the baselines, highlighting the value of high-quality preference data and demonstrating the effectiveness of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2606.19987 2026-06-19 cs.SD eess.AS 新提交

PolSeT: Polish Semantics of Timbre Dataset

PolSeT: 波兰语音色语义数据集

Jan Jasiński

AI总结介绍PolSeT数据集，通过自由言语化和语义差异实验，收集波兰语语义描述符和音色评分，填补音色研究数据空白，支持跨文化心理声学和MIR研究。

Comments 8 pages, 7 figures. Data descriptor for the PolSeT dataset (Polish Semantics of Timbre), available at https://doi.org/10.5281/zenodo.17830609 under CC BY 4.0

详情

AI中文摘要

本数据报告介绍了PolSeT（波兰语语义音色）数据集，该数据集旨在促进波兰语及跨文化背景下的心理声学和音乐信息检索（MIR）研究。数据集包含两个连续实验的数据。实验1（N=60）是一项自由言语化任务，旨在创建波兰语语义描述符词汇表。使用11个刺激，共收集了1901个描述符（701个唯一）。实验2（N=105）利用该词汇表进行语义差异研究，参与者对18种乐器声音在8个双极量表上进行评分，并进行了重复试验以进行信度分析。发布的数据集包括原始听众响应、全面的人口统计数据（经验、性别、年龄）、音频刺激以及提取的声学特征及Python提取代码。该数据集填补了开放音色研究数据的空白，为心理声学研究和多语言语义嵌入模型的训练提供了必要的定性语言基础和定量评分。

英文摘要

This data report introduces PolSeT (Polish Semantic Timbre), a dataset designed to facilitate research in psychoacoustics and Music Information Retrieval (MIR) in Polish and cross-cultural contexts. The dataset contains data from two sequential experiments. Experiment 1 (N=60) was a free-verbalization task aimed at creating a lexicon of Polish semantic descriptors. Using 11 stimuli, a total of 1901 descriptors (701 unique) were gathered. Experiment 2 (N=105) utilized this lexicon to conduct a semantic differential study, where participants rated 18 instrument sounds on 8 bipolar scales, with repeated trials for reliability analysis. The released dataset includes raw listener responses, comprehensive demographics (experience, gender, age), audio stimuli, and extracted acoustic features with Python extraction code. This dataset addresses a gap in open timbre research data, providing both the qualitative linguistic groundwork and the quantitative ratings necessary for psychoacoustic research and the training of multilingual semantic embedding models.

URL PDF HTML ☆

赞 0 踩 0

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结通过声学退化、韵律错误和说话人特征扰动，发现MOS预测模型对声学退化敏感，但对韵律错误不敏感，且对基频有偏见，而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

平均意见得分（MOS）预测模型在文本到语音（TTS）研究中被广泛用作代理指标，但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点：声学退化、韵律错误以及说话人特定特征（如音高和语速）的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测，并分析了它们感知特征的差异。结果表明，大多数模型能很好地跟踪声学退化，而所有模型对韵律错误不敏感，尽管主观评分大幅下降。对于说话人特征，模型表现出双重分离：在人类评分中不存在的强平均基频（F0）偏见，但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA：针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结提出PASQA模型，通过可控重音合成数据集和伪重音质量分数，结合自监督表示、摩拉条件融合等训练策略，有效评估音高重音正确性，优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

现有的平均意见得分（MOS）预测模型通常预测话语级别的自然度MOS，并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估（PASQA），明确针对音高重音正确性。为了训练我们的模型，我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集，并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上，并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明，传统模型无法保持按重音错误严重程度的排序，而PASQA在已见和未见说话者上都实现了高排序准确性。此外，PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取：https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

URL PDF HTML ☆

赞 0 踩 0

2606.14784 2026-06-19 cs.SD cs.LG eess.AS 版本更新

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

基于上下文学习的音频情感分类的LLM合成真实标签生成

Qing Huang, Pooja Pol, Jianing Zhang

发表机构 * School of Business, Technical University of Applied Sciences Augsburg（应用技术大学阿沙芬堡商学院）； Data Science und Autonome Systeme Technologietransferzentrum (TTZ)（数据科学与自主系统技术转移中心（TTZ））

AI总结提出利用大语言模型（LLM）和上下文学习（ICL）从多用户VR环境的流式语音数据中自动生成情感相关合成真实标签，解决团队协作状态标注难题。

Comments https://icaiit.org/paper.php?paper=14th_ICAIIT_2/3_9

详情

AI中文摘要

理解人类状态和交互动态是人机交互（HCI）的核心目标。随着交互范式变得更加沉浸，虚拟现实（VR）已成为研究协作工作的强大平台。在此类环境中，评估团队协作状态（包括团队表现和团队韧性）需要从多模态传感器数据（如语音信号）中连续可靠地推断潜在的团队级认知和情感状态。然而，由于传感器噪声、上下文变异性和稀疏的专家标注，为这些潜在状态生成真实标签仍然具有挑战性。传统的自我报告方法仅提供静态和延迟的测量，因此不足以捕捉连续语音数据中反映的动态团队过程。在这项工作中，我们提出了一种由大语言模型（LLM）驱动的、基于代理的推理工作流，用于从多用户VR环境中的流式语音数据自动生成情感相关的合成真实标签。利用LLM的泛化能力，我们使用上下文学习（ICL）和少量配对的音频样本及其对应转录的演示。ICL倾向于实现与模型微调相当的任务适应，同时避免了参数更新的计算开销。为了构建信息丰富且鲁棒的上下文提示，我们采用基于检索的选择策略，根据声学特征空间中的相似性动态识别相关的音频演示。

英文摘要

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.

URL PDF HTML ☆

赞 0 踩 0

2606.19579 2026-06-19 cs.SD cs.AI 新提交

FlowFake: Liquid Networks for Audio Deepfake Detection

FlowFake: 用于音频深度伪造检测的液态网络

Shivaay Dhondiyal, Divyansh Sharma, Dinesh Kumar Vishwakarma

发表机构 * Delhi Technological University（德里理工大学）

AI总结针对音频深度伪造检测中跨数据集泛化失败的问题，提出基于液态时间常数（LTC）架构的FlowFake模型，通过学习ODE演化隐藏状态并自适应时间常数，以34K参数在跨域基准上超越现有方法。

Comments Accepted at the Workshop on Learning to Listen: Machine Learning for Audio at ICML 2026

详情

AI中文摘要

由神经文本转语音和语音克隆系统生成的音频深度伪造对说话人验证和公共话语构成大规模威胁。核心挑战是跨数据集泛化：在一种合成流水线上训练的检测器在面对未见过的伪造时性能崩溃。我们认为这种失败主要是由于结构性合成语音伪影，这些伪影是多时间尺度的轨迹异常。尽管每个现有检测器都聚合固定窗口的帧统计量，但这使得架构与信号不对齐。我们提出FlowFake，一种液态时间常数（LTC）架构，其隐藏状态通过学习ODE演化，每个神经元具有自适应时间常数，同时解析频谱（10ms）和韵律（2s）线索。仅34K参数，FlowFake实现了正式的BIBO稳定性和O(dt^4)积分误差。在四个数据集的跨域基准（ASVspoof2019-LA、FakeOrReal、InTheWild、MLAAD）上，FlowFake在仅用FakeOrReal训练时在ASVspoof2019上达到75.29%，仅用MLAAD训练时达到79.97%。它在每个评估对上优于RawGAT-ST和Whisper-DF，并以0.01%的参数数量匹配SSL Wav2vec2（大300倍）。源代码可在以下网址获取：this https URL

英文摘要

Audio deepfakes generated by neural text-to-speech and voice-cloning systems threaten speaker verification and public discourse at scale. The core challenge is cross-dataset generalization: detectors trained on one synthesis pipeline collapse on unseen forgeries. We argue that this failure is primarily because of structural synthetic speech artifacts which are multi-timescale trajectory anomalies. Though every existing detector aggregates a fixed-window frame statistics, this misaligns the architecture with the signal. We propose FlowFake, a Liquid Time-Constant (LTC) architecture whose hidden state evolves via a learned ODE, with per-neuron adaptive time constants simultaneously resolving spectral (10ms) and prosodic (2s) cues. At only 34K parameters FlowFake achieves formal BIBO stability and O(dt^4) integration error. On a four-dataset cross domain benchmark (ASVspoof2019-LA, FakeOrReal, InTheWild, MLAAD), FlowFake reaches 75.29% on ASVspoof2019 trained only on FakeOrReal and 79.97% trained only on MLAAD. It outperforms RawGAT-ST and Whisper-DF on every evaluated pair and matching SSL Wav2vec2 (300x larger) at 0.01% of its parameter count. The source code is available on : https://github.com/GhostRider2023/FlowFake

URL PDF HTML ☆

赞 0 踩 0

2603.16941 2026-06-19 eess.AS cs.CL cs.SD 版本更新

The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs

言语背后的声音：量化语音大语言模型中的交叉偏见

Shree Harsha Bokkahalli Satish, Christoph Minixhofer, Maria Teleki, James Caverlee, Ondřej Klejch, Peter Bell, Gustav Eje Henter, Éva Székely

发表机构 * 1 Department of Speech, Music ； Hearing, KTH Royal Institute of Technology, Sweden 2 Centre for Speech Technology Research, University of Edinburgh, UK 3 Texas A\&M University, USA

AI总结本研究通过2880次受控交互，评估三种语音大语言模型在六种英语口音和两种性别呈现中的口音与性别交叉偏见，发现东欧口音（尤其女性）获得更低有用性评分，且人类评估者比LLM评判更敏感。

Comments 5 pages, 3 figures, 1 table, Accepted to Interspeech 2026

详情

AI中文摘要

语音大语言模型直接处理语音输入，保留了之前级联管道中去除的口音和感知性别等线索，这导致了依赖于说话者身份的反应差异。我们使用2880次受控交互（涵盖六种英语口音和两种性别呈现，通过语音克隆保持语言内容不变），对三种语音大语言模型中的口音和性别偏见进行了大规模交叉评估。通过逐点LLM评判评分、成对比较以及经过人工验证的最佳-最差缩放，我们检测到反复出现的定向差异。东欧口音的语音获得较低的有用性评分，尤其是女性呈现的语音。反应保持礼貌但在有用性上存在差异。虽然LLM评判捕捉到了这些偏见的定向趋势，但人类评估者表现出显著更高的敏感性，显示出更强的口音级别对比。

英文摘要

Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect recurring directional disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. Responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, showing stronger accent-level contrasts.

URL PDF HTML ☆

赞 0 踩 0

2507.19137 2026-06-19 eess.AS cs.AI cs.SD 版本更新

Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios

二元角色扮演场景中跨情境的人格维度评估

Alice Zhang, Skanda Muralidhar, Daniel Gatica-Perez, Mathew Magimai-Doss

发表机构 * Idiap Research Institute（日内瓦研究所）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结研究通过对话语音分析，发现感知人格在不同工作情境下显著变化，并识别出与各人格特质相关的声学特征。

Comments Accepted to IEEE Transactions on Affective Computing

详情

AI中文摘要

先前研究表明，用户偏好与其人格相匹配的辅助技术。这引发了对自动人格感知（APP）的兴趣，旨在预测个体感知到的人格特质。以往的APP研究将人格视为静态特质，独立于情境。然而，心理学研究表明，感知人格会随情境和场景而变化。在本研究中，我们调查了参与两种工作情境（中性面试和压力客户互动）的参与者对话语音与感知人格之间的关系。我们的主要发现是：1）感知人格在不同互动中显著不同；2）响度、声压级和频谱通量特征在中性互动中指示感知的外向性、宜人性、尽责性和开放性，而在压力情境中，神经质与这些特征相关；3）手工声学特征和非语言特征在感知人格推断中优于说话人嵌入；4）压力互动更能预测神经质，这与现有心理学研究一致。

英文摘要

Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual's perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.

URL PDF HTML ☆

赞 0 踩 0

2509.04390 2026-06-19 eess.AS cs.SD 版本更新

Accelerated Interactive Auralization of Highly Reverberant Spaces using Graphics Hardware

Hannes Rosseel, Toon van Waterschoot

发表机构 * KU Leuven, Dept. of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing

Comments 9 pages, 6 figures, submitted to Journal of the Audio Engineering Society

1. 语音识别与关键词检测 8 篇

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

2. 语音合成与声音生成 8 篇

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

RIVET: Robust Idempotent Voice Attribute Editing

Exploring Pre-training Benefits on Phoneme Addition through Fine-tuning in Speech Synthesis

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

3. 语音增强、降噪与音频修复 2 篇

Latency-Configurable Streaming Speech Enhancement via Asymmetric Temporal Padding

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

4. 音频事件检测与场景理解 2 篇

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

Bioacoustic Geolocation: Species Sounds as Geographic Signals

5. 多模态音频与视听学习 1 篇

MixProLAP: Mixture-Induced Uncertainty Modeling for Probabilistic Language-Audio Pretraining

6. 低资源、多语言与方言语音 2 篇

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

7. 数据集、基准与评测 5 篇

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

PolSeT: Polish Semantics of Timbre Dataset

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

8. 安全、隐私与深度伪造音频 2 篇

FlowFake: Liquid Networks for Audio Deepfake Detection

The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs

9. 其他/综合语音音频 2 篇

Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios

Accelerated Interactive Auralization of Highly Reverberant Spaces using Graphics Hardware