arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.SD声音42

1. 语音识别与关键词检测 1 篇

2606.11279 2026-06-11 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Massive Open-Vocabulary Keyword Spotting

大规模开放词汇关键词识别

Leonor Barreiros, Raul Monteiro, Afonso Mendes, Gonçalo M. Correia

AI总结 提出一种内存占用更小的开放词汇关键词识别系统,无需微调即可处理大规模数据库,在未见语言中达到与未压缩方案相当的实体召回率。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

自动语音识别系统在转录训练数据中罕见词汇(即专业术语)时表现不佳。开放词汇关键词识别结合上下文偏置已被证明可以缓解这一问题。然而,现有系统只能处理几百个术语的词汇表,否则会成为不可行的瓶颈。我们提出了一种系统,其存储特征的内存占用比可比基线小128倍,允许用户处理大规模数据库,同时保持开放词汇。无需微调语音识别模型,我们的系统在未见过的语言中也达到了与未压缩解决方案相当的实体召回率。

英文摘要

Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.

2. 语音合成与声音生成 3 篇

2606.11611 2026-06-11 cs.SD 新提交

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

SARA: 一种通过整合语义和声学表示实现高保真语音生成的双流VAE

Peijie Chen, Wenhao Guan, Weijie Wu, Kaidi Wang, Daiyu Huang, Zhuanling Zha, Junbo Li, Jun Fang, Qingyang Hong, Lin Li

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) School of Electronic Science and Engineering, Xiamen University(厦门大学电子科学与技术学院) DiDi Global Inc.(滴滴全球股份有限公司)

AI总结 提出SARA双流VAE,融合冻结的SSL语义锚点和残差声学编码器,解决语音分词器中声学与语义的权衡,实现高保真重建和零样本TTS的自然合成。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

零样本文本转语音(TTS)依赖于鲁棒的语音表示。然而,当前的语音分词器面临一个基本权衡:声学编解码器保留高保真音频但缺乏语言约束,导致生成过程中出现内容错误;而来自自监督学习(SSL)模型的语义标记确保精确的文本对齐,但丢弃了一些声学信息。为了弥合这一差距,我们提出了SARA,一种双流VAE,它直接将冻结的SSL语义锚点与专用的残差声学编码器融合。这有效缓解了困境,创建了一个高效且紧凑的潜在空间,而无需依赖复杂的正则化器。SARA在重建质量上优于强基线。此外,在下游零样本TTS任务中,它产生了高度自然且富有表现力的合成质量,即使在加速推理下也保持稳健的生成性能,在合成速度和计算成本之间提供了有利的权衡。

英文摘要

Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.

2602.00560 2026-06-11 cs.SD eess.AS 版本更新

Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

编辑内容,保留声学:基于自一致性奖励的不可感知文本语音编辑

Yong Ren, Jiangyan Yi, Jianhua Tao, Tao Wang, Le Xu, Zhengqi Wen

AI总结 提出一种在稳定语义空间中编辑内容、通过流匹配解码器保持声学连续性的框架,并利用自一致性奖励组相对策略优化实现不可感知的文本语音编辑。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

不可感知的基于文本的语音编辑通过转录操作修改口语内容,同时保持声学连续性。先前的声学空间方法存在内容-风格纠缠,导致生成不稳定和边界伪影。我们引入了一个以“编辑内容,保留声学”原则为指导的框架。编辑在稳定的语义空间中进行,而声学实现由流匹配解码器处理。为了确保感知一致性,我们提出了自一致性奖励组相对策略优化,该优化利用预训练的文本到语音模型作为隐式评判器,并结合可理解性和持续时间约束。实验表明,在可理解性、鲁棒性和感知质量方面,该方法持续优于最先进的自回归和非自回归基线。

英文摘要

Imperceptible text-based speech editing modifies spoken content through transcript manipulation while preserving acoustic continuity. Prior acoustic-space approaches suffer from content-style entanglement, causing unstable generation and boundary artifacts. We introduce a framework guided by the principle of "Edit Content, Preserve Acoustics". Editing is conducted in a stable semantic space, while acoustic realization is handled by a Flow Matching decoder. To ensure perceptual consistency, we propose Self-Consistency Rewards Group Relative Policy Optimization, which leverages a pre-trained Text-to-Speech model as an implicit critic, together with intelligibility and duration constraints. Experiments demonstrate consistent improvements over state-of-the-art autoregressive and non-autoregressive baselines in intelligibility, robustness, and perceptual quality.

2603.11678 2026-06-11 eess.AS cs.SD 版本更新

RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis

RAF:用于通用语音合成的相对论对抗反馈

Yongjoon Lee, Jung-Woo Choi

AI总结 提出相对论对抗反馈(RAF)训练目标,通过自监督语音模型和相对论配对改进GAN声码器的域内保真度和泛化能力,在参数减少88%的情况下超越LSGAN训练的BigVGAN。

详情
Comments
Accepted to Interspeech 2026 Long paper track. Code: this https URL
AI中文摘要

我们提出相对论对抗反馈(RAF),一种用于GAN声码器的新型训练目标,可提高域内保真度和对未见场景的泛化能力。尽管现代GAN声码器采用先进架构,但其训练目标往往无法促进可泛化的表示。RAF通过利用语音自监督学习模型辅助判别器评估样本质量,鼓励生成器学习更丰富的表示来解决这一问题。此外,我们利用真实和虚假波形的相对论配对来改善训练数据分布的建模。跨多个数据集的实验表明,基于GAN的声码器在客观和主观指标上均获得一致提升。重要的是,经过RAF训练的BigVGAN-base仅使用12%的参数就在感知质量上优于经过LSGAN训练的BigVGAN。对比研究进一步证实了RAF作为GAN声码器训练框架的有效性。

英文摘要

We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.

3. 说话人识别、验证与分离 2 篇

2606.11795 2026-06-11 eess.AS cs.SD 交叉投稿

Tight Boundary Prediction in Speaker Diarization Using Causal-Anticausal Consistency

说话人日志中的紧边界预测:基于因果-反因果一致性

Shota Horiguchi, Marc Delcroix, Naohiro Tawara, Takanori Ashihara, Atsushi Ando

AI总结 针对松标注训练导致预测边界松散的问题,提出利用因果与反因果模型生成紧伪标签,并通过协同训练迭代优化,恢复约70%的紧标签训练效果并提升下游性能。

详情
Comments
Accepted to Interspeech 2026 (Long Paper Track)
AI中文摘要

多说话人对话自动语音识别数据常用于训练说话人日志模型。由于此类数据优先考虑语义连续性,语音段中包含停顿和边界余量,导致标注松散。在此类数据上训练的模型倾向于内化产生这种松散性的机制,尽管紧语音区间有时更适用于下游应用。本文解决了利用松散标签使模型产生紧预测的新任务。我们的方法使用因果和反因果模型生成更紧的伪标签,这些模型本质上无法学习松散行为。我们进一步提出了一种协同训练方案,迭代地收紧标签并更新两个模型以进行更渐进式的优化。实验结果表明,所提方法恢复了理想紧标签训练所实现的约70%的收紧效果,并提升了下游性能。

英文摘要

Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.

2606.10046 2026-06-11 cs.SD cs.AI 版本更新

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

潜流内部:音频分离基础模型中注意力动力学的因果解读

Yuxuan Chen, Haoyuan Yu, Peize He

AI总结 本文通过因果干预协议揭示流匹配Transformer在音频分离中的双路径注意力机制,并提出无训练加速方法LSAC,在保持质量的同时减少约25%自注意力计算。

详情
AI中文摘要

流匹配变压器实现了强大的音频分离,但其注意力动力学是不透明的。我们将已建立的因果干预原则适应为SAM Audio的确定性推理时探测协议。正交探测揭示了一种双路径文本条件机制:加法注入控制语义身份,而交叉注意力细化声学结构。我们观察到异步逐层收敛:稳定层早期构建时间支架,而快速层在采样过程中继续解决伪影。该模型还减弱时间分割线索以维持连续流稳定性。利用这些见解,我们提出了层选择性注意力缓存(LSAC),一种无训练加速方法,在稳定层中缓存注意力。在各种声学复杂度下,LSAC将自注意力计算减少约25%,质量损失可忽略,并且与朴素步长减少相比,质量保持率高达6.7倍。

英文摘要

Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.

4. 语音增强、降噪与音频修复 1 篇

2606.11631 2026-06-11 eess.AS cs.SD 交叉投稿

Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

从率失真角度基准测试神经语音压缩

Jun Xu, Zhengxue Cheng, Fengxi Zhang, Yuhan Liu, Li Song, Wenjun Zhang

AI总结 提出熵约束编解码器ECC,通过标量量化与学习熵模型结合,在低比特率下实现优于传统和神经编解码器的率失真性能。

详情
AI中文摘要

基于学习的语音压缩在低比特率性能上取得了有前景的成果,但许多神经语音编解码器仍使用预设速率的离散符号描述量化潜变量,或仅在符号生成后应用熵编码。这种设计将表示学习与概率建模解耦,限制了它们利用学习到的语音潜变量的非均匀使用和时间依赖性的能力。本文从率失真角度基准测试神经语音压缩,并进一步研究用于低比特率语音压缩的熵约束编码。我们首先制定了一个统一的基于学习的语音编码流程,并对最近的神经语音编解码器进行了基准测试风格的分析,表明显式概率建模在学习语音压缩中仍未得到充分探索。然后,我们提出了ECC,一种熵约束编解码器,它将标量量化与学习熵模型相结合。ECC集成了基于超先验的边信息、通道上下文建模、潜变量残差预测和轻量级时间建模,以在训练期间估计用于率估计的潜变量似然,并在推理期间进行算术编码。为了进一步提高低比特率效率,ECC引入了熵跳跃,它使用解码器可用的尺度估计省略高度可预测的残差符号,而无需传输额外的跳跃掩码。大量实验表明,ECC在低比特率下实现了优于传统和神经编解码器基线的率失真权衡,在两个广泛使用的测试集上,平均BD-rate在ViSQOL上降低39.9%,在PESQ上降低76.3%。消融和诊断研究进一步验证了熵建模的有效性。项目页面:此 https URL

英文摘要

Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9% on ViSQOL and 76.3% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling. Project Page: this https URL

5. 音频事件检测与场景理解 5 篇

2606.11915 2026-06-11 cs.SD cs.AI 新提交

Quality Adaptive Angular Margin Learning for Respiratory Sound Classification

呼吸音分类的质量自适应角度边界学习

Yoon Tae Kim, Heejoon Koo, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS, Republic of Korea(RSC实验室,MODULABS,韩国) Department of Electronic Engineering, Wonkwang University, Republic of Korea(韩国圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University, Republic of Korea(韩国圆光大学人工智能融合研究所)

AI总结 提出质量自适应角度边界学习框架QLung,通过频谱熵和均方根能量推导无参考音频质量边界,自适应缩放角度边界,改善特征泛化,在ICBHI和SPRSound数据集上分别提升2.46%和达到最优分布外性能。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

我们提出了一种质量自适应角度边界学习框架,通过增强类内紧凑性和类间可分离性来改进特征泛化。我们的框架名为QLung,引入了基于频谱熵和均方根能量的无参考音频质量边界,根据录音质量自适应缩放角度边界。为此,我们提出了一种对数缩放的角度边界,在严重类别不平衡下稳定训练。我们还使用了一个角度分类器,对特征和类别权重进行归一化,确保在单位超球面上一致地应用边界惩罚。我们的方法在ICBHI数据集上比交叉熵基线提高了2.46%的分布内性能,最重要的是,在SPRSound数据集上,与先前最先进的方法相比,实现了最强的分布外性能。代码可在以下网址获取:https://this URL。

英文摘要

We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46\% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at this https URL.

2606.11922 2026-06-11 cs.SD cs.AI 新提交

Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification

Lung-SRAD: 基于谱感知正则化音频DASS与双轴补丁混合对比学习的呼吸音分类

Hemansh Shridhar, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS(RSC实验室,MODULABS) Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究所)

AI总结 针对呼吸音分类中AST模型对局部异常模式不敏感的问题,提出基于状态空间模型的谱感知层正则化和双轴补丁混合对比学习,在ICBHI基准上达到64.48%分数,比AST基线提升5%。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

最近的呼吸音分类(RSC)研究主要依赖于CLS令牌驱动的自注意力架构,如音频频谱图变换器(AST)。虽然它在建模全局上下文方面有效,但最近的分析表明存在低通滤波行为,可能会降低对局部异常模式的敏感性。在这项工作中,我们研究了状态空间模型(SSM)作为RSC的替代骨干网络。使用蒸馏音频状态空间模型,我们通过频谱响应曲线分析中间表示,并观察到对中到高空间频率分量的更强保留。基于这些观察,我们引入了使用高斯卷积应用于选定层的谱感知层正则化。我们进一步提出了针对基于SSM的音频模型定制的双轴补丁混合对比学习,以实现稳健的表示学习。在ICBHI基准上的实验表明,我们的方法达到了64.48%的分数,比AST基线高出5%。代码可在以下网址获取:https://this https URL。

英文摘要

Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at this https URL.

2606.12339 2026-06-11 cs.SD cs.RO 新提交

Fast-SDE: Efficient Single-Microphone Sound Source Distance Estimation in Reverberant Environments

Fast-SDE:混响环境中高效单麦克风声源距离估计

Jiang Wang, Runwu Shi, Yaozhong Kang, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Institute of Science Tokyo(东京科学大学)

AI总结 提出Fast-SDE,一种基于子带处理的轻量级单麦克风框架,用于在资源受限的机器人平台上实现高效声源距离估计。

详情
Comments
To appear in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
AI中文摘要

声源距离估计(SDE)是人机交互中的关键能力。不适当的交互距离不仅会降低语音采集和理解的可靠性,还会损害交互的自然性和舒适性。现有大多数SDE方法依赖麦克风阵列,然而多麦克风系统通常需要精心的硬件同步、几何校准以及额外的空间和计算资源,这限制了其在尺寸受限和计算能力有限的嵌入式平台上的适用性。为了解决这些问题,我们提出了Fast-SDE,一种轻量级的单麦克风SDE框架,适用于计算资源有限且尺寸严格受限的机器人平台。具体来说,Fast-SDE采用基于子带的骨干网络,将频率轴分解为多个子带,而不是使用宽的全频带骨干处理整个频谱。然后,一个共享的子带编码器将每个子带映射为紧凑的潜在表示,并学习声学结构与时频模式之间的关系。最后,一个轻量级的回归头将融合后的子带表示转换为估计的距离。大量的仿真和真实世界实验证明了所提方法的优点。为了惠及更广泛的研究社区,我们在以下网址开源了代码:this https URL。

英文摘要

Sound source distance estimation (SDE) is a critical capability in human-robot interaction. An inappropriate interaction distance not only reduces the reliability of speech acquisition and understanding, but also compromises the naturalness and comfort of the interaction. Most existing SDE methods rely on microphone arrays, however, multi-microphone systems typically require careful hardware synchronization, geometric calibration, and additional space and computational resources, which limits applicability to size-constrained and computability-limited embodied platforms. To alleviate these issues, we propose Fast-SDE, a lightweight single-microphone SDE framework that is suited for deployment on robot platforms with limited computational resources and strict size constraints. Specifically, Fast-SDE employs a subband-based backbone that decomposes the frequency axis into multiple subbands, rather than processing the entire spectrum with a wide full-band backbone. A shared subband encoder then maps each subband to a compact latent representation and learns the relationship between acoustic structure and time-frequency patterns. Finally, a lightweight regression head converts the fused subband representations into the estimated distance. Extensive simulation and real-world experiments demonstrate the merits of the proposed method. To benefit the broader research community, we have open-sourced our code at this https URL.

2603.03855 2026-06-11 cs.SD 版本更新

A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs

音频大语言模型中多事件音频定位的敏感性分析

Taehan Lee, Jaehan Jung, Hyukjun Lee

AI总结 通过大规模评估,发现音频大语言模型在复杂声学场景中事件数量增加会导致真阳性率下降和假阳性率上升,提示词则引入权衡,模型对多事件音频更不确定。

详情
Comments
6 pages, Accepted to Interspeech 2026
AI中文摘要

音频大语言模型在理解音频样本方面表现出强大能力,但其在复杂声学场景中的可靠性仍待探索。不同于以往局限于小规模或查询构建控制不足的工作,我们提出了一种大规模评估,研究随着听觉场景复杂度增加时的事件定位和误报情况。使用71K个AudioCapsV2片段,我们提取标准化的(源,属性)事件,并构建两种查询类型:用于真实检测的存在事件查询和用于探测幻觉的缺失事件查询,在音频对齐的文本嵌入空间中采用相似性过滤的负采样。我们评估了四种最先进的音频大语言模型,每个模型使用12种提示变体,处理超过50万个是/否查询。在所有模型中,事件数量增加一致地降低了真阳性率并提高了假阳性率,而提示则在两者之间引入了强烈的权衡。我们的置信度分析表明,模型在多事件音频上变得更加不确定,揭示了改进空间。

英文摘要

Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.

2606.06921 2026-06-11 cs.SD 版本更新

Towards Event-Robust Acoustic Scene Classification

面向事件鲁棒的声学场景分类

Yiqiang Cai, Bohan Hu, Yu Yang, Pengwei Lu, Shengchen Li, Xi Shao

发表机构 * Xi'an Jiaotong-Liverpool University Zhongdian Zhiheng Information Technology Service Co., Ltd China Telecom Jiangsu Branch Nanjing University of Posts and Telecommunications

AI总结 针对现有声学场景分类系统在未知声音事件下性能下降的问题,提出事件移位声学场景数据集ESAS,通过大语言模型注入前景事件模拟真实环境,评估并推动事件鲁棒ASC研究。

详情
Comments
Accepted to Interspeech 2026. The ESAS dataset is available at: this https URL
AI中文摘要

本文介绍了事件移位声学场景(ESAS)数据集,这是一个用于评估声学场景分类(ASC)系统对未知声音事件鲁棒性的新型基准。现有的ASC数据集通常包含干净且一致的音频记录,而现实环境往往包含多样且意外的事件。为弥合这一差距,ESAS通过借助大语言模型将前景声音事件注入背景场景来模拟现实世界的声学变化。本文介绍了构建方法、数据集统计和评估协议。此外,使用ESAS基准对最先进的ASC系统进行了全面评估。实验结果表明,现有的ASC模型在面对事件移位挑战时性能显著下降。ESAS数据集的引入旨在推动未来研究朝向事件鲁棒的ASC发展。

英文摘要

This paper introduces the Event-Shifted Acoustic Scene (ESAS) dataset, a novel benchmark for evaluating the robustness of Acoustic Scene Classification (ASC) systems against unknown sound events. Existing ASC datasets typically contain recordings of clean and consistent audio, while real-world environments often include diverse and unexpected sound events. To bridge this gap, ESAS simulates real-world acoustic variability by injecting foreground sound events into background scenes with the assistance of large language models. In this work, we present the construction methodology, dataset statistics, and evaluation protocols. Furthermore, a comprehensive evaluation of state-of-the-art ASC systems is conducted using the ESAS benchmark. Experimental results reveal that existing ASC models suffer significant performance degradation when facing the event-shift challenge. The introduction of the ESAS dataset aims to drive future research toward event-robust ASC.

6. 音乐信息检索与音乐生成 3 篇

2606.11886 2026-06-11 cs.SD cs.OS 新提交

Real-Time Language Model Jamming: A Case Study for Live Music Accompaniment Generation

实时语言模型阻塞:现场音乐伴奏生成的案例研究

Bowen Zheng, Andrew H. Yang, Jiaqi Ruan, Jia He, Xinyue Li, Yuan-Hsin Chen, Ziyu Wang, Xiaosong Ma

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出StreamMUSE系统,在客户端-服务器架构中实现帧同步流式推理,通过现场音乐伴奏任务验证了不同延迟环境下实时同步的有效性。

详情
Comments
Accepted to RTAS 2026. 14 pages, 5 figures, 3 tables
AI中文摘要

语言模型(LMs)已成为现代生成建模中最突出的范式之一。虽然提高速度是实时部署的主要焦点,但仅靠速度是不够的。许多实际应用,如同步翻译和语音合成,还需要生成内容与外部信号在生成内容和时序上精确对齐。我们将此问题称为\textit{帧同步流式推理}。为了解决这个问题,我们提出了StreamMUSE,一个在客户端-服务器架构中响应外部信号流执行LM生成的推理系统。客户端基于最新输入持续发送高频推理请求,并接收与外部时钟同步的输出,而服务器执行模型推理。我们通过现场音乐伴奏任务演示了该框架,展示了在不同往返延迟的部署环境中如何实现实时同步。我们进一步建模了系统超参数与往返延迟之间的关系,并评估了不同环境如何影响实现实时性能的最佳配置。实验结果表明,系统实时性能与音乐质量之间存在一致对应关系,证明了所提出框架的有效性。该项目是开源的。相关代码和最新更新可在此https URL获取。

英文摘要

Language models (LMs) have become one of the most prominent paradigms in modern generative modeling. While making them faster has been the main focus of real-time deployment, speed alone is not enough. Many real-world applications, such as synchronized translation and voice synthesis, also require precise alignment between generation and external signals, both in terms of generation content and timing. We refer to this problem as \textit{frame-synchronous streaming inference}. To address it, we present StreamMUSE, an inference system that performs LM generation in response to an external signal stream within a client-server architecture. The client continuously sends high-frequency inference requests based on the most recent inputs and receives outputs synchronized to the external clock, while the server executes model inference. We demonstrate the framework through a live music accompaniment task, showing how real-time synchronization can be achieved across different deployment environments with varying round-trip latencies. We further model the relationship between system hyperparameters and round-trip latency, and evaluate how different environments affect optimal configurations to achieve real-time performance. Experimental results show a consistent correspondence between system real-time performance and music quality, demonstrating the effectiveness of the proposed framework. The project is open source. Relevant code and the latest updates are available at this https URL.

2606.11903 2026-06-11 cs.SD 新提交

Snapping Matters: Context-Aware Onset Refinement for Automatic Music Transcription

Snapping Matters: 上下文感知的起始点细化用于自动音乐转录

Abhirup Saha, Hans-Ulrich Berendes, Meinard Müller, Ben Maman

AI总结 针对弱对齐的乐谱-音频数据,提出基于二分图匹配的上下文感知起始点细化方法,显著提升自动音乐转录的起始点对齐和转录精度。

详情
Comments
Published in International Computer Music Conference (ICMC) 2026
AI中文摘要

精确的音符级标注对于训练自动音乐转录(AMT)系统至关重要,尤其是音符起始点标签,它是许多现代AMT系统的核心组成部分。然而,真实世界录音的高质量标注非常稀缺。序列级乐谱-音频对齐方法(如动态时间规整)仅提供粗略对应,因此需要局部细化步骤。这个细化步骤称为snapping,它使用神经起始点后验图的峰值来调整对齐的乐谱起始点,并且通常决定了弱对齐的乐谱-音频对是否能够成为可用的训练数据。尽管具有实际重要性,snapping通常被视为简单的后处理启发式方法,并通过贪婪的局部决策实现。我们提出了用于训练乐器无关转录器的snapping策略的系统分析,证明了snapping对于从弱对齐数据学习至关重要。在此基础上,我们将snapping形式化为每个音高的分配问题,并通过二分图匹配解决,从而在重叠的细化窗口和不确定的初始对齐下做出上下文感知的起始点决策。在钢琴、室内乐和管弦乐录音上的广泛跨数据集实验表明,与贪婪snapping相比,起始点对齐和转录精度有所提高,并且随着snapping窗口变宽和初始对齐变粗糙,增益增加。定性示例见我们的项目页面:this https URL

英文摘要

Precise note-level annotations are critical for training automatic music transcription (AMT) systems, in particular note-onset labels, which form a core component of many recent AMT systems. However, high-quality annotations for real-world recordings are scarce. Sequence-level score--audio alignment methods such as dynamic time warping provide only coarse correspondence, making a local refinement step necessary. This refinement step, known as snapping, adjusts aligned score onsets using peaks in a neural onset posteriorgram and often determines whether weakly aligned score--audio pairs become usable training data at all. Despite its practical importance, snapping is typically treated as a simple post-processing heuristic and implemented with greedy local decisions. We present a systematic analysis of snapping strategies for training instrument-agnostic transcribers, demonstrating that snapping is essential for learning from weakly aligned data. Building on this, we formulate snapping as a per-pitch assignment problem and solve it via bipartite graph matching, yielding context-aware onset decisions under overlapping refinement windows and uncertain initial alignments. Extensive cross-dataset experiments across piano, chamber, and orchestral recordings show improved onset alignment and transcription accuracy over greedy snapping, with gains increasing for wider snapping windows and coarser initial alignments. Qualitative examples are provided on our project page: this https URL

2606.12282 2026-06-11 cs.SD cs.LG 新提交

PianoKontext: Expressive Performance Rendering from Deadpan Context

PianoKontext: 从平淡语境中生成富有表现力的演奏

Dmitrii Gavrilev

AI总结 提出PianoKontext,一种基于流匹配的钢琴演奏渲染模型,通过动态时间规整对齐乐谱与演奏的潜在表示,生成可变长度的表现力演奏。

详情
Comments
ICML 2026 Workshop on Machine Learning for Audio (Oral)
AI中文摘要

表现力演奏渲染(EPR)旨在根据音符序列生成逼真的演奏。然而,流匹配音频编辑模型仅操作相同时长的同步音乐样本,限制了它们对表现力时机的理解。我们提出了PianoKontext,一种针对古典钢琴音乐的流匹配渲染模型,该模型在预训练的Music2Latent模型的潜在空间中生成可变长度的演奏。我们将MIDI乐谱合成为平淡音频,并在潜在空间中使用动态时间规整(DTW)构建用于训练的对齐数据。对齐的嵌入在DiT块中拼接,从而简单有效地学习乐谱与演奏之间的依赖关系。音频样本可在我们的演示页面获取:此https URL。

英文摘要

Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable-length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: this https URL.

7. 语音翻译与语音语言模型 5 篇

2606.11400 2026-06-11 cs.SD cs.AI eess.AS 新提交

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

引导听哪里:基于指令的激活操控重定向大型音频语言模型中的时间注意力

Tsung-En Lin, Hung-Yi Lee

AI总结 提出基于指令的向量操控方法,通过对比不同指令下的激活来重定向音频令牌的时间注意力,实现无需训练的声音事件定位,显著优于直接提示和随机基线。

详情
AI中文摘要

大型音频语言模型(LALMs)在音频理解方面表现出色,但很少揭示它们关注音频信号的哪个部分。我们引入了基于指令的向量操控,该方法通过对比不同指令提示下的激活来构建操控向量,同时保持音频不变。通过对LALM注意力的系统探测,我们发现——与标准提示或基于音频的操控不同——这种干预显著重新分配了分配给音频令牌的时间注意力,将其集中在声学相关的区域。然后我们展示了这种注意力转移在行为上是有意义的:在受控的三事件设置中,读取由操控引起的最大注意力变化的时间位置,可以恢复查询声音事件的位置,而无需任何训练,在Qwen2-Audio和Audio Flamingo 3上分别达到60.87%和68.72%与真实区间的重叠,远高于直接提示(31.84%,46.75%)和随机基线(27.74%)。我们的结果表征了LALMs中基于指令的操控的机制特性,并为这些模型编码的潜在时间结构提供了一种无需训练的探测方法。

英文摘要

Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

2606.11219 2026-06-11 cs.CL cs.AI cs.SD 交叉投稿

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Afrispeech Semantics: 评估跨领域和口音的口语语言模型中的音频语义推理

Chibuzor Okocha, Christan Grant

发表机构 * University of Florida(佛罗里达大学)

AI总结 提出五项语义与副语言推理任务(蕴含、一致性、合理性、口音漂移、口音约束),评估音频语言模型在口音变化、领域迁移和语义过度推断下的推理能力,揭示当前评估的局限性。

详情
Comments
Accepted to ACL
AI中文摘要

音频语言模型(ALMs)越来越多地用于基于语音的理解,但它们在转录、文本到音频检索、字幕生成和问答准确性之外的语义推理能力仍未得到充分基准测试。特别是,口音变化、领域迁移和语义过度推断对音频推理的影响尚不清楚。我们评估了音频语言模型在五项语义和副语言推理任务上的表现:蕴含、一致性、合理性、口音漂移和口音约束。这些任务共同评估模型以口语音频作为主要证据来源进行推理的能力,包括文本假设是否可以从音频中推断、矛盾或无法确定,陈述是否与口语内容一致或冲突,给定话语的声明是否合理,以及模型预测在口音变化下是否保持稳定或适当约束。这些发现凸显了当前音频推理评估的关键局限性,并希望为更稳健和公平的ALM设计与评估提供指导。

英文摘要

Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains insufficiently benchmarked. In particular, the effects of accent variation, domain shift, and semantic over-inference on audio reasoning are poorly understood. We evaluate audio language models across five semantic and paralinguistic reasoning tasks: entailment, consistency, plausibility, accent drift, and accent restraint. Collectively, these tasks assess a model's ability to reason over spoken audio as the primary evidence source, including whether a textual hypothesis can be inferred, contradicted, or left undetermined by the audio, whether statements align or conflict with spoken content, whether claims are plausible given the discourse, and whether model predictions remain stable or appropriately constrained across accent variation. These findings highlight critical limitations in current audio reasoning evaluations and hope to provide guidance for more robust and equitable ALM design and assessment

2606.12199 2026-06-11 eess.AS cs.CL cs.SD 交叉投稿

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

哪种语音表示更匹配文本原生推理?帧率和表示对语音-文本对齐的研究

Zhen Ye, Xu Tan, Yiming Li, Guangyan Zhang, Chimin Chan, Haohe Liu, Zhengxi Liu, Hongzhan Lin, Zheqi Dai, Xinshen Zhang, Peiwen Sun, Qiuqiang Kong, Wei Xue

AI总结 研究语音与文本模态差异中的时间粒度不匹配问题,提出因子化FSQ和轻量非自回归音频LM头以降低帧率,发现4.17Hz帧率结合中间层表示对齐在语音问答中表现最佳。

详情
Comments
Accepted by Interspeech 2026 long paper
AI中文摘要

口语对话模型通常以文本LLM骨干网络为基础,但在以语音而非文本为条件时,推理能力往往会下降。我们将这种模态差异部分归因于时间粒度不匹配:在语义匹配的情况下,语音标记在时间上是冗余的,且远长于文本,这稀释了每个标记的语义密度,削弱了文本原生的推理动态。我们将语音标记设计视为一个表示选择问题,并在固定信息速率下,在冻结的LLM骨干网络中扫描帧率。为了实现低帧率,我们引入了因子化FSQ和一个轻量级的非自回归音频LM头,在不牺牲高效预测的情况下将容量扩展到近300比特/帧。在消除瓶颈后,我们扫描帧率(50→2.08 Hz)和对齐深度,并观察到在4.17 Hz帧率下,结合中间层表示对齐,语音问答存在一致的最佳区域。

英文摘要

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.

2509.15680 2026-06-11 cs.SD eess.AS 版本更新

SAM: A Mamba-2 State-Space Audio-Language Model

SAM: 一种基于 Mamba-2 状态空间的音频-语言模型

Taehan Lee, Jaehan Jung, Hyukjun Lee

AI总结 提出 SAM,一种结合 Mamba-2 骨干网络的音频-语言模型,在 AudioSet 和 AudioCaps 上以更少参数达到或超越 7B 变压器模型性能,并系统分析了 SSM 与音频编码器输出的交互机制。

详情
Comments
6 pages, Accepted to Interspeech 2026
AI中文摘要

我们提出了 SAM,一种状态空间音频-语言模型,它将音频编码器与 Mamba-2 骨干网络集成。SAM-2.7B 在 AudioSet 上达到 21.1 mAP,在 AudioCaps 上达到 17.6 SPICE,以更少的参数匹配或超越更大的 7B 变压器模型。我们进一步首次提供了系统性的、表示级别的分析,研究 SSM 如何与音频编码器输出交互:(1) 联合音频编码器微调是必要的,这由准确率提升以及在不同 SSM 大小下观察到的 token 表示秩和相似性的适应所支持;(2) 尽管线性缩放,SSM 从紧凑、信息丰富的音频 token 表示中获益更多,而非过长的 token 序列;(3) 融入指令跟随监督显著提升了推理能力,将 MMAU-Sound 准确率从 22.8 提升至 56.8。通过全面的实验和分析,我们为 SSM 作为音频-语言模型的强大、可扩展骨干网络建立了实用的设计原则。

英文摘要

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

2606.06940 2026-06-11 eess.AS cs.SD 版本更新

Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models

超越语义主导:音频语言模型中的认知情感推理与共情响应对齐

Zhixian Zhao, Shuiyuan Wang, Wenjie Tian, Jingbin Hu, Ziyu Zhang, Lei Xie

AI总结 提出CogAudio-LLM框架,通过构建LIME-440K数据集实现声学-语义解耦,设计EIPS思维链机制进行心理推理,并采用DR-SAPO优化策略平衡逻辑严谨性与共情质量,解决音频语言模型中的语义主导和情感认知不足问题。

详情
Comments
Accepted by Interspeech2026
AI中文摘要

虽然音频语言模型(ALM)表现出强大的语义理解能力,但在复杂的情感交互方面仍存在困难。具体来说,文本语义主导常常掩盖声学细微差别,而缺乏认知深度导致生成通用、与情感无关的响应。我们提出了CogAudio-LLM\footnote{ \urlstyle{same} this https URL},一种新颖的认知情感推理框架。为了缓解语义主导,我们构建了LIME-440K,一个“词汇相同、多情感”的数据集,旨在促进声学-语义解耦。我们引入了EIPS,一种包含心理推理的4步思维链(CoT)机制。为了提高推理效率,多阶段训练通过监督微调显式建立EIPS,然后将这种逻辑提炼为隐式生成过程。最后,我们设计了DR-SAPO(双路径软自适应策略优化)来动态平衡CoT的逻辑严谨性与直接响应的共情质量。

英文摘要

While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} this https URL, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion'' dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.

8. 多模态音频与视听学习 1 篇

2405.06995 2026-06-11 cs.SD cs.CV cs.MM eess.AS 版本更新

Benchmarking Cross-Domain Audio-Visual Deception Detection

跨域音视频欺骗检测基准测试

Xiaobao Guo, Zitong Yu, Nithish Muthuchamy Selvaraj, Bingquan Shen, Adams Wai-Kin Kong, Alex C. Kot

AI总结 提出首个跨域音视频欺骗检测基准,评估不同场景下的泛化能力,并设计MM-IDGM算法和Attention-Mixer融合方法提升性能。

详情
Comments
17 pages
AI中文摘要

自动欺骗检测对于帮助人类准确评估真实性和识别欺骗行为至关重要。传统的接触式技术,如测谎仪,依赖生理信号来确定个体陈述的真实性。然而,自动欺骗检测的最新进展表明,从音频和视频模态中提取的多模态特征在公开数据集上可能优于人类观察者。尽管有这些积极发现,现有音视频欺骗检测方法在不同场景下的泛化能力仍 largely unexplored。为弥补这一空白,我们提出了首个跨域音视频欺骗检测基准,使我们能够评估这些方法在现实场景中的泛化能力。我们使用了广泛采用的音频和视觉特征以及不同的架构进行基准测试,比较了单到单和多到单域泛化性能。为了进一步利用来自多个源域的数据进行训练的影响,我们研究了三种域采样策略,包括域同步、域交替和逐域采样,用于多到单域泛化评估。我们还提出了一种通过最大化模态编码器之间的梯度内积来增强泛化性能的算法,称为“MM-IDGM”。此外,我们提出了Attention-Mixer融合方法来提高性能,并相信这一新的跨域基准将促进未来音视频欺骗检测的研究。

英文摘要

Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

9. 低资源、多语言与方言语音 5 篇

2606.11429 2026-06-11 eess.AS cs.CL cs.SD 交叉投稿

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

Gumbel-BEARD:低资源领域Whisper自监督自适应的自动层选择

Zilai Wang, Natarajan Balaji Shankar, Mohan Shi, Kaiyuan Zhang, Abeer Alwan

AI总结 提出Gumbel-BEARD框架,通过可训练的Gumbel-Softmax选择器自动选择Whisper编码器层,结合BEST-RQ自监督目标实现低资源领域自适应,在儿童语音和方言数据集上取得最先进词错误率。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

语音基础模型在低资源领域常因领域不匹配和数据稀缺而表现不佳。我们提出Gumbel-BEARD,一种领域自适应框架,通过端到端可训练的硬Gumbel-Softmax选择器自动选择Whisper编码器层。它利用BEST-RQ目标实现自监督自适应,无需手动调整即可动态适应目标声学特征。在MyST儿童语音语料库上的实验证明了其效率和可扩展性:使用10小时标注数据进行微调,我们的方法匹配了在完整133小时标注集上训练的完全监督基线。我们在MyST上使用Whisper-medium建立了8.21%的新最先进词错误率(WER),在OGI自发言语数据集上使用Whisper-small达到11.06%。在CORAAL上的评估进一步证实了对成人方言领域偏移的鲁棒性,相对WER降低高达6%,突显了我们的方法对多样低资源条件的泛化能力。

英文摘要

Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.

2606.11681 2026-06-11 cs.CL cs.SD 交叉投稿

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT:通过通用罗马化和语音标记预测扩展大规模多语言TTS的文本编码器

Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang

发表机构 * Dept. of Electronics and Electrical Engineering, Yonsei University(延世大学电子与电气工程系)

AI总结 提出UR-BERT,一种基于罗马化转录的TTS编码器,通过统一书写系统为罗马化表示,结合语音标记预测目标,在495种语言上实现高效多语言TTS,优于现有基线并泛化到未见语言。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

我们提出UR-BERT,一种基于罗马化转录的文本到语音(TTS)编码器,用于大规模多语言TTS系统。传统的字素到音素(G2P)方法由于可靠G2P资源的可用性,仅限于约100种语言。相比之下,UR-BERT通过将多样化的书写系统统一为共享的罗马化表示,扩展到495种语言。为了进一步增强语音保真度和文本-语音对齐,我们在训练过程中引入了一个语音标记预测目标,这促使编码器以数据高效的方式学习语音感知的语音表示。实验表明,基于UR-BERT构建的TTS系统在广泛的语言和资源条件下,始终优于最近的文本编码器基线,并展现出对未见语言的强大泛化能力。

英文摘要

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

2604.16287 2026-06-11 cs.SD 版本更新

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

NaijaS2ST:低资源尼日利亚语言的多口音语音到语音翻译基准

Marie Maltais, Yejin Jeon, Min Ma, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Maryam Ibrahim Mukhtar, Daud Abolade, Joel Okepefi, Johnson Sewedo, David Ifeoluwa Adelani

AI总结 针对低资源非洲语言语音翻译数据稀缺问题,构建了涵盖伊博语、豪萨语、约鲁巴语和尼日利亚皮钦语的平行语音数据集NaijaS2ST(每语种约50小时),并系统评估了级联、端到端及音频大模型方法,发现少样本音频大模型在语音到文本翻译中更优,而语音到语音翻译中所有范式性能相近。

详情
Comments
Preprint
AI中文摘要

低资源语言的语音翻译仍然受到高质量、多样化平行语音数据稀缺的根本限制,这一挑战在非洲语言背景下尤为突出。为解决此问题,我们引入了NaijaS2ST,一个平行语音翻译数据集,涵盖伊博语、豪萨语、约鲁巴语和尼日利亚皮钦语,并与英语配对。该数据集每种语言包含约50小时的语音,并捕捉了说话人和口音的显著变化,反映了现实的多语言和多口音条件。利用NaijaS2ST,我们对级联、端到端(E2E)和基于AudioLLM的方法在双向翻译设置中进行了全面基准测试。我们的结果表明,具有少样本示例的音频大模型在语音到文本翻译中比基于微调数据的级联和端到端方法更有效。然而,对于语音到语音翻译,级联和音频大模型范式性能相当,表明在此设置下开发针对性的任务特定模型仍有相当大的改进空间。通过提供高质量数据集和系统基准,我们希望NaijaS2ST能成为推动低资源多语言语音翻译研究的有力基础。

英文摘要

Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yorùbá, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for speech-to-speech translation, the cascaded and audio LLM paradigms yield comparable performance, indicating that there is still considerable room for improvement in developing targeted, task-specific models for this setting. By providing both a high-quality dataset and a systematic benchmark, we hope that NaijaS2ST will serve as a strong foundation for advancing research in low-resource, multilingual speech translation.

2606.10360 2026-06-11 cs.SD 版本更新

ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning

ViP-VL:基于向量量化学习的越南语自监督语音预训练模型

Khanh Le, Kiet Anh Hoang, Bao Nguyen, Duy Vo, Dung Vo, Thai Tran, Linh Pham, Khoa D Doan

AI总结 提出ViP-VL模型,通过声学堆叠、感受野对齐和掩码选择策略,在BEST-RQ框架上实现高效自监督预训练,在越南语ASR、情感识别、方言分类和说话人验证四项任务上取得最优结果。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

我们提出了ViP-VL,一种高效的越南语自监督语音预训练模型,利用向量量化学习。为了弥合高分辨率音频与高效处理之间的差距,ViP-VL在ChunkFormer架构中引入了声学堆叠和感受野对齐,实现了同步的8倍下采样率,同时通过在BEST-RQ框架上的预训练中采用专门的掩码选择策略,进一步增强了表示的鲁棒性。在17,000小时未标注的越南语语音上预训练后,我们的模型在自动语音识别、语音情感识别、方言分类和说话人验证四个主要下游任务上建立了新的最优结果。为了促进未来研究和高性能越南语语音技术的发展,我们在此http URL公开发布了预训练权重和实现。

英文摘要

We present ViP-VL, an efficient Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning. To bridge the gap between high-resolution audio and efficient processing, ViP-VL incorporates Acoustic Stacking and Receptive Field Alignment to enable a synchronized 8x subsampling rate within the ChunkFormer architecture, while further enhancing representation robustness through a specialized Mask Selection Strategy during pretraining on the BEST-RQ framework. Pretrained on 17,000 hours of unlabeled Vietnamese speech, our model establishes new state-of-the-art results across four major downstream tasks: Automatic Speech Recognition, Speech Emotion Recognition, Dialect Classification, and Speaker Verification. To facilitate future research and the development of high-performance Vietnamese speech technologies, we publicly release our pretrained weights and implementation at this http URL.

2606.06065 2026-06-11 cs.CL cs.SD eess.AS 版本更新

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

多任务学习还不够:双输出第二语言语音识别中的表示纠缠

Seung Hwan Cho, Young-Min Kim

AI总结 针对双输出第二语言语音识别,研究发现多任务学习导致表面转录性能下降,归因于编码器级别的表示纠缠,尤其在英语中随表面-意义差异增大而加剧。

详情
Comments
5 pages, 2 figures, Accepted to the 43rd International Conference on Machine Learning Workshop on Machine Learning for Audio
AI中文摘要

第二语言(L2)语音识别通常需要发音转录和预期意义的转录。多任务学习(MTL)是一种自然的方法,因为它假设共享表示对两个输出都有益。然而,本文表明这一假设在韩语和英语中并不成立。MTL提高了意义转录但降低了表面转录,尤其是在英语中,性能下降与通过Levenshtein编辑距离测量的表面-意义差异成正比。编码器分析将这些模式与编码器级别的纠缠联系起来,韩语保留了不同的任务表示,而英语产生了几乎相同的表示。跨任务解码器分析表明,意义双输出解码器适应了独特的表示,而表面双输出解码器仍受编码器约束。这些发现促使设计能够减轻编码器级别纠缠的MTL框架,以减少双输出L2自动语音识别中的表面性能下降。

英文摘要

Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit distance. Encoder analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.

10. 数据集、基准与评测 5 篇

2606.11260 2026-06-11 cs.SD cs.AI 新提交

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

RAIL: 基于CHC框架重新思考大型音频语言模型中的听觉智能

Hongyu Jin, Siyi Wang, Yang Xiao, Jiaheng Dong, Shihong Tan, Kaiyuan peng, Georgiana Juravle, Shanquan Chen, Gongping Huang, Hong Jia, Eun-Jung Holden, James Bailey, Ting Dang

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院) Faculty of Psychology and Educational Sciences, Alexandru Ioan Cuza University of Iași(亚历山德鲁伊万库扎大学心理学与教育科学学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院) School of Public Health, The University of Hong Kong(香港大学公共卫生学院) School of Computer Science, The University of Auckland(奥克兰大学计算机科学学院) Department of Data Science and Artificial Intelligence, Monash University(莫纳什大学数据科学与人工智能系)

AI总结 提出RAIL基准,基于CHC认知框架将听觉智能分解为五种核心能力,构建结构化评估任务,系统评测大型音频语言模型的认知行为。

详情
AI中文摘要

人类通过紧密集成的认知能力(如音频感知、音频推理和记忆)处理丰富的听觉环境。尽管大型音频语言模型(LALMs)在语音理解和多模态音频推理方面取得了近期进展,但当前的评估范式仍然主要围绕任务或模态,关注最终性能而忽视了潜在的听觉认知行为。这揭示了人类听觉认知理解与LALMs评估之间的根本差距,特别是缺乏将认知原则操作化到任务级指标之外以系统捕捉模型行为的框架。在这项工作中,我们引入了RAIL,一种基于Cattell-Horn-Carroll(CHC)认知框架的以人为中心的评估范式。RAIL将听觉认知形式化为五种核心能力,并将其发展为结构化评估任务,探究模型如何处理、保留和整合听觉信息。我们进一步构建了一个认知基础的基准,包含原则性数据收集和人类对齐的评估协议。评估26个最先进的LALMs,我们发现当前模型在认知能力上表现出高度不平衡的性能。RAIL建立了一种新的评估范式,从以任务为中心的基准测试转向基于认知的听觉智能评估。

英文摘要

Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.

2606.11514 2026-06-11 cs.SD 新提交

CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

CS-YODAS:一个挖掘自真实环境的代码转换语音数据集

Brian Yan, Qingzheng Wang, Matthew Wiesner, Anuj Diwan, Olga Iakovenko, Alexander Polok, Injy Hamed, Shuichiro Shimizu, Iris Emerman Thomas Hain, David R. Mortensen, Peter Viechnicki, Shinji Watanabe

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Johns Hopkins University(约翰霍普金斯大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Sheffield(谢菲尔德大学) Brno University of Technology(布尔诺理工大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Kyoto University(京都大学)

AI总结 本文提出CS-YODAS数据集,通过可扩展的人机协同流程从多语言YouTube数据中挖掘真实代码转换语音,涵盖7种基质语言共313小时,并分析其分布特征与语言对切换模式。

详情
AI中文摘要

我们提出CS-YODAS,一个基于Creative Commons许可的数据集,包含从多语言YouTube数据中挖掘的真实环境代码转换语音。代码转换(CS),即在话语或对话中交替使用不同语言,在多语言环境中很常见,但在现有的CS语音资源中代表性不足,这些资源通常规模小、领域特定或人为构建。基于YODAS语料库,我们开发了一个可扩展的人机协同流程,用于识别和验证自然发生的代码转换。最终数据集总计313小时,涵盖7种基质语言,提供了多样化的真实世界自发性代码转换语音示例。我们进一步分析了真实环境中代码转换的分布和特征,考察了语言对频率和切换模式,并报告了口语语言识别的基线结果。我们希望CS-YODAS能够促进对代码转换语音更广泛和全面的研究。数据集链接:此https URL。

英文摘要

We present CS-YODAS, a Creative Commons-licensed dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching (CS), or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hours and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: this https URL.

2606.11581 2026-06-11 eess.AS cs.SD 交叉投稿

Sensitivity Analysis of Generative Spatial Audio Metrics: A Study on Responsiveness, Smoothness, and Symmetry

生成式空间音频指标的敏感性分析:响应性、平滑性和对称性研究

Purnima Kamath, Adrian S. Roman, Koichi Saito, Yuki Mitsufuji, Juan P. Bello

AI总结 提出一个框架分析生成式空间音频指标对空间参数变化的敏感性,定义响应性、平滑性和对称性三个期望属性,评估标准指标后发现FAD和声学地图表现最佳。

详情
Comments
Accepted for publication at Interspeech 2026
AI中文摘要

由于对指标如何响应方位角和仰角等空间参数变化的理解有限,评估一阶环绕声(FOA)的生成式空间音频仍然具有挑战性。我们借鉴参数化声音合成中的敏感性分析原理,提出了一个沿连续空间轨迹分析指标敏感性的框架。通过使用复杂度递增的受控FOA场景,我们定义了指标行为的三个期望属性:响应性、平滑性和对称性。我们评估了标准基于分布和基于样本的指标,包括Fréchet音频距离(FAD)、强度向量和声学地图。我们的发现表明,使用定位特定嵌入和声学地图的FAD在不同条件下具有高响应性以及稳健的平滑性和对称性,而强度向量随着场景复杂度的增加而退化。这是研究生成式空间音频指标敏感性的第一步。

英文摘要

Evaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spatial trajectories, drawing on principles of sensitivity analysis in parametric sound synthesis. Using controlled FOA scenes with increasing scene complexity, we define three desiderata for metric behavior: Responsiveness, Smoothness, and Symmetry. We assess standard distribution-based and sample-based metrics, including Fréchet Audio Distance (FAD), intensity vectors, and acoustic maps. Our findings show that FAD using localization-specific embeddings and acoustic maps yield high Responsiveness and robust Smoothness and Symmetry across conditions, while intensity vectors degrade with increasing scene complexity. This is the first step towards investigating the sensitivity of metrics for generative spatial audio.

2606.05394 2026-06-11 cs.SD eess.AS 版本更新

nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

nnAudio 2: 克服动态编译障碍与变换不一致性

Abhinaba Roy, Junyi Liang, Dorien Herremans

AI总结 针对 nnAudio 在 TorchScript 编译、逆变换边缘情况和依赖漂移方面的问题,通过移除动态状态变异、限制逆变换适用范围并更新依赖,实现了与现代 PyTorch 和 SciPy 的兼容,提升了可微音频分析的鲁棒性。

详情
AI中文摘要

nnAudio 是一个用于深度学习的开源音频特征提取工具箱,但在当前环境中,其使用受到 TorchScript 不兼容、逆变换边缘情况和依赖漂移的阻碍。我们针对现代 PyTorch 和科学 Python 进行了有针对性的现代化改造。我们通过从脚本化代码路径中移除动态状态变异和模块构造,并收紧逆相关辅助函数中的参数处理,解决了 STFT 和 iSTFT 中的 TorchScript 编译失败问题。我们通过将可靠逆变换限制为均匀 bin 设置(freq_scale='no'),并对不支持的频率尺度引发显式运行时错误,澄清了逆 STFT 行为,防止了静默退化的重构。我们恢复了与现代 SciPy 的 CFP 兼容性,并确保当 gamma = 0 时 VQT 简化为 CQT。回归测试涵盖了新的 STFT/iSTFT 行为,更新后的代码库在现代 Python 环境中通过了完整的仓库测试套件。这些改进为研究和部署中的可微音频分析提供了更坚实的基础。

英文摘要

nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.

2510.23320 2026-06-11 eess.AS cs.CL cs.SD 版本更新

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization

LibriConvo:从阅读文献模拟对话用于ASR和说话人日志

Máté Gedeon, Péter Mihajlik

AI总结 提出LibriConvo合成对话语音语料库,基于说话人感知模拟对话框架构建,用于说话人日志和ASR基准测试,包含240.1小时音频,基线实验显示Sortformer在日志中优于pyannote,Fast Conformer-CTC在ASR中优于Whisper。

详情
Comments
Accepted by TSD 2026
AI中文摘要

我们介绍了LibriConvo,一个用于说话人日志和自动语音识别(ASR)的合成对话语音语料库,通过在数据集和基准测试设置中实例化先前提出的说话人感知模拟对话(SASC)框架构建而成。本文的主要贡献是基于该框架的语料库构建流程和基准测试。为了使数据更适合下游ASR和说话人日志,我们使用外部语音活动检测从英语CallHome估计对话时间统计信息,压缩长停顿,按书籍分组LibriTTS话语以改善局部语义连续性,并通过空间合理性启发式选择房间脉冲响应。生成的语料库包含240.1小时的音频,涉及830个说话人的1496个对话,划分为说话人不重叠的训练、验证和测试集。我们报告了说话人日志和ASR的基线结果。在测试集上,Sortformer在说话人日志中优于pyannote流水线(DER 11.1%对比24.4%)。对于ASR,使用序列化输出训练微调的Fast Conformer-CTC XLarge模型实现了7.29%的WER和6.97%的cpWER,优于零样本Whisper-large-v3。这些结果使LibriConvo成为研究合成对话语音和评估多说话人语音处理系统的实用基准。

英文摘要

We introduce LibriConvo, a synthetic conversational speech corpus for speaker diarization and automatic speech recognition (ASR), built by instantiating the previously proposed Speaker-Aware Simulated Conversation (SASC) framework in a dataset and benchmarking setting. The main contribution of this paper is a corpus construction pipeline and benchmark derived from that framework. To make the data more suitable for downstream ASR and diarization, conversational timing statistics are estimated from English CallHome using external voice activity detection, long pauses are compressed, LibriTTS utterances are grouped by book to improve local semantic continuity, and room impulse responses are selected with a spatial-plausibility heuristic. The resulting corpus contains 240.1 hours of audio across 1,496 dialogues involving 830 speakers, partitioned into speaker-disjoint train, validation, and test splits. We report baseline results for both diarization and ASR. On the test split, Sortformer outperforms the pyannote pipeline in diarization (11.1\% vs.~24.4\% DER). For ASR, a Fast Conformer-CTC XLarge model fine-tuned with Serialized Output Training achieves 7.29\% WER and 6.97\% cpWER, outperforming zero-shot Whisper-large-v3. These results position LibriConvo as a practical benchmark for studying synthetic conversational speech and for evaluating multi-speaker speech processing systems.

11. 安全、隐私与深度伪造音频 4 篇

2606.11666 2026-06-11 cs.SD 新提交

The Hidden Cost of Pairwise Verification in Synthetic Speech Source Tracing

合成语音源追踪中成对验证的隐藏成本

Anton Firc, Zbyněk Lička, Vojtěch Staněk, Kamil Malinka

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 研究比较全局锚定与成对验证在合成语音源追踪中的性能,发现成对验证导致嵌入方差集中、分辨率降低,从而在域内和域外任务中表现更差。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

开放集源追踪日益被框定为验证问题,促使使用来自生物特征识别的成对度量学习目标。因此,我们在匹配的骨干网络以及固定的数据和epoch预算下,在MLAAD(域内)和STOPA(域外)上比较全局锚定和成对验证。在我们的运行中,全局锚定产生的域内错误率(8.61% EER)低于成对变体(12-15% EER),即使使用对抗挖掘和XLS-R微调也是如此。由于成对目标直接优化相似性,它们将方差集中到更少的嵌入方向上,降低了紧密相关生成器之间的分辨率。为了测试这是否导致了性能下降,我们对全局监督基线施加了类似的瓶颈,但基线仍然具有竞争力。结合嵌入空间分析($k_{99}$),这些结果表明差距不能仅由维度解释,而是由成对目标对保留方向的塑造所致。

英文摘要

Open-set source tracing is increasingly framed as a verification problem, motivating the use of pairwise metric-learning objectives from biometrics. We thus compare global anchoring and pairwise verification under matched backbones and a fixed data and epoch budget on MLAAD (in-domain) and STOPA (out-of-domain). In our runs, global anchoring yields lower in-domain error (8.61% EER) than pairwise variants (12-15% EER), even with rival mining and XLS-R finetuning. Because pairwise objectives optimize similarity directly, they concentrate variance into fewer embedding directions, reducing resolution among closely related generators. To test if this drives the drop, we impose a similar bottleneck to the globally supervised baseline, yet the baseline remains competitive. Together with an embedding-space analysis ($k_{99}$), these results suggest that the gap is not explained by dimensionality alone, but rather by the pairwise objective's shaping of the retained directions.

2606.11674 2026-06-11 cs.SD cs.LG 新提交

SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing

SpAArSIST: 用于高效可靠反欺骗的稀疏化AASIST

Anton Firc, Vojtěch Staněk, Zbyněk Lička, Kamil Malinka, Martin Perešíni

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出SpAArSIST,通过稀疏化图池化后端,在保持竞争力的同时降低计算量20.7%、模型大小4.1%,并提升域外鲁棒性。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

我们提出了SpAArSIST,这是对广泛使用的基于自监督学习(SSL)的反欺骗方法AASIST图池化后端的面向部署的改进。受公共实现中冗余操作的启发,我们用显式的轻量级选择替换了学习池化和堆栈节点注意力:分离的训练和推理图池化比率$(k_{\mathrm{tr}},k_{\mathrm{inf}})$、基于幅度的节点评分以及图节点的均值聚合。最佳整体配置(排名第一)将后端计算削减了20.7%(从195.045M MACs降至154.706M MACs),模型大小减少了4.1%(从611.8k参数降至586.4k参数),同时将在In-the-Wild上的域外鲁棒性提升至2.82% EER和0.078 minDCF(原为4.64%和0.133),并在ASVspoof5上保持竞争力。我们还提供了一个综合选择分数,总结了准确性、校准和计算量,以支持平衡的面向部署的模型选择。

英文摘要

We present SpAArSIST, a deployment-oriented refinement of the widely used AASIST graph pooling backend for self-supervised learning (SSL) based anti-spoofing. Motivated by redundant operations in public implementations, we replace learned pooling and stack-node attention with explicit, lightweight choices: separate train and inference graph pooling ratios $(k_{\mathrm{tr}},k_{\mathrm{inf}})$, magnitude-based node scoring, and mean aggregation of graph nodes. The best overall configuration (rank 1) cuts backend compute by 20.7% (195.045M $\rightarrow$ 154.706M MACs) and model size by 4.1% (611.8k $\rightarrow$ 586.4k params), while improving out-of-domain robustness on In-the-Wild to 2.82% EER and 0.078 minDCF (from 4.64% and 0.133) and remaining competitive on ASVspoof5. We further provide a composite selection score that summarizes accuracy, calibration, and compute to support balanced deployment-oriented model choice.

2606.11828 2026-06-11 cs.SD cs.AI cs.CR cs.MM 新提交

Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions

特征对齐的语音水印技术以抵抗重建失真

Haiyun Li, Shuhai Peng, Zhisheng Zhang, Jingran Xie, Xiaofeng Xie, Hanyang Peng, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Shenzhen Key Laboratory of Intelligent Media and Content Understanding(深圳市智能媒体与内容理解重点实验室) Tencent AI Lab(腾讯人工智能实验室)

AI总结 提出特征对齐水印方法,通过将水印与原始语音特征分布对齐,在保持不可感知性的同时提高水印能量,增强对语音重建模型的鲁棒性。

详情
Comments
Accepted by ICME2026
AI中文摘要

音频水印旨在将可识别信息嵌入音频中同时保持不可感知性。现有方法采用高保真、低能量设计以保持感知质量,但由此产生的水印在语音重建模型的抑制下缺乏鲁棒性。由于现有设计中固有的鲁棒性-保真度权衡,提高鲁棒性具有挑战性,增加水印能量会提高鲁棒性但降低保真度。为解决此问题,我们提出一种特征对齐的水印方法,将水印与原始语音特征分布对齐,允许更高的水印能量以提高鲁棒性同时保持不可感知性。我们使用预训练的语音编解码器生成伪语音水印,并将其融合到输入音频的频谱图中,通过VAD损失和感知损失引导在浊音区域嵌入。实验表明,我们的方法在保持与现有方法相当的不可感知性的同时,在见过和未见过的语音重建模型下均显著提高了鲁棒性。

英文摘要

Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.

2510.01157 2026-06-11 cs.CL cs.CR cs.SD 版本更新

Where Do Backdoors Live? A Component-Level Analysis of Backdoor Propagation in Speech Language Models

后门藏身何处?语音语言模型中后门传播的组件级分析

Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal, Peter West

AI总结 本文通过后门攻击视角,对语音语言模型进行组件级分析,揭示后门在不同组件中的传播机制,发现后门持久性高度依赖目标组件,且中毒样本与良性样本在共享嵌入中不可直接分离。

详情
Comments
Interspeech 2026 (long paper)
AI中文摘要

语音语言模型(SLM)是系统的系统:独立组件联合起来实现共同目标。尽管其异构性,SLM 通常被端到端研究;信息如何流经管道仍然模糊。我们通过后门攻击的视角研究这一问题。我们首先确定后门可以通过 SLM 传播,使所有任务高度脆弱。由此,我们设计了一个组件分析来发现每个组件在后门学习中的作用。我们发现后门的持久性或擦除高度依赖于目标组件。除了传播,我们研究了后门如何在共享的多任务嵌入中被编码,表明中毒样本与良性样本不可直接分离,挑战了过滤防御中常用的可分离性假设。我们的发现强调需要将多模态管道视为具有独特脆弱性的复杂系统,而不仅仅是单模态系统的扩展。

英文摘要

Speech language models (SLMs) are systems of systems: independent components that unite to achieve a common goal. Despite their heterogeneous nature, SLMs are often studied end-to-end; how information flows through the pipeline remains obscure. We investigate this question through the lens of backdoor attacks. We first establish that backdoors can propagate through the SLM, leaving all tasks highly vulnerable. From this, we design a component analysis to discover the role each component takes in backdoor learning. We find that backdoor persistence or erasure is highly dependent on the targeted component. Beyond propagation, we examine how backdoors are encoded in shared multitask embeddings, showing that poisoned samples are not directly separable from benign ones, challenging a common separability assumption used in filtering defenses. Our findings emphasize the need to treat multimodal pipelines as intricate systems with unique vulnerabilities, not solely extensions of unimodal ones.

12. 其他/综合语音音频 7 篇

2606.11836 2026-06-11 cs.SD cs.AI eess.AS 新提交

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩:基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出一种基于k-means通道聚类的无数据无训练压缩方法,通过层间不同参数簇数实现细粒度混合稀疏剪枝,在HuBERT-large和Whisper-large-v3上显著降低WER。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类。还探索了更细粒度的混合稀疏剪枝,通过层间不同数量的参数簇实现。在LibriSpeech数据集上进行的实验表明,当对HuBERT-large进行50%的剪枝稀疏度操作时,在微调前,测试干净和测试其他子集上,相对于基于幅度的剪枝,获得了27.73%/18.61%绝对(34.37%/21.91%相对)的一致WER降低;在仅3个epoch的微调后,获得了0.19%/0.79%绝对(3.36%/4.62%相对)的降低。在Whisper-large-v3上,在10%稀疏度下,相对于基于幅度的剪枝,观察到2.86%/5.02%绝对(59.21%/55.29%相对)的类似WER降低,所有这些相对于未压缩基线均没有显著的WER增加。

英文摘要

This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

2606.11197 2026-06-11 eess.AS cs.AI cs.CL cs.SD 交叉投稿

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

MA-DLE: 基于记忆增强的语音自动抑郁程度估计

Xuzhi Wang, Xinran Wu, Ziping Zhao, Jianhua Tao, Björn W. Schuller

AI总结 提出记忆增强特征方法,通过选择性整合历史时序特征和动态记忆特征,结合层次注意力融合模块,在DAIC-WOZ和E-DAIC数据集上实现最优性能。

详情
Comments
Accepted at IEEE TAC
AI中文摘要

基于语音的抑郁程度自动估计对于实现早期检测和及时干预至关重要,尤其是在资源受限的心理健康环境中。近年来,深度学习在包括情感计算和心理健康评估在内的多个领域取得了显著成功。现有方法大多依赖基于RNN的架构(如LSTM和GRU)来建模时间信息以进行抑郁估计。然而,提取的特征往往只强调少数相邻语音片段,限制了其捕捉长程依赖的能力。为克服这一局限,我们引入了一种基于记忆的特征增强方法,以增强GRU提取特征的表示能力。我们的记忆库并非不加区分地整合历史数据,而是设计为选择性整合两类组件以减少冗余和不相关性:(1) 与当前GRU输出高度相似的历史时序特征,提供互补的上下文信息;(2) 基于特征变异性识别的动态记忆特征,捕捉指示抑郁症状的行为和情绪波动。为有效融合记忆增强特征与GRU输出,我们进一步设计了层次注意力融合(HAF)模块。我们的方法在广泛使用的DAIC-WOZ和E-DAIC数据集上进行了评估,取得了最先进的性能。

英文摘要

Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.

2606.11766 2026-06-11 eess.AS cs.AI cs.CL cs.SD 交叉投稿

Fast Speech Foundation Model Distillation Using Interleaved Stacking

快速语音基础模型蒸馏使用交错堆叠

Eungbeom Kim, Kyogu Lee

AI总结 提出交错堆叠方法加速语音基础模型蒸馏训练,通过保持层位置一致性解决性能下降问题,在SUPERB上验证有效性。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

将大型语音基础模型(SFM)蒸馏为高效的学生模型已成功应用于低资源环境。尽管蒸馏减少了推理延迟,但它需要额外的学生模型训练。然而,SFM蒸馏的训练效率仍未得到充分探索。在这项工作中,我们探索了SFM蒸馏的训练加速以加快模型部署。我们研究了堆叠的潜力,其中模型深度通过训练逐步增加,直到达到目标模型深度。虽然现有的堆叠方法提高了训练速度,但它们遭受性能下降。为了解决这一限制,我们提出了交错堆叠,一种新颖的堆叠方法,在整个堆叠过程中始终保持层位置。这一特性在SFM中尤为关键,因为每一层编码了不同的层特定知识。我们在SUPERB上验证了所提方法的有效性。

英文摘要

Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

2606.11875 2026-06-11 cs.CL cs.SD 交叉投稿

I Understand How You Feel: Enhancing Deeper Emotional Support Through Multilingual Emotional Validation in Dialogue System

我理解你的感受:通过对话系统中的多语言情感验证增强深层情感支持

Zi Haur Pang, Yahui Fu, Koji Inoue, Tatsuya Kawahara

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息学研究科)

AI总结 提出情感验证在对话系统中的应用,构建多语言语料库M-EDESConv和测试集M-TESC,设计多语言情感感知门控单元MEGUMI进行时机检测,并评估当前LLM在情感验证响应生成中的表现。

详情
Comments
This paper has been accepted for presentation at SIGdial Meeting on Discourse and Dialogue 2026 (SIGDIAL 2026)
AI中文摘要

情感验证——明确承认用户的感受是合理的——已被证明具有治疗价值,但很少受到计算方面的关注。对话系统中的情感验证可以分解为:(i) 验证响应识别,(ii) 验证时机检测,以及 (iii) 验证响应生成。为了支持所有三个子任务的研究,我们发布了 M-EDESConv,一个通过混合手动和自动标注创建的 12 万条英日多语言语料库,以及 M-TESC,一个多语言口语对话测试集。对于时机检测,我们提出了 MEGUMI,一种多语言情感感知门控单元用于相互融合,它通过跨模态注意力和门控融合将冻结的 XLM-RoBERTa 语义与特定语言的情感编码器融合。MEGUMI 在 M-EDESConv 和 M-TESC 数据集上均表现出优越的性能,无论是客观还是主观评价。最后,我们的 EmoValidBench 基准测试(使用 GPT-4.1 Nano 和 Llama-3.1 8B)表明,当前的 LLM 能够生成上下文相似且多样化的验证响应,但情感理解仍然是一个需要改进的主要领域。项目页面:this https URL

英文摘要

Emotional validation - explicitly acknowledging that a user's feelings make sense - has proven therapeutic value but has received little computational attention. Emotional validation in dialogue systems can be decomposed into (i) validating response identification, (ii) validation timing detection, and (iii) validating response generation. To support research on all three subtasks, we release M-EDESConv, a 120k English-Japanese multilingual corpus created through hybrid manual and automatic annotation, and M-TESC, a multilingual spoken-dialogue test set. For timing detection, we propose MEGUMI, a Multilingual Emotion-aware Gated Unit for Mutual Integration, that fuses frozen XLM-RoBERTa semantics with language-specific emotion encoders via cross-modal attention and gated fusion. MEGUMI shows superior performance on both the M-EDESConv and M-TESC datasets, both objectively and subjectively. Finally, our EmoValidBench benchmarks of GPT-4.1 Nano and Llama-3.1 8B indicate that current LLMs generate contextually similar and diverse validating responses, but emotional understanding remains a major area for improvement. Project page: this https URL

2605.28882 2026-06-11 cs.CL cs.AI cs.SD 版本更新

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

GrowLoop: 由人类种子驱动的自进化对话评估

Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Yue Liu

AI总结 针对开放域对话中类人性评估的隐性知识、标准分歧和动态演化三大挑战,提出GrowLoop自进化评估系统,通过最小人工种子标注和启发式学习迭代提取评估标准,并利用标准-案例协同进化机制持续适应模型进步和场景变化。

详情
AI中文摘要

随着大语言模型的快速发展,评估开放域对话中的类人性变得越来越重要。然而,类人性是一种隐性知识,人类可以直观感知,但其背后的标准难以明确表述。人类判断差异很大,在某些情况下高度一致,在其他情况下则存在合理分歧。同时,人类判断背后的标准仍然是隐性的,没有明确的基础来构建案例。此外,什么算作类人并非一成不变,而是随着模型能力和人类期望而演变。尽管在评估方法上取得了进展,如专家编写的基准、奖励模型和自进化基准,但没有一种方法能同时解决这三个挑战。因此,我们提出了GrowLoop,一个自进化的对话评估系统,能够随着模型进步和场景变化而持续适应。以最小的人工种子标注作为初始动力,LLM代理通过启发式学习迭代提取和细化评估标准。在标注者意见一致的地方要求人机一致,而在意见分歧的地方只要求合理性。此外,标准-案例协同进化机制实现了持续进化,当评估目标发生变化时,通过新的种子进行扩展。应用于开放域对话中的类人性评估,生成的标准不仅在与人判断的一致性上显著优于现有方法,而且还发现了标注者忽略的问题。由此产生的基准能够有效区分不同能力层级的模型,并揭示其不足之处,同时能够泛化到新场景并随着模型进步而适应。我们的工作将基准测试范式从手动更新或难度扩展转变为全面、持续的自我进化。

英文摘要

With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-likeness is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. Starting from minimal human seed annotations, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution. When the evaluation target shifts, new human seeds expand the system's coverage accordingly. When applied to human-likeness evaluation in open-ended conversation, the AI judge guided by these rubrics not only substantially outperforms existing methods in alignment with human judgments, but also uncovers issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.

2406.07909 2026-06-11 eess.AS cs.CL cs.SD stat.ML

Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Eungbeom Kim, Hantae Kim, Kyogu Lee

详情
Comments
Accepted by Interspeech 2024
英文摘要

Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model's performance. In order to resolve this problem, this paper introduces a self-knowledge distillation (SKD) method that guides the frame-level alignment during the training time. In contrast to the conventional method using separate teacher and student models, this study introduces a simple and effective method sharing encoder layers and applying the sub-model as the student model. Overall, our approach is effective in improving both the resource efficiency as well as performance. We also conducted an experimental analysis of the spike timings to illustrate that the proposed method improves performance by reducing the alignment disagreement.

2305.13108 2026-06-11 eess.AS cs.CL cs.LG cs.SD

Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test

Eungbeom Kim, Yunkee Chae, Jaeheon Sim, Kyogu Lee

详情
Comments
Accepted by Interspeech 2023
英文摘要

Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups are severe. In this study, we aim to improve the ASR system in terms of group robustness for dysarthric speakers. To achieve our goal, we present a novel approach, sample reweighting with sample affinity test (Re-SAT). Re-SAT systematically measures the debiasing helpfulness of the given data sample and then mitigates the bias by debiasing helpfulness-based sample reweighting. Experimental results demonstrate that Re-SAT contributes to improved ASR performance on dysarthric speech without performance degradation on healthy speech.